Benchmarks for Document Image Analysis: A Survey of Datasets and Their Blind Spots

Abstract

A document-image benchmark is a grading instrument, and like any grading instrument it can only report on the material it contains. This survey reads the public document-analysis datasets from that angle. Rather than catalogue what each set includes, which the field has done well, we ask what the corpus as a whole omits, and whether the omission is patterned. It is. The public releases cluster on clean, printed, page-shaped text and thin out sharply as content becomes geometric, thin-lined, and heavily degraded. We place the canonical releases on a two-axis coverage matrix, content type against scan degradation, and use the digitisation of scanned raster well logs as the exposing case, because a strip log is the extreme corner of both axes at once: thin geometric curves under real scan degradation. The corner is empty. The consequence is the survey's central claim. A model can top a document-analysis leaderboard and still fail on a scanned log, because the leaderboard never scored the content the log is made of. To keep the argument anchored to real scale rather than to a toy example, we quote the size of the material the public sets do not reach, an archive of 136,771 scanned TIF files and 7,781 paired LAS files, and the 118 wells and 22 measurement columns of the Xeek/FORCE 2020 tutorial set that the petrophysical side of the problem lives in. This is a survey of the public field and credits the datasets it reads; the numbers used as scale markers are ours and are worked examples, not a new benchmark.

Reading a benchmark by its silences

The habit this survey pushes against is old and well documented. More than a decade ago Torralba and Efros took an unbiased look at dataset bias and showed that a recognition dataset is never a neutral sample of the world; each set carries a signature so strong that a classifier can often name the dataset an image came from, and a model's headline accuracy reflects the peculiarities of its benchmark as much as any general ability it has learned [1]. Their warning was about natural-image recognition, but it transfers cleanly to document analysis, where the benchmarks are, if anything, more narrowly sampled. The point is not that any dataset is badly built. It is that a score computed on a biased sample inherits that bias silently, and the bias is invisible in the score itself.

Recht and colleagues made the same failure mode concrete in a way that is worth carrying into any benchmark discussion. They rebuilt fresh test sets for ImageNet and CIFAR-10 by replicating the original collection process as faithfully as they could, then re-scored published models on the new sets, and found accuracy drops of eleven to fourteen points on ImageNet [5]. The models had not gotten worse; the new images were only slightly harder, drawn from the same intended distribution. A leaderboard number, in other words, can overstate how a model behaves the moment the input drifts even a little from the exact sample the benchmark froze. If a small, honest resample of the same distribution can move the score that much, a genuinely different content class, one the benchmark never sampled at all, is not a small drift but a cliff.

What the public document corpus actually samples

The public document-analysis landscape is a real achievement, and the survey credits it before it critiques it. At the born-digital end, PubLayNet assembled over a million page images by automatically annotating the layout of PDF articles into text, titles, lists, tables, and figures, and it became a default pretraining corpus precisely because it is enormous and clean [4]. Its images are renders of pages: portrait, dense with type, and free of any imaging noise, because nothing was ever scanned. It is the purest possible sample of the clean-printed-text corner.

Moving toward realism, FUNSD introduced 199 real, fully annotated scanned forms with the explicit goal of studying noisy, degraded documents, and it is one of the sets that takes scan degradation seriously rather than assuming it away [3]. DIVA-HisDB pushes degradation further still, with pixel-level annotation of medieval manuscript pages where bleed-through, staining, and irregular layout make the task genuinely hard [2]. Both are exactly the kind of realism a scanned-log model needs on the degradation axis. On the content axis, though, both are still about text and page regions. The hard, noisy content they contain is degraded writing, not degraded geometry.

The one strand that reaches toward geometric content is chart and figure extraction. ChartOCR is representative: it extracts the thin geometric primitives that make up a chart, the keypoints and lines that a bar or a plotted curve reduce to, rather than blocks of type [6]. This is the closest the public field comes to the content a well-log is made of, and it matters that the closest neighbour is a chart-extraction paper rather than a mainstream document-layout benchmark. Thin-line geometric content exists in the corpus, but it lives in a small specialist corner, and it is almost always born-digital or lightly rendered rather than heavily scanned. The combination that a strip log demands, thin geometry and heavy degradation together, falls between the specialties.

How we mapped the corpus

The method is deliberately plain, because the question is structural rather than statistical. We took the canonical public releases and placed each one on a two-dimensional coverage matrix. One axis is content type, ordered from the content the field samples most to the content it samples least: printed text, tables, figures, handwriting, and thin-line geometric curves. The other axis is scan degradation, from born-digital clean through light and moderate scanning to heavy scan degradation. For each cell we asked a single question: how densely does the public corpus cover this content-and-degradation combination, from many strong releases down to none. We then marked where a scanned raster well-log sits, which is unambiguous. A log is thin geometric curves, and a field scan of one is heavily degraded, so it belongs in the bottom-right cell of the matrix.

To keep the map honest about scale we carried three sourced anchors alongside it, as markers of the real material the benchmarks do not reach rather than as fitted data. The raster side of the problem is an archive of 136,771 scanned TIF files and 7,781 paired LAS files, which is larger in raw count than the public document sets and yet absent from all of them. The petrophysical side, the vector measurements a digitiser ultimately has to produce, is exemplified by the Xeek/FORCE 2020 tutorial set with its 118 Norwegian Sea wells and 22 electrical-measurement columns. These numbers do not enter the coverage weights; they sit next to the matrix to fix the difference between the scale of the real task and the scale of the benchmarks that are supposed to prepare a model for it. The coverage weights themselves are an illustrative reading of the landscape, and we flag them as such wherever they appear.

Where the leaderboard goes quiet

A coverage matrix of the public document-image benchmark landscape, crossing content type (rows: printed text, tables, figures, handwriting, and thin-line geometric curves) against scan degradation (columns: born-digital clean through heavy scan). Each teal cell is filled in proportion to how densely the public sets cover that content-and-degradation combination, dense at the clean-printed-text corner and thinning toward the bottom-right. The one scarce-orange element is the empty cell that carries the argument: the intersection of thin-line geometric content with heavy scan degradation, which is exactly where a scanned strip log lives and exactly the cell the public sets leave blank. Drag the lever to raise the share of the evaluation drawn from that blind-spot cell, and the aggregate score of an illustrative leaderboard-topping model falls, because the benchmarks never trained the leaderboard to reward the content a strip log is made of. The scale markers on the left are sourced from the engagement archive: 136,771 scanned TIF files, 7,781 paired LAS files, and the 118 wells and 22 measurement columns of the Xeek/FORCE 2020 tutorial set. The coverage weights and the two model competences are an illustrative reading of the landscape, not fitted measurements.

The atlas draws the survey's whole argument in one figure. Read across the top row and the corpus is dense: printed text is covered strongly at every degradation level, because that is what document analysis has always been about, and even a heavily scanned page of type has strong representation through sets like FUNSD and DIVA-HisDB [2][3]. Read down the left column and the coverage stays reasonable, because born-digital content of almost any type is cheap to generate and PubLayNet-scale sets exist for it [4]. The trouble is the diagonal. As you move toward thinner geometry and heavier degradation at once, the cells fade, and the single bottom-right cell, thin-line geometric curves under heavy scan degradation, is empty. That cell is drawn in the one scarce colour on the canvas because it carries the entire claim: it is exactly where a scanned strip log lives, and it is exactly where the public corpus is silent.

The lever makes the consequence legible. It sweeps the share of an evaluation that is drawn from that blind-spot cell, from a leaderboard-style mix that is nearly all clean printed text to one weighted toward strip-log content, and it tracks the aggregate score of an illustrative leaderboard-topping model as that share rises. The score falls, and it falls for a reason the matrix has already explained: the model was trained and graded on the top-left of the space, so its competence on the bottom-right was never built and never measured. The model competences in that read-out are illustrative and flagged on the canvas; only the scale markers, the 136,771 TIF and 7,781 LAS files and the 118 wells and 22 columns, are sourced. The figure is there to make one thing unmissable, that a benchmark's silence about a content class does not show up as a low score, it shows up as no score at all, until an evaluation that actually samples the cell forces the number down.

What the empty cell costs in practice

The gap is not academic once you try to build the digitiser. Because no public set occupies the bottom-right cell, a team cannot train on it and, more quietly, cannot even validate against it with public data. The natural instinct is to reach for the nearest neighbour, and the atlas shows why each substitution fails on a different axis. A page-layout model brings clean-text competence and no geometry. A chart-extraction model brings thin-geometry competence but has almost never seen heavy scan degradation [6]. A historical-manuscript model brings degradation robustness but is tuned to writing, not curves [2]. Each covers one of the two axes the log needs and misses the other, which is the precise signature of a corpus that samples the edges of a space but not its far corner.

This is also why the honest response to the blind spot is to synthesise the missing cell rather than to keep borrowing from the wrong ones, and why the scale markers matter to that decision. An archive of 136,771 scanned logs is enough raw material to make the problem real, but raw scans are not labelled ground truth, and the paired 7,781 LAS files cover only a fraction of them. The vector target the digitiser must hit, the kind of depth-aligned, multi-column measurement record the 118-well, 22-column Xeek/FORCE 2020 tutorial set represents, is the format a log has to become, not the format a document benchmark ever grades. A model measured only on public document sets has no signal at all about whether it can bridge that raster-to-vector gap on real scans, because every benchmark in the corpus stops at the edge of the cell where the bridge would be tested.

Discussion

Read as a whole, the coverage matrix says something more general than a note about well logs. It says that the shape of a benchmark corpus is a policy about what the field will get good at, made implicitly, and that extreme document types pay for that policy twice: once because there is nothing to train on, and again because there is nothing to be caught failing on. The dataset-bias literature named the first cost long ago [1], and the generalisation-gap work showed how easily a frozen test set flatters a model even within its own distribution [5]. The document-analysis corpus adds the sharper case, where the drift is not a slightly harder resample but a content class the corpus never sampled, and the score's silence about it is total.

Where our own work sits relative to this map is the line between this survey and our applied writing. This is a reading of the public field's coverage, and its conclusion is a map with an empty corner, not a new dataset to fill it. Our engagement numbers are downstream of that conclusion: the 136,771-scan archive is why the problem is worth solving at scale, and the empty bottom-right cell is why VeerNet, our raster-log digitiser, had to be trained on synthetic data built to occupy that exact combination of thin geometry and degradation rather than on anything public. The survey explains why that was not a preference but a forced move. It also marks the limit of what any leaderboard can tell an operator considering an off-the-shelf document model: the leaderboard's silence about the log cell is not evidence the model can handle logs, it is the absence of evidence either way, and on this task the two are easy to confuse.

Limitations

This is a survey and carries a survey's boundaries. It reads the canonical, most-cited public document-analysis releases and does not enumerate every dataset in a large and active field; the claim is about the dominant shape of the corpus, clean printed text over-sampled and thin degraded geometry under-sampled, not an exhaustive census, and a reader can surely name a specialist release that sits closer to the empty corner than the ones we treat. The coverage weights in the atlas are an illustrative reading of how densely the public sets populate each content-and-degradation cell rather than a per-image measurement, and the two model competences that drive the lever's score are illustrative values chosen to show the mechanism, not metrics we measured on any specific model; both are flagged as such on the canvas, and only the scale markers, the 136,771 TIF and 7,781 LAS files and the 118 wells and 22 columns, are sourced. The score the lever reports is therefore a demonstration of how an evaluation's content mix moves a headline number, not a fitted prediction of any particular model's drop. We also did not run a controlled cross-benchmark transfer study, so the claim that page-trained and chart-trained models fail on scanned logs rests on the structure of the coverage gap and on our own results rather than on a head-to-head measurement across every dataset named here; such a study would test the argument more rigorously than a coverage map can, and it would be a good thing for the field to have. Finally, the survey is fixed to the state of the public corpus at the time of writing, and later releases may reach further into the corner it draws as empty.

What to carry from the survey

A benchmark can only grade what it contains, so a corpus's omissions are invisible in its scores. The public document-analysis corpus over-samples clean printed text and under-samples thin geometric content under heavy degradation, and that bias is patterned, not random.
The dataset-bias literature (Torralba and Efros) established that a model's headline number reflects its benchmark as much as its ability, and the generalisation-gap work (Recht and colleagues) showed an 11 to 14 point drop from a mere honest resample of the same distribution. A genuinely unsampled content class is a cliff, not a drift.
On a coverage matrix of content type against scan degradation, the public releases are dense at the clean-printed-text corner and fade along the diagonal. The bottom-right cell, thin-line geometric curves under heavy scan degradation, is empty, and that is exactly where a scanned strip log lives.
The nearest public neighbours each cover one axis and miss the other: page-layout sets bring clean-text competence and no geometry, chart-extraction sets bring thin geometry but little degradation, and historical-manuscript sets bring degradation robustness but are tuned to writing. None covers the far corner a log needs.
Dataset representation, not model capacity, is the binding constraint for extreme document types. The real scale sits outside the benchmarks entirely (136,771 scanned TIF and 7,781 paired LAS files; the 118-well, 22-column Xeek/FORCE 2020 tutorial set), which is why the missing cell had to be synthesised rather than borrowed.

The smallest habit this survey would install is a question to ask before trusting any document model's leaderboard rank on a new content type: does the benchmark that produced that rank actually contain the cell my inputs fall into, and if it does not, the rank is silent about my task rather than reassuring about it, and the honest next step is to build the missing cell and measure there.

References

[1] Torralba, A., and Efros, A. A. Unbiased Look at Dataset Bias. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011. Shows that recognition datasets carry a strong built-in bias and that a model's score reflects its benchmark as much as its own ability. https://ieeexplore.ieee.org/document/5995347

[2] Simistira, F., Seuret, M., Eichenberger, N., Garz, A., Liwicki, M., and Ingold, R. DIVA-HisDB: A Precisely Annotated Large Dataset of Challenging Medieval Manuscripts. International Conference on Frontiers in Handwriting Recognition (ICFHR), 2016. A pixel-annotated dataset of degraded medieval manuscript pages for layout and text-line segmentation. https://ieeexplore.ieee.org/document/7814109

[3] Jaume, G., Ekenel, H. K., and Thiran, J.-P. FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents. arXiv preprint (2019). A set of 199 real, fully annotated scanned forms aimed at understanding noisy, degraded printed documents. https://arxiv.org/abs/1905.13538

[4] Zhong, X., Tang, J., and Jimeno Yepes, A. PubLayNet: largest dataset ever for document layout analysis. International Conference on Document Analysis and Recognition (ICDAR), 2019. Over a million automatically annotated, born-digital PDF page images for layout segmentation of text, titles, lists, tables, and figures. https://arxiv.org/abs/1908.07836

[5] Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do ImageNet Classifiers Generalize to ImageNet? arXiv preprint (2019). Rebuilds fresh test sets by replicating the original collection process and finds large accuracy drops, evidence that a leaderboard number can overstate generalisation. https://arxiv.org/abs/1902.10811

[6] Luo, J., Li, Z., Wang, J., and Lin, C.-Y. ChartOCR: Data Extraction from Charts Images via a Deep Hybrid Framework. IEEE Winter Conference on Applications of Computer Vision (WACV), 2021. Extracts the thin geometric primitives of chart figures, one of the few document benchmarks aimed at line and curve structure rather than blocks of type. https://openaccess.thecvf.com/content/WACV2021/html/Luo_ChartOCR_Data_Extraction_From_Charts_Images_via_a_Deep_Hybrid_WACV_2021_paper.html

Benchmarks for Document Image Analysis: A Survey of Datasets and Their Blind Spots

Abstract

Reading a benchmark by its silences

What the public document corpus actually samples

How we mapped the corpus

Where the leaderboard goes quiet

What the empty cell costs in practice

Discussion

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on