Public Benchmarks and Datasets for Document Image Segmentation

Abstract

We needed a model that could segment the curve traces in scanned raster well logs, and the standard first step in any computer-vision project is to look for a public dataset that matches the problem closely enough to train on or pretrain from. This paper is the result of that search. We catalogue the public document-image segmentation datasets that were available to us when we started, credit the groups that released them, and characterise what each one actually contains: PubLayNet and DocBank for born-digital page layout, the PRImA layout-analysis dataset and DIVA-HisDB for scanned and historical pages, and the cBAD baseline-detection set from the READ project. Every one of these is a serious, well-constructed release for the documents it targets. The finding is that none of them covers the image geometry that a well-log scan presents. Document benchmarks are built from page-shaped images with an aspect ratio below one, whereas a well-log strip is thousands of pixels wide and only a few hundred tall, an aspect ratio that runs from roughly 5:1 to nearly 27:1. That region of the property space is empty, and the emptiness is structural rather than incidental, which is why we ultimately synthesized our own training corpus rather than adapting any of the public sets.

Background and the public landscape

Document image analysis has a mature culture of shared benchmarks, and a team approaching a new segmentation problem inherits a real advantage from it. The releases below are the ones we considered, grouped by the kind of document they target. We are surveying the field here, not claiming any of these datasets as ours; the contribution of this paper is the map and the gap, and the credit for the datasets belongs entirely to the groups who built them.

Born-digital page layout

The two largest and most influential releases we evaluated are built from born-digital documents rather than scans. PubLayNet assembled over a million page images by automatically annotating the layout of PDF articles, labelling each page into text, titles, lists, tables, and figures, and it became a default pretraining corpus for layout models because of its sheer scale and clean labels [1]. DocBank took a complementary route, rendering arXiv papers from their LaTeX source so that the typesetting itself supplies token-level layout labels with no manual annotation, producing a weakly supervised set that is large, consistent, and tightly tied to the structure of academic writing [2]. Both are excellent, and both share a defining property: their images are renders of pages, so every sample is portrait-oriented, dense with type, and roughly the shape of a sheet of paper.

Scanned and historical pages

Where PubLayNet and DocBank are born-digital, two further releases address the harder problem of segmenting real scans. The PRImA layout-analysis dataset was assembled to evaluate layout methods on realistic contemporary documents, scanned magazine and journal pages with region-level ground truth, and it has served as a standard performance-evaluation set for that task for over a decade [3]. DIVA-HisDB pushes into a different and more difficult regime, with pixel-level layout annotation of challenging medieval manuscript scans, where degradation, bleed-through, and irregular layout make the segmentation genuinely hard [4]. These two introduce the noise, skew, and artefacts of real imaging that the born-digital sets do not have, which is exactly the kind of realism a scanned-log model needs. Their images are still pages, though, scanned at page proportions.

Line and baseline structure

A third strand of the field segments structure within the page rather than the page into regions. The cBAD competition and dataset from the READ project framed text-line baseline detection on archival documents as a benchmark task, supplying page scans with annotated baselines and a fixed evaluation protocol [5]. We looked at it carefully because a well-log curve and a text baseline are both thin, elongated structures threading down an image, and the segmentation machinery for one might transfer to the other. The analogy holds at the level of method but not at the level of data, for the same reason the others did not fit: cBAD is a page benchmark, and its images are page-shaped.

How we surveyed the field and measured the gap

Our method was deliberately simple, because the question is concrete. For each public release we recorded what it segments, how it is labelled, and, most importantly for our purposes, the dimensions and aspect ratios of its images. We then placed each dataset as a coverage region in a two-dimensional property space whose axes are image width in pixels and aspect ratio expressed as width over height. Against that same space we plotted the region our own data occupies. Our synthetic well logs span 3,200 to 12,800 pixels in width and 480 to 640 pixels in height, which puts their aspect ratio between 5:1 at the squarest and almost 27:1 at the most extreme. The comparison is therefore not a judgement about quality, since the public datasets are high quality; it is a geometric question about whether any of them reaches into the shape of image we actually have to process.

We chose aspect ratio as the decisive axis for a specific reason. A segmentation model is not shape-agnostic. Receptive fields, the spatial-pooling cadence of an encoder, padding and tiling behaviour, and the statistics a model learns about where structure tends to lie are all conditioned on the proportions of the images it trained on. A model pretrained exclusively on portrait pages has never seen a feature that runs uninterrupted across twelve thousand pixels of width while the vertical extent stays small, and that is the defining situation in a well log. So the gap on the aspect-ratio axis is not cosmetic. It is the axis along which transfer is most likely to fail.

Results

The atlas below draws the outcome of that survey directly. Each public document-segmentation benchmark is a teal coverage rectangle sitting where its images live in the width-versus-aspect plane, and the orange band is the region our well-log scans occupy.

A coverage atlas of the public document-image segmentation datasets that existed when we started, plotted against the image shapes a scanned well log actually takes. The horizontal axis is image width in pixels on a log scale; the vertical axis is the aspect ratio, width over height, also on a log scale. Each teal rectangle is the dimension envelope of one public release: PubLayNet and DocBank (PDF and arXiv page renders), the PRImA layout-analysis set (magazine and journal scans), DIVA-HisDB (historical manuscript scans), and the cBAD baseline-detection set from the READ project. They all cluster in the portrait page band, with an aspect ratio below one, because they are pictures of pages. The single orange dashed band is where our synthetic well logs live: 3,200 to 12,800 pixels wide and only 480 to 640 pixels tall, an aspect ratio that runs from 5:1 up to almost 27:1. Drag the probe along the synthetic range and the read-out counts how many public benchmarks cover that exact shape; it stays at zero from one end of the range to the other. That empty corner is the structural blind spot the piece is about, and the reason we had to build our own corpus. The well-log dimensions are the build's own figures; the public-benchmark coverage rectangles are an illustrative reading of each dataset's typical page-scan dimensions, not a per-image measurement.

Two things are visible at once. First, the public releases cluster tightly in the portrait page band, all with aspect ratios below one, which is the expected consequence of the fact that they are images of pages. PubLayNet and DocBank sit at the narrow, born-digital end; PRImA, DIVA-HisDB, and cBAD spread toward larger scan widths but stay at page proportions. Second, the well-log region floats high above all of them on the aspect axis, in a corner of the space that no rectangle reaches. The interactive read-out makes the consequence unambiguous: drag the probe anywhere along the real synthetic-log range, from the squarest 5:1 shape to the most extreme 26.7:1 shape, and the count of public benchmarks covering that point never leaves zero. There is no width at which a portrait-page dataset and a well-log strip share the same proportions, because their proportions differ by an order of magnitude on the dimension that matters most for segmentation.

This is the empirical core of the paper. The blind spot is not that public document datasets are too small, or too clean, or labelled for the wrong regions, though some of those are also true. It is that the public document-segmentation corpus is, almost by definition, built from pages, and a well log is not a page. The one structural property a model most needs to have seen during training, the extreme aspect-ratio geometry of the input, is the one property no public release supplies.

Discussion

The honest reading of this survey is that the public datasets did their job and ours was simply a different job. PubLayNet and DocBank remain the right starting points for anyone segmenting page layout [1][2], PRImA and DIVA-HisDB are the right sets for realistic and historical scans [3][4], and cBAD is the right benchmark for line and baseline structure on archival pages [5]. We credit all of them and would reach for them again on the problems they were built for. What the atlas establishes is that none of them transfers cleanly to the proportions of a scanned well log, and once that is clear the path forward is determined.

Because the gap is on the input geometry rather than on the labels or the scale, the most reliable way to close it is to generate the data ourselves with the geometry built in. This is exactly the situation domain randomization was introduced to handle: when real labelled data in the target regime is scarce or absent, train on randomized synthetic renders that span the variation you expect to meet, and a model can learn features that transfer to real images [6]. For us that meant rendering synthetic well logs across the full 3,200 to 12,800 pixel width range and the 480 to 640 pixel height range, with randomized curve shapes, grid backgrounds, and noise, so the model sees the extreme aspect ratios from the first epoch rather than meeting them for the first time at inference. The synthetic corpus is what VeerNet, our encoder-decoder for raster-log digitisation, was trained on, and the reason it exists is precisely the empty corner this atlas draws. We did not synthesize data because the public datasets were poor. We synthesized it because the shape we needed was simply not on the map.

It is worth saying where this leaves the relationship between the public field and our work. We are not adding a benchmark to the document-segmentation landscape; a synthetic, single-purpose well-log corpus is not a general benchmark and we make no such claim. We are pointing out, with the field's own datasets as the evidence, that a real and recurring class of images falls outside the geometry those datasets cover, and that for anyone working on logs, seismic strips, or other extreme-aspect scientific scans, the public document benchmarks are the right inspiration for method and the wrong source of data.

Limitations

This survey has clear boundaries. We catalogued the public document-image segmentation datasets that were available and relevant to us at the time of the work, and the field has more releases than the six we treat in depth; our claim is about the dominant, page-shaped character of the landscape, not an exhaustive enumeration. The coverage rectangles in the atlas are an illustrative reading of each dataset's typical image dimensions rather than a per-image measurement, so a small number of individual samples in any release may fall outside the box we drew for it; the order-of-magnitude separation on the aspect-ratio axis is large enough that this does not change the conclusion, but it is a simplification. The well-log dimensions we plot are the real bounds of our synthetic corpus, not a survey of every physical log that exists, and some archival logs scanned at unusual settings will sit elsewhere. Finally, our argument that page-trained models fail to transfer to well-log geometry is grounded in the structure of segmentation architectures and in our own results rather than in a controlled transfer study across every dataset named here; a systematic cross-dataset transfer benchmark would test it more rigorously than this catalogue does, and we would welcome one.

References

[1] Zhong, X., Tang, J., and Jimeno Yepes, A. PubLayNet: largest dataset ever for document layout analysis. International Conference on Document Analysis and Recognition (ICDAR), 2019. A corpus of over a million automatically annotated PDF page images for layout segmentation of text, titles, lists, tables, and figures. https://arxiv.org/abs/1908.07836

[2] Li, M., Xu, Y., Cui, L., Huang, S., Wei, F., Li, Z., and Zhou, M. DocBank: a benchmark dataset for document layout analysis. International Conference on Computational Linguistics (COLING), 2020. A weakly supervised, token-level layout dataset of arXiv page renders built from LaTeX source. https://arxiv.org/abs/2006.01038

[3] Antonacopoulos, A., Bridson, D., Papadopoulos, C., and Pletschacher, S. A realistic dataset for performance evaluation of document layout analysis. International Conference on Document Analysis and Recognition (ICDAR), 2009. The PRImA layout-analysis dataset of scanned contemporary magazine and journal pages with region-level ground truth. https://www.primaresearch.org/datasets/Layout_Analysis

[4] Simistira, F., Seuret, M., Eichenberger, N., Garz, A., Liwicki, M., and Ingold, R. DIVA-HisDB: a precisely annotated large dataset of challenging medieval manuscripts. International Conference on Frontiers in Handwriting Recognition (ICFHR), 2016. A pixel-level layout dataset of historical manuscript page scans. https://ieeexplore.ieee.org/document/7814134

[5] Diem, M., Kleber, F., Fiel, S., Gruning, T., and Maier, A. cBAD: ICDAR2017 competition on baseline detection. International Conference on Document Analysis and Recognition (ICDAR), 2017. The READ project competition and dataset for text-line baseline detection on archival document page images. https://ieeexplore.ieee.org/document/8270258

[6] Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., and Abbeel, P. Domain randomization for transferring deep neural networks from simulation to the real world. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017. The method of training on randomized synthetic renders so a model transfers to real images, which underpins our decision to synthesize a well-log corpus. https://arxiv.org/abs/1703.06907