The Public Well-Log Benchmark Landscape: What FORCE, Xeek, and Volve Standardized

A field cannot measure its own progress without a shared yardstick, and for a long stretch petrophysical machine learning did not have one. Models were trained and evaluated on proprietary well data that never left the operator that owned it, so a claimed result was almost impossible to reproduce and nearly meaningless to compare. What changed that was not a new algorithm but a small set of open data releases. This piece is a map of that landscape. It credits the releases that gave the field its common benchmarks, surveys what each of them actually standardised, and then places one more corpus on the map for context: the raster archive we built for digitising scanned logs. We want to be clear about authorship from the outset. The benchmarks below are not ours; they are the open releases that made the field legible, and the most useful thing we can do here is credit them precisely and then say honestly where our own corpus does and does not fit alongside them.

Why a shared benchmark mattered at all

The reason the open releases were transformative is structural rather than technical. Well-log data is commercially sensitive, expensive to acquire, and historically locked inside the company that drilled the well. That meant the natural state of the field was a collection of private leaderboards, each operator quietly evaluating its own models on its own data, with no way for an outside reader to check a result or for two groups to compare methods on equal footing. An open, labelled, well-documented dataset breaks that deadlock in one move: it turns a private claim into a public, falsifiable one. The same dynamic that public benchmarks brought to image classification and to natural-language processing is what these releases brought to subsurface machine learning, and the credit for the shift belongs to the people and institutions who chose to open the data, not to any one model that trained on it.

There is prior art for this even before the headline releases, and it deserves a mention. An early and influential example was an open, reproducible facies-classification exercise that published both the data and the code so that anyone could run it end to end [5]. It was small, but it established the template the larger competitions later followed: real well logs, an explicit label, a fixed evaluation, and nothing hidden. The releases we focus on below scaled that template up.

Volve: a whole field opened

The release that signalled how much was possible was Equinor's decision to open the Volve field data set in 2018 [1]. Volve is a decommissioned oil field in the Norwegian North Sea, and what Equinor published was not a curated benchmark slice but something closer to the raw contents of a working subsurface archive: well logs, seismic, production data, reports, and more, released for research and education. For machine learning specifically, Volve mattered less as a tidy labelled task and more as a demonstration that a major operator would put a complete, real field into the public domain. It gave researchers genuine, messy, depth-aligned well data to work with, and it set a precedent that an open release could be comprehensive rather than token. Its strength is realism and breadth; its limitation, from a pure-benchmark point of view, is that it was not packaged as a single labelled prediction task with a fixed score, so different groups still had to make their own choices about exactly what to predict and how to evaluate it.

FORCE 2020 and Xeek: a benchmark with a fixed score

The release that most directly standardised a task was the FORCE 2020 lithology dataset and competition [2]. Here the contribution was precisely the thing Volve left open: a clearly defined prediction problem with a labelled target and a single agreed metric. The dataset assembled wells from the Norwegian continental shelf, each with a suite of wireline measurements as inputs and an interpreted lithology label as the target, and the task was to predict the lithology log from the measurement logs. By fixing the inputs, the label, the train and test split, and the scoring rule, FORCE 2020 turned lithology prediction into something every entrant solved on identical terms, which is the whole point of a benchmark.

The competition was run on the Xeek platform, which hosted the challenge and standardised its evaluation and leaderboard [3]. It is worth separating the two contributions even though they are often spoken of in one breath. The dataset is the corpus and the labels; the platform is the fixed evaluation, the held-out test set, and the public leaderboard that made every submission comparable. Both were necessary. A labelled corpus with no agreed evaluation is still a private leaderboard waiting to happen, and an evaluation platform with no open corpus has nothing to evaluate. Together they delivered what the field had been missing: a reproducible, public petrophysical ML task.

A widely used teaching slice of this data makes its shape concrete. A tutorial on handling missing values walks through a subset of 118 wells from the Norwegian Sea, each described by 22 measurement columns, and a great deal of the practical value of the release is visible in that single sentence [4]. The data is vector LAS, meaning each curve is already a depth-indexed numeric series rather than a picture of one; it is multi-well, so a model can be trained on some wells and tested on others; and it is tabular and dense, with a manageable, named set of features per well. That is exactly the form that supervised learning consumes most readily, and it is why FORCE and Xeek became the default starting point for petrophysical ML.

Where our corpus sits, and where it does not

Against that backdrop we can place our own corpus honestly, which is the only reason to put it on the same map. We built a raster archive derived from the public well records of the Railroad Commission of Texas, the state regulator that maintains a vast collection of scanned well logs from onshore Texas wells [6]. The archive we assembled from it contains 136,771 raster TIF files alongside 7,781 LAS files. The contrast with the open benchmarks is the entire point, so it is worth stating plainly rather than dressing up.

An atlas of the public well-log benchmarks that gave petrophysical ML its shared yardsticks, with our raster archive plotted as one more point on the same map. Each bubble is a corpus: the horizontal axis is its size on a log scale, the vertical axis is an illustrative reading of how richly it is labelled (expert-curated at the top, weak labels at the bottom), and the bubble area grows with size. The teal points are the open vector-LAS releases we credit: FORCE 2020 hosted on Xeek (118 Norwegian Sea wells, 22 measurement columns, per McDonald 2021) and Equinor's Volve field-data release. The single orange point is our TXRRC-derived corpus (136,771 raster TIF files plus 7,781 LAS), which is an order of magnitude larger in raw files but raster-image first and weak-labelled rather than expert-curated, so it is complementary to the open releases rather than a replacement for them. Toggle whether size is measured in wells or files, and click any bubble to read its modality and counts. The well, file, and column counts are real; the label-richness axis and the bubble sizes are an illustrative reading of modality and curation, not a measured score.

On raw count our corpus is roughly three orders of magnitude larger than the 118-well teaching slice, but that number flatters it, because the two corpora are not the same kind of thing. The FORCE, Xeek, and Volve data is vector LAS: each log is already a clean, depth-aligned numeric curve with named measurement columns and, in the FORCE case, an expert lithology label. Our 136,771 files are raster images: photographs, in effect, of paper logs, where the curve exists only as ink on a scan and has to be recovered before any of it becomes a numeric series at all. The 7,781 LAS files in our archive are the small minority for which a vector log already exists; the overwhelming bulk is image-first and weakly labelled. So our corpus trades the open benchmarks' greatest strengths, clean vectors and expert labels, for a different strength entirely: enormous scale and a faithful sample of the messy, scanned reality that most legacy well data actually lives in. The instrument above draws that trade as a single map, with the open releases as the curated, vector-native yardsticks in teal and our raster archive as the large, weak-labelled outlier in orange.

We are deliberately not claiming our corpus is a better benchmark. It is not even a benchmark in the FORCE sense, because it does not ship with a fixed task, a held-out test split, and an agreed score. What it is, is complementary. The open vector-LAS releases are the right yardstick for the question of predicting one curve from others, or a lithology from a measurement suite, on clean data. Our raster corpus is the right starting material for a different question entirely: recovering a numeric log from a scanned image in the first place, which is the problem VeerNet, our encoder-decoder for raster-log digitisation, was built to solve. The two sit at opposite ends of the same pipeline. You need something like the raster corpus to turn a paper archive into vectors, and you need something like the FORCE benchmark to evaluate models on the vectors once you have them.

What the open releases actually standardised

If there is a single thread through this map, it is that the open releases standardised three things that the field could not have shared otherwise, and that none of those three is a model. The first is a corpus: real well data, in the public domain, that anyone can download and inspect. The second is a task: a fixed prediction problem with an explicit label, so that two groups are demonstrably solving the same thing. The third is an evaluation: a held-out split and an agreed metric, so that a number on a leaderboard means the same thing to everyone reading it. Volve contributed overwhelmingly on the first axis, opening a whole field's worth of data [1]. FORCE 2020 contributed on the first and second, adding a labelled task to an open corpus [2]. Xeek contributed on the third, supplying the fixed evaluation and the public leaderboard that made submissions comparable [3]. The earlier facies exercise prefigured all three at smaller scale [5]. Each is a piece of shared infrastructure, and the field's progress rests on them.

Our contribution to this particular map is modest and we will not inflate it: one more corpus, of a different modality, plotted for context. The honest reading of the atlas is that the open releases set the coordinate system and we have added a single point to it, in a corner the vector benchmarks do not cover, the raster-native, weak-labelled, very-large-scale corner that is where most of the world's historical logs still sit. Crediting the releases that drew the map is the first and most important thing to say. Adding one point to it is the second.

Discussion

For a team starting petrophysical ML today, the practical guidance that falls out of this map is simple. If your data is already clean vector LAS and your question is curve-to-curve or curve-to-label prediction, start from the open benchmarks: FORCE 2020 on Xeek for a labelled, scored task [2][3], Volve when you want a fuller, more realistic field to work against [1], and the published tutorials and slices that make them approachable [4]. If instead your raw material is scanned paper logs, a regulatory raster archive like the Texas one is the starting point, but recognise that you are at the front of the pipeline and the digitisation problem comes first, before any of the open benchmarks become relevant to you. The two are not in competition. They are different stations on the same line, and the open releases are the reason the line has a shared map at all.

Key takeaways

Petrophysical ML got its shared yardsticks from open data releases, not new models: Equinor's Volve field data (2018), the FORCE 2020 lithology dataset, and the Xeek platform that hosted and scored it. We credit those releases as the standard-setters.
The open releases standardised three things, none of them a model: a public corpus (Volve), a labelled task (FORCE 2020), and a fixed evaluation with a leaderboard (Xeek). An earlier open facies exercise prefigured all three at smaller scale.
A widely used teaching slice of the FORCE / Xeek data is 118 Norwegian Sea wells with 22 measurement columns, in vector-LAS form: depth-aligned numeric curves, multi-well, dense, and expert-labelled. That clean, tabular shape is why it became the default benchmark.
Our TXRRC-derived corpus (136,771 raster TIF files plus 7,781 LAS) is roughly three orders of magnitude larger in raw count but raster-image first and weak-labelled. It trades the benchmarks' clean vectors and expert labels for scale and a faithful sample of scanned, legacy reality.
Our corpus is complementary, not a replacement: it has no fixed task or score, so it is not a benchmark in the FORCE sense. It is the raw material for the digitisation step (the problem VeerNet solves) that comes before the open benchmarks become relevant.

References

[1] Equinor. Volve field data set. Equinor open data release (2018). The full set of subsurface and production data from the decommissioned Volve field in the Norwegian North Sea, released for research and education. https://www.equinor.com/energy/volve-data-sharing

[2] Bormann, P., Aursand, P., Dilib, F., Manral, S., and Dischington, P. FORCE 2020 Well log and lithofacies dataset for machine learning competition. FORCE / GitHub (2020). The labelled North Sea and Norwegian Sea well-log corpus used in the FORCE 2020 lithology prediction contest. https://github.com/bolgebrygg/Force-2020-Machine-Learning-competition

[3] Xeek (Enthought). FORCE 2020: Predict lithology from well logs. Xeek competition platform (2020). The data-science challenge that hosted the FORCE 2020 lithology task and standardised its evaluation. https://xeek.ai/challenges/force-well-logs/overview

[4] McDonald, A. Using the missingno Python library to identify and visualise missing data prior to machine learning. Towards Data Science (2021). A tutorial on the FORCE 2020 / Xeek dataset slice of 118 Norwegian Sea wells with 22 measurement columns. https://towardsdatascience.com/using-the-missingno-python-library-to-identify-and-visualise-missing-data-prior-to-machine-learning-34c8c5b5f009

[5] Hall, B. Facies classification using machine learning. The Leading Edge, 35(10), 906-909 (2016). An early open, reproducible well-log facies-classification benchmark that helped set the template the later competitions followed. https://library.seg.org/doi/10.1190/tle35100906.1

[6] Railroad Commission of Texas. Well log and digital records, public well data. Texas RRC (accessed 2022). The state regulatory archive of scanned raster well logs from which our raster corpus is derived. https://www.rrc.texas.gov/resource-center/research/data-sets-available-for-download/

The Public Well-Log Benchmark Landscape: What FORCE, Xeek, and Volve Standardized

Why a shared benchmark mattered at all

Volve: a whole field opened

FORCE 2020 and Xeek: a benchmark with a fixed score

Where our corpus sits, and where it does not

What the open releases actually standardised

Discussion

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on