A Reader's Guide to the FORCE and Xeek Well-Log Competitions

A public competition is easy to read as a scoreboard and hard to read as a specification, and for the FORCE and Xeek 2020 well-log event the specification is the interesting half. Underneath the leaderboard sits a decision that the organisers made once, on everyone's behalf, and then held fixed: a single public set of 118 Norwegian Sea wells with 22 electrical-measurement feature columns, split into train and evaluation by well rather than by individual depth sample. That looks like a housekeeping detail. It is the entire point. This guide is a reading of the competition from that angle, first-person-plural where the digitisation work is ours, and the claim it makes is narrow and practical: the thing a public well-log competition standardises is not a model or even a metric, it is a leakage-safe way of measuring, and that discipline is exactly what a team building its own digitiser has to copy before any number it produces means anything.

We should be clear about what this guide is not. It is not a recap of which architectures topped the board or by how much, and it is not a survey of what the public well-log corpus covers and fails to cover, which is a different question we have written about separately. It is a focused look at one mechanism, the split rule, and why the competition's choice of it is the part worth borrowing. Everything else the event provides, the fixed set, the fixed columns, the fixed metric, is in service of making that mechanism trustworthy and comparable across strangers.

The split rule is the specification, not the housekeeping

Start with what a split rule actually decides. When you divide a dataset into a training part and an evaluation part, you are declaring what the model is allowed to have seen. The score you report at the end is a claim about performance on data the model has not seen, and that claim is only as honest as the wall between the two parts. In a well-log set the natural unit of data looks like a row: one depth, with its 22 measurements. The tempting thing to do is shuffle those rows and deal them into train and validation at random, because that is what a generic tabular pipeline does by default and it feels fair.

It is not fair, and the reason is that depth samples inside a single well are not independent of each other. A reading at one depth is highly predictive of the reading a few centimetres above and below it, because rock changes gradually and the logging tool has real vertical resolution. Shuffle the rows and you scatter a well's depths across both sides of the wall. Now the training set contains depths that sit immediately adjacent to the depths you are scoring on, which means the model has effectively already seen a near-twin of every validation sample. The wall is still there on paper, but information walks straight through it. This is the textbook shape of data leakage: performance-inflating information from outside the legitimate training data, entering through a partition that does not respect the structure of the data [1].

A competition that splits by well closes that door. Hold out whole wells, and every depth you score on belongs to a well whose other depths are also held out, so the model has never seen anything from that well at all. The wall now follows the real seam in the data. That is the discipline the FORCE and Xeek event standardises, and it is why its numbers can be compared across teams who never spoke to each other: they were all measured against the same honest wall.

Why the honest number is the lower one

The uncomfortable consequence is that doing it right lowers your score. This is not a bug to be engineered away; it is the whole value of the exercise. The literature on grouped data is blunt about it. When observations cluster into groups, a random split leaks the within-group dependence and produces an over-optimistic error estimate, and the fix is to hold out whole groups so the evaluation reflects the harder task the model will actually face on genuinely new groups [2]. The well is the group. Bergmeir and Benitez made the same point for dependent sequences generally: a naive random-split cross-validation gives an error estimate that is too good, because the independence it assumes is not there [3]. A well log is a dependent sequence in depth, so both arguments land on it directly.

Read that way, the gap between a row-shuffled score and a well-partitioned score is not noise. It is the measurement of how much the leaky number was lying by. The competition's contribution is to fix the split so that gap is designed out, and every entrant reports the honest, lower number by construction. A team scoring itself in private gets no such gift. It has to impose the discipline on itself, and the strong temptation, especially when a demo is due, is to reach for the shuffle that reports the flattering number.

What the FORCE and Xeek 2020 well-log competition standardises is not a model but a measurement discipline: one public set of 118 Norwegian Sea wells with 22 electrical-measurement feature columns, split by WELL rather than by row. Lever A toggles the split rule. By row shuffles individual depth samples so adjacent depths from the same well straddle train and validation, and the reported score is inflated because the model has effectively seen each validation depth's near-twin. By well holds out whole wells, so a validation depth is genuinely unseen and the score is honest. Lever B drags the held-out share of the 118 wells. The teal curve is the honest by-well score and the faint dashed curve is the inflated by-row score; the only element that argues is the orange bracket, the trust gap a row split hides, which collapses to nothing the moment you partition by well. The 118 wells, the 22 columns, and our 136,771 TIF and 7,781 LAS archive counts are sourced from the archive; the two score curves are illustrative geometry that dramatises the leakage mechanism, not measured competition results.

The exhibit is that gap made visible. Drag the share of the 118 wells held out for validation, and toggle between the two split rules. The teal line is the honest by-well score and the faint dashed line is the inflated by-row score; the orange bracket between them is the trust gap a row split hides, and it collapses to nothing the instant you switch to by-well, because an honest wall has nothing to inflate. The two score curves in the figure are illustrative geometry, drawn to dramatise the mechanism rather than to report competition results, and they are flagged as such on the canvas. What is sourced is the scale: the 118 wells, the 22 feature columns, and, as the yardstick, our own archive.

Reading the competition against our own scale

Here is where the guide turns from reading someone else's benchmark to using it. Our raster-log work runs against an archive of 136,771 scanned TIF files and 7,781 paired LAS files. Against that, a public set of 118 wells is tiny, and it would be easy to conclude the competition has nothing to teach a project operating two and three orders of magnitude larger. That conclusion gets the value backwards. When your own data dwarfs any public set, the public set stops being useful as a source of training volume and becomes useful as a source of measurement discipline. You are not going to learn geology from 118 wells you did not log. You are going to learn how to score, from a set where the honest way to score has already been fixed and defended in public.

The transfer is concrete. Our archive is organised by well and by scan, and every honest instinct the competition encodes says the wall in our own evaluation has to follow the well boundary too. A digitised curve from one well is dependent along depth in exactly the way the FORCE and Xeek set is, so a random split of our depth samples would leak in exactly the same way and report exactly the same flattering lie, just at larger scale. The competition is small enough to reason about completely and public enough to argue about openly, which makes it the right place to settle the principle. Then we carry the settled principle, not the data, to the archive.

What borrowing the discipline actually looks like

Borrowing it is less about tooling and more about refusing an easy number. In practice it means three habits, all of which the competition models. Partition by well, never by row, so a held-out well is genuinely unseen and the score reflects the real task of reading a well the model has not encountered. Fix the evaluation set and the metric before you start tuning, the way the organisers fixed theirs, so you cannot quietly drift the target toward whatever your current model happens to be good at. And report the honest, lower number even when a shuffled split would look better on a slide, because the lower number is the one that survives contact with a new well.

None of these is exotic, and that is the point of reading a competition this way. The FORCE and Xeek event did not invent a scoring rule so much as commit, in public, to the one that leakage-aware evaluation has always demanded [1][2][3]. Its 118 wells and 22 columns are a small, legible instance of a discipline that scales to any archive, including one the size of ours. The competition's real export is not a dataset. It is a defensible answer to the question every digitiser eventually has to answer for itself: when you say your model scores this well, what exactly was it allowed to see, and are you sure.

What the competition standardises

The part of a public well-log competition worth reading is the split rule, not the leaderboard. FORCE and Xeek 2020 fix one public set of 118 Norwegian Sea wells with 22 electrical-measurement feature columns and split it by well, which is the whole measurement discipline in one decision.
Splitting by row leaks. Depth samples inside a well are dependent, so a random shuffle puts a validation depth's near-twin into training, and the model is scored on data it has effectively already seen. Splitting by well holds out whole wells, so the wall follows the real seam in the data.
The honest, well-partitioned score is the lower one, and the gap between it and the leaky row-split score measures how much the leaky number was overstating. The competition designs that gap out by construction; a team scoring itself in private has to impose the same wall on itself.
At our archive scale of 136,771 TIF and 7,781 paired LAS files, the value of a 118-well public set is not training volume but measurement discipline. You borrow the leakage-safe partition, not the data.
Borrowing it means three habits the event models: partition by well not by row, fix the evaluation set and metric before tuning, and report the honest lower number even when a shuffle would look better.

Limitations

This is a reading of one competition's scoring discipline, not a study of its results, and it should be held to that scope. The 118 wells, the 22 feature columns, and our 136,771 TIF and 7,781 LAS archive counts are sourced; the two score curves in the instrument are illustrative geometry chosen to show the leakage mechanism cleanly, not measured leaderboard outcomes, and the size of the gap in the figure is a drawn example rather than a reported result. The by-well partition removes the specific leakage that comes from depth-adjacency within a well, and nothing more: it does not fix a validation set that fails to represent the field's real variety, it does not account for the possibility that whole regions or tool vintages are absent from the 118 wells, and it says nothing about label quality. A leakage-safe split can still sit on top of an unrepresentative sample and report an honest number about the wrong distribution. The transfer to our archive is also an argument, not a measurement: we hold that the same dependence structure applies because a digitised curve is a depth sequence like the competition's logs, but the exact size of the leakage a row split would introduce on our data is not something this guide quantifies. And the discipline governs only how you measure, not whether the thing you measured is good enough to ship, which remains a separate judgement about whether an honestly scored curve is actually usable downstream.

The number that survives a new well

The habit this leaves us with is to distrust any score whose split we cannot name. A public well-log competition is worth reading precisely because it names its split out loud and defends it, and the split it chose is the one that makes the reported number a claim about new wells rather than about wells the model already half-knew. Copy that, and a private evaluation stops being a mirror the model poses in front of and becomes the closest thing we have to the field: a set of wells the model has not seen, scored under a wall that information cannot walk through, reported at the honest, lower value that is the only one worth trusting.

References

[1] Kaufman, S., Rosset, S., Perlich, C., and Stitelman, O. Leakage in Data Mining: Formulation, Detection, and Avoidance. ACM Transactions on Knowledge Discovery from Data 6, 4 (2012), article 15. The formal account of how information from outside the training data inflates measured performance, and why the remedy is a partition that respects the data's structure. https://dl.acm.org/doi/10.1145/2382577.2382579

[2] Roberts, D. R., Bahn, V., Ciuti, S., et al. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 40, 8 (2017), pp. 913-929. Why grouped observations require holding out whole groups, because within-group samples are dependent and a random split leaks that dependence. https://onlinelibrary.wiley.com/doi/10.1111/ecog.02881

[3] Bergmeir, C., and Benitez, J. M. On the use of cross-validation for time series predictor evaluation. Information Sciences 191 (2012), pp. 192-213. That a naive random-split evaluation of dependent samples gives an over-optimistic error estimate, and the scheme has to respect the dependence to be trustworthy. https://www.sciencedirect.com/science/article/abs/pii/S0020025511006773

[4] McDonald, A. Using the missingno Python Library to Identify and Visualise Missing Data Prior to Machine Learning. Towards Data Science (2021). A worked tutorial on the Xeek/FORCE 2020 set of 118 Norwegian Sea wells and its electrical-measurement feature columns, the concrete public set this guide reads. https://towardsdatascience.com/using-the-missingno-python-library-to-identify-and-visualise-missing-data-prior-to-machine-learning-34c8c5b5f009

A Reader's Guide to the FORCE and Xeek Well-Log Competitions

The split rule is the specification, not the housekeeping

Why the honest number is the lower one

Reading the competition against our own scale

What borrowing the discipline actually looks like

Limitations

The number that survives a new well

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on