Self-Supervised and Foundation-Model Approaches for Subsurface Signals: An Assessment

Abstract

The subsurface is not one data type, and the self-supervised methods that reshaped machine learning do not meet it as one. A seismic volume is an image; a scanned report is a document; a well log is neither. It is a set of one-dimensional, depth-indexed measurement traces, sampled at a fixed depth step down a borehole, and its defining property for a pretraining objective is not its resolution but its holes. Curves stop and start at tool changes; a bad-hole interval washes out; a tool was simply not run over a zone. This assessment asks a single question of the period-correct (2023-Q4) self-supervised toolkit: does the pretext task a given objective optimises actually fit a log curve, as opposed to fitting a document or a photograph that these methods were designed on. We read masked-reconstruction and contrastive objectives against the structure a real log imposes, two curves in Track 3 (NPHI and RHOB) and three in Tracks 1 and 2, and against a public pretraining corpus at the scale a subsurface team would actually assemble: 118 Norwegian-Sea wells across 22 electrical-measurement columns from the Xeek/FORCE 2020 set [7]. The finding is that transfer is uneven and that the unevenness is a property of the pretext, not of the parameter count. A masked-reconstruction objective that hides random spans of a trace and asks a network to rebuild them cannot, on real data, separate an intended training mask from an interval that was never logged, so its benefit falls off fastest precisely where logs are hardest. Contrastive and denoising objectives that key on depth-local structure degrade more gently. The corpus and signal figures here are sourced; the fitness ratings are a reasoned illustration and are labelled as such on the exhibit.

Why a log is not a picture

It is worth being concrete about what changes when the downstream object is a trace rather than a raster. A photograph has two spatial axes with roughly homogeneous statistics and a dense, complete grid of pixels. A well log has one axis, depth, along which the statistics are strongly non-stationary, and along which the observation itself is frequently absent. The 22-column Xeek/FORCE 2020 tutorial set is the honest picture of this: it holds 118 wells, and the reason the reference tutorial reaches for a missing-data visualisation library at all is that no single well carries every one of those 22 measurements over its full depth [7]. Some columns are present in a handful of wells. Many are shot through with null intervals. This is not dirty data to be cleaned before the real work; it is the structure of the signal, and a pretraining objective either respects it or fights it.

The second difference is that the useful downstream task on a log is usually regression, not classification or a per-pixel mask. A petrophysicist wants a value, a bulk density or a neutron porosity at a depth, with a defensible error. That is what our own applied stack targets when it reconstructs the two Track 3 curves, NPHI and RHOB, and the three curves in Tracks 1 and 2. A representation that is excellent at telling two log images apart, or at reconstructing the look of a curve, has still not discharged its debt until that representation carries into a curve-value regression with a usable residual. Transfer to a log is a downstream-regression claim, and it has to be measured on the value, not asserted from a low pretext loss.

The two objective families, read against a trace

The period's self-supervised methods fall, for this purpose, into two families that treat the input very differently. The first is masked reconstruction. Its lineage runs from the masked-token pretext that BERT made canonical [1] to the masked-image autoencoder that carried the same recipe into vision by hiding a large fraction of the input and rebuilding it [2], and on to the time-series forms that mask spans or whole patches of a 1D series and ask the network to fill them in [3] [6]. The pretext is generative: hide part of the signal, predict it. The second family is contrastive. Rather than reconstruct values, it learns a representation in which two views of the same signal are close and views of different signals are far, keying on the signal's own local context to define what counts as the same. TS2Vec builds this hierarchy of contrasts over a series [4], and temporal neighborhood coding defines its positives by a signal's own neighbourhood so that the objective survives the non-stationarity that a log has in abundance [5].

Read against a complete, well-behaved trace, both families work, and masked reconstruction often works best, because rebuilding a smooth curve from context is a task a transformer does well. Read against a real log, the ordering is not so clean, and the reason is specific rather than general. A masked-reconstruction pretext manufactures its own supervision by choosing which samples to hide. On complete data that is free supervision. On a log where a quarter of the trace is genuinely absent, the objective faces two kinds of gap that look identical to it: the span it chose to mask, which has a ground-truth value it is being trained to recover, and the span that was never logged, which has no ground-truth value at all. Nothing in the raw objective distinguishes them, so a naive implementation either learns to reconstruct nulls as if they were signal or wastes capacity modelling the pattern of absence. The methods built for time series confront this directly, which is why the multivariate framework frames its pretext as denoising of masked values with an explicit accounting for which positions are observed [3], and it is the single most important design detail to get right when the family is applied to logs.

An assessment of period-correct (2023-Q4) self-supervised objectives against the thing that actually defines a well log: a 1D, depth-indexed measurement trace, not a document image and not a natural photograph. The grid on the left rates four objective families (masked reconstruction, contrastive, autoregressive, denoising) against four fitness criteria a log trace imposes (respecting 1D depth order, tolerating holes in the trace, generalising across the many measurement channels, and transferring into a curve-value regression rather than only a reconstruction loss). The plot on the right, and the missing-data lever below it, carry the argument: as the fraction of the trace that is simply missing rises, the masked-reconstruction objective (the one orange element) loses effective fit fastest, because a pretext that hides random spans of a trace cannot distinguish an intended mask from a hole that was never logged, while contrastive and denoising objectives degrade more gently. The corpus scale (118 Norwegian-Sea wells, 22 log-feature columns) and the target-signal structure (2 curves in Track 3, 3 in Tracks 1-2) are sourced from the engagement archive (Xeek/FORCE 2020, McDonald); every fit rating and the missingness-to-fit response are illustrative and are flagged as such on the canvas.

The exhibit sets the assessment out as a grid and one lever. Each objective family is rated against four criteria a log trace imposes: whether it respects one-dimensional depth order, whether it tolerates holes in the trace, whether it generalises across the many measurement channels a log carries, and whether it transfers into a curve-value regression rather than only a reconstruction loss. The lever is the argument. As the fraction of the trace that is missing rises, the masked-reconstruction row, drawn in the single scarce orange, loses effective fit fastest, for exactly the reason above; the contrastive and denoising rows, which lean on depth-local context that survives a hole, fall away more slowly. The corpus and signal counts on the canvas are sourced from the Xeek/FORCE 2020 set and our own log structure [7]; the fitness ratings and the missingness response are a reasoned illustration and are flagged as such, because our archive does not hold a controlled pretext ablation on this corpus and we will not imply one it does not contain.

Where the foundation-model framing helps, and where it does not

The word foundation model carries an implicit promise: pretrain once at scale, adapt everywhere cheaply. For subsurface signals the promise has to be read against two facts. The first is scale. The natural-image and language foundation models draw their strength from corpora many orders of magnitude larger than 118 wells, and a subsurface team assembling a signal corpus is not going to reach that scale from any single basin. A foundation-model recipe whose entire advantage is pretraining scale is therefore weakest exactly where our regime lives. The second is transfer surface. A generic representation earns its keep when many downstream tasks share it. On logs, the downstream tasks do share a great deal, because the same handful of curves recur across wells and basins, so a representation learned on the 22 available columns has real reuse across the 118 wells and beyond. That is the honest case for pretraining here, and it is a case for a modest, domain-specific pretrained encoder rather than for importing a photograph-scale model whose inductive biases were tuned on a different object.

This is also where the patch-based time-series work becomes relevant rather than decorative. Treating a 1D series as a sequence of patches, and pretraining by masking whole patches, gives the masked-reconstruction family a unit of prediction that matches how a log actually varies, in intervals with shared character, rather than sample by sample [6]. A patch that falls entirely inside a logged interval is a clean reconstruction target; a patch that straddles a hole can be excluded rather than hallucinated. Patch-level masking does not remove the missing-data problem, but it gives an implementer a natural place to handle it, which sample-level masking does not.

What we would actually reach for

Read as a decision rather than a survey, the assessment points somewhere specific for a signals stack. If the corpus is largely complete and the downstream task is value reconstruction, a masked-reconstruction pretext, ideally patch-level with explicit observed-position accounting [3] [6], is the strong first choice, because rebuilding a curve from context is precisely what it is good at. As the missing-data prevalence rises, and on the Xeek/FORCE 2020 set it rises far enough that a visualisation tool is the tutorial's opening move [7], the case tilts toward a contrastive objective that keys on the signal's own neighbourhood and does not manufacture supervision from spans that may or may not exist [4] [5]. The two are not exclusive. A defensible stack pretrains with a denoising masked objective that is honest about which positions are observed, and adds a contrastive term so the representation is not wholly dependent on the reconstruction target surviving the holes. What we would not do is assume that a large photograph-pretrained or document-pretrained foundation model transfers to a depth-indexed trace on the strength of its size, because the object it was built for is not this object, and the property that breaks the transfer, the holes, is the one a log has the most of.

Limitations

This is an assessment of the published field applied to one signal type, and it inherits the limits of that framing. It does not re-implement or re-benchmark the methods it weighs; it reads their pretext tasks against the structure of a log and reasons about fit. The corpus and signal figures are the real ones, 118 Norwegian-Sea wells across 22 electrical-measurement columns from the Xeek/FORCE 2020 set, and two curves in Track 3 with three in Tracks 1 and 2 [7], but the fitness ratings in the exhibit and the missingness-to-fit response are a reasoned illustration, not a measured ablation, and they are labelled as such on the canvas. We did not run a controlled pretraining sweep on this corpus, so the assessment makes no numerical claim about how much a given pretext gains or loses at a given missing-data level; the ordering it argues, masked reconstruction strongest on complete traces and most fragile as holes accumulate, contrastive and denoising more robust to absence, is a prediction from how each objective manufactures its supervision, not a result we recorded. The scope is the period-correct (2023-Q4) objective families and stops at the close of that quarter, so later signal-specific foundation models are out of frame. A reader should take this as a map of which self-supervised objective fits a one-dimensional, hole-ridden log trace, and why, not as a substitute for running the pretext ablation on their own wells and their own target curve.

References

[1] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL (2019). The masked-token pretext that made masked reconstruction the default self-supervised objective. https://arxiv.org/abs/1810.04805

[2] He, K., Chen, X., Xie, S., Li, Y., Dollar, P., and Girshick, R. Masked Autoencoders Are Scalable Vision Learners. CVPR (2022). Masks a large fraction of the input and reconstructs it, the pretext this assessment tests against a 1D trace. https://arxiv.org/abs/2111.06377

[3] Zerveas, G., Jayaraman, S., Patel, D., Bhamidipaty, A., and Eickhoff, C. A Transformer-based Framework for Multivariate Time Series Representation Learning. KDD (2021). A masked-value denoising pretext built directly for multivariate, unevenly observed time series. https://arxiv.org/abs/2010.02803

[4] Yue, Z., Wang, Y., Duan, J., Yang, T., Huang, C., Tong, Y., and Xu, B. TS2Vec: Towards Universal Representation of Time Series. AAAI (2022). Contrastive representation learning that keys on depth-local context rather than reconstructing raw values. https://arxiv.org/abs/2106.10466

[5] Tonekaboni, S., Eytan, D., and Goldenberg, A. Unsupervised Representation Learning for Time Series with Temporal Neighborhood Coding. ICLR (2021). A contrastive objective that defines positives by a signal's own neighbourhood, robust to non-stationarity. https://arxiv.org/abs/2106.00750

[6] Nie, Y., Nguyen, N. H., Sinthong, P., and Kalagnanam, J. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. ICLR (2023). Patches a 1D series and pretrains by masking whole patches, the patch-masking variant this assessment weighs for logs. https://arxiv.org/abs/2211.14730

[7] McDonald, A. Using the missingno Python library to Identify and Visualise Missing Data Prior to Machine Learning. Towards Data Science (2021). The Xeek/FORCE 2020 well-log corpus, 118 Norwegian-Sea wells across 22 measurement columns, and the missing-data prevalence this assessment treats as the transfer obstacle. https://towardsdatascience.com/using-the-missingno-python-library-to-identify-and-visualise-missing-data-prior-to-machine-learning-34c8c5b5f009

Self-Supervised and Foundation-Model Approaches for Subsurface Signals: An Assessment

Abstract

Why a log is not a picture

The two objective families, read against a trace

Where the foundation-model framing helps, and where it does not

What we would actually reach for

Limitations

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on