Skip to main content

Research

Basin-Targeted Seismic Foundation Models: Why Region-Specific Pretraining Wins

A family of masked-autoencoder foundation models pretrained on ~30 TB of Norwegian Continental Shelf seismic data outperforms generic vision models and a globally pretrained seismic baseline across four interpretation benchmarks — with the 2.5D multi-view variant emerging as the practical default.

Tannistha Maitiby Tannistha Maiti9 min read
Research

Foundation models trained on cat photos do not understand seismic amplitudes. A family of masked autoencoders pretrained on ~30 TB of Norwegian Continental Shelf data shows that basin-targeted pretraining — not global aggregation — is the strongest default for regional interpretation workflows.

Abstract

Automated seismic interpretation has inherited much of its model machinery from computer vision, but the physics that produces a seismic amplitude has little in common with the physics that produces a photographic pixel. That mismatch shows up as brittle transfer: state-of-the-art self-supervised vision encoders, dropped onto migrated post-stack volumes, underperform even modest seismic-specific baselines. This note summarises a family of foundation models pretrained exclusively on approximately 30 TB of 3D seismic from the Norwegian Continental Shelf (NCS), using masked autoencoders (MAE) [1] with Vision Transformer backbones in three tokenization variants — 2D, 2.5D multi-view, and 3D volumetric.

Across four geological interpretation benchmarks, basin-targeted pretraining consistently outperforms both generic vision models and a globally pretrained seismic baseline [2,3]. Gains are largest on amplitude-sensitive tasks such as flatspot mapping, where local acquisition and processing characteristics dominate the signal. The 2.5D multi-view formulation achieves the strongest average accuracy at a fraction of the compute of dense 3D tokenization, making it the practical default at repository scale. Learned embeddings also support interactive similarity search across full 3D cubes in seconds, opening a human-in-the-loop mapping mode with minimal labelling [5]. Pretrained weights are released openly.

Background

Seismic interpretation is a labour-bound bottleneck in subsurface workflows. A 3D survey arrives as a multi-terabyte amplitude cube; an interpreter spends weeks to months picking horizons, faults, and direct hydrocarbon indicators. The field has reached for deep learning to compress that cycle, but most production-grade architectures still trace their lineage to ImageNet-class natural-image backbones. The implicit assumption — that low-level visual primitives transfer cleanly from photographs to migrated amplitudes — is increasingly difficult to defend.

Seismic data is not an image. A pixel encodes reflected light intensity at a sensor; an amplitude sample encodes a band-limited estimate of an acoustic impedance contrast, modulated by acquisition geometry, processing flow, and overburden. The statistics differ — Gaussian-like, signed, with structured spatial correlation along bedding rather than object boundaries. Recent surveys of foundation models in seismic processing make the gap explicit: natural-image pretraining helps where edges and textures dominate, and fails where amplitude fidelity matters [4].

masked autoencoder

Two responses have emerged. The first is global seismic pretraining: aggregate everything available worldwide and train a single backbone [2,3]. The second, examined here, is basin-targeted pretraining: restrict the corpus to a coherent geological province and let the model specialise to its acquisition, processing, and stratigraphic conventions. The NCS — with decades of open public-domain surveys via the Norwegian Offshore Directorate disclosure regime — is an unusually good testbed for the second strategy.

Method

The pretraining corpus comprises approximately 30 TB of 3D post-stack migrated amplitude volumes from the NCS. Density-aware sampling is applied to mitigate the dominance of a small number of large overlapping surveys, but the corpus remains spatially imbalanced — dense across the North Sea, sparser in the Norwegian Sea, and thin in the Barents Sea. All variants share a Vision Transformer encoder–decoder architecture trained with the MAE objective of He et al. [1].

Three tokenization regimes are compared. The 2D variant patchifies inline or crossline sections and treats each as an independent image. The 2.5D variant samples three orthogonal slices through a common centre voxel and concatenates their token sequences before encoding, giving the model a multi-view summary of local 3D structure at modest cost. The 3D variant patchifies a true volumetric block; due to memory and I/O constraints, it is trained with sparse pillar sampling rather than dense volumetric patches, which limits per-update batch diversity.

2.5D multi-view
  • Three orthogonal slices through a common voxel
  • Token sequences concatenated before encoding
  • Captures local 3D structure at 2D-like cost
  • Strongest average benchmark accuracy
3D volumetric
  • True volumetric patchification
  • Sparse pillar sampling due to memory/IO limits
  • Reduced per-update batch diversity
  • Dense regime remains an open question

An 85% masking ratio is applied to the flattened token sequence across all variants, matching the ratio that proved optimal in the original MAE work [1]. The reconstruction objective minimises mean squared error in pixel space over the masked patch positions:

MAE reconstruction loss over masked patch positions 𝓜.
L=1MpMx^pxp22\mathcal{L} = \frac{1}{|\mathcal{M}|} \sum_{p \in \mathcal{M}} \left\| \hat{x}_p - x_p \right\|_2^2

Pretrained encoders are then evaluated on four downstream geological interpretation benchmarks spanning facies classification, fault identification, salt-body delineation, and flatspot mapping, with linear probing and light fine-tuning protocols. Baselines include a frozen DINOv2 natural-image encoder, a globally pretrained seismic foundation model [2], and a from-scratch supervised ViT.

Results

Three findings hold across the four benchmarks. First, natural-image foundation models do not transfer reliably to seismic — DINOv2 and comparable self-supervised vision encoders trail seismic-specific baselines on every task, with the gap widest on amplitude-driven problems. Second, seismic-domain pretraining is necessary but not sufficient: a globally aggregated seismic baseline beats all natural-image encoders on average, but is in turn outperformed by basin-targeted pretraining on the NCS test suite. Third, the 2.5D multi-view tokenization delivers the best average accuracy while using a fraction of the compute and memory of the 3D variant.

Pretraining regime, ranked by average benchmark accuracy

1st

Basin-targeted NCS pretraining

2nd

Global seismic pretraining

3rd

Natural-image self-supervised

4th

From-scratch supervised ViT

Gains are largest on flatspot mapping — a direct hydrocarbon indicator task that depends on accurate preservation of local amplitude relationships. This is consistent with the hypothesis that basin-targeted pretraining absorbs region-specific acquisition and processing signatures that a globally pooled corpus averages out. Facies classification and fault identification show smaller but consistent improvements; salt-body delineation, where geometry dominates over amplitude, shows the narrowest margins.

Beyond benchmark scores, the learned embeddings prove useful as a retrieval substrate. Cosine similarity search over indexed embeddings returns geologically analogous patches across full 3D cubes in seconds, supporting an interactive mapping mode in which an interpreter labels a handful of exemplars and propagates them via nearest-neighbour lookup. This is the practical mechanism by which a foundation model becomes a productivity layer rather than a one-shot classifier [5].

Interactive similarity search loop

  1. Interpreter labels exemplar

    A handful of patches of interest

  2. Encode to embedding

    Frozen basin-targeted ViT encoder

  3. Cosine search over cube

    Seconds across full 3D volumes

  4. Propagate + refine

    Human-in-the-loop geological map

Discussion

The headline result — that basin-targeted pretraining beats global aggregation — runs against the prevailing instinct in foundation-model work, which assumes more data is always better. For seismic, more data is better only when it shares the acquisition and processing conventions of the target province. A backbone pretrained across mixed vintages, contractors, and processing flows learns an averaged prior that is robust but blunt; a backbone pretrained on a single basin learns a sharper prior that pays off precisely where amplitude fidelity matters.

That has practical implications for operators. The default architecture for a regional interpretation workflow should be a basin-specialised encoder, not a globally pooled one. For data-rich provinces with public disclosure regimes — the NCS, the UK Continental Shelf, parts of the US Gulf — the corpus already exists. For data-poor regions, the case for federated or transfer-from-analogue strategies becomes the open research question.

Operator takeaway

For regional workflows in data-rich basins, a basin-specialised encoder should be the default. Globally pooled seismic backbones are a useful fallback when local pretraining data is unavailable — not the other way around.

Limitations are real and worth stating plainly. The corpus is geographically biased toward the North Sea. Evaluation is restricted to migrated post-stack amplitude volumes, leaving angle stacks and well-log integration to future work. Ground-truth labels embed interpreter subjectivity, which caps achievable scores and may mask small inter-model differences. The 3D variant was trained with sparse pillar sampling, so whether a fully dense 3D regime would close the gap with 2.5D is unresolved. No quantitative scaling law is established — the monotonic ranking from natural-image → global seismic → basin-targeted may not yet be in a saturation regime, and the relative contribution of corpus size versus corpus diversity remains open.

A final methodological caveat. Masked pixel reconstruction biases the encoder toward local texture statistics rather than higher-level structural or stratigraphic abstractions. Self-distillation and latent-prediction objectives — which optimise in representation space rather than pixel space — may be better matched to the relational signals that interpretation tasks ultimately demand. The right next step is not a bigger MAE; it is a different objective.

By the numbers

The quantitative spine of this work, in one frame.

By the numbers

~30 TB

NCS 3D seismic pretraining corpus

85%

MAE token masking ratio

3

Tokenization variants — 2D, 2.5D, 3D

4

Geological interpretation benchmarks

1st

Basin-targeted vs. global + natural-image baselines

References

[1] He, K., Chen, X., Xie, S., Li, Y., Dollár, P., and Girshick, R. (2022). Masked autoencoders are scalable vision learners. IEEE/CVF CVPR, 15979–15988.

[2] Sheng, H., Wu, X., Si, X., Li, J., Zhang, S., and Duan, X. (2025). Seismic foundation model: A next generation deep-learning model in geophysics. GEOPHYSICS, 90(2), IM59–IM79.

[3] Sansal, A., Lasscock, B., and Valenciano, A. (2025). Scaling seismic foundation models. First Break, 43, 69–74.

[4] Fuchs, F., Fernandez, M.R., Ettrich, N., and Keuper, J. (2025). Foundation models for seismic data processing: An extensive review. arXiv:2503.24166. https://arxiv.org/abs/2503.24166

[5] Waldeland, T.J., Forgaard, L., Ordonez, A., Wade, D., and Bugge, A.J. (2025). Interactive injectite mapping with minimal training data using self-supervised learning. 86th EAGE Conference, Extended Abstracts.

EarthScan
Continuous AI for explorers

info@earthscan.io

Go to Top

© 2026 Copyright. Earthscan