We thought we were finished the afternoon the vectorisation step returned a clean curve. The raster mask had become two columns of numbers, depth and value, sitting in memory exactly the way a curve should sit. Then a petrophysicist on the operator's team asked, reasonably, where the file was. Not the array. The file. The thing that opens in the software they had been driving since before any of us learned to spell convolution. And we did not have one, because nobody on the modelling side had written the program that turns a correct array into a correct file. That program turned out to be the entire deliverable, and this is the account of writing it.
The prediction was right and still unusable
By late 2023 our raster-log digitization system, VeerNet, did the hard part well. A scanned well-log went in as a tall grayscale image and a vectorised curve came out, a depth value pair recovered from the predicted mask. On the multiclass logs that meant two curves per image, neutron porosity and bulk density, the NPHI and RHOB pair the operator cared most about. The numbers were good. The lowest mean absolute error we reached against the ground truth was 0.0132, close enough that a geoscientist reading the recovered curve against the original would trust it.
None of that mattered to the software on the petrophysicist's desk, because that software does not accept arrays. It accepts a Log ASCII Standard file, the plain text format the industry has used for digital logs since 1992 [1]. A LAS file is not a CSV with a friendly extension. It has a defined grammar: a sequence of named sections, each introduced by a tilde, written in a fixed order, with rules about what goes in each one. Our array satisfied none of that grammar. We had built a model that was right and an output that was unreadable, and the gap between the two was a file writer nobody had budgeted for.
What a loader actually parses
The petrophysical tool never sees our model. It sees a text file, and it reads that file top to bottom expecting a Version section, then a Well section, then a Curve section, then an ASCII data block. Get the order or the grammar wrong and it stops at the offending line. The prediction quality is invisible until the file parses.
What we committed to build, and the rules it had to obey
The task was bounded and unforgiving: emit, for a single digitized log, a LAS file that the operator's own software would open without complaint and that carried our recovered curves on the right depth index. We had a strong reference for what right looked like, because the public Texas Railroad Commission dataset we trained against ships 7,781 LAS files alongside its 136,771 scanned TIF rasters [4]. Those LAS files were our specification by example. Whatever we wrote had to be indistinguishable, to a parser, from one of them.
Four rules governed the writer. The sections had to appear in the canonical order, Version then Well then Curve then ASCII. Every channel we wrote into the data block had to be declared, by mnemonic and unit, in the Curve section above it. The depth column had to be the first column and had to increase strictly from row to row. And anywhere we lacked a value we had to write the null sentinel the file declares in its own Well section, not a blank and not a zero, because a zero is a real porosity and a blank breaks the column count. Three of those four rules are about consistency between sections of the same file. The model has nothing to say about any of them.
Where the writer earned its keep, section by section
We did not hand roll the byte layout. The Python community had already solved the grammar in lasio, the library that reads and writes LAS files to the letter of the standard, and leaning on it meant we inherited a writer that other people had already validated against thousands of real logs [2]. Our job was to populate it correctly, and the populating is where the subtlety lived.
The Version section was the only trivial one: declare LAS 2.0, declare that lines do not wrap. The Well section was where the file's own contract gets written, the start depth, the stop depth, the depth step, and the null value, which we set to the conventional negative sentinel so that any depth our model could not recover would read as missing rather than as a real measurement. The Curve section is the part the modelling instinct underrates, because it is pure bookkeeping: one entry per channel, in the exact order the columns will appear, each with a mnemonic and a unit. DEPT in feet first, then NPHI in volume per volume, then RHOB in grams per cubic centimetre. If that list and the data block ever disagree on count or order, the file is wrong in a way that produces plausible looking nonsense rather than an error, which is the most dangerous failure of all.
Then the ASCII block, the matrix itself. Each row is a depth followed by one value per declared channel, and we wrote 300 rows, the interpolated depth resolution our validation work standardised on using the standard scientific resampling routine [3]. The arithmetic that fills those cells is the curve recovery the rest of the pipeline did. The discipline that makes them a valid file is the writer's: every row the same width, every width equal to the channel count, the depth column monotonic, the gaps filled with the sentinel.
named LAS sections
curves per multiclass log / per Tracks 1 and 2
depth rows in the ASCII matrix
The exhibit below is the writer made visible. It assembles the four sections live as you set how many curve channels to emit, two for a multiclass log or three for a Tracks 1 and 2 style log, and it runs the two checks the writer must pass before the file is allowed to call itself valid.
Drag the channel count and the ASCII matrix widens to match the Curve section, because the two are bound together by the count rule. Then flip the depth fault, which injects a single out of order depth sample, and watch the file fall straight to INVALID. That is not a cosmetic warning. It is the writer refusing to emit, because a non monotonic depth index is exactly the thing a downstream loader trips over, and catching it in our code is far cheaper than catching it in the operator's.
The afternoon the file looked perfect and opened to nothing
The bug that taught us the most did not raise an exception. An early version of the writer produced a file that lasio itself parsed without complaint, that looked correct when we printed it, and that the operator's software opened to a completely flat curve. The depths were there. The values were there. The display showed a straight line at the null sentinel for the entire interval.
The cause was the count rule, violated invisibly. We had declared two curves in the Curve section but, through an off by one in how we stacked the recovered arrays, written a depth column and a single value column into the ASCII block, three declared channels reading two columns of data. lasio did not object, because the file was still rectangular; it simply mapped our one value column onto the first declared curve and filled the second with nulls. The loader then drew the all null second curve as a flat line, and our beautiful 0.0132 error curve was sitting in a column nobody was reading. The fix was one line. The lesson was that the model's accuracy is held hostage by an integer, the agreement between the channel count you declare and the number of columns you write, and that this is the single check most worth enforcing before the file ever leaves your process. It is now the first thing the writer asserts, and it is the check the exhibit foregrounds.
What the writer had to get right
- The deliverable is a spec compliant LAS file, not a prediction array: four named sections in canonical order, Version then Well then Curve then ASCII, each obeying the 1992 standard the operator's software parses.
- The most dangerous bug is silent: if the channel count declared in the Curve section disagrees with the number of columns in the ASCII matrix, a tolerant parser will still load the file and quietly drop a curve, so the count check and a strictly increasing depth index are the two assertions the writer runs before it emits.
- Built against the 7,781 reference LAS files in the public Texas Railroad Commission dataset and carrying recovered curves at a best MAE of 0.0132 across 300 depth rows, the writer is what turned a correct two-curve array into a file a petrophysicist could actually open.
What writing the last program changed about how we scoped the next one
The habit this work left us with is to write the output file format on day one of any digitization project, before a single epoch of training, as an empty but valid LAS file that the operator's software will open. It feels backwards to produce the deliverable before producing anything to put in it. It is the most useful thing we now do, because it forces the depth index, the curve mnemonics, the unit strings, and the null sentinel to be decided while they are cheap to change, and it turns the model's job into filling a contract that already parses rather than inventing one at the end under deadline. The curve a petrophysicist loaded from this engagement read 0.0132 away from the truth on average, but it read at all only because a short, fussy writer placed it inside a grammar that the rest of the pipeline never had to think about. We had spent a year on the part that predicts. The part that no one talks about, the part that writes the file, is the part that decided whether any of it could be opened.
References
-
Canadian Well Logging Society (1992). LAS Version 2.0: A Digital Standard for Logs, Log ASCII Standard. https://www.cwls.org/products/#products-las
-
Kinverarity, W. (2021). lasio: Log ASCII Standard (LAS) files in Python. Software documentation. https://lasio.readthedocs.io/en/latest/
-
Virtanen, P. et al. (2020). SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17. https://www.nature.com/articles/s41592-019-0686-2
-
Railroad Commission of Texas (2022). Digital Well Log Files and Imaged Records, Public Data Sets. https://www.rrc.texas.gov/resource-center/research/data-sets-available-for-download/