The parts of a machine-learning system that fail in production are almost never the model. The model is the part that got the attention, the sweeps, the loss-curve screenshots, the demo that convinced someone. What breaks at serve time is the plumbing around it: the code that reshapes an input before the network sees it, the code that decides what a missing value means, the code that writes the answer somewhere the next process can read it. This note is a checklist of those quiet details for VeerNet, the encoder-decoder EarthScan uses to lift curves off scanned well logs for a Texas onshore operator, and it is deliberately not a description of the model. It is a description of everything standing between a model that scores well on a held-out split and a service you can point a pipeline at.
What makes these defects quiet is the whole problem. They do not raise an error on the happy path. They pass every test written against clean, correctly shaped, sentinel-free input, which is exactly the input a demo runs on. Then a real scan arrives at an odd size, or with a -999 where a reading should be, or the write lands in a layout the reader downstream does not expect, and the output is wrong without anything having crashed. Sculley and colleagues gave the general version of this: the model is a small box in the middle of a large system, and the risk lives in the glue around it, the pipelines and data dependencies that no one owns [3]. The entries below are that glue, made specific.
The one you can compute exactly
Start with alignment, because it is the defect you can predict to the pixel rather than discover in a log file. VeerNet is an encoder-decoder in the U-Net family, and its encoder runs five stride-2 stages [1]. Each stride-2 stage halves the spatial dimensions of the tensor, so after five of them the feature map is the input divided by two, five times over, which is thirty-two. That is only clean arithmetic if the input height and width are themselves divisible by thirty-two. If they are not, the halving does not land on integers, the decoder's five upsampling stages cannot line their skip connections back up with the encoder's, and the network either refuses the input or, worse, silently mis-strides and hands back a curve shifted off its true depth.
Now put the real data against that constraint. The synthetic logs VeerNet trains and serves on are not a fixed size. They span a wide band: widths from 3200 to 12800 pixels and heights from 480 to 640 pixels. Almost none of the values in that band are multiples of thirty-two. A 6180-pixel-wide scan is not; a 500-pixel-tall one is not. So at serve time, before the forward pass, every scan has to be padded up to the next multiple of thirty-two, thirty-two-minus-the-remainder pixels of padding added on, and then that padding has to be tracked so the output can be cropped back to the true extent afterward. Get the pad wrong and the answer is off by exactly the padding you forgot to remove. It is a two-line fix and a whole class of subtly wrong outputs if you skip it.
Alignment earns the worked example because it makes the shape of the whole ledger visible. The defect is not in the model, it is in the code that prepares the input; it never fires on a conveniently sized test image; and when it does fire it corrupts the output quietly rather than loudly. Every other entry has the same shape.
The sentinels that poison a normalisation
The second entry is missing values, and it is quiet in a more dangerous way, because floating-point arithmetic will help you lose the answer. Real log data arrives with holes, and those holes are not empty, they are filled with sentinels: -999 is a decades-old convention for a null reading in petrophysical files, and the ingestion path also sees NA and None depending on how the source was exported. If one of those slips past the guard and into a tensor, and you then normalise, the sentinel does not just contribute a wrong value. Under IEEE 754 the moment a computation touches a NaN it returns NaN, and NaN propagates through every operation after it [4]. A -999 that survives into a mean-and-standard-deviation normaliser drags the statistics for the whole scan; a NaN that survives into a division turns the entire feature map to NaN. Either way the corruption is not local to the bad pixel, it spreads, and the model produces confident garbage rather than an error.
So the guard has to be earlier than the model and it has to be exhaustive. Every sentinel form the source can emit, -999, NA, None, has to be recognised as missing before any arithmetic runs on it, replaced or masked deliberately rather than accidentally, and the decision recorded so a curve full of interpolated fill is not mistaken for a curve full of measurement. This is the same lesson alignment taught, one layer in: the input has to be made safe before the network sees it, and the place that gets it wrong is the code no one thought of as part of the model.
The write that has to match the reader
The third entry moves past the model entirely, to what happens after inference. A served prediction is not useful until it is written somewhere the next process can read it, and for this system that means an S3 write-back. The trap is that a write can succeed, return a clean status, and still be useless, because usefulness is not whether the bytes landed, it is whether the layout of what landed matches what the reader downstream expects: the key structure, the file format, the channel ordering, the metadata that says which depth range this curve covers. A grayscale output written in a layout the reader parses as something else is not a crash. It is a file that exists, that a dashboard counts as a successful inference, and that produces wrong curves for anyone who opens it. The write-back is a contract with a process you are not looking at when you write the serving code.
This is why the ledger is a ledger and not a single fix. Alignment is a contract with the encoder's arithmetic. Sentinel handling is a contract with floating-point arithmetic. The write-back is a contract with a downstream reader. None of them is negotiable at serve time, none of them is visible in a demo, and each of them turns correct model output into wrong service output if it is quietly broken.
Why GroupNorm and grayscale belong on the list too
Two architectural facts sit under the ledger because they decide what counts as safe input. VeerNet normalises with GroupNorm, which computes its statistics over groups of channels with a minimum group size of sixteen and, unlike batch normalisation, does not depend on other samples in the batch [2]. That batch-independence is what lets the model serve one scan at a time without its normaliser behaving differently than it did in training, a serving property and not a training footnote. And the input is single-channel grayscale, so the sentinel guard and the alignment pad both operate on exactly one plane, with no colour channels to hide a bad value in or pad inconsistently. These are the constraints that decide whether a given piece of input-preparation code is correct, and they are why the checklist has the exact entries it has.
The checklist is the deliverable
The habit all of this leaves us with is to treat the model as the easy part and the serving path as the thing that needs a written checklist. Before a VeerNet checkpoint is a service rather than a demo, the same short list runs every time: is the input padded up to a multiple of thirty-two and will the output be cropped back; is every sentinel form caught before any arithmetic; does the write-back layout match what the reader downstream parses. Not one of those questions is about whether the model is any good. They are about whether the plumbing around a good model is honest, and the difference between the two is the difference between a screenshot that convinced someone and a pipeline that runs unattended on scans no one has looked at.
Limitations
This is a serving checklist from one engagement, not a general theory of inference hygiene, and it should be read as the entries that mattered for this system rather than a complete taxonomy. The alignment stride of thirty-two follows exactly from five stride-2 encoder stages, and the dimension band, the GroupNorm minimum group size, the single-channel input, and the sentinel values are the real archive figures; the specific 6180-pixel raw dimension the instrument opens on is just a mid-band example chosen to sit off the thirty-two grid, and the padding it draws for every other dimension is exact arithmetic, not measured data. The three ledger entries are the defects this pipeline actually hit, not the only defects a serving path can have; a different architecture with a different downsampling factor would need a different alignment number, and a source that emits different sentinels would need a different guard. And a clean checklist buys correct plumbing, not a correct model. It says nothing about whether the digitised curve is accurate, only that a good curve will not be corrupted on its way out, which is a necessary condition for a service and nowhere near a sufficient one.
References
[1] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015, LNCS 9351, Springer, pp. 234-241. The encoder-decoder architecture whose repeated downsampling is why an input dimension must be divisible by a power of two. https://link.springer.com/chapter/10.1007/978-3-319-24574-4_28
[2] Wu, Y., and He, K. Group Normalization. ECCV 2018, pp. 3-19. The batch-independent normaliser whose group size and single-sample behaviour are serving-time properties. https://openaccess.thecvf.com/content_ECCV_2018/html/Yuxin_Wu_Group_Normalization_ECCV_2018_paper.html
[3] Sculley, D., et al. Hidden Technical Debt in Machine Learning Systems. NIPS 2015. The argument that production ML risk lives in the glue, pipelines, and data dependencies around the model rather than the model itself. https://proceedings.neurips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html
[4] IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2019. The NaN propagation rule, that any arithmetic touching a NaN returns NaN, which is why one unguarded sentinel poisons everything downstream. https://ieeexplore.ieee.org/document/8766229