Eight Real Scans, One Reviewer: Standing Up Human-in-the-Loop Validation

The model could read a scanned well-log, and on the days everything went right it read it well. What it could not do, and what no segmentation network can do, was tell a customer that the curve it had traced was correct. Somebody has to sign that, and on this project that somebody was one reviewer with eight real field scans, the operator's original logs to check against, and a question that had no good answer if we ducked it: how does a person actually validate a model's output before it becomes a file someone drills against? This is the account of how we wired that person into the first production frontend, and why the wiring mattered more than the accuracy number we kept quoting.

A demo is not a delivery

VeerNet, the network we built to digitize raster well-logs, returned a three-class mask for every scan: background, curve one, curve two, one label per pixel. The peak goodness-of-fit we could show on a well-recovered curve was real, and it was the number that opened every conversation. But a peak is one curve on one scan, and the operator was not buying a peak. They were buying eight digitized logs they could load into a petrophysical workflow without first wondering which rows the model had invented. The gap between those two things, between a metric we could quote and a delivery we could stand behind, was the whole job, and it was a human-in-the-loop job, not a modelling one.

So we held a small line internally that turned out to matter for the rest of the engagement. No CSV would leave the building unread. Every one of the eight scans would be put in front of the reviewer, overlaid on the model's prediction, checked against the operator's own log, corrected where it was wrong, and signed off where it was right. That is an obvious-sounding rule until you try to honour it for a model whose two curve masks were not very good, and then it becomes the design problem.

The constraint that set the design

The honest picture of the model is the one we had to design around. Per region, the recall told the story plainly. The background mask recalled at 0.97, which is to say the model almost never missed the empty page. The two curve masks recalled at 0.37 and 0.32, which is to say the model found roughly a third of the curve pixels it should have found and left the rest for someone to notice ^[3]. If the reviewer treated every region the same, opening each one, comparing it pixel by pixel, and correcting it, the eight scans would have been a long, undifferentiated grind, and the part of the grind that mattered, the faint curves, would have been buried under the part that did not, the background the model already had right.

The goal we set was therefore not to make the reviewer faster in general. It was to spend the reviewer's attention where the model was weak and to stop spending it where the model was strong ^[1]. A region the model was confident about, and right about, should flow through without a person ever opening it. A region the model was unsure about should land in a queue, flagged, with the overlay ready and the operator log beside it. The reviewer should see the doubtful curves first and the settled background never. That is a routing problem, and recall was the signal we already had to route on.

Wiring the reviewer in

What we built was a confidence-sorted validation queue inside the first production frontend, and the sorting key was the model's own per-region recall used as a confidence proxy ^[2]. The frontend took each of the eight scans, split its prediction into the three regions, and asked one question of each: does this region clear the confidence bar? If it did, the region was auto-accepted and never reached the reviewer's worklist. If it did not, the region dropped into the reviewer queue, rendered as an overlay on the scan with the operator's log alongside, so the correction was a matter of dragging the trace back onto the ink rather than reconstructing it from nothing.

The bar was the lever, and it was a real decision rather than a default. Set it high and almost everything queues, including the background the model already had right, and the reviewer is back to the undifferentiated grind. Set it low and the curves slip through unchecked, which for a deliverable that gets drilled against is the failure that actually costs money. The setting we shipped put the background safely above the bar and both faint curves safely below it, so the eight scans entered as twenty-four regions, the eight background regions cleared on their own, and the sixteen curve regions went to the reviewer. That split is the engine of the next section, and it is what the instrument below lets you drive.

The hard part was not the queue. It was making the corrected output land somewhere trustworthy. A reviewer dragging a trace back onto faded ink produces a fixed mask, but the customer artefact is a depth-indexed curve in a LAS file, not a mask ^[4], so each correction had to flow back through the same indexing and interpolation the auto-accepted regions went through, or a hand-fixed curve would have read worse than a machine-accepted one. We learned that the cheapest way to lose the reviewer's trust was to show her a clean overlay and then deliver a CSV that did not match it because the correction had bypassed the assembly step. Once corrections and auto-accepts shared one path to the CSV, the overlay she signed and the file we shipped were the same thing.

What the eight scans showed

Run the eight scans through the gate and the arithmetic is small and clear. Twenty-four predicted regions enter. With the bar set where we shipped it, the eight background regions auto-accept and the sixteen curve regions queue, so the reviewer opens two-thirds of the regions and skips the third the model had already earned. The curves she corrects come out at a CSV mean absolute error of 0.11 and 0.12 against the operator's own log, which is the number that mattered: not how the mask scored in pixel space, but how far the delivered curve sat from the truth the customer could check.

The human-in-the-loop story told as a routing funnel. Eight real field scans enter validation, each carrying three predicted regions (one background mask plus two curve masks), and the model's own recall on each region is the confidence the gate reads. Background recall is high at 0.97, so the background almost always clears the bar and auto-accepts; the two curve masks recall far lower at 0.37 and 0.32, so most curve predictions drop into the reviewer queue to be checked and corrected against the operator's own log. Drag the confidence bar or use the arrow keys: a higher bar routes more regions to the reviewer (slower, safer), a lower bar auto-accepts more (faster, but weaker fits slip through). The card on the right reads off reviewer touches saved and the post-correction error the surviving CSV clears, a mean absolute error of 0.11 and 0.12 per curve against the log. The sourced figures are the engagement's own: 8 scans validated end to end by one reviewer, the three per-region recalls, and the two CSV error figures. The recall-to-routing gate and the touches-saved estimate are an illustrative model built on top of those sourced recalls.

The instrument makes the trade visible, because the trade is the point. Drag the confidence bar up and more regions queue, the reviewer works harder, and fewer doubtful curves escape unchecked. Drag it down and more regions auto-accept, the reviewer works less, and the risk of an unverified curve in the delivery rises. There is no setting that is free. The value of putting a person in the loop was never that the loop made the work disappear; it was that the loop let us choose, deliberately and per region, which work a person did and which the model was trusted to do alone, and to defend that choice to a customer who was going to drill against the result.

What stayed with us

The reflex we carried out of these eight scans is to measure a human-in-the-loop tool by where it sends attention, not by how much attention it removes. A digitizer that auto-accepts the wrong regions is worse than one that queues everything, because the second merely wastes a reviewer's afternoon while the first quietly ships a curve nobody looked at. Recall, the number we had been quoting as a model statistic, turned out to be most useful as a routing instruction: 0.97 on the background meant trust it, 0.37 and 0.32 on the curves meant do not, and the frontend's job was to act on that difference rather than average it away. Eight scans is a small number, and one reviewer is a fragile process, but the shape of the thing we built held when the project grew. The first file that left the building had been read by a person, and that, more than the peak fit, was what made it a delivery instead of a demo.

References

Amershi, S. et al. (2019). Guidelines for Human-AI Interaction. CHI 2019. https://www.microsoft.com/en-us/research/publication/guidelines-for-human-ai-interaction/
Settles, B. (2009). Active Learning Literature Survey. University of Wisconsin-Madison, Computer Sciences Technical Report 1648. https://minds.wisconsin.edu/handle/1793/60660
Sorower, M. S. (2010). A Literature Survey on Algorithms for Multi-label Learning. Oregon State University. https://www.researchgate.net/publication/266888594_A_Literature_Survey_on_Algorithms_for_Multi-label_Learning
Canadian Well Logging Society (1992). LAS Version 2.0: A Digital Standard for Logs, Log ASCII Standard. https://www.cwls.org/products/#products-las

Eight Real Scans, One Reviewer: Standing Up Human-in-the-Loop Validation

A demo is not a delivery

The constraint that set the design

Wiring the reviewer in

What the eight scans showed

What stayed with us

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on