Skip to main content

Blog

How Much Error Is Human? Negotiating Accuracy Thresholds With the Expert Your AI Assists

Before we reported a single recall number, we sat with the client's chief interpreter and agreed what counts as a hit: 2, 4 or 6 cm of depth, 1, 3 or 5 degrees of dip. Then we made the metric deliberately harsh. This is a method piece on co-designing evaluation with the domain expert, bounded from below by the sensor's own resolution floor, so that a claim like '75% recall' means something the expert will sign.

Tannistha Maitiby Tannistha Maiti8 min read
EarthScan insight

The step nobody writes up sits between building the model and reporting how well it works. Before we quoted a single accuracy number to a mid-sized Middle East carbonate operator, we spent an afternoon with their chief interpreter arguing about centimetres. Not about the network, the loss, or the training data. About one question: when our model predicts that a fracture crosses the wellbore at a given depth, how close does that prediction have to be to the interpreter's own pick before we are allowed to call it correct? Two centimetres? Four? Six? The number we settled on decides everything downstream, because it defines the word "hit," and every recall, precision and F1 we ever report is just a count of hits under that definition.

That afternoon is the subject of this piece. Co-designing the evaluation metric with the domain expert is unglamorous, it produces no plots, and it is the difference between a number the interpreter signs and a number they quietly distrust.

Why the threshold is a decision, not a default

A detection model for borehole-image features does not output a category. It outputs geometry: for each planar feature it emits a depth, a dip and an azimuth, regressed against the interpreter's ground-truth pick. To score a prediction you have to compare two continuous quantities and decide whether they are close enough. "Close enough" is a tolerance, and a tolerance is a choice someone has to make and own.

The temptation is to pick a tolerance that flatters the model. Open the depth window to 9 cm and almost every prediction lands inside it; recall climbs and the slide looks good. The interpreter is not fooled by this, because they know their own picks are good to a couple of centimetres and a 9 cm hit is, to them, a miss they were told to count as a hit. So we inverted the incentive. We asked the interpreter to name the tightest window they would still be comfortable calling correct, and we agreed to report against that, not against whatever made the number largest.

The tested settings came out of that conversation. Depth tolerance: 2, 4 or 6 cm. Dip tolerance: 1, 3 or 5 degrees. These are not round numbers we invented; they are the granularity at which a human interpreter and a model can meaningfully be said to agree on a carbonate feature, and they became the axes we scored everything on.

The physical floor under the negotiation

There is a hard limit on how tight the interpreter is allowed to make the depth window, and it is not up for negotiation with anyone. It comes from the sensor.

The unrolled borehole image is a raster, and in this dataset one pixel of that raster corresponds to about 3 cm of true depth at low resolution. You cannot localise a feature more precisely than one raster cell, so a built-in ±3 cm\pm 3\ \mathrm{cm} depth uncertainty rides in every pick before any model runs. (We laid out that resolution ceiling in detail in a separate piece on the 1-pixel-equals-3-cm rule; the point here is what it does to the negotiation.) A depth tolerance of 2 cm asks the metric to resolve depth finer than the instrument itself can. It is not a harder test of the model; it is an impossible one, and any score reported against it is measuring quantisation noise, not skill. So when the interpreter reached for 2 cm to be maximally strict, we pushed back with the sensor floor: the harshest honest window is the one that sits at or above 3 cm, not below it.

This is the whole shape of the exercise. The tolerance is bounded above by what flatters the model and bounded below by what the sensor can physically resolve. The honest metric lives in the narrow band between the two, and it is the expert who has to place it there and sign it.

WHAT COUNTS AS A HIT · CO-SIGNED WITH THE INTERPRETER75%fracture recall at 5 cm depth toleranceA recall number only counts once the expert signs the window that defines a hitand no window can sit below the sensor floor: one pixel is 3 cm of depth.A · DEPTH TOLERANCE THE EXPERT SIGNS2 cmsub-floor4 cm6 cmB · DIP TOLERANCE THE EXPERT SIGNS1°3°5°the signed contract5 cm · 3°expert will sign ityes, recall >= 75%MEASURED AGAINST A DELIBERATELY HARSH METRIC1.72 cmdepth MAE1.71°dip MAE16wellsC · RECALL RE-SCORES AS THE SIGNED WINDOW TIGHTENSsensor floor1 px = 3 cm025507510001234567depth tolerance (cm)75% sign-off line5 cm = 75%The metric was made deliberately harsh: a hit must land inside the signed depth AND dip window,and no window is allowed below the 3 cm the sensor itself cannot resolve.
What a domain expert has to co-sign before an accuracy number means anything: the depth window (tested at 2, 4 and 6 cm) and the dip window (tested at 1, 3 and 5 degrees) that decide whether a prediction scores as a hit. Dial A tightens the depth window and recall re-scores against it; the model reaches the 75% fracture recall the expert signs only once the window opens to 5 cm on the 16-well dataset. The orange band is the only element that argues: at low sensor resolution one raster pixel is 3 cm of depth, so a plus-or-minus 3 cm error sits in every pick before any model runs, and any depth window below that floor asks the metric to resolve depth finer than the sensor can, which is why 2 cm is struck through. The contract is measured against a deliberately harsh metric with combined-fracture depth MAE 1.72 cm and dip MAE 1.71 degrees. The 5 cm equals 75% anchor, the 2/4/6 and 1/3/5 tested settings, the 3 cm floor and the MAEs are sourced from the engagement archive; the recall values plotted at the tighter windows are illustrative re-scores of the same predictions against a harsher window.

Making the metric deliberately harsh

Having agreed the floor, we did the opposite of what a vendor demo does. We made the metric as strict as the physics allowed rather than as loose as the marketing wanted. A hit had to land inside the signed depth window and the signed dip window at the same time, so a prediction that got depth right but dip wrong scored as a miss. We reported against that combined condition, on the interpreter's own picks, for every well.

Under that harsh definition the fracture model reached about 75% recall at a 5 cm depth tolerance on the 16-well sinusoid dataset. That sentence is doing real work. The 75% is not a headline chosen to impress; it is what the model actually cleared once the interpreter had set a window they would defend, and it re-scores downward the moment you tighten the window toward the floor. A looser 9 cm window would have produced a bigger number that meant less. We chose the smaller, harder, signed number on purpose, because it is the one the interpreter could take to their own management without a caveat.

The underlying error was small enough to make that recall credible. Across the combined fracture set, depth predictions carried a mean absolute error of about 1.72 cm and dip a mean absolute error of about 1.71 degrees. Those MAEs are inside the tolerances the interpreter signed, which is exactly why the recall at 5 cm holds up: the model is not scraping past a generous threshold, it is landing comfortably inside a strict one.

Two axes, not one slider

An accuracy negotiation is easy to reduce to a single depth slider, and that reduction hides half the contract. Dip has its own tolerance, and the interpreter has to sign that too. A fracture predicted at the right depth but two degrees off in dip is, geologically, a different feature; a stress interpretation built on it would point the wrong way. So the metric was a two-tolerance condition from the start, and we swept dip at 1, 3 and 5 degrees alongside depth. The instrument above keeps both dials on screen for that reason: "recall" is shorthand for "recall under a depth window and a dip window the expert co-signed," and dropping either axis quietly loosens the test.

The rule we took away

Report your headline metric against the tightest tolerance the domain expert will sign, floored by what the sensor can physically resolve, never against the tolerance that maximises the number. A recall you negotiated down to honesty travels further than a recall you inflated with a loose window.

What this buys you

The payoff is not statistical, it is institutional. When we reported 75% recall at 5 cm, the interpreter did not have to take our word for the threshold, because they had set it. The number arrived pre-endorsed. That changed how the result was received inside the operator: it stopped being an outside vendor's optimistic claim and became a measurement the client's own expert had defined the terms of.

It also disciplines the model team. Once the tolerance is fixed and signed, you cannot quietly relax it to rescue a disappointing run. The metric is a contract with an external party, and moving the goalposts is visible. That constraint made our own iteration honest, because every improvement had to show up against a window we were no longer allowed to touch.

None of this requires exotic machinery. It requires an afternoon, a domain expert who will argue with you about centimetres, and the discipline to floor the negotiation at the sensor's resolution rather than at whatever produces the friendliest slide. The model does the detection. The expert defines what detection means. Report the second number, not the first.

Limitations

The 75% recall at 5 cm is specific to the fracture model on this 16-well carbonate dataset and does not transfer to beddings, to horizontal wells, or to other reservoirs; a signed tolerance is a local agreement, not a universal constant. The recall values shown at depth windows tighter than 5 cm in the instrument are illustrative re-scores of the same predictions against a harsher window, included to show the shape of the trade-off; only the 5 cm equals 75% point, the 2/4/6 cm and 1/3/5 degree tested settings, the 3 cm sensor floor, and the 1.72 cm and 1.71 degree mean absolute errors are drawn from the engagement archive. The 3 cm pixel floor holds at low image-log resolution and would tighten on higher-resolution runs. And a negotiated metric is only as trustworthy as the interpreter who signs it: a co-designed threshold moves the question of trust onto the person, which is a feature when the expert is credible and a risk when they are not.

References

[1] Internal engagement archive, borehole-image fracture-detection progress reports and results tables (2022 to 2023). Sourced figures: depth tolerances tested 2/4/6 cm and dip tolerances 1/3/5 degrees; 1 pixel equals 3 cm at low resolution; fracture recall approximately 75% at 5 cm depth tolerance on 16 wells; combined-fracture depth MAE 1.72 cm and dip MAE 1.71 degrees.

Go to Top

© 2026 Copyright. Earthscan