Beating Commercial Tools With Open Methods

Most comparisons between a commercial tool and an open method are written as brand against brand. There is a feature grid, a logo, a column of green checkmarks, and somewhere near the bottom a line about enterprise pricing on request. That table answers a question nobody who has to ship actually asked. The question a technical buyer of raster well-log digitisation asks is narrower and harder: does the open pipeline read a curve well enough to trust, and what does it cost to run against the alternative. Those are two measured quantities, accuracy and cost, and the comparison worth publishing is the one that puts them side by side. This is that comparison for the pipeline behind VeerNet, the encoder-decoder EarthScan uses to lift curves off scanned paper logs, and it is deliberately not the argument about who owns the data. It is the plainer one about whether an open method reaches shippable accuracy at a fraction of a commercial licence.

One boundary first. The reason a bespoke pipeline can exist at all, and why the training set is what it is, is a separate story about owning your inputs, which this note assumes and does not retell. What it does instead is take the model that story produced and run the two-column benchmark: on the left, the accuracy the open method actually measured; on the right, what a year of running it costs against a per-seat licence. The open-method column is not free, and pretending it is would be its own kind of dishonesty; it costs a GPU rental and the hours to train. The point is that those costs have a shape, flat in the size of the team, that a per-seat licence does not, and the shape is the argument.

The left column is a measured fit, not a checkmark

Start with accuracy, because if the open method does not clear the bar, the cost column is irrelevant. A curve digitiser has to do two things: find the curve in the raster, and put it back at the right depth. We measure the second as a regression fit once the predicted mask is turned back into a depth-indexed curve, and the best coefficient of determination we recorded on the validation split was 0.9891. That is a tight fit for a task where the input is a photograph of a decades-old paper log, not a clean digital signal. We measure the first, the segmentation itself, with an overlap-style score, and the best F1 we recorded was 0.55. Those two numbers belong in different registers and it is worth being honest about both: the curve-fit is excellent, the raw mask overlap is middling, and the reason the pipeline still ships is that a good regression fit on the reconstructed curve tolerates a mask that is imperfect at the pixel boundary. A marketing table would print one of these and hide the other. A benchmark prints both.

None of this accuracy comes from a proprietary component. The architecture is in the open encoder-decoder lineage that starts with U-Net ^[3], the losses are published, and the training set is synthetic logs we generated. That matters for the cost column, because it means the accuracy on the left is paid for entirely in compute and engineering time, not in a licence. The order-of-magnitude gains reviews report for AI methods in upstream work are real and measured ^[2], and the point of an open pipeline is to capture them without renting them by the seat. The 15,000-log multiclass set trains in 10 hours. That is the whole recurring build cost of the accuracy above: ten GPU-hours, repeatable, on hardware anyone can rent.

The right column has a shape

Now the cost, which is where the brand-against-brand table quietly cheats. A per-seat SaaS licence is priced at 1,200 USD per analyst per year in the market we compared against. That number looks small next to a GPU bill until you notice what it does as the team grows: it multiplies. Two analysts is 2,400 USD, ten is 12,000, twenty is 24,000, and the line goes through the origin and keeps climbing. The open method's cost does the opposite. One trained model serves the whole team, so the recurring cost is the GPU rent, 750 EUR a month for a high-end card or 1,800 EUR a month for an advanced one, and that rent is flat in the number of people reading logs. A flat line and a rising line cross somewhere, and where they cross is the entire comparison reduced to a single seat count.

The scoreboard below draws exactly that. On the left it holds the accuracy verdict, the 0.9891 fit and the 0.55 overlap, so the cost is never read without the quality it buys. On the right it plots annual cost against team size: the open method as a flat rent you can toggle between the two tiers, the licence as a line that climbs one seat at a time. The orange marker sits at the crossover, the seat count past which the per-seat licence costs more than the fixed open-method rent for the whole team. Drag the team size and the crossover does not move, because it is a property of the two prices, not of where you happen to be standing on the axis. That fixed crossover is the number we think belongs in a published benchmark, in place of a logo wall.

The comparison an honest benchmark publishes: measured accuracy on the left, measured annual cost on the right, on one plate. The open-method pipeline reaches a peak curve-fit R-squared of 0.9891 and a peak segmentation F1 of 0.55, trained in 10 hours on the 15,000-log multiclass set. Its cost is a flat annual GPU rent, 750 EUR per month for a high-end card or 1,800 EUR per month for an advanced one, that does not grow with the team because one trained model serves every analyst. A per-seat SaaS licence at 1,200 USD per seat per year is a line through the origin: every analyst you add buys another seat. Drag the seat lever to sweep team size and toggle the rent tier; the plot draws the flat open-method line against the climbing licence line and the orange marker sits at the crossover seat, the point past which per-seat licensing overtakes the fixed rent. The R-squared, F1, GPU tiers, 10-hour training budget, and 1,200 USD seat price are sourced from the engagement archive; the 1.06 USD-to-EUR rate that places both series on a single EUR axis is an illustrative input, and this is an engineering cost comparison, not a quote.

Why the crossover is the honest headline

The crossover is honest in a way a single price is not, because it refuses to pretend one number settles the question. On the high-end GPU tier the flat rent is 9,000 EUR a year, and at 1,200 USD a seat the licence overtakes it a little under eight analysts in. On the advanced tier the rent is 21,600 EUR a year and the crossover sits near nineteen seats. Neither of those is a slogan. They are the two seat counts at which the recommendation flips, and a buyer with four analysts and a buyer with forty should reach opposite conclusions from the same scoreboard. A comparison that hides this behind "cheaper" or "more scalable" is hiding the one input the reader needs.

It also keeps us honest about the open method's weaknesses, which the flat line does not erase. Below the crossover, for a small team, the per-seat licence is genuinely cheaper, and the open pipeline's fixed rent is a cost you pay whether one analyst uses it or none. The open method wins on cost only once the team is large enough that spreading a flat rent beats paying per head, and it wins on accuracy only because the measured fit happens to clear the bar. Change either fact, a smaller team or a task where 0.55 overlap is not good enough, and the recommendation changes. That contingency is not a weakness of the comparison. It is the comparison doing its job, which a checkmark grid structurally cannot.

What the two-column benchmark is for

The reason to publish accuracy against cost rather than brand against brand is that it survives contact with a real buyer. Someone evaluating a curve digitiser can take the left column to their petrophysicist and ask whether a 0.9891 curve fit is trustworthy for their logs, and take the right column to their finance team and ask where their seat count sits relative to the crossover. Both questions have answers grounded in measured numbers rather than a vendor's framing. Reported accuracy figures in this domain, like the ones the upstream-AI literature collects, are the currency of that first conversation ^[1], and a flat-versus-linear cost model is the currency of the second.

We are not claiming the open method wins every comparison. We are claiming it wins the one that is actually run, at a stateable set of conditions: a team past the crossover seat count, on a task where the measured accuracy clears the bar, willing to run a ten-hour training job on rented hardware in exchange for never paying per head again. State those conditions and the recommendation is defensible; hide them behind a brand comparison and it is just marketing with the polarity reversed.

Limitations

This is a comparison built from real accuracy numbers and real prices, but it is a model, and it simplifies. The peak R-squared of 0.9891 and the peak F1 of 0.55 are the best values we recorded, not guaranteed floors on an arbitrary operator's logs; a scan set with failure modes our synthetic training data did not cover could score lower, and the crossover argument assumes the accuracy bar is cleared. The cost side compares a flat GPU rent against a per-seat licence at list price, which ignores real complications on both sides: the open method also costs engineering time to run and maintain that the rent line does not show, and a real licence negotiation may discount the per-seat price at volume, which would push the crossover to a higher seat count. The scoreboard places both series on a single EUR axis using a fixed 1.06 USD-to-EUR conversion, flagged in the exhibit as illustrative; a different rate shifts the crossover modestly. And the training figure of 10 hours on the 15,000-log set is the recurring build cost, not the full first-time cost of assembling the synthetic data and the pipeline, which is a one-off the flat rent does not capture. None of these change the shape of the argument, flat cost against linear cost, but they set the boundary on how literally any single crossover number should be read.

Publishing the number instead of the logo

The habit this leaves us with is to distrust any tool comparison that will not state its crossover. A feature grid can be true and useless at the same time, because it answers a question of parity rather than a question of fit and price. The two-column benchmark forces the harder honesty: here is the accuracy we measured, here is what it costs against the alternative, and here is the exact seat count where the recommendation flips. That is a number a buyer can act on and a number we can be held to, which is more than a wall of checkmarks has ever offered either of us.

References

[1] Buah, E., et al. Machine-learning accuracy in CO2 storage engagement prediction. Energies 13, 6259 (2020). A measured, reported accuracy figure of the kind a cost-versus-accuracy comparison rests on. https://www.mdpi.com/1996-1073/13/23/6259

[2] Koroteev, D., and Tekic, Z. Artificial intelligence in oil and gas upstream: Trends, challenges, and scenarios for the future. Energy and AI 3 (2021), 100041. The order-of-magnitude speed and cost gains reported for AI methods in upstream workflows, and why the comparison that matters is a measured one. https://www.sciencedirect.com/science/article/pii/S2666546820300410

[3] Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015. The open encoder-decoder architecture the pipeline inherits, and the reason the accuracy column costs a GPU rental rather than a licence. https://arxiv.org/abs/1505.04597

Beating Commercial Tools With Open Methods

The left column is a measured fit, not a checkmark

The right column has a shape

Why the crossover is the honest headline

What the two-column benchmark is for

Limitations

Publishing the number instead of the logo

References

Continuous AI for explorers

About Earthscan

Products

Legal

Follow us on