When the Data Can Speak for Itself
In complex operational environments, making a truly data-driven decision is harder than it sounds. This post explores why the two most common approaches, manual case selection and defaulting to large language models, both fall short, and what a more rigorous, traceable alternative looks like. The real constraint is almost never the method. It is the data foundation underneath it.
Yezid Arevalo
5/12/20263 min read
One of the problems I kept running into when working with large operational datasets was knowing how to ask the right question of the data before drawing any conclusions. The dataset was large. The variables were many. The conditions across cases were never quite the same. And the decision that needed to be made was real, with operational and financial consequences attached to it.
In practice, this kind of problem tends to get resolved in one of two ways, neither of which fully uses the data available.
The first is the manual path. An engineer identifies a handful of cases that feel comparable based on experience and local knowledge, runs the numbers across that limited sample, and draws a conclusion. This works up to a point. The engineer's judgement is genuinely valuable. But the cases selected tend to reflect what is already known, and what is already believed. Unconscious bias in sample selection is not a failure of competence. It is a feature of how human pattern recognition works. The result is a decision that is informed by data but not necessarily guided by it, and one that can miss the parts of the dataset where the real answer sits.
The second has become increasingly common. The question gets pushed to a large language model, and if the response sounds coherent, it informs the decision. There are many problems where this is a reasonable approach. Structured quantitative comparison of performance across a multidimensional operational dataset is not one of them. The risk is not that the model gives an obviously wrong answer. It is that the model gives a plausible answer constructed from context that may not match the problem at hand. The coherence of the response is not evidence of its relevance. There is also an energy cost to this that tends to go unacknowledged. Reaching for a large language model by default, for problems it is not well suited to, carries a consumption footprint that is entirely invisible to the user.
There is a cleaner, more direct path for this class of problem.
The core challenge in making a fair comparison across a large and varied dataset is defining what comparable actually means when the data has many dimensions. In two dimensions this is intuitive. In twenty or thirty it is not, and simplifying the comparison to a manageable number of manually selected variables discards information that may be decisive. The approach that addresses this properly is to let the data define the groups. Clustering across the full feature space groups cases by their similarity across all relevant dimensions simultaneously, without a preferred outcome and without the constraint of manual selection. Statistical testing within each cluster then determines whether the performance difference between options is real or within the range of noise. The result is traceable, bias-free, and directly tied to the data rather than to the assumptions of the person asking the question.
This is not a complicated idea. What makes it work in practice is more demanding than the concept itself. Knowing which features define meaningful similarity in a given operational context requires domain knowledge. Handling data that mixes continuous measurements with categorical variables requires deliberate preprocessing choices. Deciding where to draw the line between similarity and statistical power, because more similar groups tend to be smaller and therefore harder to test rigorously, is a judgement call that the algorithm cannot make on its own. These are the points where operational experience and analytical rigour have to work together. The method provides the structure. The practitioner provides the context that makes the structure meaningful.
I was part of a team that applied exactly this approach across one of the largest operational datasets in the industry, spanning hundreds of thousands of cases across dozens of countries. The methodology worked. It produced conclusions that manual analysis had not reached, and challenged assumptions that had been treated as settled. The hardest part of the work was not the analysis. It was building and maintaining the data foundation that made the analysis possible in the first place.
That is almost always where the real constraint sits. Before asking how to interrogate data, the prior question is whether the data is in a condition to be interrogated. Quality, completeness, consistency of capture, and accessibility at scale are prerequisites, not infrastructure problems to be solved later. A rigorous analytical approach applied to a poorly governed data foundation will produce results that are methodologically sound and operationally misleading. The sequence matters.
The reason this is worth reflecting on now is that the analytical methods to extract meaningful, defensible insight from large operational datasets have been available for some time. The gap in most organisations is not methodological. It is the data foundation underneath the method, and the discipline to build it properly before reaching for the analysis.
When that foundation is in place, decisions that felt too complex to make confidently become straightforward. Not because the problem became simpler, but because the data was finally in a position to answer it.
** Image adapted from Khvostichenko, Skoff, Arevalo et al., SPE-212446-MS, 2023
Khvostichenko, D., Skoff, G., Arevalo, Y., Makarychev-Mikhailov, S. (2023). Apples to Apples: Impartial Assessment of Drilling Technologies Through Big Data and Machine Learning. SPE/IADC International Drilling Conference and Exhibition, Stavanger. SPE-212446-MS. https://doi.org/10.2118/212446-MS