Predicting oil 30,000 ft below - UX, ML, Statistics and the science of reservoir engineering
Today the most cost efficient oil exploration happens from places that no one could have ever imagined before. It happens through applying robust statistical methods supported by rapid advancement in high performance computing. This article describes the magic of those methods and how AI and ML support these calculations to allow easy investment decisions by company executives.
10/31/20258 min read


I spent almost two and a half years collaborating with subject matter experts, people with PhDs in complex yet highly interesting areas such as computational fluid dynamics, quantum physics, statistics, reservoir engineering and geology. We were working on a software to predict where, when and how much to extract hydrocarbons from beneath the surface of Earth, mostly in deep-sea beds. The software is driven through the concept of a model, where all numerical values are thrown into the mix and different statistical approaches are applied to arrive at possible outcomes.
Modelling is not new, and all policy makers, economists, meteorologists across the globe use them for a variety of purposes from predicting GDP, population, stock markets, weather or long-term climate changes. Most of this can be done on Excel, however, the advent of affordable high-performance computing allows us to truly realise the power of complex regression and stochastic statistical algorithms like never before. This is part two of a series of articles where I will explain how Design of Experiments and experimental design methodologies work to deliver economic calculations for hydrocarbon exploration. I was also curious about how this differs from a typical Artificial Intelligence algorithm, hence I have dedicated a section for a high-level comparison.
The idea was to build a model of models with one statistical model feeding into another, and so on, repeating the same as many times until a satisfactory answer is obtained to explain how the reservoir will work. An interesting observation I had was how the statistical methods used were very similar to the ones used for machine learning and artificial intelligence. For instance, how the software would create proxy data to generate and refine plots to match historical production data trends and try to predict different events based on the geological profile of the reservoir. Since reservoir modelling involved multi-dimensional parameters (i.e. thousands of influencing factors on a single property), we were dealing with hyperspace. We are used to charts plotted on X, Y, Z axes, but imagine a plot that uses 1000s of such axes.
More difficult than a rocket launch
The whole process helped me appreciate the basics behind machine learning and experience first-hand the complexities and patience associated with building such complex models. Besides the high-performance computing involved, the most interesting learning was seeing the interplay between physics and statistics to arrive at different economic scenarios. I am not the first one writing about this, as there are several papers from key scientists I collaborated with over two and a half years. Some of them I worked with are already the thought leaders in this field through published research work, such as Rigorous Multi-Scenario Uncertainty Analysis by Erik Van Der Stein or A Practical Approach to Select Representative Deterministic Models by G. Gao, Kefei Wang, Sean Jost, Shakir Shaikh, Carl Blom, Terence Wells and dearest Late Jeroen Vink.
Now an although trivial, yet a key point is that even people with domain experience sometimes find these model-building computations and the tools for building them very overwhelming and sometimes quite open to interpretation. I won’t be touching upon the UX and Service Design impact, as I have already written about that in another article.
So before we venture into the explanation, if you are keen to learn more about machine learning, there are several learning resources available online for free from the large hyperscalers such as Google, Microsoft and Amazon. Put simply, machine learning is the reverse of a conventional logical program. While a logical algorithm-based program already has a defined equation to give the answer, machine learning builds and refines the equation as it learns more about the input data (i.e. the universe of all the answers possible).
Statistically building a model considering many scenarios
So every model is made of a collection of properties. In the case of an underwater reservoir model, the properties can be described by facies, porosity, saturation, permeability, temperature, pressure, liquid types and 100 others that are far too complex and geology domain oriented. As one of the most learned and admired geologists in this space explained to me in the simplest way, a property is like the different ingredients in a dish.
Next comes defining different parameters and determining whether they are going to be uncertain or not. The uncertainties are assigned a distribution along with a range. A property once defined is then associated with a parameter along with its underlying uncertainty, if applicable. The parameterisation is done primarily because geologists or reservoir engineers are not fully sure whether the behaviour of the rock or liquid beneath the surface is what they would ideally expect based on experience alone. Hence, the clever thing to do is to come out with a range of probabilistic scenarios for every property.
Design of Experiments: Structure Amidst Complexity
However, there is a risk of generalising everything; hence, to give weight to some scenarios versus the others, statistical modelling methods such as Latin Hypercube, Box-Behnken, Plackett-Burman, Space filling, Tornado design, etc are used to understand the sensitivity and influence of one parameter over the other on that property. Put simply, the idea is to be able to identify which parameters are having the strongest, least, infrequent and no influence. The statistical models work on different ranges and distributions and come back with the answers on the influence and strength of parameters. These Design of Experiments provides the necessary scaffolding for learning through experimentation. By systematically exploring the variable space, DOE methods ensure that the scenarios generated are both comprehensive and efficient—maximising learning while minimising computational burden.
These initial experimentations help the geologists to build different scenarios that mathematically articulate their view of how the reservoir geology behaves. There is a similar exercise around modelling the properties (such as pressure, volume, temperature) related to active production reservoirs and the facilities through which the hydrocarbons are piped and sent for further processing. This methodological rigour is what ultimately transforms domain knowledge, physical laws, and empirical data into actionable intelligence for reservoir management.
Upscaling the model
Before moving further, a key activity in reservoir modelling is upscaling—ensuring that the fine-grained properties calculated at small scales remain representative when considered across the entire reservoir or the full depth of a well. This step is essential to avoid overfitting the model to a small localised dataset while underrepresenting the wider variability and uncertainty inherent in geological systems.
History Matching: Learning from the Past
Imagine you’re working with a weather app that claims to predict rainfall, but you also have actual records of rain over the past year. History matching is the process of adjusting your prediction model so it aligns closely with what's already happened—making it more trustworthy for the future. In oil and gas, this means tweaking the reservoir model so its simulated production closely fits real production data from wells. This step is crucial because it “grounds” the model in proven reality. The closer the match, the more confidence decision-makers can have in the model’s forecasts and investment plans.
Quadratic Proxies: Fast, Smart Shortcuts
Complex subsurface simulations can take hours or even days to run, especially when you want to try out thousands of “what if” scenarios. To speed things up, engineers often use quadratic proxies—simplified mathematical stand-ins for the real models. These proxies use equations based on historical patterns (often quadratic, or “squared” relationships) to estimate outcomes quickly. It’s like replacing a time-consuming gourmet tasting with a quick taste test using just the main flavour notes—fast, efficient, and surprisingly effective for screening scenarios before diving into detailed analysis.
Probabilistic vs Deterministic Approaches in Reservoir Modelling
One of the recurring challenges in hydrocarbon modelling centres on the philosophical and practical distinction between probabilistic and deterministic modelling:
Deterministic Models rely on fixed values for each property or parameter, often reflecting the “best guess” or most likely scenario. While these simplify computation and interpretation, they can obscure the inherent uncertainty of subsurface predictions—especially in environments like deep-sea reservoirs, where direct measurements are limited and expensive. For example, if you are planning a family holiday, you will choose the deterministic method to pick one date, one route, and one hotel, expecting everything will go as planned.
Probabilistic Models, on the other hand, explicitly incorporate uncertainty by assigning distributions, rather than single values, to each parameter. By running simulations across thousands (or millions) of scenarios, these models generate an ensemble of possible outcomes. This not only helps quantify the range of uncertainty but also provides vital information for risk assessment and economic decision-making. In the holiday example, this method allows you to check different travel dates, routes, and hotels, weighing the odds of bad weather or traffic, and planning for a range of possibilities.
In practice, both approaches are used in tandem. A deterministic realisation might serve as a reference case, but it is the ensemble of probabilistic runs—made computationally feasible by modern high-performance computing—that delivers a richer picture of possible futures, informing strategies such as field development plans or investment decisions. Companies can see not just what’s likely, but also what could go wrong—and how to prepare for it. This helps in making big decisions, like where to invest millions of dollars and how to manage risks responsibly.
RMSE (Root Mean Square Error): A Scorecard for Models
How do you know if your predictions are any good? Enter Root Mean Square Error (RMSE), a statistical tool that acts like a scorecard. RMSE measures the difference between what actually happened and what your model predicted, combining all errors into a single number. Lower RMSE means your model’s guesses are closer to reality, so it’s an easy way for both scientists and executives to compare models, track improvements, and justify investments in better modelling techniques.
The next phase of model simulation is the most interesting because this is where we truly experience machine learning at play.
Training Data from a Model: Preparing the Learning Ground
In machine learning, training data is the information a model uses to learn patterns and relationships, much like a student learns from examples before taking a test. In reservoir engineering, this training data comes not only from real measurements (like pressure, production rates, or rock types) but also from synthetic scenarios generated by the model itself. By exposing the AI to a wide range of possible conditions—produced by running the reservoir model with different inputs—engineers ensure the machine can recognise trends, spot outliers, and make robust predictions even in unfamiliar situations.
Experimentation and Fine-Tuning vs. Black Box Approach
There are two main philosophies when building predictive models:
Experimentation and Fine-Tuning: This approach is transparent and interactive. Engineers and data scientists systematically adjust model settings, review results, and learn from every outcome. They know what each adjustment means and why it affects the results—a bit like a chef tasting and tweaking a recipe until it’s perfect.
Black Box Approach: Here, you feed the data in and get predictions out, but you may not understand what happens inside—like a magic box that refuses to explain itself. While often fast and effective (used in some deep learning and AI systems), black boxes lack transparency, making it tough for executives to trust critical decisions—especially when millions are at stake, or regulatory approval is needed.
Most leaders in industry prefer a blend: using AI’s speed and data-processing power, but also ensuring they understand how and why models make their recommendations. This helps build confidence and supports accountability.
Visualising the correlation of parameters in hyperspace
In these models, there can be hundreds or thousands of factors influencing outcomes—far more than the familiar X, Y, and Z axes in a normal chart. This “hyperspace” can’t be visualised directly, but advanced machine learning tools step in to help. There are proprietary tools developed in Python by companies themselves. The closest example of this is from the work done by Google. https://experiments.withgoogle.com/visualizing-high-dimensional-space
These tools make hidden relationships visible, highlighting which sets of parameters move together and which ones have the biggest impact on results. For executives and teams, this means you can “see” what matters most at a glance—streamlining decision-making and focusing resources on what truly drives success.
Conclusion - beyond reservoir engineering
As you can see, the power of domain understanding, along with an experimentation approach using manual statistics as well as computational-driven or supported algorithms, can be used in almost every other scenario to understand and predict outcomes or behaviour. The key is to be able to structure the scaffolding together and be able to use the process to present rich yet understandable data.
©2023 Strated, All rights reserved
CONTACT
5 Brayford Square
London
E1 0SG
United Kingdom
