Editorial disclosure
This article reflects the independent analysis and professional opinion of the author, informed by professional practice and direct experience building data systems for upstream oil and gas. Code examples are tested and functional. No vendor reviewed or influenced this content prior to publication.
Petroleum engineering generates some of the most interesting data in any industry. A single well can produce continuous logs of gamma ray, resistivity, porosity, and density spanning thousands of feet of subsurface rock. A producing field generates daily streams of oil, gas, and water rates across hundreds of wells, alongside pressure readings, choke settings, artificial lift parameters, and injection volumes. Drilling operations generate sub-second telemetry from the rig floor -- weight on bit, RPM, torque, flow rate, ROP -- in real time.
And yet, for a field that produces this much data, petroleum engineering has been remarkably slow to adopt modern data tools. Many engineers still do their analysis in Excel spreadsheets, proprietary vendor software, or legacy Fortran programs. The data lives in formats that most software engineers have never seen -- LAS files for well logs, WITSML streams for drilling data, and regulatory CSV downloads with idiosyncratic column layouts.
Python changes this. Not because Python is inherently better than any other tool, but because it offers something that no proprietary petroleum engineering software can match: a single environment where you can read well logs, clean production data, fit decline curves, build machine learning models, generate publication-quality plots, and deploy results through APIs -- all with code that is reproducible, version-controlled, and free.
This article is a practical guide to getting started. If you are a petroleum engineering student, a data scientist entering the energy industry, or an engineer who has been meaning to learn Python but never had a domain-relevant starting point, this is written for you.
Why Python for Petroleum Engineering
Python has become the default language for data analysis across most technical fields, and petroleum engineering is catching up. The reasons are straightforward:
The scientific computing stack is mature. NumPy for numerical arrays, pandas for tabular data, SciPy for optimization and curve fitting, matplotlib for visualization. These libraries are battle-tested across decades of use in physics, finance, and machine learning. They work.
Domain-specific libraries exist. The Python ecosystem includes libraries built specifically for petroleum engineering data: lasio for LAS files, welly for well data manipulation, dlisio for DLIS files, and pyResToolbox for reservoir engineering calculations. These are not toy projects -- they are maintained, documented, and used in production.
Machine learning is native. If you want to move beyond basic analysis into predictive modeling -- decline curve forecasting, well log imputation, drilling dysfunction detection, production anomaly classification -- Python is where scikit-learn, XGBoost, TensorFlow, and PyTorch live. There is no equivalent ecosystem in MATLAB, R, or any proprietary platform.
Reproducibility matters. A Python script that reads data, runs analysis, and produces output can be shared, reviewed, version-controlled with Git, and re-run by anyone. An Excel spreadsheet with embedded formulas and manual chart adjustments cannot.
It is free. No license fees, no seat restrictions, no vendor lock-in. For a small operator running lean, this matters more than most vendors want to admit.
Setting Up Your Environment
Start with a clean Python installation. If you do not already have Python installed, download it from python.org or use Anaconda/Miniconda if you prefer managed environments.
Install the core libraries you will need:
pip install lasio pandas numpy scipy matplotlib
For additional petroleum engineering work, you will eventually want:
pip install welly dlisio striplog pyResToolbox
Create a working directory for your projects and open it in your preferred editor (VS Code is the most common choice for Python data work). If you are new to Python, Jupyter notebooks provide an interactive environment that lets you run code in cells and see results immediately:
pip install jupyterlab
jupyter lab
Throughout this article, the code examples assume you are running Python 3.9 or later and have the core libraries installed. Each code block is self-contained and runnable.
Reading LAS Files with lasio
LAS (Log ASCII Standard) is the workhorse format for well log data. Developed by the Canadian Well Logging Society, LAS files store curve data -- gamma ray, resistivity, porosity, density, neutron, sonic -- alongside header metadata that identifies the well, the logging run, and the measurement parameters.
If you have never seen a LAS file, it looks something like this at the top:
~VERSION INFORMATION
VERS. 2.0 : CWLS LOG ASCII STANDARD - VERSION 2.0
WRAP. NO : ONE LINE PER DEPTH STEP
~WELL INFORMATION
WELL. SMITH 1-15 : WELL NAME
STRT.FT 5000.0000 : START DEPTH
STOP.FT 10000.0000 : STOP DEPTH
STEP.FT 0.5000 : STEP
The lasio library reads these files cleanly:
import lasio
import matplotlib.pyplot as plt
import numpy as np
# Load a LAS file
las = lasio.read("well_logs.las")
# View the header information
print(las.well) # Well information block
print(las.curves) # Available curves and their units
# Access specific header fields
well_name = las.well["WELL"].value
start_depth = las.well["STRT"].value
stop_depth = las.well["STOP"].value
print(f"Well: {well_name}, Depth range: {start_depth} - {stop_depth}")
Accessing Curve Data
LAS curve data is stored as numpy arrays. You can also convert the entire file to a pandas DataFrame for easier manipulation:
import pandas as pd
# Access a specific curve as a numpy array
depth = las["DEPT"]
gamma_ray = las["GR"]
resistivity = las["ILD"] # deep induction log
porosity = las["NPHI"] # neutron porosity
# Convert to a pandas DataFrame (often more convenient)
df = las.df()
print(df.head())
print(df.describe())
# The DataFrame index is depth by default
print(f"Depth range: {df.index.min()} to {df.index.max()}")
print(f"Available curves: {list(df.columns)}")
Handling Common Issues
LAS files from the real world are messy. Here are the issues you will encounter and how to handle them:
# Issue 1: Null values (usually -999.25 in LAS files)
# lasio converts these to NaN automatically, but verify:
print(f"Null values per curve:\n{df.isnull().sum()}")
# Issue 2: Different curve mnemonics across files
# Gamma ray might be "GR", "SGR", "CGR", or "GAMMA"
# Build a mapping dictionary for your dataset:
GR_MNEMONICS = ["GR", "SGR", "CGR", "GAMMA", "GRGC"]
def find_curve(df, mnemonics):
"""Find the first matching curve mnemonic in a DataFrame."""
for mnemonic in mnemonics:
if mnemonic in df.columns:
return mnemonic
return None
gr_col = find_curve(df, GR_MNEMONICS)
if gr_col:
print(f"Found gamma ray curve: {gr_col}")
# Issue 3: Depth step inconsistencies
# Check for irregular depth sampling
depth_diff = np.diff(df.index)
if not np.allclose(depth_diff, depth_diff[0], atol=0.01):
print("Warning: irregular depth sampling detected")
# Resample to regular depth intervals
new_index = np.arange(df.index.min(), df.index.max(), 0.5)
df = df.reindex(new_index).interpolate(method="linear")
Plotting Well Logs
A standard well log display shows multiple curves in side-by-side tracks. Here is how to create one:
def plot_well_logs(df, depth_col=None, top=None, bottom=None):
"""Create a standard 4-track well log plot."""
if depth_col:
depth = df[depth_col]
else:
depth = df.index # LAS DataFrames use depth as index
fig, axes = plt.subplots(1, 4, figsize=(12, 10), sharey=True)
fig.suptitle("Well Log Display", fontsize=14, fontweight="bold")
# Track 1: Gamma Ray
ax1 = axes[0]
ax1.plot(df["GR"], depth, color="green", linewidth=0.5)
ax1.set_xlabel("GR (API)")
ax1.set_xlim(0, 150)
ax1.set_title("Gamma Ray")
ax1.fill_betweenx(depth, 0, df["GR"].clip(upper=150),
where=(df["GR"] < 75), color="yellow", alpha=0.3)
ax1.fill_betweenx(depth, 0, df["GR"].clip(upper=150),
where=(df["GR"] >= 75), color="gray", alpha=0.3)
# Track 2: Resistivity (logarithmic scale)
ax2 = axes[1]
if "ILD" in df.columns:
ax2.semilogx(df["ILD"], depth, color="red", linewidth=0.5,
label="Deep")
if "ILM" in df.columns:
ax2.semilogx(df["ILM"], depth, color="blue", linewidth=0.5,
label="Medium")
ax2.set_xlabel("Resistivity (ohm-m)")
ax2.set_xlim(0.1, 1000)
ax2.set_title("Resistivity")
ax2.legend(fontsize=8)
# Track 3: Porosity (reversed scale)
ax3 = axes[2]
if "NPHI" in df.columns:
ax3.plot(df["NPHI"], depth, color="blue", linewidth=0.5,
label="Neutron")
if "DPHI" in df.columns:
ax3.plot(df["DPHI"], depth, color="red", linewidth=0.5,
label="Density")
ax3.set_xlabel("Porosity (v/v)")
ax3.set_xlim(0.45, -0.05) # reversed scale
ax3.set_title("Porosity")
ax3.legend(fontsize=8)
# Track 4: Bulk Density
ax4 = axes[3]
if "RHOB" in df.columns:
ax4.plot(df["RHOB"], depth, color="red", linewidth=0.5)
ax4.set_xlabel("RHOB (g/cc)")
ax4.set_xlim(1.95, 2.95)
ax4.set_title("Bulk Density")
# Format depth axis
axes[0].set_ylabel("Depth (ft)")
axes[0].invert_yaxis()
if top and bottom:
axes[0].set_ylim(bottom, top)
for ax in axes:
ax.grid(True, alpha=0.3)
ax.tick_params(axis="x", labelsize=8)
plt.tight_layout()
plt.savefig("well_log_plot.png", dpi=150, bbox_inches="tight")
plt.show()
# Usage:
# plot_well_logs(df, top=7000, bottom=8500)
This produces the kind of multi-track log display that every petroleum engineer recognizes. The gamma ray track with sand/shale fill, logarithmic resistivity, neutron-density crossover for porosity -- these are standard visual conventions in the industry. Having them generated from code means you can produce consistent plots across hundreds of wells without touching a GUI.
Working with Production Data
Production data is the other essential dataset in petroleum engineering. Unlike well logs (which are recorded once during drilling or completion), production data is cumulative -- monthly or daily rates of oil, gas, and water over the entire producing life of a well.
Most production data comes from state regulatory agencies. In the United States, the most commonly used sources are:
- Texas Railroad Commission (RRC) -- Monthly production by lease and well
- New Mexico OCD -- Well-level production data
- North Dakota NDIC -- Individual well production and completion records
- Colorado COGCC -- Production data with detailed completion information
These agencies publish data in CSV or fixed-width formats, each with their own column naming conventions, date formats, and quirks.
Loading and Cleaning Texas RRC Data
Texas RRC production data is the most widely used public dataset in the industry. Here is how to load and clean it:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Load production data (CSV export from Texas RRC or IHS/Enverus)
df = pd.read_csv("production_data.csv")
# Common column names in Texas RRC exports:
# LEASE_NO, DISTRICT, OPERATOR_NAME, LEASE_NAME,
# MONTH, YEAR, OIL_BBL, GAS_MCF, WATER_BBL
# Create a proper datetime column
df["date"] = pd.to_datetime(
df["YEAR"].astype(str) + "-" + df["MONTH"].astype(str) + "-01"
)
# Sort by date
df = df.sort_values("date").reset_index(drop=True)
# Handle common data quality issues
# 1. Replace zeros with NaN where appropriate (shut-in vs. truly zero)
df["oil_bpd"] = df["OIL_BBL"] / df["date"].dt.days_in_month
df["gas_mcfd"] = df["GAS_MCF"] / df["date"].dt.days_in_month
df["water_bpd"] = df["WATER_BBL"] / df["date"].dt.days_in_month
# 2. Remove rows with no production (well not yet online or abandoned)
df = df[df[["oil_bpd", "gas_mcfd", "water_bpd"]].sum(axis=1) > 0]
# 3. Calculate cumulative production
df["cum_oil"] = df["OIL_BBL"].cumsum()
df["cum_gas"] = df["GAS_MCF"].cumsum()
# 4. Calculate GOR and water cut
df["gor"] = df["gas_mcfd"] / df["oil_bpd"].replace(0, np.nan) * 1000
df["water_cut"] = df["water_bpd"] / (df["oil_bpd"] + df["water_bpd"])
print(f"Production data: {len(df)} months")
print(f"Date range: {df['date'].min()} to {df['date'].max()}")
print(f"Cumulative oil: {df['cum_oil'].iloc[-1]:,.0f} BBL")
print(f"Cumulative gas: {df['cum_gas'].iloc[-1]:,.0f} MCF")
Plotting Production Decline Curves
The production decline curve is arguably the most important chart in petroleum engineering. It shows how a well's rate changes over time and forms the basis for reserves estimation, economic analysis, and field development planning.
def plot_production(df, well_name="Well"):
"""Plot oil and gas production with cumulative overlay."""
fig, (ax1, ax3) = plt.subplots(2, 1, figsize=(12, 8), sharex=True)
# Top panel: rates
color_oil = "#2E8B57"
color_gas = "#CD5C5C"
color_water = "#4169E1"
ax1.plot(df["date"], df["oil_bpd"], color=color_oil,
linewidth=1, label="Oil (bpd)")
ax1.set_ylabel("Oil Rate (bpd)", color=color_oil)
ax1.tick_params(axis="y", labelcolor=color_oil)
ax2 = ax1.twinx()
ax2.plot(df["date"], df["gas_mcfd"], color=color_gas,
linewidth=1, alpha=0.7, label="Gas (mcfd)")
ax2.set_ylabel("Gas Rate (mcfd)", color=color_gas)
ax2.tick_params(axis="y", labelcolor=color_gas)
ax1.set_title(f"{well_name} - Production History", fontweight="bold")
ax1.legend(loc="upper left")
ax2.legend(loc="upper right")
ax1.grid(True, alpha=0.3)
# Bottom panel: water cut and GOR
ax3.plot(df["date"], df["water_cut"] * 100, color=color_water,
linewidth=1, label="Water Cut (%)")
ax3.set_ylabel("Water Cut (%)", color=color_water)
ax3.set_xlabel("Date")
ax3.set_ylim(0, 100)
ax4 = ax3.twinx()
ax4.plot(df["date"], df["gor"], color="orange",
linewidth=1, alpha=0.7, label="GOR (scf/bbl)")
ax4.set_ylabel("GOR (scf/bbl)", color="orange")
ax3.legend(loc="upper left")
ax4.legend(loc="upper right")
ax3.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("production_plot.png", dpi=150, bbox_inches="tight")
plt.show()
# Usage:
# plot_production(df, well_name="Smith 1-15H")
Decline Curve Analysis in Python
Decline curve analysis (DCA) is the foundation of reserves estimation in unconventional reservoirs. The Arps decline equations, published by J.J. Arps in 1945, remain the industry standard despite being over 80 years old. The reason is simple: they work well enough for most practical purposes, and every petroleum engineer understands them.
The three Arps decline models are:
- Exponential decline (b = 0): Constant percentage decline per unit time. Common in conventional reservoirs under boundary-dominated flow.
- Hyperbolic decline (0 < b < 1): Declining percentage decline rate. Common in unconventional wells during transitional flow.
- Harmonic decline (b = 1): A special case of hyperbolic decline. Sometimes observed in water-drive reservoirs.
The Math
The general Arps hyperbolic decline equation is:
q(t) = qi / (1 + b * Di * t)^(1/b)
Where:
q(t)= production rate at time tqi= initial production rateDi= initial decline rate (1/time)b= decline exponentt= time from start of decline
For exponential decline (b = 0), this simplifies to:
q(t) = qi * exp(-Di * t)
Implementation with scipy
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
# Define the Arps decline functions
def exponential_decline(t, qi, di):
"""Exponential decline: q = qi * exp(-di * t)"""
return qi * np.exp(-di * t)
def hyperbolic_decline(t, qi, di, b):
"""Hyperbolic decline: q = qi / (1 + b * di * t)^(1/b)"""
return qi / (1 + b * di * t) ** (1 / b)
def harmonic_decline(t, qi, di):
"""Harmonic decline (b=1): q = qi / (1 + di * t)"""
return qi / (1 + di * t)
Fitting Decline Curves to Real Data
def fit_decline_curve(time, rate, model="hyperbolic"):
"""
Fit an Arps decline curve to production data.
Parameters:
time: array of time values (months from first production)
rate: array of production rates (bpd or mcfd)
model: 'exponential', 'hyperbolic', or 'harmonic'
Returns:
popt: fitted parameters
pcov: covariance matrix
"""
# Remove zero/NaN values
mask = (rate > 0) & np.isfinite(rate) & np.isfinite(time)
t_clean = time[mask]
q_clean = rate[mask]
if len(t_clean) < 3:
raise ValueError("Not enough valid data points for curve fitting")
qi_guess = q_clean.max()
di_guess = 0.05 # 5% per month initial guess
if model == "exponential":
popt, pcov = curve_fit(
exponential_decline, t_clean, q_clean,
p0=[qi_guess, di_guess],
bounds=([0, 0], [qi_guess * 2, 1.0]),
maxfev=10000
)
print(f"Exponential fit: qi={popt[0]:.1f}, Di={popt[1]:.4f}/month")
elif model == "hyperbolic":
popt, pcov = curve_fit(
hyperbolic_decline, t_clean, q_clean,
p0=[qi_guess, di_guess, 0.5],
bounds=([0, 0, 0.01], [qi_guess * 2, 1.0, 2.0]),
maxfev=10000
)
print(f"Hyperbolic fit: qi={popt[0]:.1f}, Di={popt[1]:.4f}/month, "
f"b={popt[2]:.3f}")
elif model == "harmonic":
popt, pcov = curve_fit(
harmonic_decline, t_clean, q_clean,
p0=[qi_guess, di_guess],
bounds=([0, 0], [qi_guess * 2, 1.0]),
maxfev=10000
)
print(f"Harmonic fit: qi={popt[0]:.1f}, Di={popt[1]:.4f}/month")
return popt, pcov
def estimate_eur(popt, model, economic_limit=5.0, max_months=600):
"""
Estimate EUR (Estimated Ultimate Recovery) by integrating
the decline curve to an economic limit.
Parameters:
popt: fitted parameters from curve_fit
model: 'exponential', 'hyperbolic', or 'harmonic'
economic_limit: minimum economic rate (bpd)
max_months: maximum forecast period
Returns:
eur: estimated ultimate recovery (bbls)
months_to_limit: time to reach economic limit
"""
t_forecast = np.arange(0, max_months, 1)
if model == "exponential":
q_forecast = exponential_decline(t_forecast, *popt)
elif model == "hyperbolic":
q_forecast = hyperbolic_decline(t_forecast, *popt)
elif model == "harmonic":
q_forecast = harmonic_decline(t_forecast, *popt)
# Find when production drops below economic limit
above_limit = q_forecast >= economic_limit
if above_limit.any():
months_to_limit = np.where(above_limit)[0][-1] + 1
else:
months_to_limit = 0
# Integrate rate over time (rate in bpd * ~30.44 days/month)
q_economic = q_forecast[:months_to_limit]
eur = np.sum(q_economic) * 30.44 # convert monthly average to cumulative bbls
print(f"EUR: {eur:,.0f} BBL")
print(f"Economic life: {months_to_limit / 12:.1f} years")
return eur, months_to_limit
Putting It All Together
def run_dca(df, rate_col="oil_bpd", well_name="Well"):
"""Complete DCA workflow: fit, forecast, plot, and estimate EUR."""
# Prepare time array (months from first production)
df = df.copy()
df["months"] = np.arange(len(df))
time = df["months"].values.astype(float)
rate = df[rate_col].values.astype(float)
# Fit all three models
results = {}
for model in ["exponential", "hyperbolic", "harmonic"]:
try:
popt, pcov = fit_decline_curve(time, rate, model=model)
eur, life = estimate_eur(popt, model)
results[model] = {"popt": popt, "eur": eur, "life": life}
except Exception as e:
print(f"{model} fit failed: {e}")
# Plot results
fig, ax = plt.subplots(figsize=(12, 6))
ax.scatter(df["date"], rate, color="black", s=10,
label="Actual", zorder=5)
t_forecast = np.arange(0, 360, 1) # 30-year forecast
colors = {"exponential": "blue", "hyperbolic": "red",
"harmonic": "green"}
for model, data in results.items():
popt = data["popt"]
if model == "exponential":
q_pred = exponential_decline(t_forecast, *popt)
elif model == "hyperbolic":
q_pred = hyperbolic_decline(t_forecast, *popt)
elif model == "harmonic":
q_pred = harmonic_decline(t_forecast, *popt)
# Create forecast dates
forecast_dates = pd.date_range(
start=df["date"].iloc[0], periods=len(t_forecast), freq="MS"
)
label = f"{model.title()} (EUR: {data['eur']:,.0f} BBL)"
ax.plot(forecast_dates, q_pred, color=colors[model],
linewidth=1.5, label=label, alpha=0.8)
ax.axhline(y=5, color="gray", linestyle="--", alpha=0.5,
label="Economic limit (5 bpd)")
ax.set_xlabel("Date")
ax.set_ylabel(f"{rate_col} (bpd)")
ax.set_title(f"{well_name} - Decline Curve Analysis", fontweight="bold")
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_ylim(bottom=0)
plt.tight_layout()
plt.savefig("dca_plot.png", dpi=150, bbox_inches="tight")
plt.show()
return results
# Usage:
# results = run_dca(df, rate_col="oil_bpd", well_name="Smith 1-15H")
A note on b-factors: in unconventional reservoirs (tight oil, shale gas), fitted b-values often exceed 1.0, which implies physically impossible infinite cumulative production. The standard industry practice is to switch from hyperbolic to exponential decline at a minimum decline rate (typically 5-8% per year). This is called a "modified hyperbolic" or "b-factor switch" and is the approach used in most commercial DCA software. Implementing this in Python is straightforward -- you just need an if-statement at the switchover point.
Introduction to WITSML Data
WITSML (Wellsite Information Transfer Standard Markup Language) is the XML-based standard for real-time drilling data transfer. If LAS files are the standard for static well log data, WITSML is the standard for dynamic, real-time data streaming from the rig to office.
WITSML data includes:
- Real-time drilling parameters -- depth, ROP, WOB, RPM, torque, flow rate, standpipe pressure
- Survey data -- measured depth, inclination, azimuth, calculated TVD/northing/easting
- Mud logging data -- gas readings, lithology descriptions, drill cuttings
- MWD/LWD data -- downhole measurements transmitted in real time
WITSML uses a client-server architecture. Data producers (EDR systems, MWD tools) publish data to a WITSML server, and data consumers (monitoring dashboards, analytics platforms) subscribe to that data. The major WITSML server implementations include Pason's DataHub, Petrolink's WITSML Server, and various service company platforms.
Accessing WITSML in Python
Direct WITSML access requires connecting to a WITSML server via SOAP API. The komle library provides Python bindings:
# pip install komle
# Note: WITSML access requires server credentials from your data provider
from komle.bindings.v1411.read import witsml
from komle import utils as komle_utils
# Most real-world WITSML access looks like this:
# 1. Connect to the WITSML server
# 2. Query for available wells and wellbores
# 3. Subscribe to real-time data channels
# 4. Process incoming data as it arrives
# For offline analysis, WITSML data is often exported to CSV or
# stored in a time-series database (InfluxDB, TimescaleDB)
# and accessed via SQL or REST API
In practice, most engineers do not interact with WITSML directly. They use platforms like Corva, Pason, or SLB's DrillOps that consume WITSML data and expose it through web dashboards or REST APIs. If you are building a drilling analytics application from scratch, you will need WITSML access. If you are analyzing historical drilling data, you will more likely work with CSV exports or database queries.
For a deeper look at how WITSML fits into the drilling data management ecosystem, see our article on drilling data management and cloud architectures.
Connecting to MCP for AI-Powered Analysis
The Model Context Protocol (MCP) is an open standard for connecting AI assistants to external data sources and tools. For petroleum engineering, MCP servers allow AI models to directly access well data, run calculations, and generate analysis -- without requiring the engineer to manually extract, clean, and format data.
petro-mcp is an open-source MCP server built specifically for petroleum engineering data. It provides tools for:
- Querying well production data from public APIs
- Running decline curve analysis
- Calculating reservoir engineering parameters
- Accessing completion and drilling data
Setting Up petro-mcp
# Clone and install
git clone https://github.com/petropt/petro-mcp.git
cd petro-mcp
pip install -e .
Once installed, petro-mcp can be connected to AI assistants like Claude Desktop, enabling natural-language queries against petroleum engineering data:
- "Show me the top 10 producing wells in Reeves County by IP30"
- "Run decline curve analysis on this well and estimate EUR"
- "Compare completion designs for Wolfcamp A wells in the Delaware Basin"
This is the direction petroleum engineering workflows are heading: AI agents that can access data, run calculations, and generate reports autonomously, with engineers reviewing and directing the work rather than performing every manual step. For more on how AI and agentic workflows are transforming upstream operations, see our coverage of digital platforms and AI in upstream oil and gas.
Key Python Libraries for Petroleum Engineering
Here is a reference table of the most useful Python libraries for PE data work:
Core Data Libraries
| Library | Purpose | Install |
|---|---|---|
| pandas | Tabular data manipulation, time series | pip install pandas |
| numpy | Numerical arrays, linear algebra | pip install numpy |
| scipy | Optimization, curve fitting, signal processing | pip install scipy |
| matplotlib | Publication-quality plots | pip install matplotlib |
| plotly | Interactive plots for dashboards | pip install plotly |
Petroleum Engineering Libraries
| Library | Purpose | Install |
|---|---|---|
| lasio | Read/write LAS well log files | pip install lasio |
| welly | Well data management, log analysis | pip install welly |
| dlisio | Read DLIS (Digital Log Interchange Standard) files | pip install dlisio |
| striplog | Manage lithological and stratigraphic data | pip install striplog |
| pyResToolbox | Reservoir engineering calculations (PVT, material balance, flow equations) | pip install pyResToolbox |
| komle | WITSML data bindings | pip install komle |
| petro-mcp | MCP server for PE data and AI workflows | github.com/petropt/petro-mcp |
Machine Learning Libraries (for advanced work)
| Library | Purpose | Install |
|---|---|---|
| scikit-learn | Classical ML (regression, classification, clustering) | pip install scikit-learn |
| xgboost | Gradient boosting (top performer for tabular data) | pip install xgboost |
| tensorflow / pytorch | Deep learning (sequence models, image recognition) | pip install tensorflow |
A note on pyResToolbox: this library is particularly useful for reservoir engineering calculations that would otherwise require implementing equations from textbooks. It includes PVT correlations (Standing, Vasquez-Beggs, Lee-Gonzalez), material balance calculations, inflow performance relationships, and flow regime identification. If you are doing reservoir engineering work in Python, install it early.
Where to Find Public Data for Practice
One of the best things about petroleum engineering data is that a significant amount of it is publicly available. State regulatory agencies require operators to report production, completion, and drilling data, and most of this data is accessible online.
Production Data
- Texas Railroad Commission (RRC) -- rrc.texas.gov -- Monthly production by lease and well for every well in Texas. The single largest public production dataset in the world.
- Colorado COGCC -- cogcc.state.co.us -- Well-level production data for the DJ Basin and Piceance Basin. Detailed completion data.
- North Dakota NDIC -- dmr.nd.gov -- Individual well production for the Bakken and Three Forks. Includes IP rates and completion details.
- New Mexico OCD -- ocd.nm.gov -- Permian Basin (New Mexico side) production data.
Well Log Data
- Kansas Geological Survey -- kgs.ku.edu -- Thousands of digitized well logs in LAS format. Excellent for practice.
- NLOG (Netherlands) -- nlog.nl -- Well logs from Dutch offshore and onshore wells.
Complete Field Datasets
- Volve Field Dataset (Equinor) -- data.equinor.com -- A complete dataset from a North Sea oil field: well logs, production data, seismic, geological models, drilling data. Released by Equinor for research and education. This is the single best public dataset for learning petroleum data science.
- FracFocus -- fracfocus.org -- Chemical disclosure data for hydraulic fracturing jobs across the US. Includes fluid volumes, proppant amounts, and chemical compositions.
Start with the Volve dataset if you want a comprehensive learning experience. Start with Kansas well logs if you just want to practice reading LAS files in Python. Start with Texas RRC production data if you want to build decline curve analysis skills.
Next Steps and Resources
If you have followed along with the code examples in this article, you now have the building blocks for a petroleum engineering data workflow in Python: reading well logs, cleaning production data, fitting decline curves, and plotting results. Here is where to go next:
Build a multi-well analysis pipeline. The real value of Python over Excel emerges when you scale from one well to hundreds. Write a script that reads all LAS files in a directory, extracts key curves, computes summary statistics (average porosity per zone, net pay thickness, hydrocarbon pore-feet), and exports results to a single DataFrame. This is the kind of task that takes days in manual workflows and minutes in Python.
Learn pandas deeply. The pandas library is the single most important tool in your Python data toolkit. Invest time in learning groupby operations, pivot tables, merge/join operations, and time series resampling. The official pandas documentation is excellent, and the book "Python for Data Analysis" by Wes McKinney (the creator of pandas) is the definitive reference.
Explore machine learning for petroleum engineering. Once you are comfortable with data manipulation and visualization, the natural next step is predictive modeling. Start with scikit-learn for classical ML: predicting EUR from completion parameters, classifying well performance tiers, clustering wells by production behavior. The petroleum engineering domain offers unusually rich datasets for ML practice.
Connect your workflows to AI agents. The petro-mcp server enables AI assistants to interact directly with petroleum engineering data and calculations. Instead of writing every analysis script from scratch, you can describe what you want in natural language and have an AI agent generate the code, run the analysis, and present the results. This is not replacing the engineer -- it is augmenting the engineer's productivity by an order of magnitude.
Read the other articles in this series. Our coverage of the upstream software landscape provides context for where Python fits in the broader technology ecosystem:
- Software Landscape for Drilling Operations -- Real-time drilling analytics and automation platforms
- Production Operations Software -- Surveillance, optimization, and AI in production
- Drilling Data Management and WITSML -- How drilling data flows from rig to cloud
- Reservoir Management Software -- Simulation, history matching, and uncertainty quantification
- Completions and Frac Software Platforms -- Frac design, real-time monitoring, and optimization
- Digital Platforms and AI in Upstream O&G -- The big picture of digital transformation
Python is not going to replace domain expertise. Understanding reservoir behavior, completion mechanics, drilling dynamics, and production physics is still what makes a petroleum engineer valuable. But Python gives you a better way to work with the data that informs those decisions -- faster, more reproducibly, and at a scale that manual workflows cannot match.
The industry is moving toward data-driven workflows whether individual engineers adopt them or not. The ones who learn these tools now will have a significant advantage.
Dr. Mehrdad Shirangi is the founder of Groundwork Analytics and holds a PhD from Stanford University in Energy Systems Optimization. He has been building AI solutions for the energy industry since 2018. Connect on X/Twitter and LinkedIn, or reach out at info@petropt.com.
Related Articles
- SCADA Data Quality for AI: The Audit Checklist -- The data quality issues you will encounter when working with real production data in Python.
- Decline Curve AI: Physics-Informed vs. Pure ML -- Why physics constraints matter when building DCA models in Python.
- 5 Open-Source Projects Every PE Student Should Contribute To -- Projects where you can apply your Python skills and build a portfolio.
Have questions about this topic? Get in touch.