Editorial disclosure: This article reflects the independent analysis and professional opinion of the author, informed by published research, vendor documentation, industry surveys, job postings, and practitioner experience. No vendor reviewed or influenced this content prior to publication. Product capabilities described are based on publicly available information and may not reflect the latest release.
This is Part 4 of the Complete Data Pipeline for Oil & Gas series. It covers Layer 4: the processing and orchestration layer that transforms raw wellsite data into clean, queryable, analytics-ready datasets. If you have not read the master article, start there for the full pipeline overview.
Raw data is not useful data.
A production engineer pulls up a Power BI dashboard at 7:00 AM. She wants to know which wells on the Smith pad had anomalous casing pressures overnight. The answer requires stitching together SCADA readings from eLynx (one-minute intervals, with gaps from a cellular outage between 2:00 and 3:00 AM), well configuration data from WellView, allocation factors from last month's test, and historical pressure ranges computed from the past 90 days.
That answer does not exist in any single system. It exists only after a processing layer has ingested, cleaned, harmonized, gap-filled, joined, and aggregated data from four different sources into a single queryable table. This is what Layer 4 does. And this is where most operators' data pipelines actually break.
The good news is that the tools for building this layer have gotten dramatically better in the past five years. The bad news is that most of the oil and gas industry has not adopted them yet. Permian Resources runs Databricks, Dagster, and dbt -- a stack that would be considered modern in any industry. Most of their mid-size peers are still on SQL Server stored procedures that nobody dares touch because the person who wrote them left three years ago.
This article covers the processing and orchestration landscape: what each tool does, where it fits, who in O&G is using it, and how to right-size the stack for your operation.
ETL vs. ELT: A Shift That Matters
Before diving into specific tools, it helps to understand the architectural shift that reshaped data processing across every industry, including oil and gas.
Traditional ETL
Extract, Transform, Load was the standard pattern for decades. Data was extracted from source systems (SCADA, historians, production databases), transformed in a separate processing engine (SQL Server Integration Services, Informatica, or custom scripts), and then loaded into a data warehouse in its final, clean form. Only the finished product was stored.
This pattern made sense when storage was expensive and compute was scarce. You transformed the data before loading it because you could not afford to store the raw version, and you did not have the processing power to transform it after the fact.
In oil and gas, traditional ETL often looked like this: a nightly batch job that pulled the previous day's production volumes from SCADA, ran them through a series of SQL stored procedures to allocate, validate, and aggregate, then loaded the results into a reporting database that fed Spotfire dashboards. If the transformation logic had a bug, you had to go back to the source system, re-extract, and reprocess. If you needed the data at a different granularity, you had to modify the stored procedures and re-run the entire pipeline.
Modern ELT
Extract, Load, Transform inverts the order. Raw data is extracted from source systems and loaded directly into a cloud data lake or lakehouse -- untouched, complete, in its original form. Transformation happens afterward, inside the lake or warehouse, using the cheap, scalable compute that cloud platforms provide.
The advantages for oil and gas are significant:
Raw data preservation. When an ESP failure investigation needs to look at one-second SCADA data from six months ago, that data exists in the lake in its original form. With traditional ETL, it might have been aggregated to hourly averages during the transformation step and the detail is gone forever.
Transformation flexibility. When the completions team needs the same production data aggregated by completion vintage rather than by pad, they can write a new transformation against the raw data rather than requesting a new ETL pipeline from IT. Multiple teams can transform the same raw data for different purposes.
Debugging transparency. When a dashboard number looks wrong, you can trace it back through the transformation layer to the raw data and identify exactly where the issue was introduced. With traditional ETL, the raw data may no longer exist.
Cheaper iteration. Changing a dbt model and re-running it against the existing raw data costs pennies in cloud compute. Rebuilding an ETL pipeline from scratch because the business requirements changed costs weeks of engineering time.
The ELT pattern has become dominant in the modern data stack. In oil and gas specifically, ELT aligns well with the reality that operators often do not know what questions they will need to answer when the data is first collected. The well that is operating normally today might develop rod pump problems in six months, and the investigation team will want access to every raw data point from the period before the failure began.
Processing Engines: Where the Compute Happens
Apache Spark / Databricks
Apache Spark is the dominant engine for large-scale data processing. It distributes computation across clusters of machines, enabling operators to process terabytes of data -- years of one-second SCADA readings across thousands of wells, seismic surveys, or well log libraries -- that would overwhelm a single database server.
Databricks is the commercial platform built around Spark. It adds a managed Spark environment (so you do not have to run your own clusters), Delta Lake (an open-source storage layer that brings ACID transactions and schema enforcement to data lakes), MLflow (for tracking machine learning experiments), and a collaborative notebook environment where data engineers and data scientists can work on the same datasets.
In upstream oil and gas, Databricks adoption is concentrated among progressive operators and supermajors:
- •Permian Resources runs a Databricks lakehouse as the center of their data platform. This is confirmed through job postings and internal tooling references. They use Delta Lake for structured storage and Databricks SQL for analytical queries.
- •Devon Energy has Databricks deployed alongside their existing SQL Server and SAP HANA infrastructure.
- •Shell runs Azure Databricks for ML model development, particularly around safety analytics and production optimization.
- •ExxonMobil, while primarily a Snowflake shop for their central data hub, uses Spark workloads for large-scale processing tasks.
When Databricks makes sense in O&G: When you are processing large volumes of time-series data (thousands of wells, years of one-second SCADA), when you need to run ML training workloads alongside data transformation, when your data team writes Python and SQL, and when you want a unified platform for data engineering, analytics, and data science.
When Databricks is overkill: When you have 200 wells and your entire production dataset fits in a PostgreSQL database. When your data engineering team is one person. When your transformations are straightforward SQL aggregations that a standard data warehouse handles fine. A Databricks cluster sitting idle still costs money. For small operators, the platform overhead -- both financial and operational -- is not justified by the processing requirements.
Databricks pricing reality: List pricing for Databricks on Azure starts at roughly $0.40/DBU for jobs compute, but actual costs depend heavily on usage patterns. A mid-size operator running daily production aggregation pipelines might spend $2,000-$5,000/month. A supermajor running real-time streaming and ML training at scale could easily spend $50,000+/month. The lakehouse model consolidates what used to be separate costs (data lake storage + data warehouse + ML platform), so the comparison is not straightforward.
Snowflake
Snowflake takes a different architectural approach: a fully managed cloud data warehouse with separation of storage and compute. Rather than running Spark clusters, you write SQL queries and Snowflake automatically scales compute resources up and down.
ExxonMobil has chosen Snowflake as their central data hub -- a signal worth paying attention to given ExxonMobil's scale and influence. Peloton offers well lifecycle data through a Snowflake data marketplace, making it straightforward for operators to join their internal data with public well data, production records, and completions information.
Snowflake's strength in O&G is in the analytical query layer rather than heavy data processing. It excels at cross-well comparisons, historical trend analysis, and serving data to BI tools. It is less suited for the kind of heavy, iterative processing workloads (ML training, complex time-series manipulation) where Spark shines.
Custom Python (pandas, Polars)
It would be dishonest not to mention what most petroleum engineering teams actually use: custom Python scripts with pandas. Every reservoir engineer who has learned Python in the past decade has written pandas code to clean production data, merge well headers, and generate decline curves. This approach works fine for ad-hoc analysis on small to moderate datasets.
Polars is the newer alternative to pandas, written in Rust and designed for performance. On the kind of tabular data manipulation common in O&G -- filtering production records, grouping by well, calculating rolling averages -- Polars can be 10-50x faster than pandas. For teams that have outgrown pandas but do not need Spark, Polars is worth evaluating.
The limitation of custom Python scripts is not performance. It is reliability, reproducibility, and maintainability. A Jupyter notebook that one engineer wrote to clean production data is not a production pipeline. It does not run on a schedule, does not alert when it fails, does not version its outputs, and does not document its assumptions. Turning ad-hoc Python scripts into reliable, monitored pipelines is precisely what orchestration tools solve.
Transformation: dbt and the SQL-First Approach
dbt (data build tool) has reshaped how data teams build transformation pipelines, and it is beginning to make inroads in oil and gas.
What dbt Does
dbt lets you write data transformations as SQL SELECT statements, then handles everything else: dependency ordering (run the well header cleanup before the production aggregation that depends on it), incremental processing (only transform new rows, not the entire history), testing (assert that API numbers are unique, that production volumes are non-negative, that every well maps to a known lease), documentation (auto-generated data dictionaries and lineage graphs), and version control (transformations are code, stored in Git, reviewed in pull requests).
A dbt model for daily production allocation might look like this:
-- models/production/daily_allocated_production.sql
WITH raw_production AS (
SELECT * FROM {{ ref('stg_scada_production') }}
),
allocation_factors AS (
SELECT * FROM {{ ref('dim_well_test_factors') }}
WHERE test_date = (
SELECT MAX(test_date) FROM {{ ref('dim_well_test_factors') }}
WHERE test_date <= CURRENT_DATE
)
),
allocated AS (
SELECT
r.well_api,
r.production_date,
r.commingled_oil_bbls * a.oil_factor AS allocated_oil_bbls,
r.commingled_gas_mcf * a.gas_factor AS allocated_gas_mcf,
r.commingled_water_bbls * a.water_factor AS allocated_water_bbls
FROM raw_production r
JOIN allocation_factors a ON r.well_api = a.well_api
)
SELECT * FROM allocated
This is just SQL. Any petroleum engineer who can write a SQL query can read it, modify it, and verify it. The logic is transparent, version-controlled, and testable. Compare that to a 2,000-line stored procedure in SQL Server that has been modified by four different people over six years and has no documentation.
dbt in Oil and Gas
Permian Resources is the confirmed reference customer for dbt in upstream O&G. Their data engineering team uses dbt to transform raw SCADA and production data in their Databricks lakehouse into the curated datasets that feed Spotfire and Power BI dashboards. This confirms that dbt works at the scale and complexity of a 300,000+ BOE/day operator.
The 2025 dbt Fusion Engine release (Rust-based parser) significantly improved performance for large dbt projects -- relevant for operators with hundreds of transformation models across drilling, production, reservoir, and financial domains.
O&G-Specific Transformation Challenges dbt Solves Well
Well naming harmonization. The most painful data engineering problem in oil and gas. The same well appears as "SMITH 1-23H" in SCADA, "Smith Unit #1-23H" in WellView, and "42-461-12345" in the state database. A dbt staging model can implement the mapping logic once, test it, and every downstream model uses the clean, harmonized well identifier.
Production allocation. Multi-well pads with commingled production require allocation based on the most recent well test. The allocation logic changes when new tests are run. In dbt, this is a model that references the test history, selects the appropriate factors, and applies them. When allocation methods change, you update the model, re-run, and all downstream aggregations update automatically.
Unit conversions. Field units (psi, BBL, MCF) vs. SI units (kPa, m3) vs. mixed systems that some regulatory databases use. A dbt macro can handle unit conversion consistently across every model, rather than having each engineer implement their own conversion in their own spreadsheet.
Time-series resampling. SCADA data arrives at irregular intervals with gaps from communication outages. dbt models can implement consistent resampling and gap-filling logic that produces the clean, regular-interval time series that dashboards and ML models need.
dbt Tests for Data Quality
dbt's built-in testing framework is underappreciated in O&G. You can define tests directly in your project:
# models/production/schema.yml
models:
- name: daily_allocated_production
columns:
- name: well_api
tests:
- not_null
- relationships:
to: ref('dim_wells')
field: api_number
- name: allocated_oil_bbls
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 5000
- name: production_date
tests:
- not_null
- dbt_utils.recency:
datepart: day
field: production_date
interval: 2
These tests run automatically after every transformation. If production volumes go negative (sensor error), if a well API does not match the master well list (naming mismatch), or if the most recent data is more than two days old (pipeline failure), the test fails and the team is alerted. This is the kind of automated data quality monitoring that most operators do manually -- if they do it at all.
Orchestration: Scheduling, Monitoring, and Dependency Management
An orchestrator answers three questions: what runs, when does it run, and what happens when something fails. In a production data pipeline, the orchestrator ensures that SCADA data ingestion completes before the transformation starts, that the transformation completes before the dashboard refreshes, and that the data team gets a Slack alert at 6:00 AM if any of those steps failed overnight.
Dagster
Dagster is the orchestrator that Permian Resources chose, and its design philosophy explains why.
Asset-based orchestration. Dagster is built around the concept of "software-defined assets" -- each pipeline step produces a named data asset. In O&G, this maps naturally to the domain: a raw_scada_readings asset, a cleaned_production_daily asset, a well_header_master asset, a decline_curve_parameters asset. The dependency graph is defined by the relationships between these assets, not by explicit pipeline DAGs.
This is a meaningful difference from traditional workflow-based orchestrators. In Dagster, you declare that allocated_production depends on raw_scada_production and well_test_factors. Dagster figures out the execution order. If you add a new downstream asset that depends on allocated_production, it is automatically included in the graph. You do not need to modify an existing pipeline definition.
Python-native. Dagster assets are defined in Python, which makes them accessible to the Python-fluent data scientists and engineers who are increasingly common in E&P. A Dagster asset for loading well headers might look like:
@asset(
description="Master well header table from WellView + state database",
group_name="wells",
)
def well_header_master(
wellview_headers: pd.DataFrame,
state_well_registry: pd.DataFrame,
) -> pd.DataFrame:
"""Merge WellView headers with state database,
resolve naming conflicts, add computed fields."""
merged = wellview_headers.merge(
state_well_registry,
left_on="api_number",
right_on="api_14",
how="left",
)
merged["display_name"] = merged.apply(harmonize_well_name, axis=1)
return merged
Built-in observability. Dagster provides a web UI (Dagit) that shows the status of every asset, when it was last materialized, how long it took, and what downstream assets depend on it. For an operations team that needs to know why the morning report is showing stale data, this visibility is invaluable.
dbt integration. Dagster has first-class support for dbt, which is directly relevant to the Permian Resources stack. dbt models appear as assets in the Dagster graph, with proper dependency tracking between dbt SQL models and Dagster Python assets. This means you can orchestrate a pipeline where a Python script ingests raw SCADA data, a dbt model transforms it, another Python script runs an ML anomaly detection model on the transformed data, and a final step sends alerts -- all managed in a single, visible graph.
Apache Airflow
Airflow is the most widely adopted workflow orchestrator overall, with the largest community, the most third-party integrations, and the most extensive documentation. If you search for "data orchestration" in O&G job postings, Airflow appears more frequently than any alternative.
DAG-based orchestration. Airflow organizes work as Directed Acyclic Graphs (DAGs) of tasks. Each task is an operator (PythonOperator, BashOperator, SparkSubmitOperator, etc.), and the DAG defines the execution order and dependencies. This is a more traditional workflow model than Dagster's asset-based approach.
Strengths in O&G contexts:
- •Mature ecosystem with operators for every major cloud service, database, and processing engine. Need to trigger an Azure Data Factory pipeline, then run a Spark job on Databricks, then refresh a Power BI dataset? Airflow has pre-built operators for all of those.
- •Large community means abundant tutorials, Stack Overflow answers, and consultant availability.
- •Managed services (Amazon MWAA, Google Cloud Composer, Astronomer) reduce operational overhead.
- •Airflow 3.0 (in preview as of early 2026) introduces a service-oriented architecture that addresses many of the scaling and reliability complaints of earlier versions.
Limitations:
- •DAGs are defined in Python, but the actual task logic is often separated from the orchestration logic, making it harder to reason about data flow.
- •Testing DAGs locally is notoriously painful. The "write, push, trigger, check logs, fix, repeat" cycle slows development.
- •The scheduler can become a bottleneck at scale without careful tuning.
- •DAGs are workflow-centric rather than data-centric, which means Airflow tracks whether a task ran but does not inherently track what data the task produced or whether that data is fresh.
For O&G teams that already have Airflow experience or need to integrate with a wide variety of systems, Airflow remains a solid choice. For greenfield projects, the newer alternatives offer a better developer experience.
Prefect
Prefect positions itself as "Airflow without the pain." It is Python-native, with a simpler programming model and easier local development. Flows are defined as decorated Python functions:
from prefect import flow, task
@task
def ingest_scada(well_api: str) -> pd.DataFrame:
return fetch_from_elynx(well_api)
@task
def transform_production(raw: pd.DataFrame) -> pd.DataFrame:
return apply_allocation(raw)
@flow
def daily_production_pipeline(well_apis: list[str]):
for api in well_apis:
raw = ingest_scada(api)
transform_production(raw)
Prefect's hybrid execution model (define flows locally, run them anywhere -- cloud, Kubernetes, local machine) makes it attractive for smaller teams that want the benefits of orchestration without the infrastructure overhead of running Airflow or Dagster servers.
For O&G specifically: Prefect is a good fit for teams that need something more structured than cron jobs but are not ready for the full weight of Dagster or Airflow. A reservoir engineering team that runs weekly decline curve updates, monthly reserve recalculations, and quarterly type curve analyses can define these as Prefect flows with proper scheduling, retry logic, and failure notifications.
Azure Data Factory / AWS Glue
These are the cloud-native, managed ETL/orchestration services from Microsoft and Amazon respectively. Their adoption in O&G follows cloud provider choice: Azure-heavy operators (which is most of them -- Azure holds 57% of the O&G cloud market) default to Data Factory. AWS shops use Glue.
Azure Data Factory (ADF):
- •Visual, drag-and-drop pipeline designer that appeals to teams without deep Python expertise.
- •Native connectors to Azure services (Blob Storage, Synapse, Databricks), SQL Server, SAP, and hundreds of other sources.
- •Built-in monitoring, alerting, and retry logic.
- •For operators already invested in the Microsoft ecosystem (Azure, Power BI, Office 365), ADF is the path of least resistance.
AWS Glue:
- •Serverless Spark-based ETL -- you write PySpark code and AWS manages the compute.
- •Automatic schema discovery and data cataloging.
- •Tight integration with S3, Redshift, and other AWS services.
- •Cost model based on actual compute usage, not reserved capacity.
The trade-off: Cloud-native services are easier to set up and maintain than self-managed orchestrators. But they create vendor lock-in. A pipeline built in Azure Data Factory does not run on AWS. A Dagster or Airflow pipeline can run on either cloud, or on-premise, with relatively minor changes. For operators who might change cloud providers (or who are managing multi-cloud environments after an acquisition), the portability of open-source tools is worth the additional setup complexity.
The Permian Resources Reference Architecture
Permian Resources (ticker: PR) operates more than 300,000 BOE/day across the Delaware and Midland basins. Their data architecture, confirmed through job postings and technology team references, represents the most modern data stack among mid-size upstream operators:
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ Field Data │ │ Databricks │ │ Dagster │
│ SCADA, EDR, │───▶│ Lakehouse │───▶│ Orchestr. │
│ well files │ │ (Delta Lake) │ │ │
└────────────────┘ └────────────────┘ └───────┬────────┘
│
▼
┌────────────────┐ ┌────────────────┐
│ Spotfire + │◀───│ dbt │
│ Power BI │ │ Transforms │
└────────────────┘ └────────────────┘
Why This Stack Works
Databricks Lakehouse serves as the unified storage and compute layer. Raw data from field systems lands in Delta Lake tables, maintaining full fidelity. Processed data lives in separate Delta tables that dbt manages. The lakehouse avoids the traditional split between a data lake (for raw/unstructured data) and a data warehouse (for structured/analytical queries) -- Delta Lake provides both in one platform.
Dagster orchestrates the pipeline. Its asset-based model aligns with the natural structure of O&G data: wells are assets, production records are assets, allocation tables are assets. When new SCADA data arrives, Dagster triggers the downstream transformations in the correct order. When a transformation fails (bad data from a sensor, schema change in a source system), Dagster provides observability into what failed and why.
dbt handles the SQL transformation logic. Well name harmonization, production allocation, unit conversions, time-series aggregation, and data quality testing are all defined as dbt models and tests. The logic is version-controlled, documented, and testable.
Spotfire + Power BI serve the analytics layer, with Spotfire for technical/engineering analysis and Power BI for enterprise distribution and executive reporting. Both query the curated tables that dbt produces in the Databricks lakehouse.
What Other Operators Can Learn
The Permian Resources stack is not exotic technology. Every component is available, documented, and supported. The reason most mid-size operators are not running this stack is not technology -- it is organizational. Building this requires:
- 1.A data engineering team that understands modern data tools, not just SQL Server and SSIS.
- 2.Executive sponsorship to invest in data infrastructure rather than just buying another dashboard tool.
- 3.Willingness to migrate from legacy processes (stored procedures, manual CSV imports, Access databases) that "work" even if they are fragile.
The return is significant. Automated, tested, monitored pipelines mean that the morning production report generates itself -- correctly -- without a production engineer spending 90 minutes copying data between systems. That engineer can spend those 90 minutes actually analyzing the data.
Data Quality Frameworks
Processing data without validating it is just moving garbage faster. Data quality is the step most operators skip, and it is the step that matters most for downstream trust.
Great Expectations
Great Expectations is an open-source Python framework for defining, running, and managing data quality tests (called "expectations"). Rather than writing ad-hoc validation scripts, you define expectations declaratively:
validator.expect_column_values_to_not_be_null("well_api")
validator.expect_column_values_to_be_between(
"oil_bbls_per_day", min_value=0, max_value=10000
)
validator.expect_column_values_to_be_in_set(
"state_code", ["TX", "NM", "OK", "ND", "CO", "WY"]
)
validator.expect_column_pair_values_a_to_be_greater_than_b(
"tubing_pressure_psi", "casing_pressure_psi"
)
Great Expectations integrates with Spark, pandas, and SQL databases. It generates data documentation ("data docs") automatically and can produce validation reports that serve as audit trails.
O&G-specific value: The last expectation above -- tubing pressure should be greater than casing pressure for a flowing well -- is a physics-based validation that catches sensor errors, channel swaps, and data integration bugs that pure statistical tests miss. Domain-specific expectations like this are where data quality frameworks become genuinely valuable in oil and gas.
Soda
Soda provides a YAML-based data quality testing framework that is lighter-weight than Great Expectations and integrates with dbt. You define checks in plain YAML:
checks for daily_production:
- row_count > 0
- missing_count(well_api) = 0
- invalid_count(allocated_oil_bbls) = 0:
valid min: 0
- freshness(production_date) < 2d
- duplicate_count(well_api, production_date) = 0
Soda's advantage is simplicity and accessibility. Data quality checks defined in YAML are readable by anyone, including the production engineer who needs to understand why a dashboard number looks wrong.
dbt Tests (Revisited)
As discussed in the transformation section, dbt's built-in testing framework provides basic data quality validation integrated directly into the transformation pipeline. For many O&G teams, dbt tests alone provide sufficient data quality coverage without needing a separate framework.
The practical recommendation: start with dbt tests. They are free, integrated, and easy to write. Add Great Expectations or Soda when you need more sophisticated validation -- statistical profiling, cross-table comparisons, time-series anomaly detection -- that exceeds what dbt tests can express.
Testing Data Pipelines End-to-End
Data quality testing is necessary but not sufficient. You also need to test the pipeline itself -- not just whether the data is correct, but whether the pipeline runs successfully, handles edge cases, and recovers from failures.
Unit testing transformations. A dbt model or Python function that implements production allocation should have unit tests with known inputs and expected outputs. When the allocation logic changes (a new well comes online, a well is shut in), the tests verify that the updated logic handles both the new case and the existing cases correctly.
Integration testing. Does the Dagster pipeline successfully ingest data from eLynx, transform it through dbt, and produce the expected output tables? Run the pipeline against a test dataset with known characteristics and verify the results.
Regression testing. After every change to the pipeline, run it against a snapshot of production data and compare the outputs to the previous version. If allocated volumes change by more than 1%, that change needs explanation.
Freshness monitoring. The most common data pipeline failure mode is silent: the pipeline stops running, but nobody notices because the dashboards still show the last successful result. Freshness checks ("this table should be updated every 24 hours") catch this failure mode.
Right-Sizing by Company Segment
Not every operator needs Databricks. The right stack depends on the scale of operations, the size of the data team, and the complexity of the data processing requirements.
Small Operators (100-500 wells, <$2M IT budget)
Do not use Databricks, Dagster, or dbt. The overhead of setting up and maintaining these tools exceeds the benefit at this scale.
Recommended approach:
- •Processing: Python scripts (pandas or Polars) for ad-hoc analysis. SQL stored procedures or views in SQL Server/PostgreSQL for recurring transformations.
- •Orchestration: Cron jobs or Windows Task Scheduler for simple scheduling. Prefect if you need retry logic and notifications.
- •Data quality: Manual validation, basic SQL assertions, dbt tests if you adopt dbt.
- •Storage: A single PostgreSQL or SQL Server database can hold all production data for 500 wells with years of history.
- •Cost: Minimal. PostgreSQL is free. Python is free. A small Azure SQL Database costs $50-$200/month.
The goal at this scale is to get off spreadsheets and into a proper database with automated daily data loads. That alone is transformative. The morning production report goes from a 90-minute manual exercise to a 5-minute dashboard review.
Mid-Size Operators (500-5,000 wells, $5-20M IT budget)
This is the sweet spot for modern data stack adoption. The data volumes justify more sophisticated tools, and the team size can support them.
Two paths:
Path A: Cloud-native managed services. Azure Data Factory for orchestration, Azure SQL or Synapse for storage, dbt for transformations, Power BI for dashboards. Minimal infrastructure to manage. Good fit for teams with strong SQL skills but limited Python or DevOps experience.
Path B: Modern open-source stack (Permian Resources model). Databricks for processing and storage, Dagster for orchestration, dbt for transformations, Spotfire + Power BI for analytics. More powerful, more flexible, but requires a data engineering team that knows Python, Git, and cloud infrastructure.
Most mid-size operators today are running neither of these. They are running a legacy stack: SQL Server on-prem, SSIS or manual CSV imports, stored procedures for transformation, Spotfire for dashboards. This works, but it is brittle, undocumented, and prevents AI adoption. The migration path from legacy to modern should be incremental: start with dbt on top of the existing SQL Server database (dbt supports SQL Server), then add orchestration, then migrate storage to the cloud.
Large Independents and Supermajors (5,000+ wells, $20M+ IT budget)
At this scale, operators generally have dedicated data engineering teams and can justify enterprise-scale platforms.
Typical stack: Databricks or Snowflake as the primary data platform. Airflow or Dagster for orchestration (Airflow more common due to longer history, Dagster gaining share). dbt for SQL transformations. Spark for large-scale processing. Great Expectations or custom data quality frameworks. Multiple BI tools (Spotfire, Power BI, Grafana, custom web applications).
The challenge at this scale is not technology selection -- it is integration. Large operators have decades of legacy systems, multiple acquired companies with different tech stacks, and organizational silos that create parallel data pipelines. The data engineering challenge is less about "how do we process data" and more about "how do we consolidate 17 different data pipelines into a coherent architecture without breaking the production reports that 200 people depend on."
Common Mistakes
Having built data pipelines for energy companies since 2018, I have seen the same mistakes repeatedly.
Buying a BI tool before building the pipeline. An operator buys Power BI licenses, hires a dashboard developer, and then discovers that the data the dashboards need does not exist in any queryable form. Dashboards are only as good as the data pipeline feeding them. Build Layer 4 before Layer 6.
Over-engineering for the wrong scale. A 300-well operator deploying Databricks because they read that Permian Resources uses it. The tools are not the goal; the pipeline is the goal. Start with the simplest tools that solve your problem and add complexity only when the simple tools break.
Under-investing in data quality. The transformation pipeline runs every night, but nobody checks whether the data it produces is correct. Six months later, the completions team discovers that oil volumes have been double-counted for wells on commingled pads since the pipeline was deployed. Data quality tests cost almost nothing to implement and prevent errors that cost real money to find and fix.
No version control. Transformation logic lives in SQL Server stored procedures that are modified in place. Nobody can tell what changed, when, or why. When the morning report shows unexpected numbers, debugging requires comparing stored procedure text across backup files. All transformation logic -- dbt models, Python scripts, SQL views -- should be in Git. Period.
Ignoring the human element. The best data pipeline in the world is useless if the production engineers do not trust its output. Building trust requires transparency (show the logic, show the tests, show the lineage), validation against the old process (run both in parallel for a month), and responsiveness when issues arise (fix data quality bugs in hours, not weeks).
Where This Is Going
The data processing and orchestration layer in oil and gas is evolving rapidly, driven by three forces:
AI demands better data. Every AI initiative -- production forecasting, predictive maintenance, drilling optimization -- requires clean, consistent, well-documented data. The operators investing in modern data stacks are the ones who will be able to deploy AI models that work in production, not just in pilots. The 70% of operators stuck in pilot-phase digital transformation are stuck because of data infrastructure, not because of AI technology.
Real-time is becoming table stakes. Batch processing (nightly ETL) is giving way to streaming and micro-batch processing. When an ESP starts showing anomalous vibration patterns, the operator that detects it in minutes (streaming pipeline) protects production. The operator that detects it the next morning (nightly batch) may already have a failure. Databricks Structured Streaming, Kafka-based pipelines, and real-time Dagster sensors all enable this shift.
The talent pool is shifting. New petroleum engineering hires know Python and SQL. They expect modern tools. They will not accept "export to CSV and open in Excel" as a workflow. The operators that provide modern data tooling will attract better data talent -- which is already the most constrained resource in E&P technology teams.
Practical Recommendations
If you are a small operator: Get your production data into a PostgreSQL or SQL Server database with automated daily loads. Use dbt or SQL views for basic transformations. Set up Power BI dashboards. This costs less than $500/month and eliminates the spreadsheet-based morning report.
If you are a mid-size operator on a legacy stack: Start with dbt on top of your existing SQL Server. This requires zero infrastructure changes -- dbt connects to SQL Server and creates views and tables. Once the transformation logic is version-controlled and tested, evaluate whether to migrate storage to the cloud.
If you are building a greenfield data platform: Follow the Permian Resources model. Databricks for the lakehouse, Dagster for orchestration, dbt for transformations, Spotfire and/or Power BI for analytics. This is a proven architecture at upstream O&G scale.
If you are evaluating orchestrators: Choose Dagster for new projects with Python-skilled teams. Choose Airflow if you need the widest ecosystem of integrations or if your team already has Airflow experience. Choose Prefect for small teams that want simplicity. Choose Azure Data Factory if you are committed to the Microsoft ecosystem and want minimal operational overhead.
Regardless of your stack: Implement data quality testing. Start with dbt tests or Soda. Define freshness requirements for every table. Alert on failures. The cost of not testing data quality is paid in bad decisions made on bad data -- a cost that never shows up on an IT invoice but is very real to the production team.
Need help building a modern data processing stack for your operations? Get in touch.