Software & Data

Data Lake Architecture for Oil & Gas: S3, Azure Data Lake, and the Cloud Migration Reality

Dr. Mehrdad Shirangi | | 25 min read

Editorial disclosure: This article reflects the independent analysis and professional opinion of the author, informed by published research, vendor documentation, industry surveys, and practitioner experience. No vendor reviewed or influenced this content prior to publication. Market share figures are cited from their original sources and may reflect specific survey populations.

Every upstream oil and gas company has a data problem. Not a data scarcity problem -- the average operator generates terabytes of sensor telemetry, well logs, production records, seismic volumes, completion reports, and maintenance histories every year. The problem is that this data lives in dozens of disconnected systems, stored in incompatible formats, governed by nobody, and accessible to very few of the people who need it.

The data lake was supposed to fix this. Dump everything into one big storage layer -- Azure Data Lake, Amazon S3, Google Cloud Storage -- and then figure out what to do with it. In practice, that approach produced what the industry affectionately calls a "data swamp": a vast repository of raw files that nobody can find, nobody trusts, and nobody uses for anything more sophisticated than emergency archaeology when a regulator asks a question.

The good news is that data architecture for upstream oil and gas has matured considerably since the early "dump it in a lake" era. The bad news is that most operators have not caught up. According to McKinsey's 2024 analysis, 70% of oil and gas companies remain stuck in the pilot phase of digital transformation, and data fragmentation -- not technology -- is the primary barrier.

This article maps the current state of data lake architecture in upstream oil and gas: which cloud platforms operators are actually using, how the modern lakehouse replaces both the traditional data lake and data warehouse, what file formats matter, how to organize data for analytics and AI, and where standards like OSDU fit. Critically, it provides different recommendations based on company size, because a supermajor's architecture has almost nothing in common with what a 200-well operator needs.


The Cloud Platform Landscape: Azure Dominates, and It Is Not Close

Before discussing architecture, the foundational question: which cloud?

A 2024 Kimberlite Research survey of geoscience and engineering professionals across the oil and gas industry found that Microsoft Azure holds 57% of the cloud market share among operators. AWS follows at approximately 17%, and Google Cloud at 13%. The remaining share is split among smaller providers and on-premises deployments.

These numbers are not accidental. Microsoft established an early foothold in energy through deep partnerships with major operators and an industry-specific cloud strategy. Several factors reinforced Azure's position:

Enterprise integration. Most oil and gas companies already ran Microsoft 365, Active Directory, and SharePoint. Azure integrated seamlessly with this existing infrastructure, reducing adoption friction. When Power BI became the default executive reporting tool, Azure's position strengthened further -- data already in Azure flows into Power BI with minimal configuration.

Strategic partnerships. Equinor built its entire Omnia data platform on Azure and Azure Data Manager for Energy (the OSDU implementation). Shell selected Azure as its primary cloud and runs Azure Databricks for machine learning workloads. Chevron deployed Azure IoT Operations across multiple sites. BP, while using Palantir Foundry as its analytics layer, runs significant Azure infrastructure underneath.

Energy-specific services. Microsoft invested in Azure Data Manager for Energy (its OSDU-compatible subsurface data management service), Azure IoT Hub for operational technology integration, and Azure Digital Twins. These energy-specific offerings gave Azure a positioning advantage that AWS and Google Cloud had to chase.

AWS: The Infrastructure Alternative

AWS holds roughly 17% of the oil and gas cloud market, with stronger adoption in North America and among companies that started their cloud journeys through general IT modernization rather than energy-specific initiatives. AWS's strengths in upstream include:

  • S3 as the storage standard. Amazon S3 effectively defined the object storage paradigm. Even operators on Azure often benchmark against S3's pricing, durability, and tiering model.
  • Baker Hughes partnership. Baker Hughes built its Leucipa production automation platform on AWS and deploys its Cordant industrial AI software on AWS infrastructure.
  • OSDU on AWS. Amazon offers Energy Data Insights (EDI), its managed OSDU implementation, providing an alternative to Azure's Data Manager for Energy.
  • SageMaker. AWS SageMaker is a capable ML platform, and Baker Hughes leverages it for Leucipa's AI models.

AWS tends to be chosen by operators whose IT organizations had existing AWS experience from non-energy workloads, or by companies working closely with Baker Hughes or Halliburton (whose DecisionSpace 365 Essentials is AWS-hosted).

Google Cloud: The Data Analytics Play

Google Cloud holds approximately 13% of the oil and gas market and is growing, though from a smaller base. Google's differentiation is primarily analytical:

  • BigQuery. Google's serverless data warehouse is arguably the most powerful analytical query engine available, and its pricing model (pay per query rather than per cluster) appeals to operators with variable analytical workloads.
  • OSDU on Google Cloud. Google is an OSDU implementation partner, and Aramco's CNTXT joint venture (with Cognite) operates as a Google Cloud reseller in MENA.
  • AI and ML. Google's Vertex AI platform and TensorFlow ecosystem are technically strong, though less commonly adopted in upstream than Azure ML or SageMaker.

Google Cloud's upstream penetration remains limited primarily because the enterprise relationship is weaker. Oil and gas companies do not run Google Workspace the way they run Microsoft 365, so Google lacks the "land and expand" path that Microsoft exploits. Google's growth in energy is more likely to come through specific analytical use cases and OSDU adoption than through broad infrastructure displacement.

The Multi-Cloud Reality

Supermajors and large independents typically operate in a multi-cloud environment, though with a clear primary. ExxonMobil uses Azure for IoT but Snowflake (which runs on all three clouds) as its central data hub. Shell is primarily Azure but uses Google Cloud for some workloads. Most mid-size and small operators, however, are single-cloud -- and that cloud is almost always Azure.

The practical implication: if you are building data infrastructure for an upstream oil and gas company, design for Azure first. The talent pool, the vendor ecosystem, the integration patterns, and the existing enterprise infrastructure all point in the same direction. This does not mean Azure is technically superior for every workload. It means the switching costs and ecosystem effects make it the default.


Data Lake vs. Data Warehouse vs. Lakehouse: What Actually Matters

The terminology has shifted rapidly, and the distinctions matter for how you architect an upstream data platform.

The Traditional Data Lake

A data lake is a storage repository that holds raw data in its native format -- structured, semi-structured, and unstructured -- at any scale. In upstream oil and gas, a data lake might contain:

  • Time-series sensor data from SCADA and historians (structured, high-volume)
  • Well logs in LAS format (semi-structured, domain-specific)
  • Seismic data in SEG-Y format (binary, massive -- terabytes to petabytes)
  • Completion reports as PDFs and Word documents (unstructured)
  • Production databases exported as CSV or Parquet files (structured)
  • Maintenance records from CMMS systems (structured, inconsistent)
  • Geospatial data as shapefiles and GeoJSON (semi-structured)

The data lake's value proposition was simple: store everything cheaply, in its original format, and figure out the analytics later. Azure Data Lake Storage Gen2, Amazon S3, and Google Cloud Storage all serve this function at costs of $0.01-0.02 per GB per month for infrequently accessed data.

The problem, as the industry learned, is that raw storage without structure produces chaos. When a production engineer needs to find the most recent well test for a specific well, a data lake containing millions of files with inconsistent naming conventions is worse than a filing cabinet. At least the filing cabinet has labels.

The Traditional Data Warehouse

Data warehouses -- Snowflake, Azure Synapse Analytics, Amazon Redshift, Google BigQuery -- solve the structure problem. They store data in highly organized, schema-enforced tables optimized for analytical queries. The data is clean, consistent, and queryable via standard SQL.

ExxonMobil's adoption of Snowflake as its central data hub reflects this approach: take the chaos of raw operational data, transform it into clean analytical tables, and provide a single source of truth that the entire organization can query.

The data warehouse's limitation is inflexibility. Loading data into a warehouse requires defining the schema upfront, transforming the data to fit, and maintaining the schema as requirements change. For upstream oil and gas, where data formats range from 1970s-era LAS files to modern JSON APIs, and where the same well might have different identifiers in different systems, the transformation burden is substantial. Warehouses also struggle with unstructured data (documents, images, seismic binaries) and extremely high-frequency time-series data.

The Lakehouse: Where the Industry Is Heading

The lakehouse architecture -- pioneered by Databricks with Delta Lake and now offered by most major platforms -- combines the storage flexibility of a data lake with the query performance and data quality features of a data warehouse.

In a lakehouse, data is stored in open file formats (typically Parquet, sometimes Delta or Iceberg) on cloud object storage (ADLS Gen2, S3, GCS). A metadata layer provides:

  • ACID transactions. Changes to data are atomic and consistent, preventing the corrupted-read problems that plague raw data lakes.
  • Schema enforcement and evolution. Tables have defined schemas, but those schemas can evolve as requirements change without rewriting all existing data.
  • Time travel. Previous versions of data are retained, allowing queries against historical states. When a production allocation error is discovered, you can see what the data looked like before the error.
  • Unified batch and streaming. The same tables can receive both batch loads (daily production imports) and streaming updates (real-time sensor data).

Databricks Delta Lake is the most widely adopted lakehouse platform in upstream oil and gas. Permian Resources runs a confirmed Databricks lakehouse with Dagster for orchestration and dbt for transformations -- one of the most progressive data stacks among mid-size operators. Devon Energy uses Databricks. Shell runs Azure Databricks for machine learning workloads. The pattern is clear: operators that have moved beyond SQL Server and Excel are moving to lakehouses, not to traditional data warehouses.

Snowflake, while architecturally a data warehouse, has been adding lakehouse-like features (Iceberg Tables, external tables on S3/ADLS) that blur the distinction. Microsoft Fabric, launched in 2023, bundles lakehouse capabilities with the Power BI and Azure ecosystem and is likely to gain significant share in oil and gas simply because it reduces the number of services operators must manage.


File Formats: The Unsexy Foundation That Determines Everything

The file formats you choose for storing data in a lake or lakehouse determine query performance, storage costs, interoperability, and the range of tools that can access the data. In upstream oil and gas, several formats coexist:

Parquet: The Analytical Standard

Apache Parquet is the columnar storage format that underpins most modern analytical workloads. When production data, well header information, completion parameters, or any structured dataset lands in a lakehouse, it should be stored as Parquet.

Why Parquet matters for upstream:

  • Columnar storage. A query that needs only the oil production rate column from a table containing 50 columns reads only that column, reducing I/O by 90%+ compared to row-based formats like CSV.
  • Compression. Parquet files are typically 5-10x smaller than equivalent CSV files due to column-level compression algorithms (Snappy, ZSTD, Gzip).
  • Schema embedded. The file contains its own schema, eliminating the "what are these columns?" problem that plagues CSV files.
  • Universal support. Every modern analytics tool -- Databricks, Snowflake, BigQuery, Spark, Pandas, Power BI, DuckDB -- reads Parquet natively.

For production time-series data, a well with hourly readings generates roughly 8,760 rows per year. A 1,000-well operator accumulates 8.76 million rows per year per parameter. In CSV format, this might be 500 MB. In Parquet with Snappy compression, it is 50-80 MB -- and analytical queries run 10-100x faster.

LAS (Log ASCII Standard): The Subsurface Workhorse

LAS files have stored well log data since the 1980s. The format is simple -- a header section with well metadata followed by columns of depth-indexed measurements (gamma ray, resistivity, density, neutron porosity). LAS 2.0 is the dominant version; LAS 3.0 added support for multiple sections and metadata but has seen slower adoption.

The challenge with LAS files in a modern data lake is that they are not analytically friendly. They require parsing into structured formats before they can be queried alongside production data, completion data, or other well information. A common pattern is to ingest raw LAS files into the lake's "bronze" layer (discussed below), parse them into Parquet tables in the "silver" layer, and join them with other well data in the "gold" layer.

Approximately 100,000+ new LAS files are generated annually across the U.S. upstream industry. Any data lake architecture that ignores LAS is ignoring the subsurface.

SEG-Y: The Seismic Elephant

SEG-Y is the standard format for seismic data, and it presents a unique storage challenge: a single 3D seismic survey can be 1-50 TB. Seismic data is the largest single data type in upstream oil and gas, and it has historically been stored on tape or local NAS because cloud egress costs for moving terabytes were prohibitive.

This is changing. Azure and AWS now offer storage tiers (Azure Archive, S3 Glacier) where seismic data can be stored for $0.002/GB/month. The economics now favor cloud storage for archival seismic, with retrieval times of hours rather than minutes being acceptable for data that is accessed infrequently. Active interpretation workflows still typically use local or high-performance cloud storage.

OSDU provides a standardized approach to indexing and cataloging seismic data in the cloud, even when the raw SEG-Y files remain in bulk storage.

CSV: The Format That Will Not Die

CSV files remain the most common data exchange format in upstream oil and gas. Production reports from state agencies, well test data exported from SCADA, completions summaries from service companies -- they all arrive as CSV. The format has no schema, no type enforcement, no compression, and no columnar optimization. It is terrible for analytics and perfect for interchange because every tool in existence can read it.

In a well-designed data lake, CSV files are transient: they land in the raw/bronze layer and are immediately converted to Parquet for analytical use. The raw CSVs are retained for lineage and audit purposes but are never queried directly.

Delta Lake, Apache Iceberg, and Apache Hudi: The Table Formats

These are not file formats but table formats -- metadata layers that sit on top of Parquet files and provide transactional capabilities:

  • Delta Lake (Databricks): The most adopted in oil and gas due to Databricks' penetration. Provides ACID transactions, time travel, schema evolution.
  • Apache Iceberg (Netflix/Apache): Growing rapidly; adopted by Snowflake (Iceberg Tables), AWS (Athena), and increasingly Databricks (UniForm). Vendor-neutral, strong partitioning.
  • Apache Hudi (Uber/Apache): Less common in upstream; strengths in incremental processing of streaming data.

The convergence trend is real: Databricks UniForm allows Delta tables to be read as Iceberg, and Snowflake can now query both. For operators, this means the table format choice is becoming less of a lock-in decision than it was two years ago. If forced to choose today for an upstream deployment, Delta Lake remains the pragmatic default due to Databricks' ecosystem in oil and gas.


Medallion Architecture: Bronze, Silver, Gold

The medallion architecture (sometimes called multi-hop architecture) has become the standard pattern for organizing data in a lakehouse. It provides a clear progression from raw data to analytics-ready data, and it maps well to upstream oil and gas data flows.

Bronze Layer: Raw Ingestion

The bronze layer contains data exactly as it arrived from source systems, with minimal transformation. For an upstream operator, this includes:

  • Raw SCADA telemetry (from eLynx, zdSCADA, CygNet, or AVEVA PI exports)
  • Well log files (LAS, DLIS) as received from logging companies
  • Production data exports from field systems (CSV, XML)
  • Drilling data (WITSML feeds from Pason EDR or Corva)
  • Completion reports (PDF, Word, CSV)
  • Land and lease files (database exports, GIS data)
  • Third-party data (Enverus feeds, state regulatory filings, FracFocus)

The bronze layer serves two purposes: it preserves the raw data for audit and lineage (you can always trace back to the original source), and it decouples ingestion from transformation. Source systems can dump data into bronze without worrying about downstream schema requirements.

Storage format: Raw source format plus metadata (ingestion timestamp, source system, file hash). Large operators may convert to Parquet at ingestion; smaller operators often retain source formats.

Typical volume for a 1,000-well operator: 1-5 TB per year, growing 20-30% annually as sensor density increases.

Silver Layer: Cleaned, Conformed, Validated

The silver layer contains data that has been cleaned, validated, deduplicated, and conformed to standard schemas. This is where the hard work of data engineering happens:

  • Well identification normalization. The same well might be identified by API number in SCADA, UWI in the drilling database, a lease name in the production system, and a completions ID in the frac database. The silver layer maps all identifiers to a single canonical well ID.
  • Unit conversion. Sensor data arrives in mixed units (psi and kPa, bbl and m3, MCF and e3m3). The silver layer standardizes to a single unit system.
  • Time zone normalization. Field data from the Permian Basin arrives in Central time, but the Denver office runs on Mountain time, and the cloud infrastructure runs on UTC. The silver layer normalizes to UTC with timezone metadata.
  • Quality flagging. Sensor readings outside physical ranges (negative pressures, flow rates exceeding pipe capacity) are flagged, not deleted. The raw value is preserved; the flag indicates it should be excluded from analytical queries unless explicitly included.
  • Deduplication. The same well test result might arrive via SCADA export, a manual spreadsheet, and a service company report. The silver layer identifies and resolves duplicates.

Transformation tooling. dbt (data build tool) has emerged as the standard for SQL-based transformations in the silver layer. Permian Resources uses dbt on Databricks for exactly this purpose. For Python-based transformations (parsing LAS files, processing non-tabular data), Apache Spark on Databricks or custom Python scripts managed by orchestrators like Dagster or Airflow are common.

Storage format: Parquet with Delta Lake or Iceberg table format. Partitioned by well, date, or data type depending on access patterns.

Gold Layer: Business-Ready, Analytics-Optimized

The gold layer contains purpose-built analytical datasets designed for specific business consumers:

  • Production dashboard tables. Pre-aggregated daily/weekly/monthly production by well, lease, field, and area. Optimized for Power BI or Spotfire consumption.
  • Decline curve analysis datasets. Normalized production time series with calculated decline parameters, ready for type curve analysis.
  • Completions benchmarking tables. Well completions parameters joined with 6-month and 12-month cumulative production, enabling completion design comparison.
  • Regulatory reporting tables. Production volumes, injection volumes, and calculated allocation in the format required by state regulatory agencies.
  • ESP/rod pump surveillance tables. Real-time and historical artificial lift parameters with calculated health indicators.
  • Financial datasets. Revenue, LOE, CAPEX, and net income by well, rolled up to lease, field, and company level.

The gold layer is where data becomes information. Each gold table is designed for a specific analytical consumer -- a dashboard, a report, an ML model, or a specific engineering workflow. Gold tables have defined owners, documented schemas, and refresh schedules.

Storage format: Parquet/Delta, heavily partitioned and optimized for the specific query patterns of the consuming application.

The Medallion Pattern in Practice

The most honest assessment of medallion architecture adoption in upstream oil and gas: supermajors and the most progressive mid-size operators (Permian Resources, Devon Energy) implement some version of this pattern. The majority of mid-size operators have, at best, an accidental bronze layer (a file share full of CSVs and exports) and no structured silver or gold layers. Small operators have no data lake at all.

This represents an opportunity. The pattern is well understood, the tooling (Databricks, dbt, Dagster) is mature, and the transformation from chaotic file shares to structured lakehouses can be done incrementally -- one data domain at a time.


OSDU: The Industry Standard That Is Not Quite Standard Yet

The Open Subsurface Data Universe (OSDU) deserves special discussion because it represents the industry's most ambitious attempt to standardize data management across upstream oil and gas.

What OSDU Is

OSDU is an open-source, cloud-native data platform developed under The Open Group, with contributions from major operators (Chevron, ExxonMobil, BP, Shell, TotalEnergies, Equinor) and service companies (SLB, Halliburton). It provides:

  • A common data model for subsurface and well data (built on Energistics standards: WITSML, PRODML, RESQML)
  • Cloud-native APIs for ingesting, querying, and managing E&P data
  • Implementation on major clouds: Azure Data Manager for Energy (Microsoft), Energy Data Insights (AWS), Google Cloud OSDU
  • Vendor-neutral data access so applications from different vendors can read the same data

The vision is compelling: a single data platform that replaces the dozens of incompatible databases currently used across exploration, drilling, completions, production, and reservoir engineering. Any application -- from SLB's Petrel to Halliburton's DecisionSpace 365 to a custom Python script -- could read from and write to the same OSDU-compatible data store.

Where OSDU Stands in 2026

The honest assessment: OSDU is real but slow. According to participants at the 2025 OSDU Forum Houston Summit, a Chevron leader described 2026 as "the year we turn the corner," with the Data Platform Standard v1 expected to launch.

That statement contains the reality check: it is 2026 and the version 1.0 standard is still not officially released. The supermajors -- Chevron, ExxonMobil, Shell, Equinor, TotalEnergies -- have adopted OSDU in various stages. Equinor uses Azure Data Manager for Energy as part of its Omnia platform. TotalEnergies published a case study on OSDU adoption in Suriname. But adoption beyond the supermajor tier has been minimal.

Why OSDU adoption has lagged:

  • Complexity. OSDU is enterprise-grade infrastructure designed for organizations with dedicated data platform teams. A mid-size operator with two data engineers cannot implement and maintain it.
  • Cost. Azure Data Manager for Energy is not cheap. The infrastructure costs, plus the professional services required for implementation, put OSDU out of reach for most mid-size operators.
  • Scope. OSDU primarily addresses subsurface data (well logs, seismic, reservoir models). It does not comprehensively cover production operations, SCADA data, financial data, or maintenance data -- which is where most operators' data pain actually lives.
  • Chicken-and-egg. Applications must be OSDU-compatible to benefit from the platform, but vendors have limited incentive to build OSDU compatibility until a critical mass of operators adopt the platform.

OSDU Recommendations by Company Size

Supermajors: Already adopting. Continue.

Large independents (Devon, EOG, Diamondback, ConocoPhillips): Monitor closely. Evaluate Azure Data Manager for Energy as a subsurface data management layer, but do not restructure your entire data architecture around OSDU until v1.0 is stable and vendor support is broad.

Mid-size operators: OSDU is not for you today. Your priority is getting production data, SCADA data, and well data into a functional lakehouse. OSDU may become relevant in 3-5 years if adoption reaches critical mass.

Small operators: Ignore OSDU entirely. Your data architecture challenge is getting off Excel, not implementing an industry-standard subsurface data platform.


The On-Prem vs. Cloud Debate: Settled in Theory, Messy in Practice

In theory, the cloud migration debate in oil and gas is settled: cloud wins. In practice, it is far more nuanced.

What Has Moved to the Cloud

  • Analytics and BI. Power BI, Spotfire Cloud, and Grafana Cloud are accessed through browsers. The analytical layer is effectively cloud-native.
  • Data storage for analytics. Lakehouse platforms (Databricks, Snowflake) run in the cloud. Historical production data, well headers, and completions data increasingly live in cloud data lakes.
  • ML/AI workloads. Training machine learning models requires elastic compute that cloud provides. Azure ML, SageMaker, and Databricks MLflow run in the cloud by design.
  • SaaS applications. Corva, Enverus, Novi Labs, GreaseBook -- modern E&P software is SaaS-delivered from the cloud.

What Has Not Moved (and May Not)

  • SCADA and process control. Operational Technology (OT) systems that control physical equipment -- SCADA servers, PLC programming, safety instrumented systems -- remain overwhelmingly on-premises. The latency, reliability, and safety requirements of process control do not tolerate cloud round-trips. Even "cloud SCADA" products like zdSCADA and eLynx are really cloud monitoring overlays on top of field-level OT infrastructure.
  • Process historians. AVEVA PI System, the dominant operational historian used by 85% of top oil and gas companies, typically runs on-premises at production facilities. AVEVA offers PI Cloud (now CONNECT) for cloud-based access to historian data, but the primary data collection and storage often remains local.
  • Subsurface interpretation workstations. Petrel, Kingdom, and other subsurface interpretation tools still commonly run on high-performance local workstations with dedicated GPUs, though SLB's Delfi platform is pushing Petrel toward the cloud.
  • Legacy engineering databases. PHDWin, OFM, and other petroleum engineering tools often run against local SQL Server or Access databases. Migration is slow because the tools themselves were not designed for cloud backends.

The Hybrid Architecture That Actually Works

The most practical architecture for most upstream operators is hybrid: OT systems remain on-premises at field locations, a bridge layer copies data to the cloud (typically on a 1-15 minute delay), and all analytics, reporting, and AI run in the cloud.

The bridge layer is the critical piece. Common approaches include:

  • AVEVA PI-to-cloud connectors. PI CONNECT (formerly PI Cloud) streams historian data to Azure. Since 85% of operators have PI, this is the most common bridge.
  • MQTT/Kafka brokers. Field devices publish to local MQTT brokers (HiveMQ, EMQX, Mosquitto), which forward to cloud event ingestion (Azure Event Hubs, AWS IoT Core). Apache Kafka provides persistent buffering that handles connectivity interruptions -- critical for remote wellsites with unreliable cellular connections.
  • OPC-UA gateways. Kepware (PTC), Softing, and HMS Networks provide gateways that translate legacy Modbus/HART protocols to OPC-UA, which can then be bridged to cloud MQTT or REST endpoints.
  • Vendor-managed bridges. eLynx, zdSCADA, and Ignition provide built-in cloud data forwarding from their SCADA platforms.

The key design principle: the cloud should never be in the control path. Data flows from field to cloud for analytics. Control commands flow from local systems to field equipment. If the internet goes down, production continues uninterrupted.


Data Governance: The Unsexy Topic That Determines Success or Failure

Data governance -- who owns the data, who can access it, how quality is maintained, how lineage is tracked -- is the least exciting and most consequential aspect of data architecture. In upstream oil and gas, governance failures are pervasive.

The Typical Governance Problem

Consider a mid-size Permian Basin operator with 800 wells. The production data exists in at least four places:

  1. 1.SCADA (eLynx or zdSCADA) -- real-time wellhead measurements
  2. 2.Allocation system (OFM, ProdView, or custom) -- allocated production volumes
  3. 3.State regulatory submissions (RRC of Texas) -- monthly reported volumes
  4. 4.Financial system (P2 or Quorum) -- revenue-basis volumes

These four systems show different production numbers for the same well on the same day. The SCADA reading is raw and unallocated. The allocation system applies test ratios and back-allocations. The regulatory submission uses a specific reporting methodology. The financial system uses sales volumes, which differ from production volumes by pipeline losses and NGL recovery.

Without governance, nobody knows which number to trust for any given purpose. The production engineer uses one number. The reservoir engineer uses another. The finance team uses a third. The SEC reserves report uses a fourth. Arguments ensue.

Governance Framework for Upstream Data

A practical data governance framework for upstream oil and gas does not require a governance committee, a data catalog vendor, or a chief data officer (though larger operators may benefit from all three). It requires:

Data ownership by domain. Production data is owned by the production engineering team. Drilling data is owned by the drilling engineering team. Each domain owner is responsible for defining the authoritative source, the refresh cadence, and the quality rules for their domain.

A single well master. Every data system must map to a single canonical well identifier. This identifier links the SCADA tag, the API number, the WellView well name, the lease name, and the regulatory permit number. Building this well master is often the single most valuable data engineering task an operator can undertake.

Data quality rules, enforced in code. dbt tests, Great Expectations, or custom validation scripts that run automatically on every data load. For production data: no negative volumes, no flow rates exceeding physical capacity, no wells producing before completion date, no allocation totals exceeding facility throughput.

Access control by role. Not everyone needs access to everything. Reservoir engineers do not need to see land data. Financial data should be restricted to authorized personnel. Cloud platforms (Azure RBAC, AWS IAM, Snowflake roles) provide the mechanisms; the governance framework defines the policies.

Lineage and auditability. For any number in a dashboard or report, you should be able to trace it back to the source system, the transformation that produced it, and the raw data it was derived from. Delta Lake's time travel feature and dbt's data lineage graphs provide this capability.


Security and Compliance: Not Optional

Oil and gas data carries specific security and compliance requirements that generic cloud architectures may not address:

Regulatory Compliance

  • SEC reserves reporting. Reserve estimates are material nonpublic information. The data and models used to generate reserves estimates must have appropriate access controls and audit trails.
  • State regulatory reporting. Production volumes reported to state agencies (Texas RRC, Oklahoma OCC, North Dakota NDIC) must be traceable from source data through allocation calculations to final submission.
  • SOX compliance. Publicly traded operators must maintain internal controls over financial data, including production revenue data that flows into financial statements.
  • EPA/OGHARA emissions reporting. Emissions data, increasingly collected via continuous monitoring (Qube Technologies, LongPath), must meet EPA Subpart W and OOOOb requirements.

Data Security

  • Well and acreage data. Detailed well performance data and undeveloped acreage positions are competitively sensitive. Leakage could affect M&A valuations and competitive drilling decisions.
  • OT/IT convergence. As SCADA data flows to the cloud for analytics, the attack surface for operational technology expands. OT network segmentation, even in cloud-connected environments, is essential.
  • Third-party data sharing. Operators share data with service companies, partners, and regulators. Data lake architectures must support granular sharing without exposing the entire dataset.

Security Architecture Patterns

  • Private endpoints. Azure Private Link and AWS PrivateLink ensure data traffic between cloud services stays within the cloud provider's network, never traversing the public internet.
  • Encryption at rest and in transit. All three major clouds encrypt data at rest by default. Customer-managed encryption keys (CMEK) provide additional control for operators with strict security requirements.
  • Network segmentation. SCADA and OT systems should be on isolated network segments with controlled, one-directional data flows to the cloud analytics environment. The ISA/IEC 62443 standard provides a framework for industrial network security.
  • Identity management. Single sign-on (SSO) via Azure AD or Okta, with multi-factor authentication (MFA) and role-based access control (RBAC). Federated identity between cloud and on-premises Active Directory.

Architecture Recommendations by Company Size

The single biggest mistake in data architecture discussions for oil and gas is applying supermajor patterns to mid-size operators, or mid-size patterns to small operators. What follows are honest, practical recommendations for each segment.

Supermajors (ExxonMobil, Shell, Chevron, BP, Equinor, TotalEnergies)

What they have: Dedicated data platform teams (50-500+ people), $100M-$1B+ annual digital/IT budgets, multi-cloud environments, custom data platforms.

What they should focus on: These companies are already past the basics. Their challenges are platform consolidation (reducing the number of overlapping tools), OSDU adoption (standardizing data across business units), and scaling AI/ML from pilot to production deployment. Lakehouse architecture is a given; the question is how to federate data across global operations.

Reference stack: Equinor's Omnia (Azure, Cognite CDF, Azure Data Manager for Energy, Radix PaaS, EurekaML) represents the current state of the art for supermajor data architecture.

Large Independents (Devon, ConocoPhillips, EOG, Diamondback, Pioneer successor)

Budget: $20-100M annual IT spend. Data engineering teams of 5-30 people.

Recommended architecture:

  • Cloud: Azure (primary), single-cloud
  • Lakehouse: Databricks on Azure or Microsoft Fabric
  • Storage: Azure Data Lake Storage Gen2, Parquet/Delta format
  • Orchestration: Apache Airflow or Dagster
  • Transformation: dbt for SQL transformations, Spark for heavy processing
  • BI: Power BI (enterprise distribution) + Spotfire (engineering analysis)
  • Historian bridge: AVEVA PI CONNECT to Azure
  • Data warehouse: Snowflake or Databricks SQL for structured analytical queries
  • OSDU: Evaluate but do not commit until v1.0 is stable

Devon Energy confirmed stack: Databricks, SQL Server, SAP HANA, Oracle Essbase -- a mix of modern and legacy that is typical for this segment.

Mid-Size Operators (Permian Resources, Matador, Crescent, Callon, Civitas)

Budget: $5-20M annual IT spend. 0-5 dedicated data engineers.

Current reality for most: SQL Server databases, Excel spreadsheets, Spotfire or Power BI dashboards connected to manual data exports, AVEVA PI historian on-premises at the central office, minimal cloud adoption beyond SaaS applications.

Recommended architecture:

  • Cloud: Azure (single-cloud)
  • Storage: Azure Data Lake Storage Gen2 (start with a single storage account, organize by domain: production, drilling, completions, land)
  • Lakehouse: Databricks (if budget allows) or Microsoft Fabric (if deeply invested in Microsoft ecosystem)
  • Transformation: dbt Core (open source) running on Databricks or Azure
  • Orchestration: Dagster or Prefect (simpler than Airflow for small teams)
  • BI: Power BI (keep it simple; add Spotfire only if engineers demand it)
  • Historian bridge: AVEVA PI CONNECT or direct MQTT from SCADA
  • Medallion architecture: Start with one domain (production data: bronze = SCADA exports, silver = cleaned/allocated production, gold = Power BI-ready tables)

Permian Resources confirmed stack: Databricks, Dagster, dbt, Spotfire + Power BI. This is the benchmark for what a progressive mid-size operator looks like.

Realistic timeline: 3-6 months for the first domain (production data) through the full medallion pipeline. 12-18 months for drilling, completions, and reservoir data domains.

Small Operators (100-500 wells)

Budget: $100K-$2M annual IT spend. Zero dedicated data engineers.

Current reality: Excel everywhere. Production data hand-entered from GreaseBook or downloaded from eLynx. Maybe a SQL Server database that one person built five years ago and nobody else understands. Regulatory reporting done in spreadsheets.

Recommended architecture:

  • Do not build a data lake. It is overkill. You need organized data, not a platform.
  • Cloud SCADA: eLynx ($10/asset/month) or zdSCADA for automated field data collection
  • Production management: GreaseBook (cheapest), OGsys, or Quorum ODA for production tracking
  • BI: Power BI ($10/user/month) connected directly to your SCADA and production databases
  • Engineering: PHDWin for economics, Enverus for public data, OFM for decline analysis
  • Storage: SharePoint/OneDrive for documents (you already have it with Microsoft 365); Azure Blob Storage if you need scalable file storage
  • Data integration: Power BI dataflows or a simple Python script to combine production, SCADA, and economics data into a single Power BI dataset

The honest advice: A small operator does not need Databricks, does not need a medallion architecture, and does not need OSDU. The operator needs data that is collected automatically (not hand-gauged), stored in a database (not a spreadsheet), and visualized in a dashboard (not printed and stapled). Get those three things right before thinking about anything more sophisticated.


What Comes Next: The Convergence of Architecture and AI

Data architecture is not an end in itself. The entire point of organizing data into a lakehouse, implementing governance, and building medallion layers is to enable the analytics and AI workloads that create operational value: production anomaly detection, decline curve forecasting, completions optimization, predictive maintenance, automated reporting.

The operators who will capture the most value from AI in the next three to five years are those who invest in data architecture now. Not because they need a perfect data platform before starting AI -- they do not -- but because every AI project that succeeds in pilot and fails in production fails for the same reason: the data was not reliable, accessible, or governed well enough to sustain a production deployment.

The digital platforms and AI landscape we mapped previously shows the application layer. This article has mapped the data foundation that the application layer requires. The two are inseparable -- and the foundation has to come first.

For a broader view of how data architecture fits within the full upstream software ecosystem, see our master guide to oil and gas software platforms.


Need help designing a data lake architecture for your operations? Get in touch.

Talk to an Expert

Book a Free 30-Min Consultation

Discuss your operational challenges with our team of petroleum engineers and data scientists. No sales pitch — just honest technical guidance.

Book Your Free Consultation →