Editorial disclosure
This article reflects the independent analysis and professional opinion of the author, informed by published research, vendor documentation, and hands-on experience with production data systems across multiple basins. No vendor reviewed or influenced this content prior to publication.
Before you build a data lake, deploy a dashboard, or train a machine learning model, you need to understand where your data comes from. That sounds obvious. In practice, it is the step most operators skip.
A mid-size Permian operator with 2,000 wells might have data flowing from wellhead pressure transducers, rod pump controllers, tank level sensors, flow meters, SCADA RTUs, drilling EDRs, frac monitoring systems, gas lift controllers, ESP panels, emissions sensors, and test separators. Each of these sources generates data in a different format, at a different frequency, through a different communication protocol, and stores it in a different system. Some of this data is streaming in real time. Some of it exists only on paper in a field office. Some of it was captured once, five years ago, and has not been looked at since.
This is the first layer in any data architecture: the data sources themselves. If you do not understand what generates your data, how it is captured, and where it goes, everything built on top of that foundation -- your dashboards, your analytics, your AI models -- is built on assumptions you have not verified.
This article catalogs the major data sources in upstream oil and gas operations, the vendors and systems that capture them, the communication protocols that move the data, and the common problems that degrade data quality before it ever reaches your database. It is written for production engineers, data engineers, and IT leaders at operators of all sizes, from 200-well privates to 5,000-well public companies.
SCADA and RTU Systems: The Backbone of Production Data
Supervisory Control and Data Acquisition (SCADA) is the foundational data system for production operations. Every piece of real-time production data that an operator sees on a dashboard or in a morning report originates from a SCADA system. Understanding the SCADA landscape is prerequisite to understanding anything else about production data.
How SCADA Works in Upstream
A typical wellsite SCADA deployment consists of three components:
- 1.Field sensors -- pressure transducers, temperature probes, flow meters, level sensors, and artificial lift controllers physically installed on the well or facility.
- 2.Remote Terminal Unit (RTU) -- a ruggedized computer at the wellsite that collects analog and digital signals from the sensors, converts them to engineering units, and stores them locally.
- 3.Communication link -- cellular, radio, or satellite connection that transmits data from the RTU to the central SCADA host, where it is stored in a historian or time-series database.
The RTU is the critical node. It determines how frequently data is sampled, how it is timestamped, how much is buffered during communication outages, and what quality codes are assigned. Two operators with identical wellhead sensors but different RTUs will produce materially different datasets.
Major SCADA Vendors in Upstream O&G
Weatherford CygNet. CygNet is arguably the most widely deployed SCADA platform among mid-size and large independent operators in North America. It provides a full stack: RTU communication, data historian, alarming, and visualization. CygNet's strength is its deep integration with production workflows -- it is not just collecting data; it is the system that production engineers and field technicians interact with daily. Operators like Devon Energy and many Permian-focused independents run CygNet as their primary SCADA platform. Pricing is enterprise-level, typically six figures annually for a mid-size deployment, which puts it out of reach for smaller operators.
eLynx Technologies. eLynx occupies the opposite end of the market from CygNet. It is a cloud-hosted SCADA platform priced at approximately $10 per asset per month, making it accessible to operators of virtually any size. eLynx provides RTU hardware, cellular communication, a cloud historian, mobile monitoring, and basic alarming. For small operators with 100-500 wells, eLynx is often the first real SCADA system they deploy, replacing manual gauging and daily field visits. The trade-off is that eLynx is less configurable than enterprise SCADA platforms and has more limited integration capabilities.
Inductive Automation Ignition. Ignition is an open-architecture SCADA platform that has gained significant traction in upstream oil and gas over the past five years. Unlike traditional SCADA vendors that sell closed, proprietary systems, Ignition provides a modular platform with unlimited licensing -- you pay for the server, not per tag or per client. This pricing model is attractive for operators with large well counts. Ignition connects to virtually any RTU or PLC via OPC-UA, Modbus, or MQTT, and its Gateway architecture supports edge-to-cloud deployments. Progressive mid-size operators are increasingly adopting Ignition as a replacement for aging CygNet or legacy SCADA installations.
Emerson (DeltaV, ROC RTUs, OpenEnterprise). Emerson is a major presence in upstream SCADA, particularly through its ROC series of RTUs, which are installed at tens of thousands of wellsites across North America. Emerson's OpenEnterprise SCADA provides the host-level data collection and visualization layer, while the DeltaV DCS (Distributed Control System) is deployed at larger facilities like gas processing plants and central production facilities. Emerson also offers Plantweb Optics for asset monitoring. The Emerson ecosystem is deep but can be complex for operators who want simple wellsite-to-cloud data flow.
ABB (SCADAvantage). ABB's SCADAvantage platform is strong in pipeline SCADA and midstream applications. ABB deployed SCADAvantage on IndianOil's 20,000-kilometer pipeline network in 2025. In upstream production, ABB has a smaller footprint than Emerson or Weatherford, but its industrial automation heritage makes it relevant for operators with significant facility infrastructure.
Honeywell (Experion PKS). Honeywell's Experion Process Knowledge System is an integrated control and safety system that is dominant in downstream refining and petrochemical operations. In upstream, Experion appears primarily at large production facilities, gas plants, and offshore platforms where process safety is paramount. It is less common at individual wellsites.
Quorum zdSCADA. zdSCADA, acquired by Quorum Software in March 2025, represents the newer generation of cloud-native SCADA platforms. Like eLynx, zdSCADA is designed for the cloud from the ground up, with a focus on ease of deployment and modern data access. Quorum's broader upstream software suite (production accounting, revenue management) gives zdSCADA an integration advantage for operators already in the Quorum ecosystem.
What SCADA Systems Measure
The standard tag list for a production well monitored by SCADA includes:
- •Tubing pressure (flowing wellhead pressure)
- •Casing pressure
- •Line pressure (downstream of the choke)
- •Wellhead temperature
- •Flow rate (if a wellsite meter is installed)
- •Tank levels (oil, water, or produced water tanks)
- •Artificial lift parameters (varies by lift type)
- •Equipment status (pump on/off, valve open/closed, alarm states)
- •Battery voltage (RTU power status)
- •Communication status (signal strength, packet loss)
A well-instrumented pad with multiple wells, a separator, and storage tanks might have 50-200 SCADA tags. Multiply that by well count, and a 2,000-well operator is managing 100,000-400,000 individual data tags. This is why tag management and metadata quality are such persistent challenges.
SCADA Cost Ranges by Operator Size
| Operator Size | Typical SCADA Platform | Annual Cost Range |
|---|---|---|
| Small (100-500 wells) | eLynx, zdSCADA | $12K-$60K/year |
| Mid-size (500-2,000 wells) | CygNet, Ignition, eLynx | $100K-$500K/year |
| Large independent (2,000-10,000 wells) | CygNet, Emerson, Ignition | $500K-$2M+/year |
| Supermajor | Emerson, Honeywell, ABB + custom | $5M-$50M+/year |
These ranges include hardware, software licensing, communication costs, and basic maintenance. They do not include the engineering time to configure, maintain, and troubleshoot the system, which is often the largest cost component.
Drilling Data: EDR Systems and Real-Time Rig Data
Drilling generates an enormous volume of high-frequency data in a compressed timeframe. A single horizontal well drilled in the Permian over 15-20 days can produce 50-200 gigabytes of raw data from surface sensors, downhole tools, mud logging equipment, and daily reports.
Electronic Drilling Recorders (EDR)
The EDR is the primary data acquisition system at the rig floor. It captures the core surface drilling parameters:
- •Hookload (weight hanging from the derrick)
- •Block position (height of the traveling block)
- •Rotary RPM and torque
- •Standpipe pressure (drilling fluid pressure)
- •Pump strokes and flow rate
- •Rate of penetration (ROP)
- •Weight on bit (WOB)
- •Mud weight in and out
EDR data is typically sampled at 1-5 second intervals, though some parameters (particularly vibration and acoustic data from downhole tools) can be captured at much higher frequencies.
Pason Systems dominates the North American land drilling market, with EDR installations on roughly 60% of active rigs. Pason captures data at configurable intervals and transmits it to Pason's DataHub, a cloud-based platform that makes data available for remote monitoring and third-party applications. DataHub is significant because it serves as a de facto data aggregation layer for much of the industry. Many analytics platforms -- Corva being the most prominent -- receive their primary data feed from Pason rather than connecting directly to rig-floor equipment.
NOV (Totco EDR and NOVOS). NOV's Totco brand competes with Pason in the EDR space. The integration between Totco EDR and NOV's NOVOS rig operating system provides a tighter coupling between data acquisition and rig automation. NOVOS is a drilling automation operating system that can execute pre-programmed drilling sequences, making it as much a control system as a data system. Totco's North American market share is smaller than Pason's, but NOV's installed base on offshore and international rigs is significant.
Corva. Corva sits downstream of the EDR, ingesting data primarily from Pason and normalizing it into a cloud-native analytics platform. For operators who use Corva across multiple rigs, the platform effectively becomes a drilling data warehouse -- a single, consistent repository that spans wells, rigs, and time periods. Corva's Copilot provides AI-powered predictive insights on top of this normalized data. The combination of data normalization and analytics in a single platform makes Corva the closest thing to a modern data stack in the drilling data space.
Real-Time vs. Stored Drilling Data
An important distinction that many operators do not appreciate: the data you see on your real-time drilling monitor is not the same as the data stored in your drilling data warehouse.
Real-time data is streamed at high frequency (1-5 seconds) and is typically consumed by monitoring applications, alarms, and real-time analytics. Stored data may be downsampled to 10-second, 30-second, or even 1-minute intervals to manage storage costs. Some parameters that are available in real time are not stored at all, particularly high-frequency vibration and acoustic data that can generate gigabytes per hour.
This matters for AI and machine learning. A model trained on 30-second averaged data will miss short-duration events -- pressure spikes, stick-slip episodes, connection anomalies -- that are clearly visible in 1-second data. If you intend to build predictive models, you need to understand what resolution of data is actually being stored, not just what is displayed on the monitor.
WITSML: The Drilling Data Exchange Standard
WITSML (Wellsite Information Transfer Standard Markup Language) is the Energistics data exchange standard for drilling data. It defines XML-based data objects for wellbore trajectories, logs, mud data, cement jobs, fluid reports, BHA records, and more. WITSML provides a standardized way for different vendors' systems to exchange drilling data.
WITSML 1.3.1 remains the most widely deployed version despite being technically superseded. It uses SOAP-based web services for data exchange. WITSML 2.1 moved to RESTful APIs and ETP (Energistics Transfer Protocol) for real-time streaming, but adoption has been slow because upgrading from 1.x to 2.x requires significant changes on both server and client sides.
The practical reality is that WITSML solves the format problem but not the quality problem. A WITSML-compliant data feed can still contain sensor errors, missing values, duplicate timestamps, and physically impossible measurements. The standard defines what the data should look like, not whether the data is correct. For a deeper dive into WITSML and drilling data infrastructure, see our companion article: Drilling Data Management: WITSML to Cloud.
Completion and Frac Data: Real-Time Monitoring During Stimulation
Hydraulic fracturing generates some of the highest-value, shortest-duration data in the entire well lifecycle. A typical Permian completion involves 40-60 stages pumped over 5-10 days. Each stage produces a detailed record of treating pressures, pump rates, proppant concentrations, fluid volumes, and wellbore response. This data is used for real-time job monitoring, post-job analysis, and increasingly, for machine learning models that optimize stage design.
What Frac Monitoring Systems Capture
- •Treating pressure (surface and, if available, bottomhole)
- •Pump rate (barrels per minute)
- •Proppant concentration (pounds per gallon)
- •Slurry rate and clean rate
- •Fluid volumes (acid, slickwater, gel, by stage)
- •Proppant totals (mesh size, volume, by stage)
- •ISIP (Instantaneous Shut-In Pressure)
- •Breakdown pressure
- •Offset well pressure monitoring (frac hit detection)
- •Microseismic events (if sensors deployed)
This data is sampled at 1-second intervals during pumping, producing dense time-series records for each stage.
Major Frac Data Platforms
NexTier NexHub and EOS Platform. NexTier operates a 24/7 Digital Center (NexHub) that provides real-time frac monitoring for all active jobs. The EOS platform enables real-time workflows and data visualization, with Corva serving as the primary data visualization layer. NexTier's partnership with Silixa provides IntelliStim diagnostics, combining fiber-optic sensing with frac treatment data for subsurface characterization.
Liberty Energy (WellWatch, FracTrends, Sentinel). Liberty has built an unusually comprehensive set of proprietary data products. WellWatch provides real-time frac hit monitoring -- detecting when fractures from the well being stimulated communicate with offset producing wells. FracTrends is a proprietary database of 60,000+ wells with completion and production data, used to benchmark designs and predict outcomes. Sentinel manages proppant logistics in real time. All three are built in-house, which gives Liberty a data differentiation that most frac companies lack.
ProPetro (AccuFrac). AccuFrac consists of two components: Job Center for remote e-frac control and data acquisition, and Power Center for real-time monitoring of electric frac fleet load and demand. ProPetro's FORCE electric fleet platform (165 MW generation capacity across four contracted fleets) generates additional data on power consumption, emissions, and equipment performance that conventional diesel fleets do not produce.
SLB and Halliburton both provide integrated completions data management within their respective platforms (Kinetix and DecisionSpace 365), though these are typically used when the operator is also using SLB or Halliburton as the completion service provider.
The Frac Data Challenge
The fundamental challenge with completion data is that it is generated by service companies, not by the operator. The frac company owns the data acquisition systems, operates them, and delivers the data to the operator after the job. The format, resolution, and completeness of that delivery varies by service company, by crew, and sometimes by well.
Some operators receive detailed electronic records with 1-second data for every stage. Others receive PDF reports with summary statistics. Some get both but cannot easily link the electronic data to the PDF because the stage numbering or well identifiers do not match.
For operators who want to build completion optimization models -- predicting which stage designs will produce the best results in a given formation -- this inconsistency is a serious obstacle. You need consistent, machine-readable completion data across hundreds of wells from multiple service companies, and that requires contractual data delivery requirements, not just technical integration.
Production Data: Wellhead to Allocation
Production data encompasses everything measured at and downstream of the wellhead during the producing life of the well. Unlike drilling data (short, intense, high-frequency) or completion data (captured by service companies), production data is generated by the operator's own systems over the multi-year life of the well.
Wellhead Sensors
The basic suite of wellhead sensors on a producing well includes:
- •Pressure transducers -- tubing pressure, casing pressure, line pressure. Typically 4-20 mA analog signals read by the RTU. Range and accuracy depend on the transducer specification; a common configuration is 0-3,000 psi with +/- 0.25% accuracy.
- •Temperature probes -- RTD or thermocouple type. Wellhead temperature can indicate changes in flow regime, water breakthrough, or equipment problems.
- •Flow meters -- multiphase flow meters (SLB Vx, ABB, Emerson) at the wellhead are still relatively rare due to cost ($50K-$200K per meter). More commonly, individual wells flow to a test separator for periodic well tests, and daily production is allocated from facility-level meters.
Artificial Lift Monitoring
Artificial lift systems generate the richest set of production data because they include both well performance and equipment health parameters.
Rod Pump Wells. Rod pump controllers capture stroke count, polished rod load (the dynacard), motor amps, pump fillage, and calculated fluid production rate. The dynacard -- a plot of polished rod load versus position -- is the primary diagnostic tool for rod pump optimization. Systems like Theta XSPOC, Lufkin (Baker Hughes), and Weatherford controllers transmit dynacard data to the SCADA system at intervals ranging from every stroke to once per hour, depending on configuration. XSPOC in particular has become nearly ubiquitous among Permian operators for rod pump optimization, providing automated pump-off control and diagnostic capabilities.
ESP (Electric Submersible Pump) Wells. ESP monitoring includes intake pressure, discharge pressure, motor temperature, vibration, motor current and voltage, and pump frequency (if variable speed drive equipped). ESP data is critical because ESPs fail catastrophically -- when an ESP dies, it typically means pulling the completion, a workover that costs $200K-$500K. Early detection of degradation (rising temperature, increasing vibration, declining intake pressure) can extend run life by months. Baker Hughes, Borets, and SLB (REDA) are the major ESP manufacturers, each with their own monitoring and control systems.
Gas Lift Wells. Gas lift monitoring captures injection gas rate, injection pressure, casing head pressure, and valve status. Automated gas lift optimization systems adjust injection rates to maximize oil production per unit of injection gas. The data requirements are less intense than rod pump or ESP monitoring, but consistent measurement of injection rates and wellhead pressures is essential for optimization models.
Test Separators and Allocation
Most multi-well pads do not have individual well meters for oil, gas, and water. Instead, wells flow to a common separator, and production is allocated to individual wells based on periodic well tests. A well test might involve routing one well at a time through a test separator for 12-24 hours, measuring individual oil, gas, and water rates, and then using those rates to allocate the commingled production measured at the facility meter.
This creates a data quality issue that many operators underappreciate: the "daily production" number for an individual well in their production database is not a measurement. It is an estimate based on a well test that might be weeks or months old, applied as an allocation factor to facility-level metered volumes. When formation depletion, artificial lift changes, or wellbore problems alter a well's relative contribution between tests, the allocated production diverges from reality.
For AI models that predict well performance or detect anomalies, this distinction between measured and allocated production is critical. A model trained on allocated production data is learning allocation methodology as much as it is learning reservoir behavior.
Reservoir Data: Subsurface Characterization
Reservoir data differs fundamentally from operational data. It is typically collected during discrete events (logging, testing, coring) rather than continuously, and it exists in specialized formats that predate modern data standards.
Well Logs (LAS Files)
LAS (Log ASCII Standard) is the industry standard format for well log data. A typical modern horizontal well has wireline or LWD (Logging While Drilling) measurements including gamma ray, resistivity, density, neutron porosity, sonic, and potentially formation imaging. LAS files contain header metadata (well name, location, depth reference, units) and columnar data at regular depth intervals (typically 0.5-foot or 6-inch spacing).
LAS files are a strength and a limitation. The format is simple, widely supported, and has been stable for decades. But the metadata is often incomplete or inconsistent -- well identifiers may not match the names in the production database, depth references may differ between curves, and there is no standard for what curve mnemonics mean across different logging vendors. A curve labeled "GR" in one LAS file might not be directly comparable to a "GR" curve in another because the tools, calibration, and environmental corrections are different.
Core Data
Core data includes laboratory measurements on physical rock samples: porosity, permeability, grain density, saturation, and special core analysis (relative permeability, capillary pressure). Core data is expensive to acquire ($500K-$1M+ per cored interval including lab analysis) and exists primarily in PDF lab reports that are not machine-readable. Digitizing legacy core data for analytics use cases is a persistent challenge.
Pressure Tests
Buildup tests, drawdown tests, and mini-DSTs provide reservoir pressure and flow capacity data. The raw data is a time-pressure record that is analyzed using derivative analysis to determine permeability, skin, and boundary effects. Like core data, pressure test results often exist in interpretation reports rather than structured databases.
PVT Data
Pressure-Volume-Temperature data characterizes the behavior of reservoir fluids under different conditions. PVT data is typically acquired from fluid samples analyzed in a laboratory and reported in a PVT report. The data includes bubble point pressure, solution gas-oil ratio, oil formation volume factor, viscosity, and compositional analysis. PVT data is essential for reservoir simulation but is often maintained in standalone spreadsheets or within specific reservoir engineering software.
Seismic Data
Seismic data is the elephant in the room for data management. A 3D seismic survey stored in SEG-Y format can range from hundreds of gigabytes to multiple terabytes. Post-stack processed volumes, pre-stack gathers, and attribute volumes multiply the storage requirements. Seismic data is typically managed by geoscience-specific platforms (Petrel, Kingdom, OpenWorks) and is rarely integrated into the same data infrastructure as operational production data. The scale mismatch alone makes integration challenging -- you cannot put a multi-terabyte seismic volume in the same data lake architecture designed for 1-minute SCADA readings without significant design consideration.
HSE and Environmental Data: Emissions Monitoring
Environmental monitoring has moved from periodic compliance checks to continuous real-time measurement, driven by EPA regulatory changes and ESG reporting requirements. This shift has created an entirely new category of oilfield data source.
Continuous Emissions Monitoring Vendors
Qube Technologies. Qube provides continuous methane monitoring systems that were EPA-approved in March 2025 as an alternative to periodic Leak Detection and Repair (LDAR) inspections. Qube sensors are installed at wellpads and facilities, providing 24/7 methane concentration measurements that can identify leaks as they occur rather than waiting for the next scheduled inspection.
LongPath Technologies. LongPath uses long-range laser methane detection to provide basin-wide continuous monitoring. A single LongPath installation can monitor multiple wellpads across several square miles, making it a cost-effective option for operators with dense, geographically concentrated assets.
MultiSensor AI. MultiSensor combines optical gas imaging (OGI) with laser spectrometry for continuous visualization and quantification of gas leaks. The AI component automatically classifies and quantifies detected emissions.
Bridger Photonics. Bridger provides aerial LiDAR-based methane detection via aircraft flyovers. This is not continuous monitoring, but it provides basin-wide survey capability that can cover thousands of wellpads in a single flight campaign.
Silixa. Silixa's distributed fiber-optic sensing (DAS/DTS) technology is used for both downhole and surface monitoring. Distributed Acoustic Sensing can detect leaks along pipelines and at facilities by sensing the acoustic signature of gas escaping. The same fiber installed for completion diagnostics can potentially serve a dual purpose for long-term leak detection.
What Emissions Data Looks Like
Continuous emissions monitoring generates time-series data similar to SCADA -- methane concentration (ppm), wind speed and direction (for source attribution), temperature, and humidity, typically at 1-minute to 15-minute intervals. The analytical challenge is converting raw concentration measurements into emission rates (kg/hr or tons/year), which requires atmospheric dispersion modeling that accounts for wind, terrain, and sensor placement.
Most emissions monitoring data currently lives in vendor-specific platforms that are not integrated with the operator's SCADA or production data systems. This is a missed opportunity: correlating emissions events with production operations data (compressor startups, tank loadings, flaring events) can identify root causes and enable predictive emissions management.
Communication Protocols: Moving Data from Field to Office
The protocol layer between field devices and data systems is often invisible to everyone except the SCADA engineer, but protocol choices have real consequences for data quality, latency, and integration complexity. Here is what each major protocol does and when to use it.
Modbus (TCP and RTU)
Modbus is the legacy workhorse of industrial communication, dating back to 1979. It is still the most common protocol at the wellsite level, used to connect sensors, PLCs, and RTUs. Modbus is simple, reliable, and virtually universal -- every RTU and PLC on the market supports it.
When to use it: At the wellsite, for device-to-RTU communication. Do not try to extend Modbus beyond the local site. It has no built-in security, no metadata, and no discovery mechanism. Bridge it to a modern protocol (OPC-UA or MQTT) at the RTU or edge gateway.
OPC-UA (Unified Architecture)
OPC-UA is the current industry standard for secure, structured industrial data exchange. It supports data modeling (so a "pressure" reading carries context about what it measures, in what units, from what device), security (authentication and encryption), and both client-server and publish-subscribe communication patterns. OPC-UA PubSub (Part 14 of the specification) enables MQTT bridging, combining OPC-UA's data model with MQTT's lightweight transport.
When to use it: For facility-to-cloud communication, for integrating SCADA with enterprise systems, and anywhere you need structured, self-describing data. OPC-UA gateways from Kepware (PTC), Softing, HMS Networks, and Matrikon (Honeywell) can bridge Modbus devices into the OPC-UA ecosystem.
MQTT 5.0
MQTT is a lightweight publish-subscribe messaging protocol designed for bandwidth-constrained environments -- exactly the conditions at remote wellsites. MQTT 5.0 (the current standard) adds features including shared subscriptions, topic aliases, and message expiry. The protocol's small overhead makes it ideal for cellular and satellite communication links where every byte counts.
When to use it: For wellsite-to-cloud data transmission, especially on cellular or satellite links. MQTT is the preferred transport protocol for modern IoT deployments. Cloud MQTT brokers (HiveMQ, EMQX, AWS IoT Core, Azure IoT Hub) provide managed infrastructure for receiving and routing MQTT messages.
SparkplugB
SparkplugB is a specification built on top of MQTT that adds structured data definitions, state management, and birth/death certificates (messages that indicate when a device comes online or goes offline). SparkplugB essentially gives MQTT the data modeling capabilities of OPC-UA while keeping MQTT's lightweight transport.
When to use it: When you want MQTT's efficiency but need more structured data than raw MQTT provides. SparkplugB is gaining adoption in upstream oil and gas, particularly with vendors like SignalFire that use it for low-power wellsite telemetry. It is not yet as widely adopted as OPC-UA, but its trajectory suggests it will become a standard part of the protocol stack.
WITSML 2.1 and PRODML 2.2
WITSML and PRODML are Energistics domain-specific standards for drilling and production data exchange, respectively. They define data objects specific to oil and gas (wellbores, logs, production volumes) rather than generic industrial measurements. Both are being modernized with ETP 1.2 (Energistics Transfer Protocol), which replaces the older SOAP-based transport with WebSocket streaming for near-real-time data exchange.
When to use them: For rig-to-office drilling data exchange (WITSML) and for production data exchange between operators and regulators or partners (PRODML). These are domain standards, not transport protocols -- they run on top of HTTP, WebSocket, or other transport layers. They are essential for interoperability with service companies and industry data platforms, but they do not replace OPC-UA or MQTT at the device level.
Protocol Selection Summary
| Layer | Protocol | Typical Use |
|---|---|---|
| Wellsite device-to-RTU | Modbus TCP/RTU | Pressure transducers, flow meters, PLCs |
| RTU-to-edge gateway | OPC-UA, Modbus | Local site data aggregation |
| Edge-to-cloud transport | MQTT 5.0, SparkplugB | Wellsite telemetry over cellular/satellite |
| Cloud ingestion | OPC-UA PubSub, MQTT, Kafka | Streaming into cloud data platform |
| Drilling data exchange | WITSML 2.1 + ETP 1.2 | EDR to operator monitoring systems |
| Production data exchange | PRODML 2.2 + ETP 1.2 | Operator to regulatory, partner systems |
| Enterprise integration | REST/JSON APIs | SaaS platform integration (Corva, Enverus) |
Data Sources by Operator Size
The data source landscape varies dramatically by operator size. Understanding where your organization falls on this spectrum is essential for making realistic decisions about data infrastructure investment.
The 5,000-Well Large Independent
A large independent like Devon Energy or Diamondback has the scale and budget to deploy comprehensive data capture:
- •SCADA: Enterprise platform (CygNet or Ignition) with RTUs on every producing well. 200,000-500,000+ active SCADA tags.
- •Historian: AVEVA PI System storing years of high-resolution time-series data on-premise, with increasing migration to PI Cloud.
- •Drilling: Pason EDR on all operated rigs, with Corva or proprietary analytics. Historical drilling data warehouse spanning hundreds of wells.
- •Completions: Contractual requirements for electronic data delivery from frac companies. Standardized data formats.
- •Production: Wellhead sensors on all wells. Test separators at major facilities. Multiphase flow meters on high-value wells.
- •Emissions: Deploying continuous monitoring (Qube, LongPath) across the asset base.
- •Annual data infrastructure budget: $20-100M.
The 2,000-Well Mid-Size Operator
A mid-size operator like Permian Resources or Matador has meaningful data infrastructure but faces budget constraints:
- •SCADA: CygNet, zdSCADA, or Ignition on most wells. Some legacy RTUs from acquisitions that do not integrate cleanly.
- •Historian: PI System or the SCADA platform's built-in historian. Data retention policies may limit high-resolution historical data to 1-2 years.
- •Drilling: Pason EDR and potentially Corva. Historical data may be fragmented across acquisitions.
- •Completions: Inconsistent data delivery from frac companies. Some wells have detailed electronic data; others have only PDFs.
- •Emissions: Beginning to deploy continuous monitoring. Many sites still on periodic LDAR inspections.
- •Annual data infrastructure budget: $5-20M.
The 200-Well Small Operator
A small operator, often PE-backed or privately held, has minimal data infrastructure:
- •SCADA: eLynx ($10/asset) or zdSCADA on some wells. Remaining wells on manual gauging with daily field visits.
- •Drilling: Pason EDR data available through DataHub but may not be systematically archived by the operator.
- •Completions: PDF completion reports. No electronic database.
- •Production: Basic wellhead sensors on SCADA-equipped wells. Production volumes from pumper field reports and GreaseBook entries.
- •Emissions: Periodic LDAR inspections only. No continuous monitoring.
- •Annual data infrastructure budget: $100K-$2M.
The gap between the 5,000-well operator and the 200-well operator is not just budget -- it is generational. The large operator is debating whether to migrate to a cloud data lakehouse. The small operator is debating whether to put SCADA on the next 50 wells. These are fundamentally different conversations, and solutions designed for one will not work for the other.
Common Data Source Problems
Every operator, regardless of size, faces a set of recurring data source problems that degrade data quality before the data ever reaches a database or dashboard. These problems originate at the physical layer -- sensors, RTUs, communication links -- and compound as data flows through the architecture.
Sensor Drift
Pressure transducers drift over time, particularly in harsh environments. A transducer installed in 2020 that has never been recalibrated might read 50-100 psi high or low by 2026. Flow meters accumulate paraffin, scale, or sand, degrading accuracy. Temperature probes corrode. The insidious nature of drift is that the data looks plausible -- it changes gradually and stays within a reasonable range -- so it is rarely flagged by automated quality checks.
Frozen Values
A sensor fails but the RTU continues reporting the last known value. Or the RTU loses communication with the sensor and substitutes a default value. Or a configuration error causes the same value to be written to multiple tags. Frozen values appear in virtually every SCADA dataset and are one of the most common causes of AI model failure. A model trained on data containing frozen values learns that "no change" is a normal operating state, which produces false confidence and missed anomalies. For a systematic approach to detecting these issues, see our SCADA Data Quality for AI Checklist.
Data Gaps
Communication outages are a fact of life at remote wellsites. Cellular coverage in the Permian is better than it was ten years ago, but it is still imperfect. When communication drops, the RTU buffers data locally. If the outage lasts longer than the buffer (which depends on RTU model and configuration -- some hold hours, some hold days), data is lost permanently. Even when buffered data is transmitted after connectivity is restored, timestamp issues, duplicate records, and out-of-order data are common.
Inconsistent Tag Naming
This problem is universal and gets worse with every acquisition. One asset uses "THP" for tubing head pressure. The acquired asset calls it "WHP." A third asset labels it "FlowingPress_psi." Some tags include units in the name; others do not. Some use well API numbers; others use lease names. Without a consistent tag naming convention and a tag dictionary that maps names to physical measurements, cross-well and cross-asset analysis is painful or impossible.
Legacy Systems and Acquisition Debt
Every acquisition brings a new SCADA system, a new historian, new tag naming conventions, and new data quality pathologies. Integrating acquired data systems is expensive, time-consuming, and often deprioritized in favor of production-focused integration tasks. The result is that many mid-size operators run two, three, or even four parallel SCADA systems from successive acquisitions, each with its own quirks.
Report-by-Exception vs. Periodic Polling
Some SCADA systems use "report by exception" -- the RTU only sends a new value when the measurement changes by more than a configured deadband. This is efficient for communication bandwidth but creates ambiguity in the data: is a missing timestamp a "no data" event or a "no change" event? For AI and analytics, this distinction matters. A model needs to know whether the absence of a data point means the value was stable or the measurement was not taken.
Bringing It All Together: The Data Source Inventory
Before investing in data lakes, dashboards, or AI, every operator should have a complete inventory of their data sources. This inventory should answer five questions for each source:
- 1.What is being measured? (Physical quantity, unit, range, accuracy)
- 2.What system captures it? (Vendor, model, software version)
- 3.How frequently? (Sample interval, report-by-exception deadband)
- 4.How does it get to the office? (Communication protocol, transport, latency)
- 5.Where does it land? (Historian, database, file system, or nowhere)
Most operators cannot answer all five questions for all their data sources. The ones who can are the ones who successfully deploy analytics and AI. The ones who cannot are the ones whose AI pilots fail and who conclude, incorrectly, that the technology does not work.
The technology works. The data source layer is where most failures originate. Understanding it is not optional -- it is the foundation for everything that follows in the data architecture stack.
Ready to Get Started?
Let's Build Something Together
Whether you're exploring AI for the first time or scaling an existing initiative, our team of petroleum engineers and data scientists can help.
Get in Touch →