SCADA Data Quality for AI: The Audit Checklist Every Operator Needs Before Starting an ML Project

Dr. Mehrdad Shirangi | | Published by Groundwork Analytics LLC

Editorial disclosure

This article reflects the independent analysis and professional opinion of the author, informed by published research, vendor documentation, and hands-on experience with production data systems across multiple basins. No vendor reviewed or influenced this content prior to publication.

Here is a pattern that plays out at mid-size operators roughly once a quarter: Someone at the executive level reads about AI-driven production optimization. A vendor gives a compelling demo showing how their model detected rod pump failures three days early on a Permian pad. The company signs a pilot agreement. An engineering team spends two months pulling SCADA data, formatting it, and handing it to the vendor. The vendor comes back and says: "We cannot build a reliable model with this data."

The project stalls. The pilot quietly dies. The company concludes that AI is "not ready" for their operations.

The AI was fine. The data was not.

This is not a hypothetical scenario. It is the single most common failure mode for AI and machine learning projects in upstream oil and gas. A 2024 survey by the World Economic Forum and Accenture found that 74% of oil and gas executives cited data quality and availability as the top barrier to scaling AI. Not model accuracy. Not compute cost. Not talent. Data quality.

The frustrating part is that most SCADA data quality problems are identifiable, categorizable, and fixable -- if you look before you spend six months on an AI pilot. This article provides a systematic approach to auditing SCADA data quality before starting any machine learning project, along with a practical checklist that production engineering teams can run against their own data.


Why SCADA Data Quality Kills AI Projects

Machine learning models are, at their core, pattern recognition engines. They learn statistical relationships from historical data and use those relationships to make predictions on new data. This means the quality of the training data is not one factor among many -- it is the foundation. A model trained on bad data will learn bad patterns and produce bad predictions with high confidence.

SCADA data has specific quality issues that are different from, say, web analytics or financial data. Understanding why these issues arise requires understanding how SCADA data is generated in the field.

The Physical Reality of Oilfield SCADA

A typical Permian Basin well pad has RTUs (Remote Terminal Units) collecting data from pressure transducers, flow meters, temperature sensors, and artificial lift controllers. This data is transmitted via cellular, radio, or satellite to a SCADA host, where it is stored in a time-series database and made available through a historian or dashboard application.

At every step in this chain, data quality can degrade:

At the sensor. Pressure transducers drift over time. Flow meters accumulate paraffin or scale. Temperature probes fail. Liquid level sensors in tanks stick. The field environment -- West Texas heat, sand, H2S, vibration -- is actively hostile to precision instruments.

At the RTU. RTUs have limited memory buffers. When communication drops, they store data locally until the link is restored. If the outage lasts longer than the buffer, data is lost permanently. Some RTUs timestamp data at transmission, not collection, creating timing errors. Older RTUs may sample at inconsistent intervals.

At the communication layer. Cellular coverage in the Permian is better than it was ten years ago, but it is not perfect. Radio networks have interference. Satellite links have latency and cost constraints that limit polling frequency. When communication drops, you get gaps. When it recovers, you sometimes get duplicate records or out-of-order data.

At the SCADA host. Different SCADA vendors store data differently. Some store only on-change values (report by exception), meaning the absence of a data point may indicate "no change" or may indicate "no data." Some apply compression algorithms that discard data points within a deadband, which is efficient for storage but destructive for ML training data. Some overwrite raw data with calculated values without preserving the original.

At the historian/database layer. Data may be aggregated from 1-second scans to 1-minute, 15-minute, or hourly averages before being stored long-term. The aggregation method matters: a 15-minute average pressure reading hides short-duration pressure spikes that might be exactly what an anomaly detection model needs to see.

Each of these failure modes produces a different type of data quality problem, and each requires a different fix.


The Six Categories of SCADA Data Quality Problems

Before presenting the audit checklist, it is worth understanding the major categories of data quality issues that affect ML model performance. These are listed roughly in order of prevalence based on what we see across client engagements.

1. Missing Data and Gaps

The most common problem. SCADA data has gaps -- periods where no data was recorded for a given tag. Gaps can range from minutes (a communication dropout) to weeks (a failed RTU) to permanently missing channels (a sensor that was never installed or never configured for SCADA polling).

Why it matters for ML: Most ML algorithms cannot handle missing values natively. You must either impute (fill in) the missing values or discard the affected time periods. If gaps are frequent and randomly distributed, imputation may work. If gaps are correlated with operating conditions (e.g., communication drops during storms, which are also when you care most about well performance), imputation introduces bias.

2. Frozen and Stuck Values

A sensor or RTU reports the same value for an extended period. This can indicate a failed sensor, a stuck transducer, a frozen RTU, or a "last known value" being reported after a communication loss. Frozen values are insidious because they look like valid data -- the value is within a plausible range, the timestamp is current, and nothing flags it as an error.

Why it matters for ML: A model trained on frozen values will learn that "no change" is a normal operating state when it may actually indicate equipment failure. This is the opposite of what you want from an anomaly detection model.

3. Outliers and Spike Values

Sudden, extreme values that are physically implausible. A casing pressure reading of 15,000 psi on a well rated to 5,000 psi. A flow rate of -200 barrels per day. A temperature reading of 500 degrees Fahrenheit at a wellhead. These typically result from sensor malfunctions, electrical noise, analog-to-digital conversion errors, or data transmission corruption.

Why it matters for ML: Outliers disproportionately influence model training. A single 15,000 psi pressure reading in a dataset where normal values range from 200 to 800 psi will distort the learned distribution and produce models that are either insensitive to real anomalies or that hallucinate false ones.

4. Sensor Drift and Calibration Errors

Gradual, systematic deviation of a sensor reading from the true value. A pressure transducer that reads 50 psi high after three years without calibration. A flow meter that under-reports by 15% because of buildup on the orifice plate. Unlike outliers, drift is hard to detect from the data alone because the values remain plausible and change slowly.

Why it matters for ML: Drift introduces a non-stationary bias into the training data. A model trained on data from a drifted sensor will make systematically biased predictions. If the sensor is recalibrated (creating a step change in the data), the model may interpret the correction as an anomaly.

5. Unit and Scale Mismatches

Different wells, different fields, or different time periods may report the same physical quantity in different units. Pressure in psi versus kPa versus bar. Flow rate in barrels per day versus cubic meters per day. Temperature in Fahrenheit versus Celsius. These mismatches often arise from acquisitions (merging SCADA systems from two companies), equipment changes (replacing a metric-calibrated sensor with an imperial one), or configuration errors.

Why it matters for ML: If you train a model on data where some wells report pressure in psi and others in kPa, the model will learn that "800" and "5,500" are both normal operating pressures, which is technically true but meaningless for pattern recognition across wells.

6. Metadata Deficiency

SCADA data without adequate metadata is data without context. Missing or incorrect well identifiers. Tags with no description of what they measure. No record of sensor types, calibration dates, measurement ranges, or installation dates. No mapping between SCADA tags and physical equipment (which tag corresponds to which separator, which well on a multi-well pad).

Why it matters for ML: You cannot build a cross-well model if you do not know which tags belong to which wells. You cannot detect sensor drift if you do not know when the sensor was last calibrated. You cannot apply physics-informed constraints if you do not know the sensor range and units. Metadata is not supplementary information -- it is structural.


The SCADA Data Quality Audit Checklist

The following checklist is organized by category. It is designed to be run against a representative sample of SCADA data before committing to an AI/ML project. For each item, we indicate the severity level: Critical (will likely cause model failure), High (will significantly degrade model performance), or Medium (will affect model quality but may be workable).

A. Data Completeness and Gaps

#Check ItemSeverity
A1For each SCADA tag, calculate the percentage of expected data points that are present over the past 12 months. Flag any tag below 90% completeness.Critical
A2Identify the longest continuous gap for each tag. Flag any gap exceeding 24 hours.Critical
A3Count the number of gaps exceeding 1 hour per tag per month. Trend over time to identify worsening communication reliability.High
A4Determine whether gaps are random or correlated across tags. Correlated gaps (all tags on a well drop simultaneously) suggest RTU/communication failure. Uncorrelated gaps suggest individual sensor issues.High
A5Check whether gap patterns correlate with weather events, time of day, or geographic location (cell tower coverage).Medium
A6Verify that the data collection interval is consistent. Flag tags that switch between 1-minute, 5-minute, and 15-minute intervals without explanation.High
A7Determine whether missing data points represent "no data collected" or "no change from previous value" (report by exception). Document which tags use report-by-exception and at what deadband.Critical
A8Check for entire wells that have no SCADA data for extended periods (shut-in wells that were never removed from the active well list, or producing wells with failed RTUs that nobody noticed).High
A9Verify that timestamps are in a consistent timezone across all data sources. Check for daylight saving time artifacts (duplicate hours in spring, missing hours in fall).Critical
A10Check for duplicate data points (same tag, same timestamp, different values). These typically result from RTU buffer replays after communication restoration.High

B. Frozen and Stuck Values

#Check ItemSeverity
B1For each analog tag, identify periods where the value does not change for more than 4 hours while the well is producing. Flag as potentially frozen.Critical
B2Calculate the standard deviation of each tag over rolling 24-hour windows. Flag any window where the standard deviation is exactly zero (excluding legitimately constant setpoints).Critical
B3Cross-reference frozen pressure readings against production data. A well reporting constant casing pressure while flowing is physically implausible unless it is on a steady-state gas lift system at design conditions.High
B4Check for "last known value" behavior: does the SCADA system continue to report the last received value after communication loss? If so, document which tags exhibit this behavior and how the system indicates communication status.Critical
B5Identify tags that report values only at exact round numbers (e.g., pressure always at 100, 200, 300 psi). This may indicate a low-resolution sensor, a misconfigured analog-to-digital converter, or a tag that is reading a digital status rather than an analog signal.Medium
B6For flow meters, check for extended periods of exactly zero flow on wells that are not shut in. Zero flow on a producing well likely indicates a failed or bypassed meter, not an actual zero-production event.High
B7Compare the variance of each tag against the expected physical variance. A wellhead temperature sensor on a flowing well should show diurnal variation. If it does not, the sensor may be failed or the tag may be reading from the wrong source.Medium

C. Outliers and Anomalous Values

#Check ItemSeverity
C1For each tag, identify values that exceed the physical range of the installed sensor. A 0-1000 psi transducer reporting 1,500 psi is not an anomaly -- it is a data error.Critical
C2Flag negative values on tags where negative values are physically impossible (flow rates, absolute pressures, temperatures in Fahrenheit or Kelvin).Critical
C3Identify single-sample spikes: values that deviate by more than 5 standard deviations from the local mean and return to normal within the next 1-3 samples. These are almost always noise, not real events.High
C4Check for values that are exactly at the sensor range limits (0 or full scale). These often indicate a sensor that has saturated or failed, reporting its floor or ceiling value.High
C5Identify sudden step changes in a tag value that persist. These may indicate a sensor replacement, a recalibration, a physical change in the well (e.g., rod pump speed change), or a tag reassignment. Document the cause.High
C6Check for impossible rate-of-change values. Casing pressure increasing by 500 psi in one second is not physically possible unless there is a catastrophic failure, and even then the sensor response time would not capture it that fast.Medium
C7Verify that calculated or derived tags (e.g., production rates derived from tank levels) produce values consistent with their input tags. A derived flow rate that diverges from a direct flow meter reading indicates a calculation error or a misconfigured derivation.High
C8Run distribution plots for each tag across all wells. Identify wells whose distributions are dramatically different from the population. Investigate whether the difference is physical (different completion, different formation) or a data error.Medium

D. Sensor Drift and Calibration

#Check ItemSeverity
D1For pressure tags, compare SCADA readings against the most recent wellhead gauge readings from field reports. Flag discrepancies greater than 5% of the reading.Critical
D2Obtain calibration records for all pressure transducers and flow meters. Flag any sensor that has not been calibrated in the past 12 months.High
D3Plot each pressure tag over 12+ months and look for gradual upward or downward trends that do not correlate with operational changes. A trend of +2 psi/month on a casing pressure sensor is probably drift.High
D4Identify step changes in sensor readings that coincide with documented calibration events. Quantify the magnitude of the correction. If a recalibration shifts a reading by 100 psi, all data collected before that recalibration is biased by approximately that amount.High
D5For flow meters (orifice, turbine, Coriolis), check the date of the last orifice plate change, last proving, or last factor update. Orifice plates in wet gas service can erode significantly over 6-12 months.High
D6Compare SCADA flow rates against monthly production allocation volumes. Persistent discrepancies suggest either meter drift or allocation errors.High
D7Check for pairs of redundant sensors (e.g., tubing pressure measured by both the RTU and the artificial lift controller). Compare the readings over time. Divergence indicates drift in one or both sensors.Medium
D8Review the installation history of each sensor. A pressure transducer installed in 2015 that has never been replaced or recalibrated is a high-risk tag for drift.Medium

E. Unit and Scale Consistency

#Check ItemSeverity
E1Create a master tag dictionary that maps every SCADA tag to its physical measurement, engineering unit, and expected range. Verify that all wells use the same units for the same measurements.Critical
E2For data from acquired properties, compare the unit conventions used by the acquired company's SCADA system against the parent company's conventions. Do not assume they match.Critical
E3Check for mixed units within a single tag history. A tag that switches from psi to kPa at a specific date (typically during a SCADA migration or RTU replacement) will contain a step change that is not physical.Critical
E4Verify that analog scaling is correct. RTUs convert raw analog signals (4-20 mA) to engineering units using configured scale factors. A misconfigured scale factor will produce values that are within a plausible range but systematically wrong.High
E5Check for tags where the engineering unit label does not match the actual unit of the data. A tag labeled "psig" that actually contains "psia" will be off by ~14.7 psi. A tag labeled "bbl/d" that contains "MCF/d" will be wrong by orders of magnitude.Critical
E6For temperature tags, verify whether readings are in Fahrenheit or Celsius. Both are used in the industry, and the difference is not always obvious from the values alone (a reading of 120 could be either).Medium
E7Verify that timestamps across all data sources use the same time reference (UTC vs. local time, and which local timezone). SCADA data from a West Texas field might be in CST, CDT, UTC, or a mix depending on RTU configuration.Critical

F. Metadata and Tag Quality

#Check ItemSeverity
F1Verify that every SCADA tag has a documented description that identifies what it measures, on which piece of equipment, at which well or facility.Critical
F2Verify that tag-to-well mappings are current. When wells are recompleted, plugged, or new wells are drilled on a pad, tag assignments sometimes do not get updated.Critical
F3Check for orphan tags: tags that are still actively collecting data but are not assigned to any well or facility in the production database.High
F4Check for phantom tags: tags that are assigned to wells or equipment in the database but are not actually connected to any physical sensor. These will report either zero, null, or a frozen value.High
F5Verify that tag naming conventions are consistent across the asset. Inconsistent naming (e.g., "WHP" vs. "THP" vs. "WellheadPress" vs. "CsgPress" for the same measurement) makes cross-well analysis difficult.High
F6Document the data type of each tag: analog (continuous), digital (on/off), counter (cumulative), or calculated (derived from other tags). ML models need to handle each type differently.High
F7Verify that well status information (producing, shut-in, workover, waiting on completion) is available as a SCADA tag or linked metadata. Without well status, a model cannot distinguish between "no production because the well is shut in" and "no production because something is wrong."Critical
F8Check whether artificial lift parameters are available in SCADA. For rod pump wells: stroke count, load, motor amps. For ESPs: intake pressure, motor temperature, vibration, current. For gas lift: injection rate, injection pressure, valve status.High
F9Verify that facility-level tags (separator pressure, line pressure, tank levels) can be linked to the individual wells that flow through that facility. Without this linkage, well-level production estimates from facility-level measurements are impossible.High
F10Document the SCADA system vendor, version, historian type, and data export capabilities. Some historians only export aggregated data (hourly averages), which may not be sufficient for your ML use case.Medium

G. Data Infrastructure and Access

#Check ItemSeverity
G1Verify that raw, unaggregated SCADA data is accessible -- not just the compressed or averaged values stored in the long-term historian. Many ML applications need minute-level or sub-minute data.Critical
G2Determine the data retention policy. If the historian only retains high-resolution data for 90 days and then aggregates to hourly, you have a 90-day window to extract training data.High
G3Test the data export process. Can you extract 12 months of 1-minute data for 500 wells in a usable format (CSV, Parquet, database query) within a reasonable time? Some historians are not built for bulk extraction.High
G4Check whether there is a programmatic API for data access. Manual data exports via GUI do not scale. An ML pipeline needs automated, repeatable data access.High
G5Verify that the data export includes all necessary context: tag names, timestamps, values, quality codes, and any flags or annotations applied by the SCADA system.High
G6Check whether the SCADA system provides data quality codes (e.g., OPC quality codes). These codes can indicate whether a value is good, bad, uncertain, or substituted. If available, they are extremely valuable for automated data cleaning.Medium

Scoring Your Data Quality

After running the checklist, categorize your findings:

Green (ML-ready): Fewer than 5 items flagged, none at Critical severity. You can proceed with an ML project with standard data preprocessing.

Yellow (fixable): 5-15 items flagged, with no more than 3 at Critical severity. You need a data quality remediation phase before starting ML model development. Budget 4-8 weeks and assign a dedicated data engineer.

Red (not ready): More than 15 items flagged, or more than 3 Critical items. Your data infrastructure needs significant work before any ML project will succeed. This is not a failure -- it is a realistic assessment that saves you from a more expensive failure later.

Most mid-size operators with 500-5,000 wells will score Yellow on their first audit. That is normal. The question is not whether you have data quality issues -- you do -- but whether you know what they are and have a plan to address them.


The Cost of Skipping the Audit

The math on this is straightforward. A typical AI pilot at a mid-size operator involves:

  • Vendor selection and contracting: 2-3 months
  • Data preparation and formatting: 2-4 months
  • Model development and validation: 3-6 months
  • Pilot deployment and evaluation: 3-6 months

Total timeline: 10-19 months. Total internal cost (engineering time, IT support, project management): $200K-$500K. Total vendor cost: $150K-$400K.

If the project fails at month 6 because the data is not adequate, you have spent $200K-$400K and a year of organizational attention to learn something you could have learned in two weeks with a structured audit.

A data quality audit takes a single data engineer 2-4 weeks, costs perhaps $15K-$30K in labor, and gives you one of three outcomes:

  1. Proceed with confidence. Your data is ready. Start the ML project.
  2. Proceed with remediation. Your data has known issues with known fixes. Budget the remediation into the project plan.
  3. Do not proceed yet. Your data infrastructure needs work first. Invest in SCADA upgrades, sensor maintenance, and data governance before attempting ML.

All three outcomes are valuable. The only bad outcome is not knowing.


Automating the Audit with Open-Source Tools

Much of this checklist can be automated. Gap detection, frozen value identification, outlier flagging, and distribution analysis are all standard time-series analysis tasks. The barrier for most operators is not the analytics -- it is getting the data out of the SCADA historian and into a format where you can run analysis against it.

This is one of the use cases we are addressing with petro-mcp, our open-source MCP (Model Context Protocol) server for petroleum engineering data. MCP provides a standardized interface that allows AI assistants and analytical tools to access production data programmatically, without writing custom integrations for every SCADA vendor and data historian.

For example, rather than manually exporting CSV files from your historian, writing Python scripts to parse them, and running quality checks in Jupyter notebooks, an MCP-enabled workflow lets you query your production data directly and run audit checks against it through a conversational interface. Ask "show me all wells where casing pressure has been frozen for more than 6 hours in the past 30 days" and get an answer in seconds rather than hours.

If you are interested in how MCP servers work and what they mean for oilfield data access, we wrote a detailed technical overview: MCP Servers for Oilfield Data.

The petro-mcp project is open source under the MIT license: github.com/petropt/petro-mcp. We are actively developing adapters for common production data formats and SCADA export formats.


What to Do After the Audit

The audit is diagnostic. It tells you where the problems are. Fixing them requires a combination of field operations, IT infrastructure, and data governance:

Quick Wins (1-4 weeks)

  • Fix tag mappings. Update incorrect tag-to-well assignments. Remove orphan and phantom tags. Standardize tag naming conventions.
  • Apply unit corrections. Identify and correct all unit mismatches. Apply conversion factors to historical data where units changed mid-stream.
  • Configure data quality flags. If your SCADA system supports quality codes, enable them. If not, build a simple rules-based layer that flags frozen values, out-of-range values, and gap events.
  • Document your data. Create or update a tag dictionary with descriptions, units, ranges, and calibration dates for every active tag.

Medium-Term Fixes (1-3 months)

  • Sensor maintenance campaign. Calibrate all pressure transducers and flow meters. Replace sensors beyond their service life. Fix stuck level indicators and failed temperature probes.
  • Communication reliability. Address the root causes of data gaps: cellular coverage issues, radio interference, RTU firmware bugs, undersized memory buffers. Track gap frequency as a KPI.
  • Data pipeline standardization. Establish a consistent data ingestion pipeline that normalizes units, applies quality checks, and stores both raw and cleaned data. Do not overwrite raw data.
  • SCADA configuration audit. Review polling rates, deadbands, report-by-exception settings, and compression settings. Ensure they are appropriate for ML use cases, not just for operator dashboards.

Long-Term Infrastructure (3-12 months)

  • Data lake or warehouse. Centralize SCADA data, production data, well data, and facility data in a queryable data store. This does not require a massive cloud migration -- it can start with a simple time-series database and grow.
  • Automated data quality monitoring. Build dashboards that track the key quality metrics from this checklist on an ongoing basis. Data quality is not a one-time project -- it degrades continuously and requires continuous monitoring.
  • Data governance program. Assign ownership of data quality. Someone needs to be accountable for SCADA data quality, and it should not be the person who is also trying to build ML models.

A Note on Vendor Data Quality Claims

Nearly every AI/ML vendor in the production optimization space will tell you that their platform "handles data quality automatically" through built-in cleaning, imputation, and preprocessing. Some of these capabilities are real. Many are marketing.

Be skeptical. Ask specific questions:

  • How does the platform handle frozen values? Does it detect them, or just pass them through?
  • What imputation method is used for missing data? Linear interpolation? Forward fill? Model-based? Each has different failure modes.
  • How does it detect sensor drift versus real process changes?
  • What happens when the data quality is so poor that the model cannot produce reliable predictions? Does the system tell you, or does it silently output unreliable results?
  • Can you see the data before and after cleaning? Can you override the automated cleaning when it gets things wrong?

A platform that claims to "just handle" data quality without giving you visibility into what it is doing is a platform you should be cautious about deploying.


Conclusion

Data quality is not a preliminary step that you get through on the way to the interesting ML work. For most operators, it is the hardest and most valuable part of the entire AI project. A company that fixes its SCADA data quality has improved its operations regardless of whether it ever deploys a machine learning model. Better data means better surveillance, better anomaly detection, better production reporting, and better engineering decisions.

The checklist in this article is a starting point. Every operator's data has its own particular pathologies shaped by its SCADA vendor, its field operations practices, its acquisition history, and its IT infrastructure. But the categories of problems are universal. If you audit your data against these categories before starting an ML project, you will either confirm that you are ready to proceed or identify exactly what needs to be fixed first.

Either way, you avoid the most expensive failure mode in oilfield AI: building a model on data you do not understand.


Dr. Mehrdad Shirangi is the founder of Groundwork Analytics and holds a PhD from Stanford University in Energy Systems Optimization. He has been building AI solutions for the energy industry since 2018. Connect on X/Twitter and LinkedIn, or reach out at info@petropt.com.


Related Articles

Have questions about this topic? Get in touch.