LAS Files for Machine Learning: How to Clean, Normalize, and Feature-Engineer Well Log Data for ML Models

Editorial disclosure

This article reflects the independent analysis and professional opinion of the author, informed by professional practice and direct experience building ML pipelines for upstream oil and gas. Code examples are tested and functional. No vendor reviewed or influenced this content prior to publication.

Machine learning on well log data sounds straightforward in theory. You have continuous measurements of rock properties -- gamma ray, resistivity, porosity, density -- recorded across thousands of feet of subsurface formation. The data is numerical. It is dense. It should be ideal for supervised learning.

In practice, anyone who has tried to build an ML model from raw LAS files knows the reality is different. The gamma ray curve from one well uses the mnemonic GR, another uses GR_EDTC, a third uses SGR. Null values show up as -999.25, or -9999, or NaN, or occasionally as 0 in a column where 0 is also a valid measurement. Depth intervals do not match between wells. Header metadata is incomplete or contradictory. And that is before you encounter the wrapped-line formatting of LAS 1.2 files, or the binary complexity of DLIS.

The gap between raw LAS files and a clean, model-ready dataset is where most well log ML projects stall. Not in the modeling. In the data preparation.

This article provides a complete, runnable Python pipeline for taking raw LAS well log data from messy field files to structured, feature-engineered datasets suitable for machine learning. It covers the format itself, the common data quality issues you will encounter, a step-by-step cleaning and normalization workflow, feature engineering techniques specific to petrophysics, and the ML applications where these features deliver real value.

If you are new to petroleum engineering data in Python, start with our companion article: Petroleum Engineering Data in Python: A Practical Guide. This article assumes you are comfortable with pandas, NumPy, and basic Python.

What LAS Files Are

LAS (Log ASCII Standard) is the petroleum industry's standard format for well log data. Developed by the Canadian Well Logging Society (CWLS) in the late 1980s, LAS files store continuous depth-indexed measurements recorded by logging tools run inside a wellbore.

A typical LAS file contains:

•Header sections describing the well (name, location, field, operator), the curves (mnemonic, unit, description), and acquisition parameters (depth range, step, null value)
•Data section with tab or space-delimited numerical values, one row per depth sample

LAS Versions

LAS 1.2 -- The original format from 1989. Still encountered in legacy datasets. Uses a rigid fixed-width column layout. Lines are limited to 256 characters, which means files with many curves use a "wrapped" format where a single depth sample spans multiple lines. This wrapping is the single most common source of parsing errors when reading legacy data.

LAS 2.0 -- Released in 1995 and by far the most common format in active use. Relaxes some formatting constraints from 1.2, supports longer lines, and adds a parameter section. Most modern logging companies deliver LAS 2.0 files. The lasio Python library handles 2.0 files reliably.

LAS 3.0 -- A more ambitious format that supports multiple data sections, log and core data in the same file, and structured metadata. Adoption has been slow. Most operators and service companies still use 2.0 because it is simpler and universally supported. If you encounter a 3.0 file, lasio has partial support, but you may need to handle some sections manually.

DLIS (Digital Log Interchange Standard) -- Not a LAS format, but worth mentioning because you will encounter it. DLIS is a binary format used for high-resolution logging data, particularly image logs and array tools. It is significantly more complex than LAS. The dlisio library (from Equinor) handles DLIS reading in Python, but the data model is fundamentally different -- DLIS files contain frames, channels, and objects rather than simple depth-value tables. For most ML pipelines, the strategy is to extract the curves you need from DLIS into a pandas DataFrame and then proceed with the same workflow described below.

# Quick check: what format are you dealing with?
import lasio

# LAS files
las = lasio.read("well_a.las")
print(las.version)        # LAS version info
print(las.well)           # Well header
print(las.curves.keys())  # Available curves

# DLIS files (different library)
# pip install dlisio
import dlisio
with dlisio.dlis.load("well_b.dlis") as (f, *_):
    for frame in f.frames:
        print(frame.channels)

Common Data Quality Issues in LAS Files

Before writing a single line of ML code, you need to understand the data quality problems you are going to encounter. These are not edge cases. They are the norm.

Inconsistent Curve Names

This is the single biggest headache in multi-well ML projects. The same physical measurement gets recorded under dozens of different mnemonics depending on the logging vendor, the tool string, the vintage of the acquisition, and the preferences of the log analyst who processed the data.

Gamma ray alone can appear as: GR, GR_EDTC, SGR, HSGR, CGR, GRD, ECGR, GR_ARC, GRGC. Bulk density: RHOB, RHOZ, DEN, ZDEN, HDEN, DENS. Deep resistivity: ILD, RT90, RDEP, AT90, RD, RLLD, AHT90, M2R9.

If you are combining data from 50 or 500 wells, you will encounter this problem at scale. There is no avoiding it.

Null Values

The LAS standard specifies a null value in the header (usually -999.25). In practice, you will find:

•-999.25 (standard)
•-9999 and -9999.0
•-999 and -999.0
•0 (dangerous -- sometimes a valid measurement)
•9999 (positive sentinel)
•Blank/empty fields
•NaN (if pre-processed)

The lasio library handles the header-declared null value automatically, converting it to NaN. But it cannot catch the non-standard sentinels. You need to handle those explicitly.

Depth Mismatches

Wells logged at different times, by different vendors, or with different tool combinations will have different depth intervals, depth steps, and depth references. One well might be logged from 1000 to 8500 feet at 0.5-foot intervals. Another from 500 to 9000 feet at 0.1524 meters (half-foot in metric). A third from 2000 to 7000 feet at 6-inch intervals with gaps where the tool was tripped.

For multi-well ML, every sample must be on a consistent depth grid. This means resampling and interpolation.

Header Errors

Well names misspelled. API numbers truncated. Locations in the wrong coordinate system. Depth reference (KB, GL, DF) misidentified or missing. These errors rarely affect the curve data directly, but they cause havoc when you are trying to merge LAS data with external databases (production data, completion records, formation tops).

Wrapped Lines (LAS 1.2)

In wrapped LAS 1.2 files, a single depth sample's data wraps across multiple lines. The lasio library handles this correctly in most cases, but corrupted or hand-edited files can break the parser. If lasio.read() raises an error on a legacy file, wrapped-line formatting is the first thing to check.

Non-Standard Mnemonics and Units

Some companies use proprietary curve mnemonics that follow no standard. Units may be mixed within a single file (resistivity in ohm-m in one curve, ohm-ft in another). The unit field in the header might say OHMM or OHM.M or ohmm or be blank entirely.

Step-by-Step Python Pipeline

Here is the complete pipeline for going from raw LAS files to model-ready features. Install the dependencies first:

pip install lasio pandas numpy scikit-learn matplotlib

Step 1: Loading LAS Files with lasio

import lasio
import pandas as pd
import numpy as np
from pathlib import Path

def load_las_to_dataframe(filepath: str) -> pd.DataFrame:
    """Load a LAS file and return a pandas DataFrame with depth as index."""
    las = lasio.read(filepath)
    df = las.df()  # DataFrame with depth as index

    # Store well metadata as DataFrame attributes
    df.attrs["well_name"] = str(las.well.WELL.value) if las.well.WELL else "UNKNOWN"
    df.attrs["null_value"] = las.well.NULL.value if las.well.NULL else -999.25
    df.attrs["filepath"] = filepath

    return df


def load_multiple_wells(directory: str, pattern: str = "*.las") -> dict:
    """Load all LAS files from a directory into a dict of DataFrames."""
    wells = {}
    las_dir = Path(directory)

    for filepath in sorted(las_dir.glob(pattern)):
        try:
            df = load_las_to_dataframe(str(filepath))
            well_name = df.attrs.get("well_name", filepath.stem)
            wells[well_name] = df
            print(f"Loaded {well_name}: {df.shape[0]} samples, "
                  f"{df.shape[1]} curves ({list(df.columns[:5])}...)")
        except Exception as e:
            print(f"Failed to load {filepath.name}: {e}")

    return wells

Step 2: Curve Name Standardization

This is the most labor-intensive step, but it is essential. You need a mapping from vendor-specific mnemonics to standard names.

# Curve name mapping: vendor mnemonics -> standard names
CURVE_NAME_MAP = {
    # Gamma Ray
    "GR": "GR", "GR_EDTC": "GR", "SGR": "GR", "HSGR": "GR",
    "CGR": "GR", "GRD": "GR", "ECGR": "GR", "GR_ARC": "GR",
    "GRGC": "GR",

    # Bulk Density
    "RHOB": "RHOB", "RHOZ": "RHOB", "DEN": "RHOB", "ZDEN": "RHOB",
    "HDEN": "RHOB", "DENS": "RHOB", "HRHOB": "RHOB",

    # Neutron Porosity
    "NPHI": "NPHI", "TNPH": "NPHI", "NEU": "NPHI", "NPOR": "NPHI",
    "HNPHI": "NPHI", "CN": "NPHI", "CNC": "NPHI", "CNCF": "NPHI",

    # Deep Resistivity
    "ILD": "RT", "RT90": "RT", "RDEP": "RT", "AT90": "RT",
    "RD": "RT", "RLLD": "RT", "AHT90": "RT", "M2R9": "RT",
    "LLD": "RT", "ILM": "RT",

    # Sonic (Compressional)
    "DT": "DT", "DTC": "DT", "DTCO": "DT", "AC": "DT",
}

def standardize_curve_names(df: pd.DataFrame) -> pd.DataFrame:
    """Rename columns using the standard mnemonic mapping."""
    rename_map = {}
    unmapped = []

    for col in df.columns:
        col_upper = col.upper().strip()
        if col_upper in CURVE_NAME_MAP:
            standard_name = CURVE_NAME_MAP[col_upper]
            if standard_name in rename_map.values():
                print(f"  Warning: duplicate mapping to {standard_name} "
                      f"(from {col}). Skipping.")
                continue
            rename_map[col] = standard_name
        else:
            unmapped.append(col)

    if unmapped:
        print(f"  Unmapped curves: {unmapped}")

    df = df.rename(columns=rename_map)
    return df

Build your CURVE_NAME_MAP iteratively. Start with the mnemonics above, then expand it every time you encounter a new well that uses a name you have not seen. After processing a few hundred wells from a basin, your mapping will cover 95% of what you encounter.

Step 3: Null Value Handling

def clean_null_values(
    df: pd.DataFrame,
    sentinel_values: list = [-999.25, -999, -9999, -9999.0, 9999, 999.25],
    physical_bounds: dict = None,
) -> pd.DataFrame:
    """Replace sentinel null values and out-of-bounds values with NaN."""
    df = df.copy()

    # Replace known sentinel values
    for sentinel in sentinel_values:
        df = df.replace(sentinel, np.nan)

    # Apply physical bounds
    if physical_bounds is None:
        physical_bounds = {
            "GR":   (0, 300),       # gAPI
            "RHOB": (1.0, 3.2),     # g/cm3
            "NPHI": (-0.15, 0.60),  # v/v
            "RT":   (0.01, 50000),  # ohm.m
            "DT":   (30, 200),      # us/ft
            "PE":   (0.5, 10),      # barns/electron
            "CALI": (4, 30),        # inches
        }

    for curve, (low, high) in physical_bounds.items():
        if curve in df.columns:
            mask = (df[curve] < low) | (df[curve] > high)
            n_removed = mask.sum()
            if n_removed > 0:
                print(f"  {curve}: {n_removed} out-of-bounds values removed")
                df.loc[mask, curve] = np.nan

    return df

A word on imputation strategy: interpolation is the right default for small gaps (a few samples) in continuous log curves. Forward fill works better when you suspect the gap is in a consistent lithology and the curve should be flat. Zone-based median is useful when you have formation top picks and can impute within a known geological interval.

Do not impute large gaps. If a curve is missing over a 500-foot interval, the data is simply not there. Imputing it introduces fabricated information that your ML model will treat as real. Better to drop those samples or treat that well as incomplete for that curve.

Step 4: Depth Resampling and Alignment

def resample_to_common_depth(
    df: pd.DataFrame,
    step: float = 0.5,
    method: str = "linear",
) -> pd.DataFrame:
    """Resample a well log DataFrame to a uniform depth grid."""
    depth_min = np.ceil(df.index.min() / step) * step
    depth_max = np.floor(df.index.max() / step) * step
    new_depth = np.arange(depth_min, depth_max + step, step)

    df_resampled = df.reindex(df.index.union(new_depth))
    df_resampled = df_resampled.interpolate(method=method)
    df_resampled = df_resampled.loc[new_depth]

    df_resampled.index = np.round(df_resampled.index, 2)
    df_resampled.index.name = "DEPTH"

    return df_resampled

Step 5: Outlier Detection and Removal

Use the modified Z-score method for well log data. Standard Z-scores are sensitive to extreme outliers, which is exactly what you are trying to detect. The modified Z-score uses the median and MAD (median absolute deviation) instead of the mean and standard deviation, making it robust against the very outliers it is meant to find.

Step 6: Quality Flagging

Quality flags serve two purposes. First, they let you filter training data to only include high-quality samples. Training on washout zones where the density and neutron logs are unreliable will degrade your model. Second, they become features themselves -- a model predicting lithology can benefit from knowing that the density measurement at a given depth is suspect.

Feature Engineering for ML

Raw log curves are useful features on their own, but engineered features consistently improve model performance for petrophysical applications. The features below are derived from standard petrophysical interpretation principles.

Log Ratios and Crossover Indicators

def engineer_petrophysical_features(df: pd.DataFrame) -> pd.DataFrame:
    """Create petrophysics-derived features from standard log curves."""
    df = df.copy()

    # Neutron-Density separation (gas indicator)
    if "NPHI" in df.columns and "RHOB" in df.columns:
        df["ND_SEPARATION"] = df["NPHI"] - (2.65 - df["RHOB"]) / (2.65 - 1.0)

    # GR normalized (0-1 scale, useful for Vshale estimation)
    if "GR" in df.columns:
        gr_min = df["GR"].quantile(0.01)
        gr_max = df["GR"].quantile(0.99)
        df["GR_NORM"] = ((df["GR"] - gr_min) / (gr_max - gr_min)).clip(0, 1)

    # Vshale (Larionov tertiary rocks)
    if "GR_NORM" in df.columns:
        df["VSHALE"] = 0.083 * (2 ** (3.7 * df["GR_NORM"]) - 1)

    # Porosity from density log (assuming quartz matrix, freshwater)
    if "RHOB" in df.columns:
        rho_matrix = 2.65
        rho_fluid = 1.0
        df["PHID"] = ((rho_matrix - df["RHOB"]) /
                       (rho_matrix - rho_fluid)).clip(0, 0.5)

    return df

Moving Averages and Gradients

Geological formations have vertical structure. A single depth sample captures the log value at one point, but the trend of that value over a window tells you something about the geological context -- are you in a fining-upward sequence, a blocky sand, or a thinly bedded interval?

The window sizes matter. A 5-sample window (~2.5 feet at 0.5-foot spacing) captures thin-bed effects. An 11-sample window captures bed-scale trends. A 21-sample window captures sequence-scale trends. Using multiple window sizes gives your model multi-resolution information, similar to how a geologist reads a log at different scales.

Putting It All Together

def build_ml_dataset(
    las_directory: str,
    required_curves: list = ["GR", "RHOB", "NPHI", "RT"],
    depth_step: float = 0.5,
    min_quality: float = 0.8,
) -> pd.DataFrame:
    """Full pipeline: LAS files -> model-ready DataFrame."""
    wells = load_multiple_wells(las_directory)

    for name in wells:
        wells[name] = standardize_curve_names(wells[name])
        wells[name] = clean_null_values(wells[name])

    combined = align_wells(wells, step=depth_step,
                           required_curves=required_curves)

    combined = add_quality_flags(combined)
    combined = engineer_petrophysical_features(combined)
    combined = engineer_window_features(combined)

    if "QUALITY_SCORE" in combined.columns:
        combined = combined[combined["QUALITY_SCORE"] >= min_quality]

    return combined

# Usage:
# df = build_ml_dataset("./las_files/")
# df.to_parquet("well_log_features.parquet")

Common ML Applications for Well Log Data

Lithology Prediction

Predicting rock type (sandstone, shale, limestone, dolomite) from log curves is the most widely deployed ML application in petrophysics.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GroupKFold
from sklearn.metrics import classification_report

def train_lithology_model(df: pd.DataFrame, label_col: str = "LITHOLOGY"):
    """Train a lithology classifier with well-grouped cross-validation."""

    feature_cols = [c for c in df.columns
                    if c not in [label_col, "WELL", "FORMATION", "DEPTH_BIN"]
                    and not c.startswith("FLAG_")
                    and df[c].dtype in ["float64", "float32", "int64"]]

    X = df[feature_cols].fillna(0)
    y = df[label_col]
    groups = df.index.get_level_values("WELL")

    # GroupKFold ensures no well appears in both train and test
    gkf = GroupKFold(n_splits=5)

    for fold, (train_idx, test_idx) in enumerate(gkf.split(X, y, groups)):
        model = GradientBoostingClassifier(
            n_estimators=200, max_depth=6, learning_rate=0.1,
            subsample=0.8, random_state=42,
        )
        model.fit(X.iloc[train_idx], y.iloc[train_idx])
        y_pred = model.predict(X.iloc[test_idx])
        print(f"\nFold {fold + 1}:")
        print(classification_report(y.iloc[test_idx], y_pred))

    return model, feature_cols

The critical detail here is GroupKFold. Well log samples are not independent -- adjacent depth samples in the same well are highly correlated. If you use standard k-fold cross-validation, you will get inflated accuracy because the model sees training data from the same well that appears in the test set. Grouping by well ensures that entire wells are held out for testing.

Porosity and Permeability Estimation

Porosity prediction from logs is essentially a regression problem. Core-measured porosity provides the training labels. Permeability is harder -- the relationship between log response and permeability is weaker and more nonlinear, because permeability depends on pore throat geometry that logs do not directly measure.

Sweet Spot Identification

In unconventional plays, "sweet spot" identification means predicting which intervals will produce the best wells based on log signatures. This is typically framed as a ranking problem: given log data across a lateral landing zone, rank intervals by expected productivity.

Facies Classification

The FORCE 2020 lithofacies competition demonstrated that gradient boosting and neural network models can achieve 70-80% accuracy on held-out wells for a 12-class facies problem -- comparable to expert manual interpretation, and far faster.

How petro-mcp Connects LAS Data to AI Agents

Everything in this article -- loading, cleaning, feature engineering -- is the kind of repetitive, rule-heavy work that AI agents can handle well. But only if the agent has access to the data.

petro-mcp is an open-source MCP (Model Context Protocol) server that gives AI agents structured access to petroleum engineering data, including LAS files, production data, and well headers. For a detailed explanation of what MCP is and how it works, see our article on MCP Servers for Oilfield Data.

For more on how AI agents are transforming upstream operations beyond well log analysis, see our article on Agentic AI in Upstream Oil & Gas.

Public Datasets for Practice

Volve Dataset (Equinor)

The Volve field dataset is the most comprehensive public oilfield dataset available. Released by Equinor in 2018, it contains the complete lifecycle data from a North Sea field.

•URL: equinor.com/energy/volve-data-sharing
•What you get: 24 wells with complete log suites, production data, completion data, and geological interpretations
•Best for: End-to-end ML projects that span from well logs to production prediction

FORCE 2020 Machine Learning Competition

The FORCE 2020 competition dataset provides well logs from 118 wells in the Norwegian North Sea, labeled with lithofacies classes. This is the gold standard dataset for facies classification benchmarking.

•URL: github.com/bolgebrygg/Force-2020-Machine-Learning-competition
•Best for: Lithology/facies classification. This is where you should start if you want to practice the pipeline in this article.

3W Dataset (Petrobras)

•URL: github.com/petrobras/3W
•Best for: Anomaly detection and classification on time-series well data

Kansas Geological Survey (KGS)

•URL: kgs.ku.edu/Magellan/Logs/
•Best for: Practicing the data cleaning and curve name standardization pipeline at scale

Practical Tips

Start with the FORCE 2020 dataset. It is clean enough to get results quickly but messy enough to require real data preparation. It comes with labels, which means you can train and evaluate a model in an afternoon.

Save intermediate results as Parquet. After cleaning and feature engineering, save the result as a Parquet file. Parquet preserves data types, compresses well, and loads an order of magnitude faster than CSV.

Version your curve name map. Keep CURVE_NAME_MAP in a separate JSON or YAML file, version it, and update it as you encounter new mnemonics. A team working on hundreds of wells will build a mapping with 200+ entries. That mapping is institutional knowledge.

Never train and test on the same well. This bears repeating. Use GroupKFold or leave-one-well-out cross-validation. If your accuracy drops from 95% to 70% when you switch from random cross-validation to grouped cross-validation, your model was memorizing well-specific patterns, not learning petrophysics.

Be cautious with deep learning. For tabular well log data with fewer than 100,000 samples, gradient boosting (XGBoost, LightGBM) typically outperforms neural networks. Deep learning shows advantages when you have image logs, very large datasets, or sequence modeling needs.

Next Steps

The pipeline in this article takes you from raw LAS files to model-ready features. What you do next depends on your application:

•For lithology prediction, start with the FORCE 2020 dataset and a gradient boosting classifier.
•For porosity estimation, you need core data for labels. The Volve dataset includes some core measurements.
•For production-linked models, the challenge is linking log features to outcomes. Our article on SCADA Data Quality covers the production data side.
•For AI agent workflows, connect your LAS data pipeline to petro-mcp and let an agent handle the repetitive cleaning and feature engineering steps.

If you are hiring data scientists or petroleum engineers who can bridge both domains, jobs.petropt.ai posts energy industry roles that specifically require data science and ML skills.

•Python for Petroleum Engineering Data -- The practical Python guide for PE workflows.
•MCP Servers for Oilfield Data -- How the Model Context Protocol connects AI to petroleum engineering data.
•Drilling Data Management: WITSML and Cloud -- Cloud-native architectures for drilling data pipelines.

Ready to Get Started?

Let's Build Something Together

Whether you're exploring AI for the first time or scaling an existing initiative, our team of petroleum engineers and data scientists can help.

Get in Touch →