Why does a static speed threshold miss real GPS jumps in urban environments?

Multipath reflections in dense urban canyons can shift a coordinate by 50–200 m in a single sample while the reported speed field stays within a plausible range, because the OBD-II speed sensor is unaffected. A static threshold on the reported speed column alone never catches these position jumps; you must compute speed from the coordinate geometry itself via Haversine and compare the two values.

What contamination rate should I use for IsolationForest on telematics data?

Start at 0.02 (2 %) for mixed urban/highway fleets. Reduce to 0.005 for highway-only heavy trucks whose motion is more uniform. Increase toward 0.05 if the hardware is consumer-grade smartphones with frequent GPS dropouts. Calibrate against a hand-labelled held-out week of data before deploying to production.

How do I prevent interpolation artefacts at chunk boundaries when processing in 15-minute windows?

Overlap adjacent chunks by at least max_interp_gap × sample_interval seconds — typically 5–10 seconds at 10 Hz. Process the overlapping region, then discard the overlap rows after merging and keep only the canonical timestamp range for each chunk. This prevents edge NaNs from propagating into the next window's forward-fill.

Automating Outlier Detection in High-Frequency Telematics Data

This page extends Outlier Removal in Raw Telematics Streams with a concrete, production-ready automation strategy for 1–10 Hz GPS feeds. At these sampling rates a 30-vehicle fleet generates upwards of 10 million rows per hour; manual review is impossible and static column-level thresholds silently pass multipath position jumps while discarding legitimate hard-braking events. The hybrid pipeline below combines geometry-derived velocity validation, kinematic continuity constraints, and unsupervised anomaly scoring to isolate physically impossible state transitions from aggressive but valid driving, and it integrates directly into the GPS Data Preprocessing & Cleaning Fundamentals ingestion layer without requiring per-vehicle rule tuning.

Compatibility and Configuration Requirements

Dependency	Minimum version	Notes
Python	3.10	`match` statement not used; 3.9 works if `typing.Union` replaces `X \| Y`
pandas	2.0	`DataFrame.rolling` `min_periods` behaviour changed in 2.0
numpy	1.24	`np.clip` broadcasting fix for coordinate arrays
scikit-learn	1.3	`IsolationForest` `max_samples="auto"` default changed
pyproj (optional)	3.6	Only needed if input coordinates are not WGS 84 — see CRS normalisation

Input DataFrame requirements:

timestamp: ISO-8601 string or datetime64[ns], monotonically increasing per vehicle ID
lat, lon: WGS 84 decimal degrees (EPSG:4326)
speed_kmh: OBD-II or CAN-bus reported speed, km/h
heading_deg: compass bearing 0–360°
gps_accuracy_m: horizontal dilution of precision in metres (HDOP × DOP constant)

If timestamps are not already synchronised across mixed OBD-II and mobile device streams, apply timestamp alignment across mixed OBD-II and mobile devices before passing data into this pipeline.

Pipeline Architecture

The three-stage flow below shows how deterministic physics checks feed into ML-based contextual scoring before the cleaned stream reaches downstream consumers such as stop detection and map matching.

Production-Ready Implementation

The function below is self-contained and copy-paste ready. It handles the full three-stage pipeline: Haversine-derived speed validation, kinematic flag accumulation, IsolationForest contextual scoring, and gap-bounded linear interpolation with an audit trail.

import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler


def _haversine_speed_kmh(
    lat1: pd.Series,
    lon1: pd.Series,
    lat2: pd.Series,
    lon2: pd.Series,
    dt_sec: pd.Series,
) -> pd.Series:
    """
    Vectorised Haversine speed in km/h between consecutive coordinate pairs.

    Parameters
    ----------
    lat1, lon1 : previous position (decimal degrees, WGS 84)
    lat2, lon2 : current position (decimal degrees, WGS 84)
    dt_sec     : elapsed seconds between the two samples (clipped to >= 0.1)

    Returns a Series aligned to the input index; first element is 0.0.
    """
    R = 6371.0  # Earth mean radius, km
    φ1 = np.radians(lat1)
    φ2 = np.radians(lat2)
    Δφ = np.radians(lat2 - lat1)
    Δλ = np.radians(lon2 - lon1)

    a = np.sin(Δφ / 2) ** 2 + np.cos(φ1) * np.cos(φ2) * np.sin(Δλ / 2) ** 2
    dist_km = 2 * R * np.arcsin(np.sqrt(np.clip(a, 0.0, 1.0)))

    speed = dist_km / (dt_sec / 3600.0)
    return speed.fillna(0.0)


def detect_telematics_outliers(
    df: pd.DataFrame,
    contamination: float = 0.02,
    max_acc_ms2: float = 12.0,
    max_speed_discrepancy_kmh: float = 30.0,
    max_hdop_m: float = 50.0,
    max_interp_gap: int = 5,
) -> pd.DataFrame:
    """
    Automates outlier detection in high-frequency telematics data.

    Parameters
    ----------
    df : DataFrame with columns timestamp, lat, lon, speed_kmh,
         heading_deg, gps_accuracy_m.  One vehicle per call.
    contamination : expected outlier fraction for IsolationForest (0.005–0.05).
         Lower for highway-only heavy trucks; higher for urban smartphones.
    max_acc_ms2 : longitudinal acceleration limit in m/s².
         12.0 covers aggressive passenger vehicles; use 4.0 for rigid trucks.
    max_speed_discrepancy_kmh : threshold for Haversine vs OBD-II speed delta.
         30 km/h catches multipath jumps without flagging hard-braking events.
    max_hdop_m : GPS accuracy threshold in metres.
         Samples above this are kinematically suspect even if speed looks fine.
    max_interp_gap : maximum consecutive outlier samples to interpolate.
         Gaps longer than this trigger a segment break rather than interpolation.

    Returns
    -------
    DataFrame with original columns plus:
        is_outlier        bool  – combined deterministic + ML flag
        anomaly_score     float – IsolationForest decision function (negative = anomaly)
        cleaned_speed_kmh float – interpolated speed with outliers replaced
    """
    df = df.copy()
    df["timestamp"] = pd.to_datetime(df["timestamp"])
    df = df.sort_values("timestamp").reset_index(drop=True)

    # --- Stage 1: Physics-based temporal-spatial sanity ---

    # Time deltas; clip to 0.1 s to avoid division-by-zero on duplicate timestamps
    df["_dt_sec"] = df["timestamp"].diff().dt.total_seconds().fillna(1.0).clip(lower=0.1)

    # Geometry-derived speed; compare to OBD-II reported speed
    df["_hav_speed_kmh"] = _haversine_speed_kmh(
        df["lat"].shift(1).fillna(df["lat"]),
        df["lon"].shift(1).fillna(df["lon"]),
        df["lat"],
        df["lon"],
        df["_dt_sec"],
    )
    speed_delta = (df["speed_kmh"] - df["_hav_speed_kmh"]).abs()
    df["_speed_flag"] = speed_delta > max_speed_discrepancy_kmh

    # Longitudinal acceleration from reported speed (m/s²)
    df["_acc_ms2"] = (df["speed_kmh"] / 3.6).diff() / df["_dt_sec"]

    # Heading continuity: shortest arc between consecutive bearings
    raw_delta = df["heading_deg"].diff().abs()
    df["_heading_delta"] = np.minimum(raw_delta, 360.0 - raw_delta)

    # Kinematic flag: impossible acceleration, impossible heading jump, or poor fix
    df["_kinematic_flag"] = (
        (df["_acc_ms2"].abs() > max_acc_ms2)
        | (df["_heading_delta"] > 180.0)
        | (df["gps_accuracy_m"] > max_hdop_m)
    )

    # --- Stage 2: Contextual anomaly scoring (IsolationForest) ---

    feature_cols = [
        "speed_kmh",
        "heading_deg",
        "gps_accuracy_m",
        "_acc_ms2",
        "_hav_speed_kmh",
    ]
    X = df[feature_cols].ffill().bfill()

    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    iso = IsolationForest(
        contamination=contamination,
        n_estimators=100,   # 100 trees balances accuracy and fit latency
        max_samples="auto", # defaults to min(256, n_samples) — fast on long series
        random_state=42,
        n_jobs=-1,
    )
    iso.fit(X_scaled)
    df["anomaly_score"] = iso.decision_function(X_scaled)
    # decision_function returns negative values for anomalies; -0.1 is a safe default
    df["_ml_flag"] = df["anomaly_score"] < -0.1

    # Combine all flags
    df["is_outlier"] = df["_speed_flag"] | df["_kinematic_flag"] | df["_ml_flag"]

    # --- Stage 3: Gap-bounded interpolation with audit trail ---

    df["cleaned_speed_kmh"] = df["speed_kmh"].copy().astype(float)
    df.loc[df["is_outlier"], "cleaned_speed_kmh"] = np.nan

    # Linear interpolation bounded by max_interp_gap; longer gaps stay NaN
    df["cleaned_speed_kmh"] = (
        df["cleaned_speed_kmh"]
        .interpolate(method="linear", limit=max_interp_gap)
        .ffill()
        .bfill()
    )

    # Drop internal working columns; keep audit columns
    internal = [c for c in df.columns if c.startswith("_")]
    df = df.drop(columns=internal)

    return df

Execution and Tuning Guidelines

Running the function requires a per-vehicle DataFrame — segment by vehicle_id before calling:

results = (
    raw_df
    .groupby("vehicle_id", group_keys=False)
    .apply(detect_telematics_outliers, contamination=0.02)
    .reset_index(drop=True)
)

For larger fleets, wrap the groupby.apply in a concurrent.futures.ProcessPoolExecutor or a polars lazy pipeline to parallelise across vehicle IDs. The function is stateless and safe to parallelise.

Key parameter knobs and their effects:

Parameter	Default	Effect of raising	Effect of lowering
`contamination`	0.02	Flags more points as anomalies; risks false positives on valid hard manoeuvres	Misses subtle sensor degradation that passes physics checks
`max_acc_ms2`	12.0	Permits steeper acceleration/braking; needed for motorcycles or sports cars	Flags legitimate emergency braking on heavy trucks
`max_speed_discrepancy_kmh`	30.0	Tolerates larger GPS jumps before flagging; useful in tunnels with brief re-acquisition lag	Flags minor Kalman filter lag in the GPS chipset as outliers
`max_hdop_m`	50.0	Accepts fixes with poor horizontal accuracy; needed in dense urban canyons	Rejects borderline fixes that are still spatially useful
`max_interp_gap`	5	Interpolates across longer dropout windows; can introduce smooth artefacts	Forces segment breaks earlier; safer for downstream stop detection algorithms

Window sizing at different sampling rates: At 10 Hz, five consecutive outlier samples represent only 0.5 seconds — a plausible GPS dropout. At 1 Hz, five samples span five seconds, which typically indicates hardware failure rather than a transient dropout. Reduce max_interp_gap to 2–3 for 1 Hz streams and increase to 10–15 for 10 Hz.

Chunked processing for long sessions: Process each vehicle’s daily log in 15-minute temporal chunks overlapped by max_interp_gap × sample_interval seconds. After merging chunks, discard the overlap rows indexed by canonical timestamp range. This bounds peak memory to roughly 9,000 rows per 15-minute chunk at 10 Hz — well within pandas’ in-process limits on edge hardware.

Baseline drift monitoring: Refit the IsolationForest monthly or after major firmware updates. Seasonal weather shifts (winter tyre chains, summer heat shimmer) alter the GPS noise floor enough to drift the contamination fraction by ±0.5 %. Track the daily outlier rate per vehicle; a sudden spike on a single unit indicates antenna damage or CAN-bus faults before they corrupt weeks of route data.

Common Pitfalls

Comparing reported speed to Haversine speed without clipping dt_sec. If two consecutive timestamps are identical (duplicate packet replay from a cellular buffer), dt_sec is zero and the Haversine speed becomes infinite, flagging the row as an outlier. Always clip(lower=0.1) the time delta before the division, not after.
Fitting IsolationForest on concatenated multi-vehicle data without segmenting first. A highway coach and a last-mile delivery van have non-overlapping kinematic distributions. Fitting a single model produces a useless contamination estimate skewed by the majority vehicle class. Always segment by vehicle_id (and optionally by road-type context) before fitting.
Silently dropping flagged rows instead of interpolating and preserving flags. Downstream algorithms — particularly DBSCAN stop clustering — require a gapless temporal sequence to compute dwell durations correctly. A dropped row creates a false time gap that inflates dwell time for the stop immediately following the gap. Interpolate the position, but always retain is_outlier and anomaly_score so the downstream system can apply its own confidence weighting.

Up: Outlier Removal in Raw Telematics Streams — GPS Data Preprocessing & Cleaning Fundamentals

Related:

Implementing a Rolling Median Filter for GPS Drift Removal — complementary smoothing approach before outlier scoring
Kalman Filtering for GPS Noise Reduction — probabilistic noise model as an alternative to IsolationForest scoring
Timestamp Synchronisation for Multi-Device GPS Logs — prerequisite alignment step for mixed OBD-II and mobile streams
DBSCAN for Fleet Stop Clustering — downstream consumer of the cleaned position stream
Stop Detection & Dwell Time Analytics — how cleaned telematics feeds accurate dwell calculations