Why does the rolling median outperform a moving average for GPS outlier removal?

The median is a breakdown-point-50% estimator: up to half the values in the window can be arbitrarily large outliers without pulling the estimate off its true value. A moving average has a breakdown point of zero — a single spike corrupts the result proportionally to its magnitude.

When should I switch center=True to center=False?

Use center=True in batch preprocessing (the full trace is available) to eliminate phase lag. Switch to center=False in real-time streaming because future samples are unavailable; accept a one-way delay of (window-1)/2 samples.

What max_drift_meters threshold is appropriate for highway driving?

At 100 km/h with a 1 Hz device and a window of 5, a legitimate sharp curve rarely moves the median more than 15–20 m from the raw point. Setting max_drift_meters to 50–100 m protects legitimate maneuvers while catching multipath spikes of 150 m or more.

Implementing a Rolling Median Filter for GPS Drift Removal

A rolling median filter is a deterministic, low-latency approach to cleaning noisy telemetry that extends the broader Kalman Filtering for GPS Noise Reduction cluster. Where the Kalman filter requires covariance matrices, an initial state, and careful noise tuning, the rolling median operates statelessly: it slides a fixed window over your coordinate stream, computes the median latitude and longitude independently, and replaces each raw reading with the median value when the displacement falls within a configurable threshold. This makes it the preferred first-pass cleaner for batch pipelines and resource-constrained edge devices before heavier probabilistic estimators are applied.

The technique directly addresses GPS jumps caused by multipath reflections off building facades, urban-canyon signal bounce, and brief satellite lock-loss events — failure modes that are equally relevant to outlier removal in raw telematics streams but require a spatially aware, window-based solution rather than a per-point threshold.

Compatibility & Configuration Requirements

Requirement	Minimum version / value
Python	3.9+
pandas	1.5+ (2.2+ recommended — `"1s"` resample alias; uppercase `"S"` is deprecated)
numpy	1.22+
Coordinate format	WGS84 decimal degrees (EPSG:4326)
Input sort order	Chronological by UTC timestamp before filtering
Timestamp column dtype	`datetime64[ns]` or `datetime64[ns, UTC]`

Note on pandas 2.2+: The resample alias "S" (uppercase) is deprecated. Always use lowercase "1s" when resampling to one-second intervals.

Production-Ready Implementation

Fleet telemetry rarely arrives perfectly aligned. Pings contain irregular intervals, dropped packets, and occasional NaN coordinates. The function below handles chronological sorting, applies the rolling median independently to latitude and longitude, vectorizes the Haversine displacement check, and reverts to original values when the median shift exceeds max_drift_meters.

import numpy as np
import pandas as pd
from typing import Tuple


def gps_rolling_median_filter(
    df: pd.DataFrame,
    lat_col: str = "latitude",
    lon_col: str = "longitude",
    ts_col: str = "timestamp",
    window: int = 5,
    max_drift_meters: float = 150.0,
) -> pd.DataFrame:
    """
    Apply a rolling median filter to GPS coordinates to suppress drift.

    Parameters
    ----------
    df : pd.DataFrame
        Input telemetry frame; must contain lat, lon, and timestamp columns.
    lat_col, lon_col : str
        Column names for latitude and longitude (WGS84 decimal degrees).
    ts_col : str
        Column name for UTC timestamps (datetime64).
    window : int
        Sliding window size in samples.  3–9 covers most 1 Hz use cases.
        Larger windows smooth more aggressively but clip legitimate tight turns.
    max_drift_meters : float
        Safety clamp in metres.  If the median coordinate differs from the
        original by more than this value, the original is kept unchanged.
        50–200 m is typical; lower values protect high-speed highway traces.

    Returns
    -------
    pd.DataFrame
        Copy of df with lat/lon columns updated where the filter was accepted.
    """
    if df.empty:
        return df.copy()

    # Ensure chronological order; work on a copy to avoid mutating caller data
    df = df.sort_values(ts_col).copy()

    # Compute rolling median per axis.
    # center=True aligns the median to the window midpoint, eliminating phase
    # lag.  For real-time streaming switch to center=False; latency is then
    # (window - 1) / 2 samples.
    # min_periods=1 prevents NaN at leading/trailing rows.
    lat_median = (
        df[lat_col]
        .rolling(window, min_periods=1, center=True)
        .median()
    )
    lon_median = (
        df[lon_col]
        .rolling(window, min_periods=1, center=True)
        .median()
    )

    # Vectorized Haversine: displacement (metres) between original and median.
    # Operating directly on degree values with the Haversine formula avoids
    # the distortion introduced by WGS84 → planar projection for short distances.
    R = 6_371_000.0  # mean Earth radius in metres
    dlat = np.radians(lat_median - df[lat_col])
    dlon = np.radians(lon_median - df[lon_col])
    a = (
        np.sin(dlat / 2) ** 2
        + np.cos(np.radians(df[lat_col]))
        * np.cos(np.radians(lat_median))
        * np.sin(dlon / 2) ** 2
    )
    displacement_m = R * 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))

    # Accept the smoothed value only where displacement is within threshold.
    # NaN displacements evaluate to False, preserving any missing originals.
    mask = displacement_m <= max_drift_meters
    df.loc[mask, lat_col] = lat_median[mask]
    df.loc[mask, lon_col] = lon_median[mask]

    return df

Execution & Tuning Guidelines

Running the filter

import pandas as pd
from your_module import gps_rolling_median_filter

df = pd.read_parquet("fleet_pings.parquet")

# Optional: resample to a uniform 1-second grid before filtering
# (use lowercase "1s" — uppercase "S" is deprecated since pandas 2.2)
df = (
    df.set_index("timestamp")
    .resample("1s")
    .mean()
    .reset_index()
)

clean_df = gps_rolling_median_filter(
    df,
    lat_col="latitude",
    lon_col="longitude",
    ts_col="timestamp",
    window=5,
    max_drift_meters=100.0,
)

Parameter knobs and their effects

Parameter	Recommended range	Effect of increasing	Effect of decreasing
`window`	3–9 samples	Smoother output, larger phase lag, clips tight turns	Reacts faster to direction changes, passes more noise
`center`	`True` (batch) / `False` (stream)	—	Switching to `False` adds `(window-1)/2` sample lag but enables real-time use
`min_periods`	Always 1	—	Higher values drop rows at trace boundaries to `NaN`
`max_drift_meters`	50–200 m	Accepts larger median shifts, risks incorporating genuine outliers	Rejects more shifts, may leave legitimate mild corrections unapplied

Matching window to sampling rate: At 1 Hz a window=5 covers five seconds of vehicle travel — roughly 140 m at 100 km/h. At 5 Hz, use window=7 to cover a similar time span (1.4 s) while remaining responsive to turns. Overshooting the window at high sampling rates blurs turn geometry and can cause outlier removal in raw telematics streams stages to flag legitimate maneuvers as anomalies.

max_drift_meters at highway speeds: A vehicle traveling at 120 km/h covers 33 m per second. Over a five-sample window at 1 Hz the true position shift can reach 100–130 m on a sweeping curve. Setting max_drift_meters below this range causes the filter to revert nearly all smoothed values during high-speed driving, effectively disabling it. Start at 150 m and lower gradually while inspecting a sample of cleaned traces against the raw input.

Irregular timestamps

pandas rolling operations count by index position, not elapsed time. If your device samples at variable rates (0.5 Hz to 2 Hz), resample to a fixed grid using df.set_index(ts_col).resample("1s").mean() before filtering. This guarantees consistent spatial coverage per window position and prevents wide-interval pings from masking multipath spikes.

NaN propagation and gap filling

The Haversine calculation propagates NaN when either the original or the median coordinate is missing. The mask displacement_m <= max_drift_meters safely evaluates to False for NaN, so original NaN coordinates are untouched. When strict gap-filling is required, apply df.interpolate(method="linear", limit=3) before median filtering. Limit interpolation to three steps to avoid fabricating coordinates across prolonged satellite outages — gaps longer than three seconds should be treated as trip boundaries by any downstream stop-detection pipeline.

Memory and throughput

For datasets exceeding 10 M rows, pandas rolling operations remain highly optimized via Cython. If you are processing raw 1-D arrays without timestamps, scipy.signal.medfilt() offers a lighter footprint; consult the SciPy medfilt documentation for boundary handling options (mode="nearest" vs mode="constant").

Common Pitfalls

Resampling before sorting. Calling resample() on an unsorted timestamp index silently produces incorrect output because pandas groups intervals before checking order. Always sort_values(ts_col) first, then set_index for resampling.
Using center=True in a streaming consumer. Centered windows require future samples that a real-time consumer does not yet hold. The result is a half-window delay that grows unbounded as the stream advances, causing the output to fall progressively further behind wall time. Switch to center=False and acknowledge the explicit latency.
Setting max_drift_meters without accounting for sampling rate. A threshold calibrated for a 1 Hz device will reject almost all corrections at 10 Hz (where per-step displacement is ten times smaller) and accept nearly everything at 0.1 Hz (where it is ten times larger). Validate the threshold against a histogram of displacement_m values from a representative trace before deploying to a new device class.

When to Graduate to State-Space Models

The rolling median is a non-parametric, breakdown-point-50% smoother. It excels at suppressing impulse noise and multipath spikes but does not model velocity, acceleration, or heading continuity. When your pipeline requires predictive tracking during prolonged signal loss of more than three seconds, sensor fusion of IMU and GNSS data, or probabilistic uncertainty bounds for downstream routing algorithms, transition to a recursive Bayesian estimator. The median filter makes an excellent preprocessing step to sanitize raw inputs before feeding them into Kalman Filtering for GPS Noise Reduction, reducing filter divergence and covariance tuning complexity for the state-space stage.

Up: Kalman Filtering for GPS Noise Reduction · GPS Data Preprocessing & Cleaning Fundamentals

Related

Kalman Filtering for GPS Noise Reduction — recursive Bayesian state-space filtering for real-time GPS smoothing with dynamic noise covariance
Outlier Removal in Raw Telematics Streams — per-point statistical and kinematic gates that complement window-based smoothing
Timestamp Synchronization for Multi-Device GPS Logs — aligning clocks from OBD-II, mobile, and GNSS receivers before any spatial filter is applied
Automating Outlier Detection in High-Frequency Telematics Data — pipeline automation for flagging and discarding anomalous pings at scale
Stop Detection & Dwell Time Analytics — downstream algorithms that depend on clean, drift-free coordinate streams