Why does a low-speed threshold alone produce too many false stops?

GPS horizontal accuracy degrades to 15–50 m in urban canyons, so a stationary reading can have apparent velocity from coordinate noise alone. Creeping stop-and-go traffic also briefly satisfies a speed cutoff without representing an operational halt.

What HDOP threshold should I use to penalise signal quality?

HDOP values above 5.0 indicate poor geometry; above 10.0 indicate unreliable fixes. A linear penalty scaling from 1.0 at HDOP 1.0 to 0.0 at HDOP 10.0 provides a smooth signal-quality factor without hard cutoffs.

How often should feature weights be recalibrated?

Recalibrate monthly or after any significant fleet composition change, firmware update, or routing-pattern shift. Monitor Kolmogorov-Smirnov test statistics on feature distributions as an automated drift indicator.

Can confidence scoring run in a streaming pipeline?

Yes. The scoring layer is intentionally stateless; trip segmentation and candidate window boundaries are resolved upstream. Candidate windows are emitted as micro-batches and scored in isolation, making the engine compatible with Flink or Spark Structured Streaming.

Confidence Scoring for Stop Detection in Fleet Telematics

Raw telematics streams are inherently noisy. A single GPS ping dropping below 5 km/h does not guarantee a meaningful operational halt. Confidence scoring for stop detection bridges the gap between raw coordinate streams and actionable fleet intelligence by assigning a probabilistic weight to each candidate stop event. This approach replaces brittle binary thresholds with a continuous reliability metric, enabling downstream systems to filter false positives, prioritise high-certainty events, and adapt to mixed vehicle behaviours.

Within the broader Stop Detection & Dwell Time Analytics framework, confidence scoring acts as the quality gate before events enter spatial aggregation, dwell calculation, or POI matching pipelines. Fleet managers and mobility engineers rely on these scores to automate dispatch rules, validate driver logs, and reconcile telematics data with ERP systems. By treating stop identification as a classification problem rather than a hard rule, organisations can dramatically reduce manual review overhead and improve SLA compliance across heterogeneous fleets.

Prerequisites

Before implementing a scoring pipeline you should have:

Python 3.10 or later; pandas 2.x, numpy 1.26+, scikit-learn 1.4+ (for optional logistic calibration).
A GPS-preprocessed telemetry DataFrame with at minimum: timestamp (UTC, timezone-aware), lat, lon, speed_kmh, hdop, ignition_status.
Candidate windows identified upstream — each row should carry a candidate_id grouping key linking pings to a specific stop candidate.
Basic familiarity with timestamp synchronisation across devices to avoid incorrect dwell boundary alignment.
Understanding of why raw streams contain outlier coordinates — those same artefacts inflate spatial dispersion features if not removed upstream.

Why Binary Thresholds Fail

Traditional stop detection relies on static velocity cutoffs: if speed_kmh < 3 for duration > 60s, flag as a stop. While computationally cheap, this method fails under real-world conditions. GPS receivers typically report horizontal accuracy between 2.5 and 5 m under open-sky conditions, but urban canyons, dense foliage, and multipath interference routinely degrade precision to 15–50 m, which introduces apparent motion in stationary coordinates.

Binary thresholds also struggle with three named failure modes:

Idling vs. parking: A delivery van waiting at a loading dock maintains engine RPM while stationary, whereas a parked truck has ignition off. A speed-only rule cannot distinguish them.
Creeping congestion: Stop-and-go traffic generates micro-halts that briefly satisfy a speed cutoff without representing an operational event, inflating stop counts by 20–40% on urban routes.
Sensor dropout: Telematics devices experience brief signal loss during tunnel transit or heavy RF interference, producing artificial zero-velocity readings that pass a naive threshold.

A probabilistic scoring engine mitigates these issues by evaluating multiple orthogonal signals simultaneously, producing a continuous confidence value that downstream logic can threshold dynamically — and recalibrate as fleet composition evolves.

Step-by-Step Implementation Workflow

1. Signal Preprocessing and Temporal Alignment

Before scoring begins, raw telemetry must be cleaned and aligned. The Kalman filtering and GPS noise reduction techniques applied upstream should already have removed the worst coordinate jumps, but the scoring stage still needs to confirm:

Duplicate timestamps per device are dropped (keep last).
Out-of-order pings are re-sorted by timestamp.
Gaps under 30 s are linearly interpolated; gaps over 2 minutes trigger a trip boundary reset and must not be bridged.
Pings with instantaneous velocity exceeding 250 km/h (calculated from consecutive coordinate deltas) are treated as position jumps and removed.

Use vectorised operations for temporal alignment. Pandas resample() and asfreq() efficiently handle irregular sampling intervals. For high-frequency ingestion (one ping per second or faster), pre-aggregate to 5-second windows to reduce compute overhead while preserving spatial variance signals.

import numpy as np
import pandas as pd

def preprocess_telemetry(df: pd.DataFrame, max_gap_s: float = 120.0) -> pd.DataFrame:
    """
    Clean and align a single-vehicle telemetry DataFrame.

    Parameters
    ----------
    df : pd.DataFrame
        Columns: timestamp (UTC, tz-aware), lat, lon, speed_kmh, hdop, ignition_status
    max_gap_s : float
        Gap threshold in seconds above which a trip boundary is inserted.

    Returns
    -------
    pd.DataFrame with duplicate rows removed, gaps flagged, and index reset.
    """
    df = (
        df.sort_values("timestamp")
          .drop_duplicates(subset=["timestamp"], keep="last")
          .reset_index(drop=True)
    )

    # Flag trip boundaries at long gaps
    dt = df["timestamp"].diff().dt.total_seconds()
    df["trip_boundary"] = (dt > max_gap_s) | dt.isna()

    # Remove implausible coordinate jumps (>250 km/h instantaneous)
    dlat = df["lat"].diff()
    dlon = df["lon"].diff()
    dist_m = np.sqrt(dlat**2 + dlon**2) * 111_000  # rough metres
    implausible = (dist_m / dt.clip(lower=0.1)) > (250 / 3.6)   # m/s cutoff
    df = df[~implausible].reset_index(drop=True)

    return df

Expected output: a DataFrame with the same schema plus a boolean trip_boundary column; implausible rows removed.

2. Candidate Stop Identification

The initial pass uses a relaxed velocity-duration heuristic to generate candidate windows. A common baseline is speed_kmh <= 3.0 sustained for at least 45 seconds. This window becomes the bounding temporal slice for feature extraction.

Candidate identification must not finalise the stop. Instead it defines [t_start, t_end] intervals that will be scored. Overlapping candidates must be merged using a greedy interval-union algorithm to prevent double-counting during prolonged stationary periods.

def identify_candidates(
    df: pd.DataFrame,
    speed_threshold_kmh: float = 3.0,
    min_duration_s: float = 45.0,
) -> pd.DataFrame:
    """
    Label stationary candidate windows in a preprocessed telemetry DataFrame.

    Adds a 'candidate_id' column (integer, NaN for non-candidate pings).
    """
    stationary = df["speed_kmh"] <= speed_threshold_kmh

    # Build run-length groups
    run_id = (stationary != stationary.shift()).cumsum()
    df["_run"] = np.where(stationary, run_id, np.nan)

    # Compute duration of each run
    run_durations = (
        df.groupby("_run")["timestamp"]
        .agg(lambda s: (s.max() - s.min()).total_seconds())
    )
    valid_runs = run_durations[run_durations >= min_duration_s].index

    # Assign candidate_id only to qualifying runs
    candidate_map = {r: i for i, r in enumerate(valid_runs, start=1)}
    df["candidate_id"] = df["_run"].map(candidate_map)
    df.drop(columns=["_run"], inplace=True)

    return df

Expected output: original DataFrame extended with candidate_id; NaN where the ping does not belong to a qualifying window.

3. Multi-Factor Feature Extraction

Each candidate window is transformed into a feature vector. The most predictive indicators are:

Spatial dispersion (sigma_m): Root mean square of coordinate standard deviations within the window, approximated in metres. Low dispersion (< 5 m) indicates true parking; high dispersion (> 20 m) suggests GPS drift or slow creeping.
Velocity consistency: Variance of speed_kmh within the window. High variance correlates with congestion oscillation rather than operational stops.
Signal quality: Mean HDOP across the window. HDOP above 5.0 indicates poor satellite geometry; the feature is linearly penalised up to HDOP 10.0.
Ignition correlation: Mean of a binary ignition_status flag. A stationary vehicle with ignition on and engine activity detected from CAN bus data scores higher for intentional stops.
Heading stability: Circular variance of heading_deg. Erratic heading jumps while stationary indicate either multipath noise or very slow manoeuvring.

def extract_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Compute per-candidate feature columns (vectorised, no row-wise apply).

    Input df must have: candidate_id, lat, lon, speed_kmh, hdop, ignition_status.
    Returns df with additional columns: sigma_m, speed_var, mean_hdop, ignition_mean.
    """
    g = df.groupby("candidate_id")

    lat_std = g["lat"].transform("std").fillna(0)
    lon_std = g["lon"].transform("std").fillna(0)
    df["sigma_m"] = np.sqrt(lat_std**2 + lon_std**2) * 111_000

    df["speed_var"]     = g["speed_kmh"].transform("var").fillna(0)
    df["mean_hdop"]     = g["hdop"].transform("mean").fillna(1.0)
    df["ignition_mean"] = g["ignition_status"].transform("mean").fillna(0)

    return df

Use .transform() rather than .apply() to preserve DataFrame alignment and avoid index misalignment bugs when grouping.

4. Probability and Math Model

The four feature scores are normalised to [0, 1] and combined with a weighted linear sum. The normalisation bounds are chosen from empirical distributions; clip them to prevent NaN propagation from extreme outliers.

Let f = [f_spatial, f_speed, f_signal, f_ignition] and weights w = [0.35, 0.25, 0.20, 0.20] (must sum to 1.0).

Spatial score — low dispersion is high confidence:

f_spatial = 1 − clip(sigma_m / 20.0, 0, 1)

Speed consistency score:

f_speed = 1 − clip(speed_var / 4.0, 0, 1)

Signal quality score:

f_signal = 1 − clip((mean_hdop − 1.0) / 9.0, 0, 1)

Ignition score — already in [0, 1]:

f_ignition = ignition_mean

Aggregate:

confidence = 100 × Σ(w_i × f_i)

When labelled historical data is available, replace the weighted linear model with a calibrated logistic regression:

P(stop) = 1 / (1 + exp(−(β₀ + Σ(β_i × f_i))))

Apply Platt scaling or isotonic regression post-training (via sklearn.calibration.CalibratedClassifierCV) to ensure P(stop) behaves as a well-calibrated probability rather than a raw decision score. In log-space, use scipy.special.expit to avoid overflow when feature values are large.

5. Vectorised Scoring Pipeline

The full production scorer assembles preprocessing, feature extraction, and aggregation into a single stateless function:

def compute_stop_confidence(df: pd.DataFrame) -> pd.DataFrame:
    """
    Vectorised confidence scoring for candidate stop windows.

    Parameters
    ----------
    df : pd.DataFrame
        Must contain candidate_id (int, NaN for non-candidates), lat, lon,
        speed_kmh, hdop, ignition_status (float 0/1).

    Returns
    -------
    pd.DataFrame with 'confidence_score' column added (0–100, NaN for
    pings outside candidate windows).
    """
    # Work only on candidate rows to avoid polluting non-candidate pings
    mask = df["candidate_id"].notna()
    c = df.loc[mask].copy()

    # --- Feature computation (vectorised) ---
    g = c.groupby("candidate_id")

    lat_std = g["lat"].transform("std").fillna(0)
    lon_std = g["lon"].transform("std").fillna(0)
    sigma_m = np.sqrt(lat_std**2 + lon_std**2) * 111_000

    speed_var     = g["speed_kmh"].transform("var").fillna(0)
    mean_hdop     = g["hdop"].transform("mean").fillna(1.0)
    ignition_mean = g["ignition_status"].transform("mean").fillna(0)

    # --- Normalise to [0, 1] ---
    f_spatial   = 1.0 - np.clip(sigma_m / 20.0, 0, 1)
    f_speed     = 1.0 - np.clip(speed_var / 4.0, 0, 1)
    f_signal    = 1.0 - np.clip((mean_hdop - 1.0) / 9.0, 0, 1)
    f_ignition  = ignition_mean  # already [0, 1]

    # --- Weighted linear combination ---
    w = {"spatial": 0.35, "speed": 0.25, "signal": 0.20, "ignition": 0.20}

    confidence = (
        w["spatial"]   * f_spatial  +
        w["speed"]     * f_speed    +
        w["signal"]    * f_signal   +
        w["ignition"]  * f_ignition
    )

    c["confidence_score"] = np.clip(confidence * 100, 0, 100)
    df = df.join(c[["confidence_score"]], how="left")

    return df

Key reliability points:

clip bounds are explicit — this prevents NaN from propagating through np.clip when features contain extreme outliers.
The join on index preserves alignment across the full DataFrame, including non-candidate pings.
Pre-allocate candidate_id upstream using the state machine from step 2 before calling this function.

6. Threshold Calibration and Operational Routing

Confidence bands map directly to business logic. Reasonable starting thresholds for a mixed last-mile fleet:

Band	Score range	Action
High confidence	≥ 85	Auto-log stop; trigger downstream analytics; notify dispatch
Medium confidence	60–84	Queue for manual review or require secondary confirmation (geofence match, driver check-in)
Low confidence	< 60	Discard or archive for model retraining

Thresholds should be calibrated per vehicle class. A refrigerated truck idling at a loading dock requires different tolerance from a last-mile cargo bike making a brief drop-off. Dwell time calculation is only meaningful for high-confidence stops; applying it to medium-confidence events without review inflates reported service durations.

Dynamic threshold tuning is implemented using rolling percentile analysis on historical confidence distributions. When the 10th percentile of auto-logged stops drops below 80 (a common early-warning signal), trigger recalibration before false positive rates climb.

After scoring, high-confidence stops proceed directly to DBSCAN-based spatial clustering to group proximate events into canonical service locations. Once clustered, POI matching enriches each location with a business type, enabling facility-level dwell benchmarks.

Calibration and Continuous Improvement

Static weights degrade as fleet composition, device firmware, or routing patterns evolve. Implement a monthly calibration routine that:

Samples 5% of scored events for manual validation (stratified by confidence band).
Computes precision-recall curves per band to detect systematic mislabelling.
Adjusts feature weights using gradient descent or Bayesian optimisation on the validated sample.
Deploys updated coefficients via a configuration service so the scorer picks them up without redeployment.

Monitor feature drift using Kolmogorov-Smirnov tests on per-feature distributions. A sudden shift in sigma_m or mean_hdop often indicates hardware degradation (antenna faults, firmware regression) or a new vehicle model entering the fleet — both require pipeline recalibration before scores can be trusted again.

Operational Troubleshooting

Score distribution flat near 50 for all candidates

Cause: Feature normalisation bounds are calibrated to a different fleet; all features compress to the mid-range.
Symptom: Histogram of confidence_score shows a narrow spike around 45–55 with very few events above 80 or below 30.
Fix: Recompute normalisation percentiles (1st and 99th) on your specific fleet’s feature distributions and replace the hard-coded 20.0, 4.0, and 9.0 divisors with fleet-derived values.

Candidate windows overlap, inflating event counts

Cause: The greedy interval-union merge in step 2 was skipped or its minimum-gap parameter is too large.
Symptom: Two adjacent candidate_id values span the same timestamp range; dwell counts are doubled for long parking events.
Fix: After identification, run a merge pass: sort candidates by t_start; merge any pair where t_start_next < t_end_current + merge_gap_s (typically 30 s).

Ignition signal absent for part of the fleet

Cause: Older OBD-II dongles do not expose ignition state; ignition_status is NaN for those vehicles.
Symptom: NaN propagates into ignition_mean, reducing effective weight and pulling all scores toward the centroid of the remaining three features.
Fix: Fill NaN ignition_status with 0.5 (neutral prior) and reduce the ignition weight to 0.10, redistributing the 0.10 difference to f_spatial. Document the imputation in pipeline metadata.

High false positive rate in urban cores

Cause: Multipath GPS errors in dense urban environments inflate sigma_m and degrade mean_hdop, but the vehicle is genuinely stopped.
Symptom: Stops at known urban delivery points (confirmed by driver logs) score below 60 despite being real.
Fix: Apply a geofence pre-confirmation layer: if the candidate window centroid falls within a known-good delivery geofence, apply a +15 point bonus before routing. This is especially effective after POI matching provides the geofence boundaries.

Logistic model overconfident on short stops

Cause: Training data overrepresents long confirmed stops; short stops (45–90 s) are underrepresented, leading the logistic model to assign low probability to all short-duration candidates.
Symptom: Precision-recall analysis shows recall drops sharply for candidate windows under 90 s.
Fix: Oversample short stops during training or apply class-weight balancing. Evaluate duration as an explicit feature rather than relying on it being implicit in spatial dispersion.

Pipeline memory grows unbounded in streaming mode

Cause: Candidate state accumulates across trip boundaries when trip_boundary resets are not forwarded to the scorer.
Symptom: Memory usage climbs steadily over multi-hour runs; occasional OOM kills on worker nodes.
Fix: Emit an explicit TRIP_BOUNDARY sentinel downstream of preprocessing. The scorer must clear its candidate window buffer and flush any open candidates to output when this sentinel arrives.

Deployment Checklist

Upstream preprocessing removes outliers and flags trip boundaries before candidate identification
All candidate_id values are integers; no string or float keys that could silently fail groupby
Normalisation divisors validated against fleet-specific percentile distributions
ignition_status NaN imputation strategy documented in pipeline metadata
Confidence score column stored with float32 precision to limit memory footprint
Threshold bands reviewed and approved per vehicle class (passenger, van, refrigerated truck, two-wheeler)
Monthly calibration cron job configured with alerting on KS-test drift thresholds
Streaming trip-boundary sentinel wired to candidate buffer flush
Geofence pre-confirmation layer enabled for known high-multipath urban delivery zones
Latency SLA measured end-to-end: scoring must complete within 200 ms per candidate batch in production

Parent: Stop Detection & Dwell Time Analytics

Related:

DBSCAN for Fleet Stop Clustering — spatial grouping of high-confidence stops into canonical service locations
Time-Window Based Dwell Calculation — deriving accurate service durations from scored stop boundaries
Location Typing and POI Matching for Stops — enriching stops with facility type for geofence-assisted confidence correction
Outlier Removal in Raw Telematics Streams — upstream coordinate cleaning that directly reduces false dispersion scores
Kalman Filtering for GPS Noise Reduction — smoothing upstream of the scoring engine to stabilise spatial dispersion features