What eps value should I use for urban delivery stops?

Start at 1.57e-05 radians (~100 m). In dense urban canyons where GPS drift can exceed 30 m, stay closer to 1.57e-05. In suburban areas with clear-sky accuracy under 10 m, you can tighten to 7.85e-06 (~50 m).

Why does DBSCAN merge a morning depot visit with an evening return?

DBSCAN is purely spatial — it has no concept of time. Two visits to the same depot will spatially overlap and merge into one cluster. Enforce a temporal gap threshold (e.g., split any cluster whose point timestamps span more than 15 minutes of absence) after clustering to separate them.

Can I use KDTree instead of BallTree for haversine distances?

No. KDTree only supports Euclidean (L2) distances. BallTree supports arbitrary metrics including haversine. scikit-learn DBSCAN with metric='haversine' will internally select BallTree; specifying algorithm='kd_tree' raises a ValueError.

Tuning DBSCAN eps and min_samples for Delivery Truck Stops

Selecting eps and min_samples for DBSCAN-based fleet stop detection is where most implementations diverge between production-quality and prototype-quality results. The parameters encode two domain facts simultaneously: the spatial resolution of your GPS hardware and the operational definition of a “stop” for your fleet type. For delivery trucks, a reasonable starting point is eps between 7.85e-06 and 2.36e-05 radians (roughly 50–150 metres; obtained by dividing metres by Earth’s mean radius of 6,371,000 m) combined with min_samples between 3 and 6 pings. These ranges anchor to real-world commercial telematics accuracy and typical 5–30 second sampling intervals — not to theoretical optima. This page explains how to arrive at values that hold up against ground truth, edge cases specific to last-mile delivery, and high-ping-rate fleet data.

Outcome grid for DBSCAN parameter combinations. Green cells indicate parameter pairs that reliably separate delivery stops in typical telematics data. Rows represent eps bands in radians; columns represent min_samples values.

Compatibility and Configuration Requirements

Requirement	Minimum version / value	Notes
Python	3.10	f-string assignment expressions used in examples
scikit-learn	1.3	`DBSCAN` and `NearestNeighbors` haversine support stable since 1.0; 1.3 adds `BallTree` performance fixes
numpy	1.24	`np.radians` vectorised over DataFrame columns
pandas	2.0	`.copy()` behaviour on filtered slices; avoids chained-assignment warnings
Coordinate format	`[lat, lon]` in radians	`metric='haversine'` in scikit-learn expects radians; lon-first order raises silent distance errors
`algorithm` parameter	`'ball_tree'` or `'auto'`	`'kd_tree'` is incompatible with haversine; `'auto'` selects BallTree correctly
Speed column	km/h float	Pre-filter threshold is typically 3–7 km/h; adjust for low-speed urban couriers on cargo bikes

Tables in this section are horizontally scrollable on narrow viewports.

Production-Ready Tuning Script

The class below encapsulates the full workflow: velocity filtering, radian conversion, K-distance plotting for eps selection, DBSCAN fitting, and post-hoc dwell-time computation. Each parameter choice is explained inline.

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from typing import Optional


class DeliveryStopTuner:
    """
    End-to-end DBSCAN stop-detection tuner for delivery truck telematics.

    Parameters
    ----------
    eps_rad : float
        Neighbourhood radius in radians.  Convert metres with: metres / 6_371_000.
        Typical range: 7.85e-06 (~50 m) to 2.36e-05 (~150 m).
        Start at 1.57e-05 (~100 m) and adjust from the K-distance elbow.
    min_samples : int
        Minimum pings to declare a core point.  At 1 ping/10 s, value=4 ≈ 40 s
        stationary — suitable for curbside courier drops.  Raise to 6–8 for
        mandatory break detection (>5 min dwell required).
    speed_threshold_kmh : float
        Pings above this speed are excluded before clustering.  5.0 km/h is a
        safe default; reduce to 3.0 for dense urban environments where trucks
        creep below 5 km/h in traffic without stopping.
    min_dwell_seconds : int
        Post-clustering filter.  Clusters whose timestamp span is shorter than
        this value are discarded as GPS jitter artefacts.  Set to 120 (2 min)
        for delivery confirmations; 900 (15 min) for mandated driver breaks.
    temporal_gap_minutes : float
        If two consecutive pings within a spatial cluster are separated by more
        than this interval, the cluster is split into separate stop events.
        Prevents a morning depot visit merging with an evening return.
    """

    EARTH_RADIUS_M = 6_371_000

    def __init__(
        self,
        eps_rad: float = 1.57e-05,
        min_samples: int = 4,
        speed_threshold_kmh: float = 5.0,
        min_dwell_seconds: int = 120,
        temporal_gap_minutes: float = 15.0,
    ):
        self.eps_rad = eps_rad
        self.min_samples = min_samples
        self.speed_threshold_kmh = speed_threshold_kmh
        self.min_dwell_seconds = min_dwell_seconds
        self.temporal_gap_minutes = temporal_gap_minutes

    @staticmethod
    def metres_to_radians(metres: float) -> float:
        """Helper: convert a distance in metres to radians on the Earth sphere."""
        return metres / DeliveryStopTuner.EARTH_RADIUS_M

    def _prefilter(self, df: pd.DataFrame, lat: str, lon: str, speed: str) -> np.ndarray:
        """Return radian coordinate array after velocity filtering."""
        mask = df[speed] <= self.speed_threshold_kmh
        if mask.sum() == 0:
            raise ValueError(
                f"No pings at or below {self.speed_threshold_kmh} km/h. "
                "Check speed column units (expected km/h)."
            )
        coords_deg = df.loc[mask, [lat, lon]].values
        return np.radians(coords_deg), df[mask].copy()

    def k_distance_plot_data(
        self,
        df: pd.DataFrame,
        lat_col: str = "lat",
        lon_col: str = "lon",
        speed_col: str = "speed_kmh",
    ) -> np.ndarray:
        """
        Compute sorted k-th nearest-neighbour distances for eps selection.

        Returns an array of distances (radians) sorted descending.  Plot this
        array on the y-axis against point index on the x-axis.  The 'elbow' —
        the steepest inflection point — is a reliable eps candidate.

        k is set to self.min_samples so the K-distance graph is consistent with
        the density definition used by DBSCAN.
        """
        coords_rad, _ = self._prefilter(df, lat_col, lon_col, speed_col)
        nbrs = NearestNeighbors(
            n_neighbors=self.min_samples,
            algorithm="ball_tree",   # BallTree required for haversine
            metric="haversine",
        )
        nbrs.fit(coords_rad)
        distances, _ = nbrs.kneighbors(coords_rad)
        # Column -1 is the k-th neighbour (0-indexed from the point itself)
        k_distances = np.sort(distances[:, -1])[::-1]
        return k_distances

    def fit(
        self,
        df: pd.DataFrame,
        lat_col: str = "lat",
        lon_col: str = "lon",
        speed_col: str = "speed_kmh",
        timestamp_col: str = "timestamp_utc",
        vehicle_id_col: Optional[str] = "vehicle_id",
    ) -> pd.DataFrame:
        """
        Run DBSCAN on velocity-filtered pings and return a stop-events DataFrame.

        Output columns
        --------------
        vehicle_id      : from input (if provided)
        cluster_id      : DBSCAN label; -1 = noise (not a stop)
        stop_event_id   : unique id after temporal-gap splitting
        centroid_lat    : mean latitude of cluster pings
        centroid_lon    : mean longitude of cluster pings
        arrival_time    : min(timestamp) within stop event
        departure_time  : max(timestamp) within stop event
        dwell_seconds   : (departure_time - arrival_time).total_seconds()
        ping_count      : number of GPS pings in the stop event
        """
        coords_rad, df_slow = self._prefilter(df, lat_col, lon_col, speed_col)

        db = DBSCAN(
            eps=self.eps_rad,
            min_samples=self.min_samples,
            metric="haversine",
            algorithm="ball_tree",  # explicit; 'auto' also works
            n_jobs=-1,              # parallelise distance computation
        )
        df_slow = df_slow.copy()
        df_slow["_cluster"] = db.fit_predict(coords_rad)

        # Discard noise points (label == -1)
        df_clustered = df_slow[df_slow["_cluster"] != -1].copy()

        # Sort within each cluster by time for temporal-gap splitting
        sort_cols = (
            [vehicle_id_col, "_cluster", timestamp_col]
            if vehicle_id_col and vehicle_id_col in df_clustered.columns
            else ["_cluster", timestamp_col]
        )
        df_clustered = df_clustered.sort_values(sort_cols).reset_index(drop=True)

        # Temporal gap splitting: if two consecutive pings in the same spatial
        # cluster are separated by > temporal_gap_minutes, assign a new event id
        gap_threshold = pd.Timedelta(minutes=self.temporal_gap_minutes)
        ts = pd.to_datetime(df_clustered[timestamp_col])
        gap_flag = ts.diff().fillna(pd.Timedelta(0)) > gap_threshold
        cluster_change = df_clustered["_cluster"].ne(df_clustered["_cluster"].shift())
        df_clustered["stop_event_id"] = (gap_flag | cluster_change).cumsum()

        # Aggregate to one row per stop event
        agg = {
            lat_col: "mean",
            lon_col: "mean",
            timestamp_col: ["min", "max", "count"],
        }
        if vehicle_id_col and vehicle_id_col in df_clustered.columns:
            grp_cols = [vehicle_id_col, "stop_event_id"]
        else:
            grp_cols = ["stop_event_id"]

        summary = df_clustered.groupby(grp_cols).agg(agg)
        summary.columns = [
            "centroid_lat", "centroid_lon",
            "arrival_time", "departure_time", "ping_count",
        ]
        summary = summary.reset_index()

        # Dwell duration filter: remove sub-threshold artefacts
        summary["arrival_time"] = pd.to_datetime(summary["arrival_time"])
        summary["departure_time"] = pd.to_datetime(summary["departure_time"])
        summary["dwell_seconds"] = (
            summary["departure_time"] - summary["arrival_time"]
        ).dt.total_seconds()
        summary = summary[summary["dwell_seconds"] >= self.min_dwell_seconds]

        return summary.reset_index(drop=True)


# ---------------------------------------------------------------------------
# Usage example
# ---------------------------------------------------------------------------
# tuner = DeliveryStopTuner(
#     eps_rad=DeliveryStopTuner.metres_to_radians(100),  # 100 m
#     min_samples=4,
#     speed_threshold_kmh=5.0,
#     min_dwell_seconds=120,
#     temporal_gap_minutes=15.0,
# )
#
# # Step 1 — inspect K-distance plot to validate / adjust eps_rad
# k_dist = tuner.k_distance_plot_data(raw_df)
# # plot k_dist to identify elbow; update tuner.eps_rad accordingly
#
# # Step 2 — fit and extract stop events
# stops = tuner.fit(raw_df, timestamp_col="timestamp_utc", vehicle_id_col="vehicle_id")
# print(stops.head())

Execution and Tuning Guidelines

1. Start with the K-distance plot

Call tuner.k_distance_plot_data(df) to get the sorted k-nearest-neighbour distance array. Plot it on a line graph. The steepest bend (elbow) marks the natural density transition between clustered stops and sparse travel points. If no clear elbow exists, your data may lack a clean speed pre-filter — verify that the speed_threshold_kmh column contains km/h values, not m/s.

2. Convert metres to radians explicitly

Use DeliveryStopTuner.metres_to_radians(metres) rather than hard-coding radian values. The conversion metres / 6_371_000 is straightforward but easy to transpose: at 50 m the radian value is 7.85e-06, not 7.85e-05. A one-order-of-magnitude error silently inflates eps to a ~1 km radius, merging an entire city block into one stop.

3. Interpret `eps` and `min_samples` together

Raise eps (e.g., from 100 m to 130 m) when GPS drift causes a single physical stop to split across two clusters. Typical symptom: a loading dock appears as two adjacent events 20–40 m apart.
Lower eps (e.g., from 100 m to 70 m) when a busy intersection or traffic light produces false positives. Confirm by checking the ping_count of flagged clusters — signal-stop false positives rarely exceed 4–5 pings at 10-second sampling.
Raise min_samples when short traffic halts (< 30 s) contaminate results. Each additional sample adds one sampling interval of required dwell.
Lower min_samples (minimum 2) for fleets with sparse 30-second or 60-second sampling intervals, where even a two-minute stop produces only 2–4 pings within the eps radius.

4. Validate against ground truth

Cross-reference the output stops DataFrame against at least one of: driver delivery confirmation logs, geofenced depot entry/exit records, or customer POI coordinates from a location typing and POI matching pipeline. Compute precision and recall across labelled stops. Aim for recall > 0.90 before tightening precision — missed stops are operationally more costly than spurious ones for most last-mile use cases.

5. Apply GPS noise reduction upstream

Raw telematics pings frequently jump 10–20 m even when the vehicle is parked. Applying Kalman filtering for GPS noise reduction or a rolling median filter for GPS drift removal before clustering allows you to use a tighter eps without fragmenting genuine stops. Smoothed inputs also reduce the minimum required min_samples because consecutive pings are more spatially coherent.

Common Pitfalls

Depot merges: morning arrival and evening return collapse into one stop event

DBSCAN operates on spatial proximity alone. If a truck visits the same depot at 06:00 and returns at 20:00, the pings from both visits overlap spatially and form a single cluster. The temporal_gap_minutes parameter in the tuner above splits these by detecting gaps larger than the threshold inside an otherwise-spatial cluster. Set temporal_gap_minutes to a value shorter than your minimum planned route duration (typically 15–60 minutes for last-mile delivery).

Silent distance corruption from wrong coordinate order or missing radian conversion

scikit-learn’s haversine metric expects input in [latitude, longitude] order and in radians, not degrees. Passing degrees produces distances roughly 57× too large (since 1 radian ≈ 57.3 degrees), making eps=1.57e-05 radians effectively ~5,700 km — every ping on Earth lands in one cluster. Passing [longitude, latitude] introduces a sign error that scrambles hemispheres. Neither mistake raises an exception; the output silently fails. The _prefilter method in the tuner above enforces radian conversion via np.radians, but always verify with a sanity check: the haversine distance between two points 100 m apart should return approximately 1.57e-05 radians.

BallTree / KDTree incompatibility drops the algorithm silently to brute-force

Specifying algorithm='kd_tree' with metric='haversine' causes scikit-learn to fall back to brute-force O(n²) computation without warning in some versions. For datasets exceeding 100,000 filtered pings this can increase fitting time from seconds to minutes. Always specify algorithm='ball_tree' or algorithm='auto' (which selects BallTree for haversine). For datasets exceeding 10 million filtered pings, replace NearestNeighbors with a pre-built BallTree and pass the distance matrix directly to DBSCAN(metric='precomputed') using chunked computation.

Up: DBSCAN for Fleet Stop Clustering | Stop Detection & Dwell Time Analytics

DBSCAN for Fleet Stop Clustering — the parent topic covering algorithm prerequisites, production workflow, and CRS handling for density-based stop extraction
Location Typing and POI Matching for Stops — enrich clustered stop centroids with venue type, address, and commercial category
Time-Window-Based Dwell Calculation — compute accurate dwell durations from stop event timestamps, including timezone-shift handling
Kalman Filtering for GPS Noise Reduction — reduce positional noise upstream to improve cluster coherence and allow tighter eps values
Outlier Removal in Raw Telematics Streams — strip high-speed transient spikes that would otherwise survive the velocity pre-filter and corrupt stop centroids

Related