Skip to content

Conversation

@derhuerst derhuerst self-assigned this Apr 11, 2023
@socket-security
Copy link

socket-security bot commented Apr 11, 2023

All alerts resolved. Learn more about Socket for GitHub.

This PR previously contained dependency changes with security issues that have been resolved, removed, or ignored.

View full report

@matthiasfeist
Copy link

Oh nice with this PR :) I was just opening an issue that DuckDB could really be a quicker way to do data analytics on GTFS datasets.

@derhuerst
Copy link
Member Author

I'm making progress! With the current state, importing the 2025-05-09 VBB GTFS works.

@derhuerst
Copy link
Member Author

derhuerst commented May 12, 2025

I stumbled upon this weird behaviour (bug?) in DuckDB v1.2.2's query plan output.

I redefined `arrivals_departures` as follows:
CREATE OR REPLACE VIEW "main.arrivals_departures" AS
SELECT
	(
		to_base64(encode(trip_id))
		|| ':' || to_base64(encode(
			extract(ISOYEAR FROM "date")
			|| '-' || lpad(extract(MONTH FROM "date")::text, 2, '0')
			|| '-' || lpad(extract(DAY FROM "date")::text, 2, '0')
		))
		|| ':' || to_base64(encode(stop_sequence::text))
		-- frequencies_row
		|| ':' || to_base64(encode('-1'))
		-- frequencies_it
		|| ':' || to_base64(encode('-1'))
	) as arrival_departure_id,

	-- todo: expose local arrival/departure "wall clock time"?

	-1 AS frequencies_row,
	-1 AS frequencies_it,

	stop_times_based.*
	EXCLUDE (
		arrival_time,
		departure_time
	)
FROM (
	SELECT
		agency.agency_id,
		trips.route_id,
		route_short_name,
		route_long_name,
		route_type,
		s.trip_id,
		trips.direction_id,
		trips.trip_headsign,
		trips.wheelchair_accessible,
		trips.bikes_allowed,
		service_days.service_id,
		trips.shape_id,
		"date",
		stop_sequence,
		stop_sequence_consec,
		stop_headsign,
		pickup_type,
		drop_off_type,
		shape_dist_traveled,
		timepoint,
		agency.agency_timezone as tz,
		arrival_time,
		(
			make_timestamptz(
				date_part('year', "date")::int,
				date_part('month', "date")::int,
				date_part('day', "date")::int,
				12, 0, 0,
				agency.agency_timezone
			)
			- INTERVAL '12 hours'
			+ arrival_time
		) t_arrival,
		departure_time,
		(
			make_timestamptz(
				date_part('year', "date")::int,
				date_part('month', "date")::int,
				date_part('day', "date")::int,
				12, 0, 0,
				agency.agency_timezone
			)
			- INTERVAL '12 hours'
			+ departure_time
		) t_departure,
		trip_start_time,
		s.stop_id, stops.stop_name,
		stations.stop_id station_id, stations.stop_name station_name,
		-- todo: PR #47
		coalesce(
			nullif(stops.wheelchair_boarding, 'no_info_or_inherit'),
			nullif(stations.wheelchair_boarding, 'no_info_or_inherit'),
			'no_info_or_inherit'
		) AS wheelchair_boarding
	FROM (
		"main.stop_times" s
		JOIN "main.stops" stops ON s.stop_id = stops.stop_id
		LEFT JOIN "main.stops" stations ON stops.parent_station = stations.stop_id
		JOIN "main.trips" trips ON s.trip_id = trips.trip_id
		JOIN "main.routes" routes ON trips.route_id = routes.route_id
		LEFT JOIN "main.agency" agency ON (
			-- The GTFS spec allows routes.agency_id to be NULL if there is exactly one agency in the feed.
			-- Note: We implicitly rely on other parts of the code base to validate that agency has just one row!
			-- It seems that GTFS has allowed this at least since 2016:
			-- https://github.com/google/transit/blame/217e9bf/gtfs/spec/en/reference.md#L544-L554
			routes.agency_id IS NULL -- match first (and only) agency
			OR routes.agency_id = agency.agency_id -- match by ID
		)
		JOIN "main.service_days" service_days ON trips.service_id = service_days.service_id
	)
	-- todo: this slows down slightly
	-- ORDER BY route_id, s.trip_id, "date", stop_sequence
) stop_times_based;

Look at the time (0.36s) of the sequential scan over main.stop_times at the bottom of the query plan's tree, while it says "Total Time: 0.0924s" at the top. Does it use multiple cores for scanning?

query plan
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE SELECT * FROM "main.arrivals_departures" WHERE t_departure >= '2025-05-09 18:00:00+02:00' AND t_departure < '2025-05-09 18:20:00+02:00' AND (date = '2025-05-08' OR date = '2025-05-09') AND station_id = 'de:11000:900100001';
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││              Total Time: 0.0924s             ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│           QUERY           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│    ────────────────────   │
│           0 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│    arrival_departure_id   │
│      frequencies_row      │
│       frequencies_it      │
│         agency_id         │
│          route_id         │
│      route_short_name     │
│      route_long_name      │
│         route_type        │
│          trip_id          │
│        direction_id       │
│       trip_headsign       │
│   wheelchair_accessible   │
│       bikes_allowed       │
│         service_id        │
│          shape_id         │
│            ...            │
│    stop_sequence_consec   │
│       stop_headsign       │
│        pickup_type        │
│       drop_off_type       │
│    shape_dist_traveled    │
│         timepoint         │
│             tz            │
│         t_arrival         │
│        t_departure        │
│      trip_start_time      │
│          stop_id          │
│         stop_name         │
│         station_id        │
│        station_name       │
│    wheelchair_boarding    │
│                           │
│          48 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│         agency_id         │
│          route_id         │
│      route_short_name     │
│      route_long_name      │
│         route_type        │
│          trip_id          │
│        direction_id       │
│       trip_headsign       │
│   wheelchair_accessible   │
│       bikes_allowed       │
│         service_id        │
│          shape_id         │
│            date           │
│       stop_sequence       │
│    stop_sequence_consec   │
│       stop_headsign       │
│        pickup_type        │
│       drop_off_type       │
│    shape_dist_traveled    │
│         timepoint         │
│             tz            │
│         t_arrival         │
│        t_departure        │
│      trip_start_time      │
│          stop_id          │
│         stop_name         │
│         station_id        │
│        station_name       │
│    wheelchair_boarding    │
│                           │
│          48 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│         agency_id         │
│          route_id         │
│      route_short_name     │
│      route_long_name      │
│         route_type        │
│          trip_id          │
│        direction_id       │
│       trip_headsign       │
│   wheelchair_accessible   │
│       bikes_allowed       │
│         service_id        │
│          shape_id         │
│            date           │
│       stop_sequence       │
│    stop_sequence_consec   │
│            ...            │
│    shape_dist_traveled    │
│         timepoint         │
│             tz            │
│ CAST(CAST("year"(date) AS │
│     INTEGER) AS BIGINT)   │
│ CAST(CAST("month"(date) AS│
│     INTEGER) AS BIGINT)   │
│  CAST(CAST("day"(date) AS │
│     INTEGER) AS BIGINT)   │
│        arrival_time       │
│       departure_time      │
│      trip_start_time      │
│          stop_id          │
│         stop_name         │
│         station_id        │
│        station_name       │
│    wheelchair_boarding    │
│    wheelchair_boarding    │
│                           │
│          48 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│  (((make_timestamptz(CAST │
│   (CAST("year"(date) AS   │
│  INTEGER) AS BIGINT), CAST│
│   (CAST("month"(date) AS  │
│  INTEGER) AS BIGINT), CAST│
│    (CAST("day"(date) AS   │
│  INTEGER) AS BIGINT), 12, │
│  0, 0.0, agency_timezone) │
│ - '12:00:00'::INTERVAL) + │
│  departure_time) BETWEEN  │
│ '2025-05-09 16:00:00+00': │
│ :TIMESTAMP WITH TIME ZONE │
│  AND '2025-05-09 16:20:00 │
│ +00'::TIMESTAMP WITH TIME │
│            ZONE)          │
│                           │
│          48 Rows          │
│          (0.01s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         HASH_JOIN         │
│    ────────────────────   │
│      Join Type: INNER     │
│                           │
│        Conditions:        ├──────────────┐
│  service_id = service_id  │              │
│                           │              │
│         2686 Rows         │              │
│          (0.00s)          │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│           FILTER          ││     BLOCKWISE_NL_JOIN     │
│    ────────────────────   ││    ────────────────────   │
│ (date = '2025-05-09 00:00 ││      Join Type: RIGHT     │
│      :00'::TIMESTAMP)     ││                           │
│                           ││         Condition:        ├──────────────┐
│                           ││  (agency_id = agency_id)  │              │
│                           ││                           │              │
│          995 Rows         ││         13256 Rows        │              │
│          (0.00s)          ││          (0.00s)          │              │
└─────────────┬─────────────┘└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         TABLE_SCAN        ││         TABLE_SCAN        ││         HASH_JOIN         │
│    ────────────────────   ││    ────────────────────   ││    ────────────────────   │
│           Table:          ││     Table: main.agency    ││      Join Type: INNER     │
│     main.service_days     ││   Type: Sequential Scan   ││                           │
│                           ││                           ││        Conditions:        │
│   Type: Sequential Scan   ││        Projections:       ││    route_id = route_id    │
│                           ││         agency_id         ││                           ├──────────────┐
│        Projections:       ││      agency_timezone      ││                           │              │
│         service_id        ││                           ││                           │              │
│            date           ││                           ││                           │              │
│                           ││                           ││                           │              │
│        193254 Rows        ││          34 Rows          ││         13256 Rows        │              │
│          (0.00s)          ││          (0.00s)          ││          (0.00s)          │              │
└───────────────────────────┘└───────────────────────────┘└─────────────┬─────────────┘              │
                                                          ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
                                                          │         TABLE_SCAN        ││         HASH_JOIN         │
                                                          │    ────────────────────   ││    ────────────────────   │
                                                          │     Table: main.routes    ││      Join Type: INNER     │
                                                          │   Type: Sequential Scan   ││                           │
                                                          │                           ││        Conditions:        │
                                                          │        Projections:       ││     trip_id = trip_id     │
                                                          │          route_id         ││                           │
                                                          │         agency_id         ││                           ├──────────────┐
                                                          │      route_short_name     ││                           │              │
                                                          │      route_long_name      ││                           │              │
                                                          │         route_type        ││                           │              │
                                                          │                           ││                           │              │
                                                          │          884 Rows         ││         13256 Rows        │              │
                                                          │          (0.00s)          ││          (0.01s)          │              │
                                                          └───────────────────────────┘└─────────────┬─────────────┘              │
                                                                                       ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
                                                                                       │         TABLE_SCAN        ││         HASH_JOIN         │
                                                                                       │    ────────────────────   ││    ────────────────────   │
                                                                                       │     Table: main.trips     ││      Join Type: INNER     │
                                                                                       │   Type: Sequential Scan   ││                           │
                                                                                       │                           ││        Conditions:        │
                                                                                       │        Projections:       ││     stop_id = stop_id     │
                                                                                       │          trip_id          ││                           │
                                                                                       │          route_id         ││                           │
                                                                                       │         service_id        ││                           ├──────────────┐
                                                                                       │        direction_id       ││                           │              │
                                                                                       │       trip_headsign       ││                           │              │
                                                                                       │   wheelchair_accessible   ││                           │              │
                                                                                       │       bikes_allowed       ││                           │              │
                                                                                       │          shape_id         ││                           │              │
                                                                                       │                           ││                           │              │
                                                                                       │        266359 Rows        ││         13256 Rows        │              │
                                                                                       │          (0.01s)          ││          (0.06s)          │              │
                                                                                       └───────────────────────────┘└─────────────┬─────────────┘              │
                                                                                                                    ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
                                                                                                                    │         TABLE_SCAN        ││         HASH_JOIN         │
                                                                                                                    │    ────────────────────   ││    ────────────────────   │
                                                                                                                    │           Table:          ││      Join Type: INNER     │
                                                                                                                    │      main.stop_times      ││                           │
                                                                                                                    │                           ││        Conditions:        │
                                                                                                                    │   Type: Sequential Scan   ││parent_station = station_id│
                                                                                                                    │                           ││                           │
                                                                                                                    │        Projections:       ││                           │
                                                                                                                    │          stop_id          ││                           │
                                                                                                                    │          trip_id          ││                           │
                                                                                                                    │       stop_sequence       ││                           │
                                                                                                                    │    stop_sequence_consec   ││                           ├──────────────┐
                                                                                                                    │       stop_headsign       ││                           │              │
                                                                                                                    │        pickup_type        ││                           │              │
                                                                                                                    │       drop_off_type       ││                           │              │
                                                                                                                    │    shape_dist_traveled    ││                           │              │
                                                                                                                    │         timepoint         ││                           │              │
                                                                                                                    │        arrival_time       ││                           │              │
                                                                                                                    │       departure_time      ││                           │              │
                                                                                                                    │      trip_start_time      ││                           │              │
                                                                                                                    │                           ││                           │              │
                                                                                                                    │        2899055 Rows       ││          177 Rows         │              │
                                                                                                                    │          (0.36s)          ││          (0.00s)          │              │
                                                                                                                    └───────────────────────────┘└─────────────┬─────────────┘              │
                                                                                                                                                 ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
                                                                                                                                                 │         TABLE_SCAN        ││         TABLE_SCAN        │
                                                                                                                                                 │    ────────────────────   ││    ────────────────────   │
                                                                                                                                                 │     Table: main.stops     ││     Table: main.stops     │
                                                                                                                                                 │   Type: Sequential Scan   ││      Type: Index Scan     │
                                                                                                                                                 │                           ││                           │
                                                                                                                                                 │        Projections:       ││        Projections:       │
                                                                                                                                                 │          stop_id          ││          stop_id          │
                                                                                                                                                 │       parent_station      ││         stop_name         │
                                                                                                                                                 │         stop_name         ││    wheelchair_boarding    │
                                                                                                                                                 │    wheelchair_boarding    ││                           │
                                                                                                                                                 │                           ││          Filters:         │
                                                                                                                                                 │          Filters:         ││     stop_id='de:11000     │
                                                                                                                                                 │  parent_station='de:11000 ││        :900100001'        │
                                                                                                                                                 │        :900100001'        ││                           │
                                                                                                                                                 │                           ││                           │
                                                                                                                                                 │          177 Rows         ││           1 Rows          │
                                                                                                                                                 │          (0.00s)          ││          (0.00s)          │
                                                                                                                                                 └───────────────────────────┘└───────────────────────────┘

edit: maybe duckdb/duckdb#17607 is related, but most likely not

@derhuerst derhuerst force-pushed the duckdb branch 2 times, most recently from 4d2805f to dfe8b8d Compare August 13, 2025 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants