Skip to content

Commit 233bce5

Browse files
committed
DuckDB [todo]
1 parent 71cefd8 commit 233bce5

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

47 files changed

+2252
-2614
lines changed

.eslintrc.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
"node_modules"
1313
],
1414
"rules": {
15-
"no-unused-vars": "off",
15+
"no-unused-vars": "warn",
1616
"no-irregular-whitespace": "off"
1717
}
1818
}

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,11 @@ pnpm-debug.log
1010
/shrinkwrap.yaml
1111

1212
/test/amtrak-gtfs-2021-10-06
13+
/test/*.duckdb
1314

1415
/*.gtfs
1516
/*.gtfs.zip
1617
/*.gtfs.tar.gz
1718
/*.gtfs.tar.zst
19+
20+
/*.duckdb

Dockerfile

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,6 @@ LABEL org.opencontainers.image.licenses="(Apache-2.0 AND Prosperity-3.0.0)"
99

1010
WORKDIR /app
1111

12-
# Both moreutils (providing sponge) and postgresql-client (providing psql) are not required but come in handy for users.
13-
RUN apk add --no-cache \
14-
postgresql-client \
15-
moreutils
16-
1712
ADD package.json /app
1813
RUN npm install --production && npm cache clean --force
1914

cli.js

Lines changed: 18 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -44,9 +44,6 @@ const {
4444
'lower-case-lang-codes': {
4545
type: 'boolean',
4646
},
47-
'stops-location-index': {
48-
type: 'boolean',
49-
},
5047
'stats-by-route-date': {
5148
type: 'string',
5249
},
@@ -59,21 +56,6 @@ const {
5956
'schema': {
6057
type: 'string',
6158
},
62-
'postgraphile': {
63-
type: 'boolean',
64-
},
65-
'postgraphile-password': {
66-
type: 'string',
67-
},
68-
'postgrest': {
69-
type: 'boolean',
70-
},
71-
'postgrest-password': {
72-
type: 'string',
73-
},
74-
'postgrest-query-cost-limit': {
75-
type: 'string',
76-
},
7759
'import-metadata': {
7860
type: 'boolean',
7961
}
@@ -84,7 +66,7 @@ const {
8466
if (flags.help) {
8567
process.stdout.write(`
8668
Usage:
87-
gtfs-to-sql [options] [--] <gtfs-file> ...
69+
import-gtfs-into-duckdb [options] [--] <path-to-duckdb> <gtfs-file> ...
8870
Options:
8971
--silent -s Don't show files being converted.
9072
--require-dependencies -d Require files that the specified GTFS files depend
@@ -102,8 +84,6 @@ Options:
10284
--routes-without-agency-id Don't require routes.txt items to have an agency_id.
10385
--stops-without-level-id Don't require stops.txt items to have a level_id.
10486
Default if levels.txt has not been provided.
105-
--stops-location-index Create a spatial index on stops.stop_loc for efficient
106-
queries by geolocation.
10787
--lower-case-lang-codes Accept Language Codes (e.g. in feed_info.feed_lang)
10888
with a different casing than the official BCP-47
10989
language tags (as specified by the GTFS spec),
@@ -124,34 +104,18 @@ Options:
124104
currently running trips over time, by hour.
125105
Like --stats-by-route-date, this flag accepts
126106
none, view & materialized-view.
127-
--schema The schema to use for the database. Default: public
128-
Even when importing into a schema other than \`public\`,
129-
a function \`public.gtfs_via_postgres_import_version()\`
130-
gets created, to ensure that multiple imports into the
131-
same database are all made using the same version. See
132-
also multiple-datasets.md in the docs.
133-
--postgraphile Tweak generated SQL for PostGraphile usage.
134-
https://www.graphile.org/postgraphile/
135-
--postgraphile-password Password for the PostGraphile PostgreSQL user.
136-
Default: $POSTGRAPHILE_PGPASSWORD, fallback random.
137-
--postgrest Tweak generated SQL for PostgREST usage.
138-
Please combine it with --schema.
139-
https://postgrest.org/
140-
--postgrest-password Password for the PostgREST PostgreSQL user \`web_anon\`.
141-
Default: $POSTGREST_PGPASSWORD, fallback random.
142-
--postgrest-query-cost-limit Define a cost limit [1] for queries executed by PostgREST
143-
on behalf of a user. It is only enforced if
144-
pg_plan_filter [2] is installed in the database!
145-
Must be a positive float. Default: none
146-
[1] https://www.postgresql.org/docs/14/using-explain.html
147-
[2] https://github.com/pgexperts/pg_plan_filter
107+
--schema The schema to use for the database. Default: main
108+
May not contain \`.\`.
148109
--import-metadata Create functions returning import metadata:
149110
- gtfs_data_imported_at (timestamp with time zone)
150111
- gtfs_via_postgres_version (text)
151112
- gtfs_via_postgres_options (jsonb)
113+
Notes:
114+
If you just want to check if the GTFS data can be imported but don't care about the
115+
resulting DuckDB database file, you can import into an in-memory database by specifying
116+
\`:memory:\` as the <path-to-duckdb>.
152117
Examples:
153-
gtfs-to-sql some-gtfs/*.txt | sponge | psql -b # import into PostgreSQL
154-
gtfs-to-sql -u -- some-gtfs/*.txt | gzip >gtfs.sql.gz # generate a gzipped SQL dump
118+
import-gtfs-into-duckdb some-gtfs.duckdb some-gtfs/*.txt
155119
156120
[1] https://developers.google.com/transit/gtfs/reference/extended-route-types
157121
[2] https://groups.google.com/g/gtfs-changes/c/keT5rTPS7Y0/m/71uMz2l6ke0J
@@ -165,11 +129,11 @@ if (flags.version) {
165129
}
166130

167131
const {basename, extname} = require('path')
168-
const {pipeline} = require('stream')
169132
const convertGtfsToSql = require('./index')
170-
const DataError = require('./lib/data-error')
171133

172-
const files = args.map((file) => {
134+
const [pathToDb] = args
135+
136+
const files = args.slice(1).map((file) => {
173137
const name = basename(file, extname(file))
174138
return {name, file}
175139
})
@@ -185,9 +149,7 @@ const opt = {
185149
statsByRouteIdAndDate: flags['stats-by-route-date'] || 'none',
186150
statsByAgencyIdAndRouteIdAndStopAndHour: flags['stats-by-agency-route-stop-hour'] || 'none',
187151
statsActiveTripsByHour: flags['stats-active-trips-by-hour'] || 'none',
188-
schema: flags['schema'] || 'public',
189-
postgraphile: !!flags.postgraphile,
190-
postgrest: !!flags.postgrest,
152+
schema: flags['schema'] || 'main',
191153
importMetadata: !!flags['import-metadata'],
192154
}
193155
if ('stops-without-level-id' in flags) {
@@ -196,31 +158,11 @@ if ('stops-without-level-id' in flags) {
196158
if ('lower-case-lang-codes' in flags) {
197159
opt.lowerCaseLanguageCodes = flags['lower-case-lang-codes']
198160
}
199-
if ('postgraphile-password' in flags) {
200-
opt.postgraphilePassword = flags['postgraphile-password']
201-
}
202-
if ('postgrest-password' in flags) {
203-
opt.postgrestPassword = flags['postgrest-password']
204-
}
205-
if ('postgrest-query-cost-limit' in flags) {
206-
const limit = parseFloat(flags['postgrest-query-cost-limit'])
207-
if (!Number.isFinite(limit) || limit < 0) {
208-
console.error('Invalid --postgrest-query-cost-limit value.')
209-
process.exit(1)
210-
}
211-
opt.lowerCaseLanguageCodes = limit
212-
}
213161

214-
pipeline(
215-
convertGtfsToSql(files, opt),
216-
process.stdout,
217-
(err) => {
218-
if (!err) return;
219-
if (err instanceof DataError) {
220-
console.error(String(err))
221-
} else if (err.code !== 'EPIPE') {
222-
console.error(err)
223-
}
224-
process.exit(1)
162+
convertGtfsToSql(pathToDb, files, opt)
163+
.catch((err) => {
164+
if (err.code !== 'EPIPE') { // todo: check still necessary? we don't pipe anymore
165+
console.error(err)
225166
}
226-
)
167+
process.exit(1)
168+
})

docs/import-metadata.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ SELECT gtfs_via_postgres_version()
1212
-- 4.5.3
1313

1414
SELECT gtfs_via_postgres_options()
15-
-- {"schema": "public", "silent": false, "importStart": 1681417454781, "postgraphile": false, "importMetadata": true, … }
15+
-- {"schema": "public", "silent": false, "importStart": 1681417454781, "importMetadata": true, … }
1616
SELECT (gtfs_via_postgres_options())['tripsWithoutShapeId']
1717
-- true
1818
```

docs/multiple-datasets.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,14 @@ As an example, let's import two datasets ([Paris](https://en.wikipedia.org/wiki/
88
wget -U 'gtfs-via-postgres demo' -O paris.gtfs.zip 'https://eu.ftp.opendatasoft.com/stif/GTFS/IDFM-gtfs.zip'
99
unzip -d paris.gtfs paris.gtfs.zip
1010
gtfs-to-sql --require-dependencies \
11-
--schema paris -- paris.gtfs/*.txt \
12-
| sponge | psql -b
11+
--schema paris multiple-datasets.duckdb -- \
12+
paris.gtfs/*.txt
1313

1414
wget -U 'gtfs-via-postgres demo' -O berlin.gtfs.zip 'https://www.vbb.de/vbbgtfs'
1515
unzip -d berlin.gtfs berlin.gtfs.zip
1616
gtfs-to-sql --require-dependencies \
17-
--schema berlin -- berlin.gtfs/*.txt \
18-
| sponge | psql -b
17+
--schema berlin multiple-datasets.duckdb -- \
18+
berlin.gtfs/*.txt
1919
```
2020

2121
We can now do queries across both datasets, for example finding the geographically furthest 2 stops:
@@ -28,6 +28,7 @@ SELECT
2828
FROM
2929
paris.stops paris,
3030
berlin.stops berlin
31+
-- todo: does this operator work in DuckDB?
3132
ORDER BY paris.stop_loc <-> berlin.stop_loc DESC
3233
LIMIT 100
3334
```

example.sh

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,39 @@
11
#!/bin/sh
22

33
set -e
4+
set -u
45
set -o pipefail
56

6-
2>&1 echo "importing into PostgreSQL:"
7+
rm -f example.duckdb
8+
9+
2>&1 echo "importing into example.duckdb:"
710
./cli.js --ignore-unsupported --require-dependencies --trips-without-shape-id --silent \
8-
node_modules/sample-gtfs-feed/gtfs/*.txt \
9-
| sponge | psql -b
11+
example.duckdb \
12+
node_modules/sample-gtfs-feed/gtfs/*.txt
1013

1114
2>&1 echo "\nfetching a connection during DST switch:"
12-
psql -c "$(cat <<- EOM
15+
duckdb -csv -c "$(cat <<- EOM
1316
SELECT
1417
trip_id, route_id,
1518
from_stop_id, t_departure,
16-
stop_sequence,
19+
from_stop_sequence,
1720
to_stop_id, t_arrival
18-
FROM connections
21+
FROM "main.connections"
1922
WHERE trip_id = 'during-dst-1'
20-
AND t_departure > '2019-03-31T01:55+01' AND t_departure < '2019-03-31T03:00+02'
23+
AND t_departure > '2019-03-31T01:55:00+01:00' AND t_departure < '2019-03-31T03:00:00+02:00'
2124
-- AND route_id = 'D'
2225
-- AND from_stop_id = 'airport'
23-
EOM)"
26+
EOM)" example.duckdb
2427
2528
2>&1 echo "\nfetching the departure at the same time:"
26-
psql -c "$(cat <<- EOM
29+
duckdb -csv -c "$(cat <<- EOM
2730
SELECT
2831
trip_id, route_id,
2932
stop_id, t_departure,
3033
stop_sequence
31-
FROM arrivals_departures
34+
FROM "main.arrivals_departures"
3235
WHERE trip_id = 'during-dst-1'
33-
AND t_departure > '2019-03-31T01:55+01' AND t_departure < '2019-03-31T03:00+02'
36+
AND t_departure > '2019-03-31T01:55:00+01:00' AND t_departure < '2019-03-31T03:00:00+02:00'
3437
-- AND route_id = 'D'
3538
-- AND stop_id = 'airport'
36-
EOM)"
39+
EOM)" example.duckdb

0 commit comments

Comments
 (0)