-
Notifications
You must be signed in to change notification settings - Fork 25
Description
As discussed, I created a repo which shows how one could implement a pipeline which uses the DuckDB R-API to create a tibble from the raw json files.
The repo is calles OpenAlex_json. It contains some functions (two modified ones from openalexR) and two more functions to read the data in the raw json files as a tibble or convert it into a parquet dataset, partitioned by publication_year. I have included some basic timing info and comparison between the two approaches and the one via DuckDB is, for 17.000 records, about 20 seconds faster than oa_fetch(output = "kibble").
See here for the report.
The main advantage is, apart of very much lower memory needs and faster timing, that it simply does uses the structure returned from OpenAlex, therefore all changes are reflected immediately without any further maintenance. Convenience functions could be added to create backward compatible output, specific formats, etc which can be done using SQL in DuckDB or possibly even dplyr pipelines.