Skip to content

Using duckDB via json to create tibble from OpenAlex return values #275

@rkrug

Description

@rkrug

As discussed, I created a repo which shows how one could implement a pipeline which uses the DuckDB R-API to create a tibble from the raw json files.

The repo is calles OpenAlex_json. It contains some functions (two modified ones from openalexR) and two more functions to read the data in the raw json files as a tibble or convert it into a parquet dataset, partitioned by publication_year. I have included some basic timing info and comparison between the two approaches and the one via DuckDB is, for 17.000 records, about 20 seconds faster than oa_fetch(output = "kibble").

See here for the report.

The main advantage is, apart of very much lower memory needs and faster timing, that it simply does uses the structure returned from OpenAlex, therefore all changes are reflected immediately without any further maintenance. Convenience functions could be added to create backward compatible output, specific formats, etc which can be done using SQL in DuckDB or possibly even dplyr pipelines.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions