Using duckDB via json to create tibble from OpenAlex return values

As discussed, I created a repo which shows how one could implement a pipeline which uses the [DuckDB](https://duckdb.org) [R-API](https://duckdb.org/docs/api/r.html) to create a tibble from the raw json files.

The repo is calles [OpenAlex_json](https://github.com/rkrug/openalexr_json). It contains some functions (two modified ones from openalexR) and two more functions to read the data in the raw json files as a `tibble` or convert it into a `parquet` dataset, partitioned by `publication_year`. I have included some basic timing info and comparison between the two approaches and the one via DuckDB is, for 17.000 records, about 20 seconds faster than `oa_fetch(output = "kibble")`. 

See [here](https://rkrug.github.io/openalexr_json/README.html) for the report.

The main advantage is, apart of very much lower memory needs and faster timing, that it simply does uses the structure returned from OpenAlex, therefore all changes are reflected immediately without any further maintenance. Convenience functions could be added to create backward compatible output, specific formats, etc which can be done using SQL in DuckDB or possibly even dplyr pipelines. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using duckDB via json to create tibble from OpenAlex return values #275

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Using duckDB via json to create tibble from OpenAlex return values #275

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions