Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it easier to run TPCH queries with datafusion-cli #14608

Open
alamb opened this issue Feb 11, 2025 · 2 comments
Open

Make it easier to run TPCH queries with datafusion-cli #14608

alamb opened this issue Feb 11, 2025 · 2 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Feb 11, 2025

Is your feature request related to a problem or challenge?

TPC-H is an important and well studied benchmark. It is used for testing many database optimizations and is well known and widely studied. TPCH This is especially important for academic research projects and classes. For example see the CMU optimizer class (see #14373)

Generating this dataset today is quite a pain as it requires compiling and running a very old c program dbgen.c (see XXX)

DuckDB has a TPCH extension that makes it very easy to create the dataset and queries: https://duckdb.org/docs/extensions/tpch.html

I also think this is another reason why DuckDB is so popular with Academic research as it lowers the barrier to getting this dataset

Today, to generate this dataset, the DataFusion bench.sh program

bench.sh runs tpdh dbgen from this container: https://github.com/scalytics/TPCH-Docker/pkgs/container/tpch-docker

And the eventual code is here: https://github.com/scalytics/TPCH-Docker/tree/main/data/tpch/2.18.0_rc2/dbgen

This setup is non ideal for several reasons:

  1. It requires docker, and takes quite a while to run
  2. It makes CSV files which are almost never what is used in practice (people use parquet, etc)
  3. It isn't part of datafusion-cli

Describe the solution you'd like

I would like it to be very easy to create data and run TPCH queries in datafusion-cli

Describe alternatives you've considered

Idea 1: Instructions + Precalculate the Data

The idea here would be to precalculate the data and find somewhere to host it (I am sure the ASF has places to host files, but we would need to research what the limits are, etc).

Here is an example repo: https://github.com/aleaugustoplus/tpch-data (maybe we can do the same or even use that one)

Then we would provide instructions / a script on how to download and use the files with datafusion-cli

Ideally we would provide the data in parquet format

Idea 2: Integrate the dbgen function into datafusion-cli (like DuckDB)

The idea here would be to implement some/all of the syntax from DuckDB: https://duckdb.org/docs/extensions/tpch.html

That would mean a command like this

CALL dbgen(sf = 1);

That would create the 8 TPCH tables, along with command to show the queries and answers

The trick here is that dbgen is some ancient c program and there is no Rust version I could find. It is critical that that data is exactly the same.

Transliterating dbgen.c from C to Rust might be a fun project (and maybe someone could figure out how to make it parallel while they are at it)

Additional context

No response

@lmwnshn
Copy link

lmwnshn commented Feb 11, 2025

If you prefer Java to C, CMU-DB's BenchBase project does implement support for generating and loading TPC-H data in parallel: https://github.com/cmu-db/benchbase/tree/main/src/main/java/com/oltpbenchmark/benchmarks/tpch

Another alternative that I explored is using DuckDB to generate the data, exporting that as Parquet, and then ingesting it into DataFusion (schema may require fixing):

./duckdb data/tpch.db -c "INSTALL tpch; LOAD tpch; CALL dbgen(sf = 1); EXPORT DATABASE './data/' (FORMAT PARQUET);"

But personally I think native integration makes for the best user experience.

@alamb
Copy link
Contributor Author

alamb commented Feb 11, 2025

Thanks @lmwnshn -- the Java implementation might be easier to transliterate to Rust...

Also BTW I am pretty sure other rust data projects would be interested in a Rust implementation of the TPCH data generator, so I bet we could get some others to help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants