Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] ClickBench Improvements (Vanity Benchmark) #14586

Open
alamb opened this issue Feb 10, 2025 · 6 comments
Open

[EPIC] ClickBench Improvements (Vanity Benchmark) #14586

alamb opened this issue Feb 10, 2025 · 6 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Feb 10, 2025

Is your feature request related to a problem or challenge?

The ClickBench Benchmark measures the performance of filtering and aggregation

Being on top of ClickBench is somewhat of a vanity benchmark, as in my opinion I think all the engines within a factor of 2 of likely have similar user experiences (and the exact speed will depends on real user queries, etc)

That being said, the engine at the top of the benchmark is certainly good for publicity and DataFusion has used it as (see see our blog here Apache DataFusion is now the fastest single node engine for querying Apache Parquet files)

So this ticket tracks improving the ClickBench peformance even more

Recently, as @Dandandan has pointed out on #14246 (comment), DuckDB slipped past us in the most recent results

Image

Describe the solution you'd like

Get DataFusion back on top

Describe alternatives you've considered

While we could clearly implement ClickBench specific optimizations, I don't think that is really a valuable exercise for users. I would very much like to focus our efforts on actually useful optimization

Some ideas of real improvements:

What I would like is of people profile queries and try and find ways to improve the queries

Additional context

See related discussions on

@alamb
Copy link
Contributor Author

alamb commented Feb 10, 2025

I took a brief look at some results

Image

Q24 and Q26

I think this is Q24:

SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY to_timestamp_seconds("EventTime"), "SearchPhrase" LIMIT 10;

Both have "ORDER BY to_timestamp_seconds("EventTime")` as a part of the query

@alamb
Copy link
Contributor Author

alamb commented Feb 10, 2025

Here is Q24:

SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY to_timestamp_seconds("EventTime") LIMIT 10;

Here is 26:

SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY to_timestamp_seconds("EventTime"), "SearchPhrase" LIMIT 10;

Both have "ORDER BY to_timestamp_seconds("EventTime")` as a part of the query

@Rachelint
Copy link
Contributor

Rachelint commented Feb 10, 2025

A low hanging fruit #13617, i plan to finish it in this week.

And maybe it is time to push #11943 forward...

I am trying a poc about support block approach by only modifying codes of group values(we also need to modifying codes of GroupAccumulatortoo in #11943).

It is really horrible if we need to implement block approach for all exist and new added GroupAccumulators...

@alamb
Copy link
Contributor Author

alamb commented Feb 10, 2025

I am trying a poc about support block approach by only modifying codes of group values(we also need to modifying codes of GroupAccumulatortoo in #11943).

If the performance gains are worth it I can potentially help organize a larger refactoring effort too (to incrementally port over the code). We are in much better shape test-wise now. If you have a good approach I'll find time to help coordinate

@Rachelint
Copy link
Contributor

Rachelint commented Feb 10, 2025

I am trying a poc about support block approach by only modifying codes of group values(we also need to modifying codes of GroupAccumulatortoo in #11943).

If the performance gains are worth it I can potentially help organize a larger refactoring effort too (to incrementally port over the code). We are in much better shape test-wise now. If you have a good approach I'll find time to help coordinate

I will find a query to measurement the performance in old implementation in #11943 and in the new implementation.
I guess they will have the similar performance but I am still not sure now.

And if approach about supporting this only by GroupValues can work, it may be easy to introduce another optimization #12526 .

@Rachelint
Copy link
Contributor

Rachelint commented Feb 10, 2025

On optimizer side, I am not sure if single_distinct_to_groupby can really improve performance in current version (it is an old rule introduced in long long ago), maybe we can check it again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants