Skip to content

Improve the hash join performance by replacing the RawTable to a simple Vec for JoinHashMap #6913

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

yahoNanJing
Copy link
Contributor

Which issue does this PR close?

Closes #6910.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Jul 11, 2023
let bucket_index = self.bucket_mask & (hash_value as usize);
// We are sure the `bucket_index` is legal
unsafe {
let index_ref = self.buckets.get_unchecked_mut(bucket_index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can limit the unsafe to:

Suggested change
let index_ref = self.buckets.get_unchecked_mut(bucket_index);
let index_ref = unsafe {self.buckets.get_unchecked_mut(bucket_index); }

}

impl fmt::Debug for JoinHashMap {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

@Dandandan
Copy link
Contributor

Thanks @yahoNanJing . These are my results on this PR (./bench.sh run tpch_mem):

Comparing main and issue-6910
--------------------
Benchmark tpch_mem.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃     main ┃ issue-6910 ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 188.49ms │   187.15ms │    no change │
│ QQuery 2     │  59.09ms │    68.42ms │ 1.16x slower │
│ QQuery 3     │  50.46ms │    68.99ms │ 1.37x slower │
│ QQuery 4     │  37.68ms │    51.33ms │ 1.36x slower │
│ QQuery 5     │  93.91ms │   127.29ms │ 1.36x slower │
│ QQuery 6     │  10.32ms │    10.48ms │    no change │
│ QQuery 7     │ 195.52ms │   281.34ms │ 1.44x slower │
│ QQuery 8     │  71.23ms │    93.99ms │ 1.32x slower │
│ QQuery 9     │ 139.60ms │   181.29ms │ 1.30x slower │
│ QQuery 10    │  97.05ms │   105.19ms │ 1.08x slower │
│ QQuery 11    │  48.91ms │    49.58ms │    no change │
│ QQuery 12    │  67.75ms │    79.33ms │ 1.17x slower │
│ QQuery 13    │ 178.68ms │   188.84ms │ 1.06x slower │
│ QQuery 14    │  11.94ms │    13.19ms │ 1.10x slower │
│ QQuery 15    │  22.51ms │    23.93ms │ 1.06x slower │
│ QQuery 16    │  50.32ms │    50.59ms │    no change │
│ QQuery 17    │ 664.73ms │   657.27ms │    no change │
│ QQuery 18    │ 516.12ms │   590.70ms │ 1.14x slower │
│ QQuery 19    │  57.57ms │    58.17ms │    no change │
│ QQuery 20    │ 203.75ms │   196.01ms │    no change │
│ QQuery 21    │ 259.05ms │   356.25ms │ 1.38x slower │
│ QQuery 22    │  28.06ms │    34.84ms │ 1.24x slower │
└──────────────┴──────────┴────────────┴──────────────┘

@Dandandan
Copy link
Contributor

Dandandan commented Jul 11, 2023

FYI, I have a branch where I got a bit better results with a similar approach (but still mixed): https://github.com/apache/arrow-datafusion/tree/bucketing, you can see the diff here f03f7f4

@yahoNanJing
Copy link
Contributor Author

yahoNanJing commented Jul 11, 2023

Hi @Dandandan, could you help run the latest code of this PR on your PC for benchmark again? Btw, on my PC, I'm testing on TPCH 1G with single partition

And could you help describe the CPU cache sizes? Mine is
Screenshot 2023-07-11 at 18 34 56

@Dandandan
Copy link
Contributor

Dandandan commented Jul 11, 2023

Sure @yahoNanJing

  • Apple M1 Pro:
image
  • you can run the benchmark in benchmarks/bench.sh. You can generate the data using ./bench.sh data and then run ./bench.sh run tpch_mem. Comparison against main can be done by running the benchmark on the main branch once as well and then running ./bench.sh compare main my-branch

@Dandandan
Copy link
Contributor

Dandandan commented Jul 11, 2023

Last commit d639244:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃     main ┃ issue-6910-2 ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 188.49ms │     186.08ms │    no change │
│ QQuery 2     │  59.09ms │      67.80ms │ 1.15x slower │
│ QQuery 3     │  50.46ms │      68.36ms │ 1.35x slower │
│ QQuery 4     │  37.68ms │      42.58ms │ 1.13x slower │
│ QQuery 5     │  93.91ms │     134.90ms │ 1.44x slower │
│ QQuery 6     │  10.32ms │      10.63ms │    no change │
│ QQuery 7     │ 195.52ms │     263.67ms │ 1.35x slower │
│ QQuery 8     │  71.23ms │      94.81ms │ 1.33x slower │
│ QQuery 9     │ 139.60ms │     180.05ms │ 1.29x slower │
│ QQuery 10    │  97.05ms │     103.32ms │ 1.06x slower │
│ QQuery 11    │  48.91ms │      50.65ms │    no change │
│ QQuery 12    │  67.75ms │      73.31ms │ 1.08x slower │
│ QQuery 13    │ 178.68ms │     191.45ms │ 1.07x slower │
│ QQuery 14    │  11.94ms │      13.21ms │ 1.11x slower │
│ QQuery 15    │  22.51ms │      24.32ms │ 1.08x slower │
│ QQuery 16    │  50.32ms │      50.80ms │    no change │
│ QQuery 17    │ 664.73ms │     669.85ms │    no change │
│ QQuery 18    │ 516.12ms │     587.09ms │ 1.14x slower │
│ QQuery 19    │  57.57ms │      57.99ms │    no change │
│ QQuery 20    │ 203.75ms │     197.04ms │    no change │
│ QQuery 21    │ 259.05ms │     331.78ms │ 1.28x slower │
│ QQuery 22    │  28.06ms │      34.91ms │ 1.24x slower │
└──────────────┴──────────┴──────────────┴──────────────┘

@Dandandan
Copy link
Contributor

FYI @sunchao who I believe is working on something similar.

@yahoNanJing
Copy link
Contributor Author

you can run the benchmark in benchmarks/bench.sh. You can generate the data using ./bench.sh data and then run ./bench.sh run tpch_mem. Comparison against main can be done by running the benchmark on the main branch once as well and then running ./bench.sh compare main my-branch

Thanks for providing this useful benchmark tool. I'll try it later.

@yahoNanJing
Copy link
Contributor Author

When using the bench tool on my PC with only one partition, the result is as follows:
Screenshot 2023-07-12 at 04 25 29

The performance comparison between RawTable and (Vec and mask) may depend on the data size to be dealt with.

@Dandandan
Copy link
Contributor

Dandandan commented Jul 11, 2023

My explanation so far:

The building phase is faster this way (simple datastructure) at the expense of the probing phase (due to more colissions and thus more rows to check for potential collisions).
Some queries have more data on the probe side than others which explains that some queries show improvements while other are running slower.

I think we might first have to see if we can further improve the collision checking (currently eq + filter + take) in order for this approach to become faster in general.

@yahoNanJing
Copy link
Contributor Author

Btw, I propose to set the partition to be 1 for benchmark purpose due to:

  • Then it will be helpful to focus on the code logic rather than the task scheduling or thread model
  • Then it will be more fair to compare with other engines with single thread

@yahoNanJing
Copy link
Contributor Author

The building phase is a bit faster at the expense

Intuitively, I think the Vec is much simpler than the RawTable and the mask operation should be very efficient. And I don't know why the result seems not so good 😭

@Dandandan
Copy link
Contributor

The building phase is a bit faster at the expense

Intuitively, I think the Vec is much simpler than the RawTable and the mask operation should be very efficient. And I don't know why the result seems not so good 😭

Sorry, hit enter too soon, I updated my comment with my full thoughts on this.

@yahoNanJing
Copy link
Contributor Author

I think we might first have to see if we can further improve the collision checking (currently eq + filter + take) in order for this approach to become faster in general.

You are right. I will continue to investigate on it.

@yahoNanJing yahoNanJing marked this pull request as draft July 12, 2023 02:33
@yahoNanJing
Copy link
Contributor Author

The main blocking issue for q17 is the join order and the decision making of choosing which part to be the build side.

I think we can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve the hash join performance by replacing the RawTable to a simple Vec for JoinHashMap
3 participants