Improve the hash join performance by replacing the RawTable to a simple Vec for JoinHashMap #6913

yahoNanJing · 2023-07-11T09:48:52Z

Which issue does this PR close?

Closes #6910.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

…le Vec for JoinHashMap

Dandandan · 2023-07-11T09:58:26Z

datafusion/core/src/physical_plan/joins/hash_join_utils.rs

+        let bucket_index = self.bucket_mask & (hash_value as usize);
+        // We are sure the `bucket_index` is legal
+        unsafe {
+            let index_ref = self.buckets.get_unchecked_mut(bucket_index);


I think we can limit the unsafe to:

Suggested change

let index_ref = self.buckets.get_unchecked_mut(bucket_index);

let index_ref = unsafe {self.buckets.get_unchecked_mut(bucket_index); }

Dandandan · 2023-07-11T10:02:30Z

datafusion/core/src/physical_plan/joins/hash_join_utils.rs

 }

+impl fmt::Debug for JoinHashMap {


Is this needed?

Dandandan · 2023-07-11T10:04:35Z

Thanks @yahoNanJing . These are my results on this PR (./bench.sh run tpch_mem):

Comparing main and issue-6910
--------------------
Benchmark tpch_mem.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃     main ┃ issue-6910 ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 188.49ms │   187.15ms │    no change │
│ QQuery 2     │  59.09ms │    68.42ms │ 1.16x slower │
│ QQuery 3     │  50.46ms │    68.99ms │ 1.37x slower │
│ QQuery 4     │  37.68ms │    51.33ms │ 1.36x slower │
│ QQuery 5     │  93.91ms │   127.29ms │ 1.36x slower │
│ QQuery 6     │  10.32ms │    10.48ms │    no change │
│ QQuery 7     │ 195.52ms │   281.34ms │ 1.44x slower │
│ QQuery 8     │  71.23ms │    93.99ms │ 1.32x slower │
│ QQuery 9     │ 139.60ms │   181.29ms │ 1.30x slower │
│ QQuery 10    │  97.05ms │   105.19ms │ 1.08x slower │
│ QQuery 11    │  48.91ms │    49.58ms │    no change │
│ QQuery 12    │  67.75ms │    79.33ms │ 1.17x slower │
│ QQuery 13    │ 178.68ms │   188.84ms │ 1.06x slower │
│ QQuery 14    │  11.94ms │    13.19ms │ 1.10x slower │
│ QQuery 15    │  22.51ms │    23.93ms │ 1.06x slower │
│ QQuery 16    │  50.32ms │    50.59ms │    no change │
│ QQuery 17    │ 664.73ms │   657.27ms │    no change │
│ QQuery 18    │ 516.12ms │   590.70ms │ 1.14x slower │
│ QQuery 19    │  57.57ms │    58.17ms │    no change │
│ QQuery 20    │ 203.75ms │   196.01ms │    no change │
│ QQuery 21    │ 259.05ms │   356.25ms │ 1.38x slower │
│ QQuery 22    │  28.06ms │    34.84ms │ 1.24x slower │
└──────────────┴──────────┴────────────┴──────────────┘

Dandandan · 2023-07-11T10:06:28Z

FYI, I have a branch where I got a bit better results with a similar approach (but still mixed): https://github.com/apache/arrow-datafusion/tree/bucketing, you can see the diff here f03f7f4

yahoNanJing · 2023-07-11T10:35:18Z

Hi @Dandandan, could you help run the latest code of this PR on your PC for benchmark again? Btw, on my PC, I'm testing on TPCH 1G with single partition

And could you help describe the CPU cache sizes? Mine is

Dandandan · 2023-07-11T10:46:59Z

Sure @yahoNanJing

Apple M1 Pro:

you can run the benchmark in benchmarks/bench.sh. You can generate the data using ./bench.sh data and then run ./bench.sh run tpch_mem. Comparison against main can be done by running the benchmark on the main branch once as well and then running ./bench.sh compare main my-branch

Dandandan · 2023-07-11T11:02:38Z

Last commit d639244:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃     main ┃ issue-6910-2 ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 188.49ms │     186.08ms │    no change │
│ QQuery 2     │  59.09ms │      67.80ms │ 1.15x slower │
│ QQuery 3     │  50.46ms │      68.36ms │ 1.35x slower │
│ QQuery 4     │  37.68ms │      42.58ms │ 1.13x slower │
│ QQuery 5     │  93.91ms │     134.90ms │ 1.44x slower │
│ QQuery 6     │  10.32ms │      10.63ms │    no change │
│ QQuery 7     │ 195.52ms │     263.67ms │ 1.35x slower │
│ QQuery 8     │  71.23ms │      94.81ms │ 1.33x slower │
│ QQuery 9     │ 139.60ms │     180.05ms │ 1.29x slower │
│ QQuery 10    │  97.05ms │     103.32ms │ 1.06x slower │
│ QQuery 11    │  48.91ms │      50.65ms │    no change │
│ QQuery 12    │  67.75ms │      73.31ms │ 1.08x slower │
│ QQuery 13    │ 178.68ms │     191.45ms │ 1.07x slower │
│ QQuery 14    │  11.94ms │      13.21ms │ 1.11x slower │
│ QQuery 15    │  22.51ms │      24.32ms │ 1.08x slower │
│ QQuery 16    │  50.32ms │      50.80ms │    no change │
│ QQuery 17    │ 664.73ms │     669.85ms │    no change │
│ QQuery 18    │ 516.12ms │     587.09ms │ 1.14x slower │
│ QQuery 19    │  57.57ms │      57.99ms │    no change │
│ QQuery 20    │ 203.75ms │     197.04ms │    no change │
│ QQuery 21    │ 259.05ms │     331.78ms │ 1.28x slower │
│ QQuery 22    │  28.06ms │      34.91ms │ 1.24x slower │
└──────────────┴──────────┴──────────────┴──────────────┘

Dandandan · 2023-07-11T11:26:40Z

FYI @sunchao who I believe is working on something similar.

yahoNanJing · 2023-07-11T11:27:40Z

you can run the benchmark in benchmarks/bench.sh. You can generate the data using ./bench.sh data and then run ./bench.sh run tpch_mem. Comparison against main can be done by running the benchmark on the main branch once as well and then running ./bench.sh compare main my-branch

Thanks for providing this useful benchmark tool. I'll try it later.

yahoNanJing · 2023-07-11T20:28:50Z

When using the bench tool on my PC with only one partition, the result is as follows:

The performance comparison between RawTable and (Vec and mask) may depend on the data size to be dealt with.

Dandandan · 2023-07-11T20:33:00Z

My explanation so far:

The building phase is faster this way (simple datastructure) at the expense of the probing phase (due to more colissions and thus more rows to check for potential collisions).
Some queries have more data on the probe side than others which explains that some queries show improvements while other are running slower.

I think we might first have to see if we can further improve the collision checking (currently eq + filter + take) in order for this approach to become faster in general.

yahoNanJing · 2023-07-11T20:33:23Z

Btw, I propose to set the partition to be 1 for benchmark purpose due to:

Then it will be helpful to focus on the code logic rather than the task scheduling or thread model
Then it will be more fair to compare with other engines with single thread

yahoNanJing · 2023-07-11T20:37:59Z

The building phase is a bit faster at the expense

Intuitively, I think the Vec is much simpler than the RawTable and the mask operation should be very efficient. And I don't know why the result seems not so good 😭

Dandandan · 2023-07-11T20:39:02Z

The building phase is a bit faster at the expense

Intuitively, I think the Vec is much simpler than the RawTable and the mask operation should be very efficient. And I don't know why the result seems not so good 😭

Sorry, hit enter too soon, I updated my comment with my full thoughts on this.

yahoNanJing · 2023-07-11T20:39:41Z

I think we might first have to see if we can further improve the collision checking (currently eq + filter + take) in order for this approach to become faster in general.

You are right. I will continue to investigate on it.

yahoNanJing · 2023-07-12T11:04:50Z

The main blocking issue for q17 is the join order and the decision making of choosing which part to be the build side.

I think we can close this issue.

Improve the hash join performance by replacing the RawTable to a simp…

a34269e

…le Vec for JoinHashMap

github-actions bot added the core Core DataFusion crate label Jul 11, 2023

Dandandan reviewed Jul 11, 2023

View reviewed changes

datafusion/core/src/physical_plan/joins/hash_join_utils.rs

}

impl fmt::Debug for JoinHashMap {

Copy link

Contributor

Dandandan Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

kyotoYaho added 2 commits July 11, 2023 18:11

Use concatenated record batch to construct the hashmap

002b37f

Fix clippy

d639244

yahoNanJing marked this pull request as draft July 12, 2023 02:33

yahoNanJing closed this Jul 12, 2023

	let index_ref = self.buckets.get_unchecked_mut(bucket_index);
	let index_ref = unsafe {self.buckets.get_unchecked_mut(bucket_index); }

Improve the hash join performance by replacing the RawTable to a simple Vec for JoinHashMap #6913

Improve the hash join performance by replacing the RawTable to a simple Vec for JoinHashMap #6913

Uh oh!

Conversation

yahoNanJing commented Jul 11, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Dandandan Jul 11, 2023

Choose a reason for hiding this comment

Uh oh!

Dandandan Jul 11, 2023

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Jul 11, 2023

Uh oh!

Dandandan commented Jul 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yahoNanJing commented Jul 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan commented Jul 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan commented Jul 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan commented Jul 11, 2023

Uh oh!

yahoNanJing commented Jul 11, 2023

Uh oh!

yahoNanJing commented Jul 11, 2023

Uh oh!

Dandandan commented Jul 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yahoNanJing commented Jul 11, 2023

Uh oh!

yahoNanJing commented Jul 11, 2023

Uh oh!

Dandandan commented Jul 11, 2023

Uh oh!

yahoNanJing commented Jul 11, 2023

Uh oh!

yahoNanJing commented Jul 12, 2023

Uh oh!

Uh oh!

Dandandan commented Jul 11, 2023 •

edited

Loading

yahoNanJing commented Jul 11, 2023 •

edited

Loading

Dandandan commented Jul 11, 2023 •

edited

Loading

Dandandan commented Jul 11, 2023 •

edited

Loading

Dandandan commented Jul 11, 2023 •

edited

Loading