-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Implement intermediate result blocked approach to aggregation memory management #15591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Implement intermediate result blocked approach to aggregation memory management #15591
Conversation
Hi @Rachelint I think I have a alternative proposal that seems relatively easy to implement. |
Really thanks. This design in pr indeed still introduces quite a few code changes... I tried to not modify anythings about
But I found this way will introduce too many extra cost... Maybe we place the |
cc37eba
to
f690940
Compare
95c6a36
to
a4c6f42
Compare
2100a5b
to
0ee951c
Compare
Has finished development(and test) of all needed common structs!
|
c51d409
to
2863809
Compare
It is very close, just need to add more tests! |
31d660d
to
2b8dd1e
Compare
I suspect if it is due to the row-level random access of And I am trying to run the benchmark in more enviorments rather than only my dev machine (x86 + centos + 6core). |
@jayzhan211 @alamb I guess I nearly understand why It is related to We maintain the So when And when Here is my benchmark result in a production machine (x86_64 + 16core + 3699.178CPU MHz + 64G RAM) with different
|
e08a109
to
8807026
Compare
@@ -93,20 +94,27 @@ where | |||
opt_filter: Option<&BooleanArray>, | |||
total_num_groups: usize, | |||
) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems rather small? won't larger values have lower overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems rather small? won't larger values have lower overhead?
Does it mean block size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, i mean block size. I would expect something like batchsize (4-8k), or maybe even bigger to have lower overhead? Did you run some experiments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, i mean block size. I would expect something like batchsize (4-8k), or maybe even bigger to have lower overhead? Did you run some experiments?
Yes, I try it.
Now I set block_size = batch_size
.
I try the smaller batch_size
like 1024, and this pr
show improvement compared to main
.
It is due to this pr
can also eliminating the call of Array::slice
, which is non-trivial due to the computation of null count. Detail can see: apache/arrow-rs#6155
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it make sense, we make block_size
a dedicated config.
I think it can get improvement in case that batch_size
is set to a too samll value, and the group by cardinality is actually large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see that batch size is used by default :).
Yeah I think it makes sense to test it a bit further, maybe for this a slightly larger value (e.g. 2x, 4x batch size) will be beneficial when the cardinality is above the batch size.
Also at some point might make sense to think of it in size in memory instead of number of elements (e.g. block of u8
values might hold 16x more values than u128).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the hash compute for primitive is actually non-trivial...
And in high cardinality case, the duplicated hash compute led by rehash is really expansive!
I am experimenting about also save hash in hashtable like multi group by. Actually I have found improvement in the new added query in extended.sql
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see that batch size is used by default :). Yeah I think it makes sense to test it a bit further, maybe for this a slightly larger value (e.g. 2x, 4x batch size) will be beneficial when the cardinality is above the batch size.
Also at some point might make sense to think of it in size in memory instead of number of elements (e.g. block of
u8
values might hold 16x more values than u128).
It is really a good idea to split block by size rather than num rows, I will experiment on it after trying some exist ideas later.
b7c2f29
to
62157a9
Compare
/// The values for each group index | ||
values: Vec<T::Native>, | ||
values: Vec<Vec<T::Native>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use
values: Vec<Vec<T::Native>>, | |
values: Vec<Box<[T::Native]>>, |
here?
FWIW the machine I am running this benchmark on has 16 not very good cores:
|
🤖 |
OK, it may be really related to The |
🤖: Benchmark completed Details
|
@alamb I think the faster queries here mainly led by the hash reusing optimization. But I also have some other ideas for this pr. |
Marking as draft as @Rachelint works on the next set of things |
c42454e
to
d41f99e
Compare
The current benchmark results in my 16cores production machine:
|
BTW, I also simplified codes although not help to performance. like |
dc94961
to
9d0b73b
Compare
6ce6a7f
to
7f529b9
Compare
Which issue does this PR close?
Rationale for this change
As mentioned in #7065 , we use a single
Vec
to manageaggregation intermediate results
both inGroupAccumulator
andGroupValues
.It is simple but not efficient enough in high-cardinality aggregation, because when
Vec
is not large enough, we need to allocate a newVec
and copy all data from the old one.So this pr introduces a
blocked approach
to manage theaggregation intermediate results
. We will never resize theVec
in the approach, and instead we split the data to blocks, when the capacity is not enough, we just allocate a new block. Detail can see #7065What changes are included in this PR?
PrimitiveGroupsAccumulator
andGroupValuesPrimitive
as the exampleAre these changes tested?
Test by exist tests. And new unit tests, new fuzzy tests.
Are there any user-facing changes?
Two functions are added to
GroupValues
andGroupAccumulator
trait.But as you can see, there are default implementations for them, and users can choose to really support the blocked approach when wanting a better performance for their
udaf
s.