Skip to content

refactor: Generate GroupByHash output in multiple RecordBatches #9818

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

guojidan
Copy link
Contributor

Which issue does this PR close?

Closes #9562 .

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@guojidan
Copy link
Contributor Author

I don't know Is this right to do, need some recommendation 😄

@alamb
Copy link
Contributor

alamb commented Mar 27, 2024

I don't know Is this right to do, need some recommendation 😄

Thanks @guojidan -- I will try and check it out over the next day or two. Unfortunately I am backed up on reviews at the moment

Comment on lines 61 to 62
/// When producing output, the remaining rows to output are stored
/// here and are sliced off as needed in batch_size chunks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might wanna update this comment

let remaining = batch.slice(size, num_remaining);
let output = batch.slice(0, size);
// output first batches element
let remaining = batches[1..].to_vec();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we push the batches in reverse order (treat batches as a stack) and pop to get each batch, I think could avoid calling to_vec() multiple times (see next comment)

Comment on lines +810 to +823
for offset in (0..len).step_by(length) {
if offset + length > len {
length = len - offset;
}
let slice_columns = batch
.columns()
.iter()
.map(|array| {
let sliced_array = array.slice(offset, length);
sliced_array.to_owned()
})
.collect();
batches.push(RecordBatch::try_new(batch.schema().clone(), slice_columns)?);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this can be simplified by first calculating how many slices you'll need (batch.num_rows() / self.batch_size or something like that) which would allow you to preallocate the Vec in advance, and then you could push the RecordBatches in reverse order which allows popping in constant time (previous comment)

@@ -787,7 +789,8 @@ impl GroupedHashAggregateStream {
let timer = elapsed_compute.timer();
self.exec_state = if self.spill_state.spills.is_empty() {
let batch = self.emit(EmitTo::All, false)?;
ExecutionState::ProducingOutput(batch)
let batches = self.split_batch(batch)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thinik the key point of the request is to avoid the call to emit(EmitTo::All) or perhaps change that call to return a Vec

Taking a large single record batch and slicing it up doesn't change how the underlying memory is allocated / laid out (aka the same large contiguous batch is used)

@alamb alamb marked this pull request as draft April 15, 2024 11:31
@alamb
Copy link
Contributor

alamb commented Apr 15, 2024

Thanks for this PR @guojidan -- I am marking it as a draft for now as I am trying to keep the "PR needs review queue" managable. I am sorry but I don't think I will have time to help with this PR for a while. Hopefully others (maybe @yjshen or @gruuya ) would have some time

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Jun 15, 2024
@github-actions github-actions bot closed this Jun 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generate GroupByHash output in multiple RecordBatches rather than one large one
3 participants