Skip to content

Consider removing skip from RowSelector #7450

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tracked by #7456
Dandandan opened this issue Apr 28, 2025 · 9 comments
Open
Tracked by #7456

Consider removing skip from RowSelector #7450

Dandandan opened this issue Apr 28, 2025 · 9 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@Dandandan
Copy link
Contributor

Dandandan commented Apr 28, 2025

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently RowSelector contains a pub skip: bool which means that the rows in the row selector needs to be skipped or not.
However, this field is not very useful as an optimal representation of a RowSelector will always be alternating selected and skipped rows.

I think we should consider dropping the skip field in order to simplify the api (and speed up / reduce memory overhead (from 16 to 8 bytes per element) as well).

We can represent a RowSelector as array of alternating select / skip / select rows.

e.g. :
[0, 10, 5, 10, 5] => select 0, skip 10, select 5, skip 10, select 5

Describe the solution you'd like

Drop the skip field, update implementation to take care of the new representation (select / skip based on alternation rather than via the field, (even/odd)).

Describe alternatives you've considered

Other representation, e.g. Vec<Range<..>>

Additional context

@Dandandan Dandandan added the enhancement Any new improvement worthy of a entry in the changelog label Apr 28, 2025
@alamb
Copy link
Contributor

alamb commented Apr 29, 2025

If we are going to change how RowSelector is implemented I think we should also include a veriant that allows for a Bitmap representation too (to handle the case where the selections are not very large). I promise to write this up as a ticket tomorrow

Related ticket:

@alamb
Copy link
Contributor

alamb commented Apr 29, 2025

I filed an epic that tries to explain the overall goal

Here is a related ticket where @tustvold describes some other ideas to improve RowSelections:

@Dandandan
Copy link
Contributor Author

Dandandan commented Apr 29, 2025

Thanks for connecting the issues!
The BooleanBuffer approach is more efficient for dense / non-selective filters, so would be great to have this as well.

Motivating this issue to remove skip is it will half the size of RowSelector from 16 bytes (for 64-bit CPU) to 8 bytes (so it might help in cases where it already is relatively dense / contains many selectors.

@alamb
Copy link
Contributor

alamb commented Apr 30, 2025

Like always, @XiangpengHao was ahead of us, and he also mentioned the size of the RowSelector in #7363 (comment)

@alamb
Copy link
Contributor

alamb commented Apr 30, 2025

@XiangpengHao
Copy link
Contributor

We can represent a RowSelector as array of alternating select / skip / select rows.

e.g. : [0, 10, 5, 10, 5] => select 0, skip 10, select 5, skip 10, select 5

But this makes concatenating/slicing selectors tricker, and can not accept row selectors that have two consecutive select/skip.

I'll prefer using a sentinel value to represent skip/select.

@alamb
Copy link
Contributor

alamb commented Apr 30, 2025

Another potential option might be an enum like this:

enum RowSelector {
  Skip(usize),
  Scan(usize),
  Bitmap(BooleanBuffer),
}
``

@Dandandan
Copy link
Contributor Author

Dandandan commented May 4, 2025

Concatenating/slicing selectors tricker

It seems just adding "0" on odd elements to denote "select 0" gives the same meaning?
Currently the output of most calls seems to be alternating skip/select anyway? So the main change seems to be that we always start with one of the two (e.g select) and go on from there.

Can not accept row selectors that have two consecutive select/skip.

Skip (5), Skip(5) is identical to Skip(10), so what would be the benefit from having consecutive select/skip?

Another potential option might be an enum like this:

enum RowSelector {
   Skip(usize),
   Scan(usize),
   Bitmap(BooleanBuffer),
}

Downside is that it is also 16 bytes per element instead of 8.

@Dandandan
Copy link
Contributor Author

Dandandan commented May 4, 2025

The nice thing is selectors: Vec<RowSelector> is private so it's possible to change it to Vec<usize> and keep the API backwards compatible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

3 participants