Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Make list.rs non generic & simplify the code #1118

Closed

Conversation

SemyonSinchenko
Copy link
Member

@SemyonSinchenko SemyonSinchenko commented Nov 24, 2024

Which issue does this PR close?

Closes #1114

Start of the discussion: comment in the PR1073

Rationale for this change

Spark supports only i32 indixed arrays but Comet attempts to support both ListArray (i32 indexed) and LargeListArray (i64 indexed). As a result the code in list.rs contains additional complexity that is not reachable and is not tested anyhow.

What changes are included in this PR?

Refactoring of the list.rs, ListExtract, GetArrayStructField and ArrayInsert now accept only ListArray instead of GenericListArray

How are these changes tested?

All existing tests.

@SemyonSinchenko SemyonSinchenko marked this pull request as ready for review November 24, 2024 17:02
@SemyonSinchenko
Copy link
Member Author

@andygrove As you mentioned in #1073, it would be interesting to remove all logic related to LargeList and check if there is any regression. I did just that and all tests passed.

@SemyonSinchenko
Copy link
Member Author

There is a valid argument against it:

The difference I think is that a LargeList can store more than Integer.MAX_VALUE entries in all rows in a single batch, so if you have multiple Spark rows all with the max num of rows supported, it wouldn't fit into an Arrow List array. That would probably need to be supported elsewhere, but it may be worth keeping the LargeList handling around in case that scenario is supported? And other DataFusion expressions might return a LargeList even if it doesn't come directly from Spark? Does the native Parquet reader ever use a LargeList?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spark support only i32 indexed arrays while comet is trying to support both i32 and i64
1 participant