Skip to content

GH-46777: [C++] Use SimplifyIsIn only when the value_set of the expression is lower than a threshold #46859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 7, 2025

Conversation

raulcd
Copy link
Member

@raulcd raulcd commented Jun 19, 2025

Rationale for this change

Using SimplifyIsIn when the value set is large has a substantial performance penalty.

What changes are included in this PR?

Ensure we do not use the simplification when the value_set on the expression is higher than a threshold (50).

Are these changes tested?

I've tested locally that the reproducer goes back to pre change levels.

$ python read.py 
=== PYARROW VERSION 20 ===
Retrieved 10,000,000 rows in 3.08 seconds.

I have added a test for large sets and validate the expression is not being modified.

Are there any user-facing changes?

No

Copy link

⚠️ GitHub issue #46777 has been automatically assigned in GitHub to PR creator.

@raulcd raulcd changed the title GH-46777: [C++] Use SimplifyIsIn only when the value_set of the expression is lower than 50 GH-46777: [C++] Use SimplifyIsIn only when the value_set of the expression is lower than a threshold Jun 19, 2025
Copy link
Member Author

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zanmato1984 @pitrou how would you proceed about testing this? Should I just try to add a new benchmark that exercises this scenario? The current Python reproducer has validated the solution fixes the original reported problem and there are currently tests with less values but I haven't found tests with a high value_set.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jun 19, 2025
@pitrou
Copy link
Member

pitrou commented Jun 19, 2025

Do we have any benchmarks for expression simplification already? Otherwise, we shouldn't bother adding any.

Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would you proceed about testing this? Should I just try to add a new benchmark that exercises this scenario?

I think it's good to have both tests and benchmarks. I assume the test shouldn't be very heavy (50 elements value set), is it?

@pitrou
Copy link
Member

pitrou commented Jul 2, 2025

It would be nice to have this in 21.0. Do you want to update this PR @raulcd ?

@raulcd
Copy link
Member Author

raulcd commented Jul 2, 2025

Sure, I am working on it at the moment, will try to push soon

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 2, 2025
@@ -80,6 +80,16 @@ Expression add(Expression l, Expression r) {
return call("add", {std::move(l), std::move(r)});
}

std::string make_range_json(int start, int end) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know our test utilities but couldn't find something like this, do we have any utility to do this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't.

@raulcd raulcd marked this pull request as ready for review July 3, 2025 13:08
@raulcd raulcd requested review from pitrou and zanmato1984 July 3, 2025 13:08
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 3, 2025
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 except for two nits

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 3, 2025
Copy link
Member Author

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@raulcd raulcd merged commit 0b34e6b into apache:main Jul 7, 2025
38 of 39 checks passed
@raulcd raulcd removed the awaiting change review Awaiting change review label Jul 7, 2025
@raulcd raulcd deleted the GH-46777 branch July 7, 2025 07:22
@github-actions github-actions bot added the awaiting changes Awaiting changes label Jul 7, 2025
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 0b34e6b.

There were 9 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 53 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants