Skip to content

Add LogicalSystemLimit automatically for data-intensive operations #3749

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

LantaoJin
Copy link
Member

@LantaoJin LantaoJin commented Jun 9, 2025

Description

From v3.0.0, PPL introduces commands that may increase data volume. To prevent out-of-memory problem, the system automatically enforces a LogicalSystemLimit operator for such commands.

plugins.query.system_limit: The size configures the maximum of rows in the subsearch to data-intensive operations against (e.g. join, lookup). The default value is: 50000. Value range is from 0 to 2147483647 (Int.MaxValue).

Update

Now, all PPL join/lookup/expand commands (data-bloat) will be affected by this PR. In future, we can add more command argument to control specific command.

For Join, when join type is

  • SEMI, ANTI: no affect
  • RIGHT: add a LogicalSystemLimit operator to left side (main-search)
  • Others: add a LogicalSystemLimit operator to right side (sub-search)

For Lookup

  • add a LogicalSystemLimit operator to right side (sub-search)

For expand

  • add a LogicalSystemLimit operator to right side (sub-search)

The results of impacted search (for example, the lookup table of lookup command, right side of inner join, etc.)
cannot exceed the limitation (50000 rows by default). If the actual number of rows in lookup table or right side
of inner join is greater then the system limit, only the number of rows specified by the configuration will be searched.
You can set the configuration to the maximum integer value (2147483647) if you are certain resources are not a concern.

Related Issues

Resolves #3731

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@LantaoJin LantaoJin requested a review from penghuo June 10, 2025 09:04
@qianheng-aws
Copy link
Collaborator

  1. Will this change leads to incorrect results?

  2. To avoid data bloating, why not adding a limit operator for each child of join operator? It should have similar effect.

penghuo
penghuo previously approved these changes Jun 10, 2025
@penghuo penghuo dismissed their stale review June 10, 2025 15:33

new comments

@LantaoJin LantaoJin marked this pull request as draft June 11, 2025 08:56
Signed-off-by: Lantao Jin <[email protected]>
Signed-off-by: Lantao Jin <[email protected]>
Signed-off-by: Lantao Jin <[email protected]>
@LantaoJin LantaoJin changed the title Pushdown system limit automatically for data-intensive operations Add LogicalSystemLimit automatically for data-intensive operations Jun 11, 2025
Signed-off-by: Lantao Jin <[email protected]>
@LantaoJin
Copy link
Member Author

@penghuo @qianheng-aws @dai-chen I have updated the description with new code refactor, and docs. please take another look.

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.19-dev calcite calcite migration releated stalled
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ENHANCEMENT] Set operator limitation for data-intensive operators
3 participants