Skip to content

Filter script pushdown with RelJson serialization in Calcite #3859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

songkant-aws
Copy link
Contributor

@songkant-aws songkant-aws commented Jul 9, 2025

Description

The work on top of preivous Calcite filter script pushdown: #3812.
The issue of previous PR is it sends unsafe arbitrary Java code string to OpenSearch DataNode to compile it, leading to security concern where malicious user could use this exposed script field to run any code to attack cluster.

In this PR, it leverages Calcite existing RexNode serialization mechanism to mitigate security issue. We're now transporting logical expression representation, aka RexNode that will not allow user to customize arbitrary programming logic. The registered function calls are limited to PPLBuiltin UDFs and Calcite standard UDFs. The pushdown now undergoes this way:

|-----------------Coordinator Node -----------------|
optimized RexNode -> encoded RexNode Json String

---transport call-->

|--------------------------------------------------------Data Node -----------------------------------------------------|
decoded RexNode Json String -> deserialize as RexNode -> translate to Linq4j expressions -> code compile expressions and run

Related Issues

Resolves #3379

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@LantaoJin
Copy link
Member

LantaoJin commented Jul 9, 2025

@songkant-aws I remember you mentioned "Calcite has the RexNode serialization related implementation", can you point it out here?

RexNode rexNode, RelDataType relDataType, Map<String, ExprType> fieldTypes) {
try {
// Serialize RexNode and RelDataType by JSON
JsonBuilder jsonBuilder = new JsonBuilder();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calcite related RexNode and RelDataType serialization

Comment on lines 92 to 98
RelJson relJson = RelJson.create();
Map<String, Object> rowTypeMap = mapper.readValue((String) objectMap.get(ROW_TYPE), TYPE_REF);
RelDataType rowType = relJson.toType(cluster.getTypeFactory(), rowTypeMap);
OpenSearchRelInputTranslator inputTranslator = new OpenSearchRelInputTranslator(rowType);
relJson = relJson.withInputTranslator(inputTranslator).withOperatorTable(pplSqlOperatorTable);
Map<String, Object> exprMap = mapper.readValue((String) objectMap.get(EXPR), TYPE_REF);
RexNode rexNode = relJson.toRex(cluster, exprMap);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calcite RexNode and RelDataType deserialization related logic

}

@Override
public RexNode translateInput(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Customized handler to handle RexNode's RelInput

@songkant-aws
Copy link
Contributor Author

@LantaoJin See comments inline for Calcite RexNode serialization usage by JSON.

public static final String EXPR = "expr";
public static final String FIELD_TYPES = "fieldTypes";
public static final String ROW_TYPE = "rowType";
private static final ObjectMapper mapper = new ObjectMapper();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It brings me that we can leverage jackson-dataformat-msgpack which is a Jackson extension library to serialize the json to binary. (could be faster and memory saving)

private static final ObjectMapper mapper = new ObjectMapper(new MessagePackFactory());

Ref https://github.com/msgpack/msgpack-java/blob/main/msgpack-jackson/README.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the Calcite RelJson code. Unfortunately, it doesn't rely on ObjectMapper. It implements its own JsonBuilder to write RexNode to json string. So we cannot leverage this msgpack extension.

@LantaoJin LantaoJin added enhancement New feature or request pushdown pushdown related issues labels Jul 9, 2025
final RexBuilder rexBuilder = cluster.getRexBuilder();

// Check if it is a local ref.
if (map.containsKey("type")) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this handling logic or throw exception if we won't really have local reference.

It's inappropriate to put input's RowType in local ref

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can remove it. I don't see we have such LocalRef case.

@songkant-aws songkant-aws changed the title [DRAFT] Create RelJsonSerializer to serialize RexNode instead of unsafe code string Filter script pushdown with RelJson serialization in Calcite Jul 10, 2025
Comment on lines +177 to +179
+ " | STATS count(AcceptedAnswerId) as count"
+ " | EVAL dateStr = makedate(2025, count)"
+ " | WHERE simple_query_string(['dateStr'], 'taste')";
Copy link
Contributor Author

@songkant-aws songkant-aws Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to more complex ppl query because now most of EVAL filters can be pushed down as script expression. Adding agg before filters will prevent them pushdown.

@songkant-aws songkant-aws marked this pull request as ready for review July 10, 2025 08:27
Signed-off-by: Songkan Tang <[email protected]>
@@ -426,6 +427,28 @@ public void testMultiFieldsRelevanceQueryFunctionExplain() throws IOException {
+ " default_operator='or', analyzer=english)"));
}

@Ignore("The serialized string is unstable because of function properties")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@songkant-aws can you add above test case in explain IT?

}

/** Expression script language name. */
public static final String CALCITE_LANG_NAME = "opensearch_calcite";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for more readable, change to:

public static final String EXPRESSION_LANG_NAME = "opensearch_calcite_expression";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@LantaoJin
Copy link
Member

@songkant-aws can you fix the conflicts, I think the enhancement has been blocked for too long. @penghuo @qianheng-aws any other concerns? can we merge it first?

penghuo
penghuo previously approved these changes Jul 18, 2025
LantaoJin
LantaoJin previously approved these changes Jul 19, 2025
qianheng-aws
qianheng-aws previously approved these changes Jul 21, 2025
@qianheng-aws
Copy link
Collaborator

LGTM. Please resolve the conflict @songkant-aws

Signed-off-by: Songkan Tang <[email protected]>
private Expression tryAnalyzeOperand(RexNode node) {
try {
Expression expr = node.accept(this);
if (expr instanceof NamedFieldExpression) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question] In which case will we go into this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't have such case. So it's noop when we encounter it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems will throw exception if invoking CompoundQueryExpression ::and later

for (QueryExpression expression : expressions) {

Copy link
Collaborator

@qianheng-aws qianheng-aws Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we do will meet this after merging this PR: #3849, since it support using a singleton field as a condition in filer. #3849 (review)

But it needs more change to support push down such NamedFieldExpression, better in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should never happen. We could just throw exception here.

Copy link
Collaborator

@qianheng-aws qianheng-aws Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK to me leaving this code here. Current behavior of throwing exception in CompoundQueryExpression ::and makes more sense to me comparing to ignoring it. Ignoring NamedFieldExpression seems will induce to correctness issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LantaoJin Do you have any other concern on this?

Copy link
Member

@LantaoJin LantaoJin Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LantaoJin Do you have any other concern on this?

Looks good to me as long as the case where field1=10 or field2 can apply partial pushdown field1=10.

@LantaoJin LantaoJin merged commit 0a07684 into opensearch-project:main Jul 22, 2025
23 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.19-dev failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/sql/backport-2.19-dev 2.19-dev
# Navigate to the new working tree
pushd ../.worktrees/sql/backport-2.19-dev
# Create a new branch
git switch --create backport/backport-3859-to-2.19-dev
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 0a07684a3dae7be069b628ae666efaf667295f48
# Push it to GitHub
git push --set-upstream origin backport/backport-3859-to-2.19-dev
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/sql/backport-2.19-dev

Then, create a pull request where the base branch is 2.19-dev and the compare/head branch is backport/backport-3859-to-2.19-dev.

@LantaoJin
Copy link
Member

@songkant-aws please manually backport it to 2.19-dev

songkant-aws added a commit to songkant-aws/sql that referenced this pull request Jul 22, 2025
…rch-project#3859)

* Support filter script push down for calcite

Signed-off-by: Heng Qian <[email protected]>

* Fix UT

Signed-off-by: Heng Qian <[email protected]>

* Change CompoundedScriptEngine name

Signed-off-by: Heng Qian <[email protected]>

* Remove duplicated code after merging main

Signed-off-by: Heng Qian <[email protected]>

* Create RelJsonSerializer to serialize RexNode instead of unsafe code string

Signed-off-by: Songkan Tang <[email protected]>

* Fix some IT

Signed-off-by: Songkan Tang <[email protected]>

* Add validation logic before script serialization

Signed-off-by: Songkan Tang <[email protected]>

* Resolve compilation issue after merge

Signed-off-by: Songkan Tang <[email protected]>

* Comply with relevance query pushdown

Signed-off-by: Songkan Tang <[email protected]>

* Add cost computing for script pushdown to balance performance

Signed-off-by: Songkan Tang <[email protected]>

* Correct test case after merge

Signed-off-by: Songkan Tang <[email protected]>

* Add V2Expression vs RexNode serde benchmark

Signed-off-by: Songkan Tang <[email protected]>

* Fix spotless check

Signed-off-by: Songkan Tang <[email protected]>

* Address comments

Signed-off-by: Songkan Tang <[email protected]>

* Fix IT after change

Signed-off-by: Songkan Tang <[email protected]>

* Encode langType into script JSON blob

Signed-off-by: Songkan Tang <[email protected]>

* Fix issue after merge

Signed-off-by: Songkan Tang <[email protected]>

* Consolidate with partial pushdown logic after merge

Signed-off-by: Songkan Tang <[email protected]>

* Fix UT

Signed-off-by: Songkan Tang <[email protected]>

---------

Signed-off-by: Heng Qian <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Co-authored-by: Heng Qian <[email protected]>
(cherry picked from commit 0a07684)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.19-dev backport-failed enhancement New feature or request pushdown pushdown related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Calcite Engine Framework: Pushdown scripts
4 participants