-
Notifications
You must be signed in to change notification settings - Fork 162
Filter script pushdown with RelJson serialization in Calcite #3859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter script pushdown with RelJson serialization in Calcite #3859
Conversation
Signed-off-by: Heng Qian <[email protected]>
Signed-off-by: Heng Qian <[email protected]>
Signed-off-by: Heng Qian <[email protected]>
Signed-off-by: Heng Qian <[email protected]>
…string Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
@songkant-aws I remember you mentioned "Calcite has the RexNode serialization related implementation", can you point it out here? |
RexNode rexNode, RelDataType relDataType, Map<String, ExprType> fieldTypes) { | ||
try { | ||
// Serialize RexNode and RelDataType by JSON | ||
JsonBuilder jsonBuilder = new JsonBuilder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calcite related RexNode and RelDataType serialization
RelJson relJson = RelJson.create(); | ||
Map<String, Object> rowTypeMap = mapper.readValue((String) objectMap.get(ROW_TYPE), TYPE_REF); | ||
RelDataType rowType = relJson.toType(cluster.getTypeFactory(), rowTypeMap); | ||
OpenSearchRelInputTranslator inputTranslator = new OpenSearchRelInputTranslator(rowType); | ||
relJson = relJson.withInputTranslator(inputTranslator).withOperatorTable(pplSqlOperatorTable); | ||
Map<String, Object> exprMap = mapper.readValue((String) objectMap.get(EXPR), TYPE_REF); | ||
RexNode rexNode = relJson.toRex(cluster, exprMap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calcite RexNode and RelDataType deserialization related logic
} | ||
|
||
@Override | ||
public RexNode translateInput( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Customized handler to handle RexNode's RelInput
@LantaoJin See comments inline for Calcite RexNode serialization usage by JSON. |
public static final String EXPR = "expr"; | ||
public static final String FIELD_TYPES = "fieldTypes"; | ||
public static final String ROW_TYPE = "rowType"; | ||
private static final ObjectMapper mapper = new ObjectMapper(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It brings me that we can leverage jackson-dataformat-msgpack which is a Jackson extension library to serialize the json to binary. (could be faster and memory saving)
private static final ObjectMapper mapper = new ObjectMapper(new MessagePackFactory());
Ref https://github.com/msgpack/msgpack-java/blob/main/msgpack-jackson/README.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the Calcite RelJson code. Unfortunately, it doesn't rely on ObjectMapper. It implements its own JsonBuilder to write RexNode to json string. So we cannot leverage this msgpack extension.
final RexBuilder rexBuilder = cluster.getRexBuilder(); | ||
|
||
// Check if it is a local ref. | ||
if (map.containsKey("type")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this handling logic or throw exception if we won't really have local reference.
It's inappropriate to put input's RowType in local ref
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can remove it. I don't see we have such LocalRef case.
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
+ " | STATS count(AcceptedAnswerId) as count" | ||
+ " | EVAL dateStr = makedate(2025, count)" | ||
+ " | WHERE simple_query_string(['dateStr'], 'taste')"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to more complex ppl query because now most of EVAL filters can be pushed down as script expression. Adding agg before filters will prevent them pushdown.
Signed-off-by: Songkan Tang <[email protected]>
@@ -426,6 +427,28 @@ public void testMultiFieldsRelevanceQueryFunctionExplain() throws IOException { | |||
+ " default_operator='or', analyzer=english)")); | |||
} | |||
|
|||
@Ignore("The serialized string is unstable because of function properties") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@songkant-aws can you add above test case in explain IT?
} | ||
|
||
/** Expression script language name. */ | ||
public static final String CALCITE_LANG_NAME = "opensearch_calcite"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for more readable, change to:
public static final String EXPRESSION_LANG_NAME = "opensearch_calcite_expression";
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
@songkant-aws can you fix the conflicts, I think the enhancement has been blocked for too long. @penghuo @qianheng-aws any other concerns? can we merge it first? |
LGTM. Please resolve the conflict @songkant-aws |
Signed-off-by: Songkan Tang <[email protected]>
1dd6752
Signed-off-by: Songkan Tang <[email protected]>
private Expression tryAnalyzeOperand(RexNode node) { | ||
try { | ||
Expression expr = node.accept(this); | ||
if (expr instanceof NamedFieldExpression) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[question] In which case will we go into this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we don't have such case. So it's noop when we encounter it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems will throw exception if invoking CompoundQueryExpression ::and
later
sql/opensearch/src/main/java/org/opensearch/sql/opensearch/request/PredicateAnalyzer.java
Line 927 in 37536b8
for (QueryExpression expression : expressions) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking we do will meet this after merging this PR: #3849, since it support using a singleton field as a condition in filer. #3849 (review)
But it needs more change to support push down such NamedFieldExpression
, better in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should never happen. We could just throw exception here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's OK to me leaving this code here. Current behavior of throwing exception in CompoundQueryExpression ::and
makes more sense to me comparing to ignoring it. Ignoring NamedFieldExpression seems will induce to correctness issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LantaoJin Do you have any other concern on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LantaoJin Do you have any other concern on this?
Looks good to me as long as the case where field1=10 or field2
can apply partial pushdown field1=10
.
The backport to
To backport manually, run these commands in your terminal: # Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/sql/backport-2.19-dev 2.19-dev
# Navigate to the new working tree
pushd ../.worktrees/sql/backport-2.19-dev
# Create a new branch
git switch --create backport/backport-3859-to-2.19-dev
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 0a07684a3dae7be069b628ae666efaf667295f48
# Push it to GitHub
git push --set-upstream origin backport/backport-3859-to-2.19-dev
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/sql/backport-2.19-dev Then, create a pull request where the |
@songkant-aws please manually backport it to 2.19-dev |
…rch-project#3859) * Support filter script push down for calcite Signed-off-by: Heng Qian <[email protected]> * Fix UT Signed-off-by: Heng Qian <[email protected]> * Change CompoundedScriptEngine name Signed-off-by: Heng Qian <[email protected]> * Remove duplicated code after merging main Signed-off-by: Heng Qian <[email protected]> * Create RelJsonSerializer to serialize RexNode instead of unsafe code string Signed-off-by: Songkan Tang <[email protected]> * Fix some IT Signed-off-by: Songkan Tang <[email protected]> * Add validation logic before script serialization Signed-off-by: Songkan Tang <[email protected]> * Resolve compilation issue after merge Signed-off-by: Songkan Tang <[email protected]> * Comply with relevance query pushdown Signed-off-by: Songkan Tang <[email protected]> * Add cost computing for script pushdown to balance performance Signed-off-by: Songkan Tang <[email protected]> * Correct test case after merge Signed-off-by: Songkan Tang <[email protected]> * Add V2Expression vs RexNode serde benchmark Signed-off-by: Songkan Tang <[email protected]> * Fix spotless check Signed-off-by: Songkan Tang <[email protected]> * Address comments Signed-off-by: Songkan Tang <[email protected]> * Fix IT after change Signed-off-by: Songkan Tang <[email protected]> * Encode langType into script JSON blob Signed-off-by: Songkan Tang <[email protected]> * Fix issue after merge Signed-off-by: Songkan Tang <[email protected]> * Consolidate with partial pushdown logic after merge Signed-off-by: Songkan Tang <[email protected]> * Fix UT Signed-off-by: Songkan Tang <[email protected]> --------- Signed-off-by: Heng Qian <[email protected]> Signed-off-by: Songkan Tang <[email protected]> Co-authored-by: Heng Qian <[email protected]> (cherry picked from commit 0a07684)
Description
The work on top of preivous Calcite filter script pushdown: #3812.
The issue of previous PR is it sends unsafe arbitrary Java code string to OpenSearch DataNode to compile it, leading to security concern where malicious user could use this exposed script field to run any code to attack cluster.
In this PR, it leverages Calcite existing RexNode serialization mechanism to mitigate security issue. We're now transporting logical expression representation, aka RexNode that will not allow user to customize arbitrary programming logic. The registered function calls are limited to PPLBuiltin UDFs and Calcite standard UDFs. The pushdown now undergoes this way:
|-----------------Coordinator Node -----------------|
optimized RexNode -> encoded RexNode Json String
---transport call-->
|--------------------------------------------------------Data Node -----------------------------------------------------|
decoded RexNode Json String -> deserialize as RexNode -> translate to Linq4j expressions -> code compile expressions and run
Related Issues
Resolves #3379
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.