-
Notifications
You must be signed in to change notification settings - Fork 162
Support filter script push down for calcite #3812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Heng Qian <[email protected]>
Signed-off-by: Heng Qian <[email protected]>
...ch/src/main/java/org/opensearch/sql/opensearch/storage/script/PPLCompoundedScriptEngine.java
Outdated
Show resolved
Hide resolved
opensearch/src/main/java/org/opensearch/sql/opensearch/request/PredicateAnalyzer.java
Show resolved
Hide resolved
opensearch/src/main/java/org/opensearch/sql/opensearch/request/PredicateAnalyzer.java
Show resolved
Hide resolved
opensearch/src/main/java/org/opensearch/sql/opensearch/storage/script/CalciteScriptEngine.java
Show resolved
Hide resolved
opensearch/src/main/java/org/opensearch/sql/opensearch/storage/script/CalciteScriptEngine.java
Show resolved
Hide resolved
opensearch/src/main/java/org/opensearch/sql/opensearch/storage/script/CalciteScriptEngine.java
Show resolved
Hide resolved
This is a side project that runs the plugin as a standalone CLI and leverages DSL queries to interact with the OpenSearch cluster. |
Yes, we can check the "calcite" related package information in script code to determine the engine type, rather than encode additional information. |
Add option incompatible with rest client is also for security consideration. Otherwise, users can use Rest client to submit java code script without limitation. |
Signed-off-by: Heng Qian <[email protected]>
public static Function1<DataContext, Object[]> compile(String code, Object reason) { | ||
try { | ||
ClassBodyEvaluator cbe = new ClassBodyEvaluator(); | ||
cbe.setClassName("Reducer"); | ||
cbe.setExtendedClass(Utilities.class); | ||
cbe.setImplementedInterfaces(new Class[] {Function1.class, Serializable.class}); | ||
cbe.setParentClassLoader(RexExecutable.class.getClassLoader()); | ||
cbe.cook(new Scanner((String) null, new StringReader(code))); | ||
Class c = cbe.getClazz(); | ||
Constructor<Function1<DataContext, Object[]>> constructor = c.getConstructor(); | ||
return (Function1) constructor.newInstance(); | ||
} catch (IOException | ||
| InstantiationException | ||
| IllegalAccessException | ||
| InvocationTargetException | ||
| NoSuchMethodException | ||
| CompileException var5) { | ||
Exception e = var5; | ||
throw new RuntimeException("While compiling " + reason, e); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Since we already get the code string, we can new RexExecutable and call getFunction() to avoid this copied method?
Signed-off-by: Heng Qian <[email protected]>
Current implementation depend on Script.java in Core deny any options, if it is not true in future, user can run ANY java code.
Are there any workarounds we can consider? Instead of using raw Java code, can we leverage LINQ4j Enumerable or V2 Expression to limit user input to only allowlisted expressions and reduce the risk? |
// Compile code when creating to detect exception as early as possible | ||
JavaTypeFactoryImpl typeFactory = | ||
new JavaTypeFactoryImpl(rexBuilder.getTypeFactory().getTypeSystem()); | ||
RexToLixTranslator.InputGetter getter = | ||
new ScriptInputGetter(typeFactory, rowType, fieldTypes); | ||
this.code = CalciteScriptEngine.translate(rexBuilder, List.of(rexNode), getter, rowType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking is it possible we delay this compilation to Java until compile()
in ScriptEngine? In this way, we serialize and deserialize RexCall
instead, similar as V2 expression pushdown? Just some thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RexNode is unserializable. We've though of similar method as mentioned in the option2 here: #3379 (comment).
@songkant-aws is working on option2 but seems it's not feasible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leveraging Kryo serialization for classes without zero-argument constructor requires explicitly registering specific class serializer. That means we may need to spend more efforts on creating separate customized Kryo serializers for Calcite Expression classes/Enumerable classes.
If we go this way, I'm not confident yet to ensure all kinds of expressions can be correctly serialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RexNode is unserializable.
You mean it's not by JDK or Kryo right? Does other serializer work, or even Json's?
Do we know how many classes related to RexCall
we need to serialize? Just thinking if that's not many and if we can ignore fields not needed by code-gen, can we have customize serializer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qianheng-aws I checked the option 2 you posted. I feel it's different?
What I was thinking is:
- Coordinator: RexNode -> serialized code
- Worker: Code -> RexNode -> Linq4j expression -> Interpret or compile to Java
In this way, the script engine can be bounded to allowlisted RexNode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@songkant-aws Not sure if you've explored the idea above. If so, could you share your PoC branch if any? I can do more test from my side. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dai-chen For the sake of convenience, I pulled Heng's code in my local and add my Kryo serialization prototype code on top of it. And then pushed the code to my personal branch: https://github.com/songkant-aws/sql/tree/kryo-serialization
Basically, the logic is similar. I just feel adding additional serialization logic for third party classes requires too much effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know how many classes related to RexCall we need to serialize?
The common RexNode we need may not be too many. Like RexCall, RexLiteral, RexInputRef, RexLambda, RexLambdaRef... I think the biggest blocker is the inner fields of them. Like RexCall, it has plenty of SqlOperators implemented by Calcite or us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@songkant-aws @qianheng-aws Got it. Will have a look and also see if other options on my side. Thanks!
throw new UnsupportedScriptException( | ||
"DSL will evaluate both branches of OR with isNUll, prevent push-down to avoid NPE"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it an existed issue in v2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think V2 won't have this issue. Firstly, it seems doesn't support function isEmpty
and isBlank
. And secondly, it won't have such issue if it includes the null handling logic in one expression.
We have such issue because we break isEmpty
into 2 expressions(isNull or isEmpty, all from Calcite) and Calcite's isEmpty
doesn't have null handling in its implementation.
throw new UnsupportedScriptException( | ||
"ScriptQueryExpression requires a valid current time from hook, but it is not set"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocker: can we add a UT for this branch?
return switch (exprType) { | ||
case INTEGER -> EnumUtils.convert(docValueExpr, Long.class); | ||
case FLOAT -> EnumUtils.convert(docValueExpr, Double.class); | ||
default -> docValueExpr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can the exprType
be a SHORT
or BINARY
? for doc value. Seems all numeric types will impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It imitates the same logic as v2. As said in the description, it only needs specific handling for Integer and Long.
sql/opensearch/src/main/java/org/opensearch/sql/opensearch/storage/script/core/ExpressionScript.java
Lines 139 to 143 in 870b82c
/** | |
* DocValue only support long and double so cast to integer and float if needed. The doc value | |
* must be Long and Double for expr type Long/Integer and Double/Float respectively. Otherwise | |
* there must be bugs in our engine that causes the mismatch. | |
*/ |
} | ||
|
||
/** | ||
* This function is copied from Calcite RexExecutorImpl It's used to compile RexNode expression to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems you copied the decompiled code instead source code. Can you check it?:
java.util.Iterator var5
, Expressions.declare(16
, Expressions.methodDecl(1
, etc.
* Calcite script executor that executes the generated code on each document and determine if the | ||
* document is supposed to be filtered out or not. | ||
*/ | ||
@EqualsAndHashCode(callSuper = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use true
here?
Mark draft currently for security consideration. |
Description
Implement filter script push down for calcite by registering a new script language for java code string. This new script language can only be invoked by Node client since it requires a option of
engine_type
which is not allowed in Rest client.Related Issues
Resolves #3379
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.