Skip to content

Conversation

@nik9000
Copy link
Member

@nik9000 nik9000 commented Oct 30, 2025

Speeds up queries like

FROM foo
| STATS SUM(LENGTH(field))

by fusing the LENGTH into the loading of the field if it has doc values. Running a fairly simple test:
https://gist.github.com/nik9000/9dac067f8ce29875a4fb0f0359a75091 I'm seeing that query drop from 48ms to 28ms. So, like, 40% faster.

More importantly, this makes the mechanism for fusing functions into field loading generic. All you have to do is implement BlockLoaderExpression on your expression and return non-null from tryFuse.

Speeds up queries like
```
FROM foo
| STATS SUM(LENGTH(field))
```
by fusing the `LENGTH` into the loading of the `field` if it has doc
values. Running a fairly simple test:
https://gist.github.com/nik9000/9dac067f8ce29875a4fb0f0359a75091
I'm seeing that query drop from 48ms to 28ms. So, like, 40% faster.

More importantly, this makes the mechanism for fusing functions into
field loading generic. All you have to do is implement
`BlockLoaderExpression` on your expression and return non-null from
`tryFuse`.
@elasticsearchmachine
Copy link
Collaborator

Hi @nik9000, I've created a changelog YAML for you.

* "fusing" the expression into the load. Or null if the fusion isn't possible.
*/
@Nullable
Fuse tryFuse(SearchStats stats);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to find another name - we already have Fuse as a command. ExpressionFieldLoader?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is FusedExpression ok? Or still too indicative?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming... 😅

I come from staring at FUSE enough that it carries a lot of weight.

For me, this feature involves BlockLoaders. And Expressions that are applied to them. I understand that fuse means getting together those two, but it's not something I would think of immediately without more context.

I'd prefer to be overly explicit here, and call this BlockLoaderExpression or something similar that helps me bridge those two concepts together. But, naming...

BlockLoaderExpression.Fuse fuse
) {
// Only replace if exactly one side is a literal and the other a field attribute
if ((similarityFunction.left() instanceof Literal ^ similarityFunction.right() instanceof Literal) == false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! It's much better to let the Expression deal with the details and make this generic 👍

*/
public boolean pushable() {
return true;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bothers me. I needed this because without it we'd try to push this:

FROM foo
| WHERE LENGTH(kwd) < 10

to the index. Now, we might be able to do that with a specialized lucene query. But we don't have one of those. Without those change instead what happens is:

  1. LENGTH(kwd) becomes $$kwd$length$hash$.
  2. We identify $$kwd$length$hash$ < 10 as pushable.

This tells us we can't push it. But it's kind of picky. If SearchStats took EsField it could check this easy enough. That might be a good solution to this.

nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Oct 31, 2025
Adds special purpose `BlockLoader` implementations for the `MV_MIN` and
`MV_MAX` functions for `keyword` fields with doc values. These are a
noop for single valued keywords but should be *much* faster for
multivalued keywords.

These aren't plugged in yet. We can plug them in and performance test
them in elastic#137382. And they give us two more functions we can use to
demonstrate elastic#137382.
}

public void testLengthInWhereAndEval() {
assumeFalse("fix me", true);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QL friends: This one looks fun!

@nik9000 nik9000 marked this pull request as ready for review October 31, 2025 19:52
@nik9000 nik9000 requested a review from carlosdelest October 31, 2025 19:52
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Oct 31, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@nik9000
Copy link
Member Author

nik9000 commented Oct 31, 2025

This is ready for folks to look at! I left two open questions around planning as self-comments.

@nik9000
Copy link
Member Author

nik9000 commented Oct 31, 2025

I think I'd also like to write a really basic integration test that uses proves the fusion is happening.

@nik9000
Copy link
Member Author

nik9000 commented Oct 31, 2025

A little test: https://gist.github.com/nik9000/3a8eb7065d20c6f36d0c71fad80d1d2c

FROM test
| STATS SUM(LENGTH(big))

With few unique values the perf goes from ~85ms to ~13ms.

If there's enough values to trigger #137217 (comment) then it's slower. So, as always, @dnhatn was right. I'll make the sort, iter, uniq bit he suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants