Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add histogram facet capabilities. #14204

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Feb 6, 2025

This is inspired from a paper by Tencent where the authors describe how they speed up so-called "histogram queries" by sorting the index by timestamp and translating ranges of values corresponding to each histogram bucket to ranges of doc IDs. This way, at collection time, they no longer need to look up values and can compute the histogram purely by looking at collected doc IDs.

YU, Muzhi, LIN, Zhaoxiang, SUN, Jinan, et al. TencentCLS: the cloud log service with high query performances. Proceedings of the VLDB Endowment, 2022, vol. 15, no 12, p. 3472-3482.

Instead of binary-searching the doc ID space to translate histogram buckets into ranges of doc IDs, the new collector manager uses recently introduced support for sparse indexing. When playing with the geonames dataset, computing a histogram of the elevation field runs ~2-3x faster with this optimization than with the naive implementation.

This is inspired from a paper by Tencent where the authors describe how they
speed up so-called "histogram queries" by sorting the index by timestamp
translating ranges of values corresponding to each histogram bucket to ranges
of doc IDs. This way, at collection time, they no longer need to look up values
and can compute the histogram purely by looking at collected doc IDs.

YU, Muzhi, LIN, Zhaoxiang, SUN, Jinan, et al. TencentCLS: the cloud log service
with high query performances. Proceedings of the VLDB Endowment, 2022, vol. 15,
no 12, p. 3472-3482.

Instead of binary-searching the doc ID space to translate histogram buckets
into ranges of doc IDs, the new collector manager uses recently introduced
support for sparse indexing. When playing with the geonames dataset, computing
a histogram of the elevation field runs ~2-3x faster with this optimization
than with the naive implementation.
long leafMinQuotient = Math.floorDiv(skipper.minValue(), interval);
long leafMaxQuotient = Math.floorDiv(skipper.maxValue(), interval);
if (leafMaxQuotient - leafMinQuotient <= 1024) {
// Only use the optimized implementation if there is a small number of unique quotients,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 If they are many quotients, it is very unlikely that the skipper would help as probably there is no skipper block that belongs to just one quotient.

Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice usage of the doc values skipper!

I only wonder if we should add a limit on the number of entries we can add to the hash table. It is easy to provide a small interval that would generate million of entries.

@jpountz
Copy link
Contributor Author

jpountz commented Feb 6, 2025

I agree that providing a small interval is a bad usage pattern. I don't know how to validate this though, since we can't know the range of values of the docs that match the query up-front. Even if the sparse index exposes the min and max values across whole segments, it may be that the query only matches a small subset of these values (e.g. if it filters on the same field that is used to compute the histogram).

So I guess that the only option is to fail at runtime. I can do that. What looks like a reasonable cap on the number of returned intervals? 1024?

@iverase
Copy link
Contributor

iverase commented Feb 6, 2025

So I guess that the only option is to fail at runtime. I can do that. What looks like a reasonable cap on the number of returned intervals? 1024?

1024 sounds a good default and maybe the limit is configurable via a constructor parameter?

Copy link
Contributor

@gsmiller gsmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool! +1 to the conversation on limiting the number of buckets in the map to something sane (and also +1 to making it configurable via the ctor).

this.interval = interval;
this.collectorCounts = collectorCounts;

leafMinQuotient = Math.floorDiv(skipper.minValue(), interval);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Since you're already computing these min/max values from the calling code, is it worth passing them along instead of recomputing?

public void finish() throws IOException {
// Put counts that we computed in the int[] back into the hash map.
for (int i = 0; i < counts.length; ++i) {
collectorCounts.addTo(leafMinQuotient + i, counts[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, lack of understanding on my part, but could you help explain why you're accumulating into an array and then transferring into the map instead of accumulating directly into the map?

import org.apache.lucene.search.Scorable;
import org.apache.lucene.search.ScoreMode;

final class HistogramCollector implements Collector {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a bit of an odd fit for the facet module given that these are generally implementations of the Facets interface that compute aggregations over a set of documents that's already been collected. Did you look at the newer sandbox faceting module at all for this? I wonder if this would hook into that module better since it's meant to compute aggregations while collecting (which is exactly what this is doing). The downside is burying something like this in sandbox... so maybe it's not great?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants