-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add histogram facet capabilities. #14204
base: main
Are you sure you want to change the base?
Conversation
This is inspired from a paper by Tencent where the authors describe how they speed up so-called "histogram queries" by sorting the index by timestamp translating ranges of values corresponding to each histogram bucket to ranges of doc IDs. This way, at collection time, they no longer need to look up values and can compute the histogram purely by looking at collected doc IDs. YU, Muzhi, LIN, Zhaoxiang, SUN, Jinan, et al. TencentCLS: the cloud log service with high query performances. Proceedings of the VLDB Endowment, 2022, vol. 15, no 12, p. 3472-3482. Instead of binary-searching the doc ID space to translate histogram buckets into ranges of doc IDs, the new collector manager uses recently introduced support for sparse indexing. When playing with the geonames dataset, computing a histogram of the elevation field runs ~2-3x faster with this optimization than with the naive implementation.
long leafMinQuotient = Math.floorDiv(skipper.minValue(), interval); | ||
long leafMaxQuotient = Math.floorDiv(skipper.maxValue(), interval); | ||
if (leafMaxQuotient - leafMinQuotient <= 1024) { | ||
// Only use the optimized implementation if there is a small number of unique quotients, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 If they are many quotients, it is very unlikely that the skipper would help as probably there is no skipper block that belongs to just one quotient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice usage of the doc values skipper!
I only wonder if we should add a limit on the number of entries we can add to the hash table. It is easy to provide a small interval that would generate million of entries.
I agree that providing a small interval is a bad usage pattern. I don't know how to validate this though, since we can't know the range of values of the docs that match the query up-front. Even if the sparse index exposes the min and max values across whole segments, it may be that the query only matches a small subset of these values (e.g. if it filters on the same field that is used to compute the histogram). So I guess that the only option is to fail at runtime. I can do that. What looks like a reasonable cap on the number of returned intervals? 1024? |
1024 sounds a good default and maybe the limit is configurable via a constructor parameter? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool! +1 to the conversation on limiting the number of buckets in the map to something sane (and also +1 to making it configurable via the ctor).
this.interval = interval; | ||
this.collectorCounts = collectorCounts; | ||
|
||
leafMinQuotient = Math.floorDiv(skipper.minValue(), interval); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: Since you're already computing these min/max values from the calling code, is it worth passing them along instead of recomputing?
public void finish() throws IOException { | ||
// Put counts that we computed in the int[] back into the hash map. | ||
for (int i = 0; i < counts.length; ++i) { | ||
collectorCounts.addTo(leafMinQuotient + i, counts[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, lack of understanding on my part, but could you help explain why you're accumulating into an array and then transferring into the map instead of accumulating directly into the map?
import org.apache.lucene.search.Scorable; | ||
import org.apache.lucene.search.ScoreMode; | ||
|
||
final class HistogramCollector implements Collector { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a bit of an odd fit for the facet
module given that these are generally implementations of the Facets
interface that compute aggregations over a set of documents that's already been collected. Did you look at the newer sandbox faceting module at all for this? I wonder if this would hook into that module better since it's meant to compute aggregations while collecting (which is exactly what this is doing). The downside is burying something like this in sandbox... so maybe it's not great?
This is inspired from a paper by Tencent where the authors describe how they speed up so-called "histogram queries" by sorting the index by timestamp and translating ranges of values corresponding to each histogram bucket to ranges of doc IDs. This way, at collection time, they no longer need to look up values and can compute the histogram purely by looking at collected doc IDs.
YU, Muzhi, LIN, Zhaoxiang, SUN, Jinan, et al. TencentCLS: the cloud log service with high query performances. Proceedings of the VLDB Endowment, 2022, vol. 15, no 12, p. 3472-3482.
Instead of binary-searching the doc ID space to translate histogram buckets into ranges of doc IDs, the new collector manager uses recently introduced support for sparse indexing. When playing with the geonames dataset, computing a histogram of the elevation field runs ~2-3x faster with this optimization than with the naive implementation.