Skip to content

Add histogram facet capabilities. #14204

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Feb 21, 2025
Merged

Add histogram facet capabilities. #14204

merged 13 commits into from
Feb 21, 2025

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Feb 6, 2025

This is inspired from a paper by Tencent where the authors describe how they speed up so-called "histogram queries" by sorting the index by timestamp and translating ranges of values corresponding to each histogram bucket to ranges of doc IDs. This way, at collection time, they no longer need to look up values and can compute the histogram purely by looking at collected doc IDs.

YU, Muzhi, LIN, Zhaoxiang, SUN, Jinan, et al. TencentCLS: the cloud log service with high query performances. Proceedings of the VLDB Endowment, 2022, vol. 15, no 12, p. 3472-3482.

Instead of binary-searching the doc ID space to translate histogram buckets into ranges of doc IDs, the new collector manager uses recently introduced support for sparse indexing. When playing with the geonames dataset, computing a histogram of the elevation field runs ~2-3x faster with this optimization than with the naive implementation.

This is inspired from a paper by Tencent where the authors describe how they
speed up so-called "histogram queries" by sorting the index by timestamp
translating ranges of values corresponding to each histogram bucket to ranges
of doc IDs. This way, at collection time, they no longer need to look up values
and can compute the histogram purely by looking at collected doc IDs.

YU, Muzhi, LIN, Zhaoxiang, SUN, Jinan, et al. TencentCLS: the cloud log service
with high query performances. Proceedings of the VLDB Endowment, 2022, vol. 15,
no 12, p. 3472-3482.

Instead of binary-searching the doc ID space to translate histogram buckets
into ranges of doc IDs, the new collector manager uses recently introduced
support for sparse indexing. When playing with the geonames dataset, computing
a histogram of the elevation field runs ~2-3x faster with this optimization
than with the naive implementation.
long leafMinQuotient = Math.floorDiv(skipper.minValue(), interval);
long leafMaxQuotient = Math.floorDiv(skipper.maxValue(), interval);
if (leafMaxQuotient - leafMinQuotient <= 1024) {
// Only use the optimized implementation if there is a small number of unique quotients,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 If they are many quotients, it is very unlikely that the skipper would help as probably there is no skipper block that belongs to just one quotient.

Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice usage of the doc values skipper!

I only wonder if we should add a limit on the number of entries we can add to the hash table. It is easy to provide a small interval that would generate million of entries.

@jpountz
Copy link
Contributor Author

jpountz commented Feb 6, 2025

I agree that providing a small interval is a bad usage pattern. I don't know how to validate this though, since we can't know the range of values of the docs that match the query up-front. Even if the sparse index exposes the min and max values across whole segments, it may be that the query only matches a small subset of these values (e.g. if it filters on the same field that is used to compute the histogram).

So I guess that the only option is to fail at runtime. I can do that. What looks like a reasonable cap on the number of returned intervals? 1024?

@iverase
Copy link
Contributor

iverase commented Feb 6, 2025

So I guess that the only option is to fail at runtime. I can do that. What looks like a reasonable cap on the number of returned intervals? 1024?

1024 sounds a good default and maybe the limit is configurable via a constructor parameter?

Copy link
Contributor

@gsmiller gsmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool! +1 to the conversation on limiting the number of buckets in the map to something sane (and also +1 to making it configurable via the ctor).

this.interval = interval;
this.collectorCounts = collectorCounts;

leafMinQuotient = Math.floorDiv(skipper.minValue(), interval);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Since you're already computing these min/max values from the calling code, is it worth passing them along instead of recomputing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit on the fence, these are very cheap to compute, and passing them along forces the ctor to take 2 more arguments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong preference on this. Out of curiosity though, what's the concern with additional ctor arguments? Potential for bugs if calling code transposes args, or something else? There's only the one call-point since this inner class is private, but maybe there's something else of concern here? (Like I said, no strong preference, just trying to learn/understand your thinking on this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that the main annoyance I have with taking these values from the constructor is that I would then want to validate that they indeed match the min/max values exposed by the skipper (just for the sake of being defensive and not letting bugs sneak in), but then we'd be back to computing these min/max buckets twice, plus the code would be a bit harder to read due to more ctor arguments and validation logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, thanks :)

public void finish() throws IOException {
// Put counts that we computed in the int[] back into the hash map.
for (int i = 0; i < counts.length; ++i) {
collectorCounts.addTo(leafMinQuotient + i, counts[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, lack of understanding on my part, but could you help explain why you're accumulating into an array and then transferring into the map instead of accumulating directly into the map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is trying to save the overhead of the hash table, which needs to deal with hashing and collisions. I'll add a comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. OK thanks.

import org.apache.lucene.search.Scorable;
import org.apache.lucene.search.ScoreMode;

final class HistogramCollector implements Collector {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a bit of an odd fit for the facet module given that these are generally implementations of the Facets interface that compute aggregations over a set of documents that's already been collected. Did you look at the newer sandbox faceting module at all for this? I wonder if this would hook into that module better since it's meant to compute aggregations while collecting (which is exactly what this is doing). The downside is burying something like this in sandbox... so maybe it's not great?

Copy link
Contributor Author

@jpountz jpountz Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a very quick look at the sandbox faceting module, but I didn't like that it introduces many abstractions (cutter, recorder, label, etc.) when I wanted something simple. I'm happy to move it somewhere else though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify - were the abstractions difficult to get started with or do you think they would have complicated the implementation?

Tagging @epotyom and @Shradha26 since I know they would like to improve that code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I spoke too quickly and didn't use the right words. When I said that I don't like it, I meant that the API is designed around the idea of hierarchical facets, which didn't match the mental model I had for this collector which I wanted to keep as simple as possible, and in particular not hierarchical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the API is designed around the idea of hierarchical facets

I don't think hierarchy is a central piece of the API, it is more like hierarchical facets are also supported if needed. For many facet implementations it is not needed, such as ranges, distinct long values, etc.

I briefly looked at the code and I think it fits into the sandbox module API use case. I believe all you need is to implement FacetCutter (#createLeafCutter) and LeafFacetCutter (#advanceExact and #nextOrd) interfaces, which return bucket indexes as facet ordinals.

What you would get for free is counting, or computing other aggregations based on numeric fields for the doc (FacetRecorder implementations); sorting/topN buckets by count or aggregation (OrdinaIterators). You could also implement OrdToLabel interface to do the math explained in HistogramCollectorManager's javadoc and return String representations of bucket ranges.

That being said, if you don't think that this additional functionality is useful, then there is probably no reason to use the new API, which is more flexible at a cost of creating more abstractions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes some follow-ups that I had in mind a bit harder, but we could look into that in follow-up PRs. Otherwise this works for me.

I see that you return bucket directly as an ordinal, which I'm not sure if that would work given that ordinals seem to be expected to be positive and dense. So that wouldn't work if the field has negative values, or very high values (e.g. a date field)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok it seems that I was confused, I thought that the densification of ordinals was the responsibility of the cutter, but it seems to happen on the recorder, so producing non-dense ordinals should be fine (?) Then it would be nice to make ordinals longs rather than ints since we can't guarantee that Math.floorDiv(value, bucketWidth) would always return a value in the int range?

Separately I played with the quick/dirty benchmark I had created, which seems to have got a bit more than 2x slower. I guess this is the cost of the additional abstractions. I'm including it for reference, it's certainly crude.

diff --git a/src/extra/perf/IndexGeoNames.java b/src/extra/perf/IndexGeoNames.java
index aa98b20..6db0837 100644
--- a/src/extra/perf/IndexGeoNames.java
+++ b/src/extra/perf/IndexGeoNames.java
@@ -44,6 +44,7 @@ import org.apache.lucene.document.IntField;
 import org.apache.lucene.document.IntPoint;
 import org.apache.lucene.document.KeywordField;
 import org.apache.lucene.document.LongField;
+import org.apache.lucene.document.NumericDocValuesField;
 import org.apache.lucene.document.StringField;
 import org.apache.lucene.document.TextField;
 //import org.apache.lucene.index.IndexReader;
@@ -52,6 +53,8 @@ import org.apache.lucene.index.IndexWriterConfig;
 import org.apache.lucene.index.IndexWriterConfig.OpenMode;
 import org.apache.lucene.index.IndexableField;
 import org.apache.lucene.index.NoMergePolicy;
+import org.apache.lucene.search.Sort;
+import org.apache.lucene.search.SortField;
 import org.apache.lucene.store.Directory;
 import org.apache.lucene.store.FSDirectory;
 import org.apache.lucene.util.PrintStreamInfoStream;
@@ -75,7 +78,10 @@ public class IndexGeoNames {

     Directory dir = FSDirectory.open(indexPath);
     //IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_48, new StandardAnalyzer(Version.LUCENE_48));
-    IndexWriterConfig iwc = new IndexWriterConfig(new StandardAnalyzer());
+    SortField sortField = new SortField("elevation", SortField.Type.LONG);
+    sortField.setMissingValue(Long.MIN_VALUE);
+    Sort sort = new Sort(sortField);
+    IndexWriterConfig iwc = new IndexWriterConfig(new StandardAnalyzer()).setIndexSort(sort);
     iwc.setOpenMode(OpenMode.CREATE);
     //iwc.setRAMBufferSizeMB(350);
     iwc.setInfoStream(new PrintStreamInfoStream(System.out));
@@ -144,8 +150,8 @@ public class IndexGeoNames {
               baseDoc.add(admin3Field);
               StringField admin4Field = new StringField("admin4", "", store);
               baseDoc.add(admin4Field);
-              LongField population = new LongField("population", 0, store);
-              LongField elevation = new LongField("elevation", 0, store);
+              NumericDocValuesField population = NumericDocValuesField.indexedField("population", 0);
+              NumericDocValuesField elevation = NumericDocValuesField.indexedField("elevation", 0);
               IntField dem = new IntField("dem", 0, store);
               KeywordField tzField = new KeywordField("timezone", "", store);
               baseDoc.add(tzField);
@@ -282,11 +288,11 @@ public class IndexGeoNames {

                   if (values[14].isEmpty() == false) {
                     long v = Long.parseLong(values[14]);
-                    doc.add(new LongField("population", v, store));
+                    doc.add(NumericDocValuesField.indexedField("population", v));
                   }
                   if (values[15].isEmpty() == false) {
                     long v = Long.parseLong(values[15]);
-                    doc.add(new LongField("elevation", v, store));
+                    doc.add(NumericDocValuesField.indexedField("elevation", v));
                   }
                   if (values[16].isEmpty() == false) {
                     doc.add(new IntField("dem", Integer.parseInt(values[16]), store));
diff --git a/src/extra/perf/SearchGeoNames.java b/src/extra/perf/SearchGeoNames.java
index 743c422..87e98f4 100644
--- a/src/extra/perf/SearchGeoNames.java
+++ b/src/extra/perf/SearchGeoNames.java
@@ -20,8 +20,6 @@ package perf;
 import java.io.IOException;
 import java.nio.file.Path;
 import java.nio.file.Paths;
-import java.text.ParsePosition;
-import java.text.SimpleDateFormat;
 import java.util.ArrayList;
 import java.util.List;
 import java.util.Locale;
@@ -32,7 +30,14 @@ import org.apache.lucene.document.IntPoint;
 import org.apache.lucene.document.LongPoint;
 import org.apache.lucene.index.DirectoryReader;
 import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.internal.hppc.LongIntHashMap;
+import org.apache.lucene.sandbox.facet.FacetFieldCollectorManager;
+import org.apache.lucene.sandbox.facet.iterators.OrdinalIterator;
+import org.apache.lucene.sandbox.facet.plain.histograms.HistogramFacetCutter;
+import org.apache.lucene.sandbox.facet.recorders.CountFacetRecorder;
+import org.apache.lucene.search.FieldExistsQuery;
 import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.MatchAllDocsQuery;
 import org.apache.lucene.search.Query;
 import org.apache.lucene.search.TopDocs;
 import org.apache.lucene.store.Directory;
@@ -53,7 +58,7 @@ public class SearchGeoNames {
     IndexSearcher s = new IndexSearcher(r);
     s.setQueryCache(null); // don't bench the cache

-    SimpleDateFormat dateParser = new SimpleDateFormat("yyyy-MM-dd", Locale.US);
+    /*SimpleDateFormat dateParser = new SimpleDateFormat("yyyy-MM-dd", Locale.US);
     System.out.println("t=" + dateParser.parse("2014-12-01", new ParsePosition(0)).getTime());

     searchOneField(s, getQueries(s, "geoNameID", 0, 10000000));
@@ -61,7 +66,32 @@ public class SearchGeoNames {
     searchOneField(s, getQueries(s, "longitude", -180.0, 180.0));

     // 1993-12-01 to 2014-12-01:
-    searchOneField(s, getQueries(s, "modified", 754722000000L, 1417410000000L));
+    searchOneField(s, getQueries(s, "modified", 754722000000L, 1417410000000L));*/
+
+    System.out.println(r.maxDoc() + " total docs");
+    System.out.println(s.count(new FieldExistsQuery("elevation")) + " docs with elevation");
+    for (int i = 0; i < 10000; ++i) {
+      long start = System.nanoTime();
+      // histogram_facet branch
+      //LongIntHashMap counts = s.search(new MatchAllDocsQuery(), new HistogramCollectorManager("elevation", 100));
+      // histogram cutter branch
+      CountFacetRecorder recorder = new CountFacetRecorder();
+      HistogramFacetCutter cutter = new HistogramFacetCutter("elevation", 100);
+      FacetFieldCollectorManager<CountFacetRecorder> collectorManager =
+          new FacetFieldCollectorManager<>(cutter, recorder);
+      s.search(new MatchAllDocsQuery(), collectorManager);
+      long end = System.nanoTime();
+      // histogram_facet branch
+      // System.out.println((end - start) + " " + counts);
+      // histogram cutter branch
+      LongIntHashMap counts = new LongIntHashMap();
+      OrdinalIterator ords = recorder.recordedOrds();
+      for (int ord = ords.nextOrd(); ord != OrdinalIterator.NO_MORE_ORDS; ord = ords.nextOrd()) {
+        counts.put(ord, recorder.getCount(ord));
+      }
+      System.out.println((end - start) + " " + counts);
+      recorder.recordedOrds();
+    }

     r.close();
     dir.close();

Copy link
Contributor

@epotyom epotyom Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it would be nice to make ordinals longs rather than int

It's an interesting idea. The issue is many facet implementations don't need long ordinals, e.g. Taxonomy or SSDV facets counting use ints, even though at least in some cases they index as longs. It looks like using long for them might be wasteful. Overall, using long for something that is essential a group id in a grouping mechanism seems excessive.

At the same time long facet ordinals can simplify and improve performance for some FacetCutter implementations, in particular LongValueFacetCutter . Also, as you've mentioned, it is FacetRecorder responsibility to keep counts in a dense data structure, so it might be fine to move to long.

In any case it requires a separate effort, and I think we should run luceneutil #325 and Amazon internal perf tests before making the decision. I can create an issue for it.

ordinals seem to be expected to be positive

Yes, it is also a limitation in the current API. IIRC the only thing that relies on it is OrdinalIterator#NO_MORE_ORDS = -1. We can probably reserve some other value for it, e.g. Long.MAX_VALUE or MIN_VALUE , it should work for most cases, including histogram since bucketWidth has to be greater than 2. It's a bit fragile for LongValueFacetCutter - we'd have to throw runtime error is the value in the index is NO_MORE_ORDS. Although the implementation is already fragile as it uses LongIntHashMap which size is limited by array max size. So I suppose it's not too terrible to not allow NO_MORE_ORDS value when counting.

Separately I played with the quick/dirty benchmark I had created, which seems to have got a bit more than 2x slower.

I'd like to look into it - maybe there is something we can improve. Are you running runGeoBench.cmd to get results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I suppose it's not too terrible to not allow NO_MORE_ORDS value when counting.

I'm a bit confused as to what your suggestion is then. For instance if the value is -5 and the interval is 10, the bucket would be computer as Math.floorDiv(-5, 10), which returns -1.

Based on your points, it seems like the least fragile approach would be to make the facet cutter responsible for densifying bucket ordinals?

So in the end, it would make sense to have two histogram implementations, one as a cutter, that densifies ordinals using a hash table that is similar to your branch, and another one as a raw collector manager for users who are interested in computing counts per histogram bucket but nothing else like this PR?

Are you running runGeoBench.cmd to get results?

I'm running the geonames benchmark, that consists or running the IndexGeoNames#main then SearchGeoNames#main.

Copy link
Contributor

@epotyom epotyom Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused as to what your suggestion is then.

I think that changing NO_MORE_ORDS constant value to Long.MAX_VALUE or Long.MIN_VALUE solves this problem (after changing facet ord type from int to long). With min bucket width > 1, we will never have bucket ID that is equal to Long.MAX_VALUE or Long.MIN_VALUE.

So in the end, it would make sense to have two histogram implementations, one as a cutter, that densifies ordinals using a hash table that is similar to your branch, and another one as a raw collector manager for users who are interested in computing counts per histogram bucket but nothing else like this PR?

It seems reasonable. And we can look into changing facet ord type to long as a separate issue.

I'm running the geonames benchmark, that consists or running the IndexGeoNames#main then SearchGeoNames#main.

Thanks!

Copy link
Contributor

@stefanvodita stefanvodita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR caught my eye because I had been working on histogramming functionality as well with dynamic range facets. That implementation works post-collection and takes the number of buckets as input, then determines their size. If I understand correctly, here the input is the size of the interval that a bucket should cover, but we don't know ahead of time the number of buckets that will be produced.
Great to see more Lucene support for this sort of thing!

// We must not double-count values that divide to the same quotient since this returns doc
// counts as opposed to value counts.
if (quotient != prevQuotient) {
counts.addTo(quotient, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a little confused by the usage of quotient up to this point. To me quotient suggests a counter, how much of something there is (more likely to be the value in a map than a key). I think the concept here is more akin to quantile, what we refer to as bucket elsewhere. Am I misunderstanding how this is meant to work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, I tried to reuse terms from mathematical division but having it mixed up with "bucket" certainly doesn't help. You got it right, though I'm not sure if "quantile" would be a good fit. I can try to use "bucket" more consistently, wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what you meant by quotient now. I like bucket. It's easy to understand and assumes the least knowledge from a reader.


/**
* Compute a histogram of the distribution of the values of the given {@code field} according to
* the given {@code interval}. This configures a maximum number of buckets equal to the default of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be more explicit about what interval represents? I might call it bucketWidth.
Maybe the comment can say "Compute a histogram of the distribution of values of the given field, with buckets from 0 to interval, from interval to 2*interval, and so on, up to the configured maximum number of buckets."

import org.apache.lucene.search.Scorable;
import org.apache.lucene.search.ScoreMode;

final class HistogramCollector implements Collector {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify - were the abstractions difficult to get started with or do you think they would have complicated the implementation?

Tagging @epotyom and @Shradha26 since I know they would like to improve that code.

@jpountz
Copy link
Contributor Author

jpountz commented Feb 11, 2025

I moved the code to the sandbox facet framework and applied suggestions, I hope I didn't miss any.

Copy link
Contributor

@stefanvodita stefanvodita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making those changes @jpountz! I see it still needs a tidy run, but other than that, looks good to me.

@gsmiller
Copy link
Contributor

This looks good to me @jpountz. I think it makes sense to put this in sandbox, but I'd personally also be fine with leaving it where you initially had it. (I think this also highlights that it would be nice to get the sandbox faceting module out of sandbox and reconciled with the traditional faceting module sooner-rather-than later).


leafMinBucket = Math.floorDiv(skipper.minValue(), bucketWidth);
long leafMaxBucket = Math.floorDiv(skipper.maxValue(), bucketWidth);
counts = new int[Math.toIntExact(leafMaxBucket - leafMinBucket + 1)];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe fallback to HistogramNaiveLeafCollector instead of throwing ArithmeticException when array size overflows an int, to avoid this hidden limitation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah never mind, you already have it.

@jpountz jpountz merged commit 2d422af into apache:main Feb 21, 2025
6 checks passed
@jpountz jpountz deleted the histogram_facet branch February 21, 2025 11:22
@jpountz jpountz added this to the 10.2.0 milestone Feb 21, 2025
jpountz added a commit that referenced this pull request Feb 21, 2025
This is inspired from a paper by Tencent where the authors describe how they
speed up so-called "histogram queries" by sorting the index by timestamp
translating ranges of values corresponding to each histogram bucket to ranges
of doc IDs. This way, at collection time, they no longer need to look up values
and can compute the histogram purely by looking at collected doc IDs.

YU, Muzhi, LIN, Zhaoxiang, SUN, Jinan, et al. TencentCLS: the cloud log service
with high query performances. Proceedings of the VLDB Endowment, 2022, vol. 15,
no 12, p. 3472-3482.

Instead of binary-searching the doc ID space to translate histogram buckets
into ranges of doc IDs, the new collector manager uses recently introduced
support for sparse indexing. When playing with the geonames dataset, computing
a histogram of the elevation field runs ~2-3x faster with this optimization
than with the naive implementation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants