Deduplicate min and max term in single-term FieldReader (#13618)

I noticed that single-term readers are an edge case but not that uncommon in Elasticsearch heap dumps. It seems quite common to have a constant value for some field across a complete segment (e.g. a version value that is repeated endlessly in logs). Seems simple enough to deduplicate here to save a couple MB of heap.
apache · Jul 31, 2024 · 47650a4 · 47650a4
1 parent ca098e6
commit 47650a4
Show file tree

Hide file tree

Showing 2 changed files with 5 additions and 2 deletions.
diff --git a/...re/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsReader.java b/...re/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsReader.java
@@ -200,6 +200,11 @@ public Lucene90BlockTreeTermsReader(PostingsReaderBase postingsReader, SegmentRe
             final int docCount = metaIn.readVInt();
             BytesRef minTerm = readBytesRef(metaIn);
             BytesRef maxTerm = readBytesRef(metaIn);
+            if (numTerms == 1) {
+              assert maxTerm.equals(minTerm);
+              // save heap for edge case of a single term only so min == max
+              maxTerm = minTerm;
+            }
             if (docCount < 0
                 || docCount > state.segmentInfo.maxDoc()) { // #docs with field must be <= #docs
               throw new CorruptIndexException(

diff --git a/...re/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java b/...re/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java
@@ -598,8 +598,6 @@ private void append(
   private final ByteBuffersDataOutput scratchBytes = ByteBuffersDataOutput.newResettableInstance();
   private final IntsRefBuilder scratchIntsRef = new IntsRefBuilder();
 
-  static final BytesRef EMPTY_BYTES_REF = new BytesRef();
-
   private static class StatsWriter {
 
     private final DataOutput out;