-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add ES93BloomFilterStoredFieldsFormat for efficient field existence checks #137331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ES93BloomFilterStoredFieldsFormat for efficient field existence checks #137331
Conversation
…hecks Introduces a new stored fields format that builds a Bloom filter for a specific field to enable fast existence checks without storing the field itself. This delegates storage of all other fields to another StoredFieldsFormat while maintaining the ability to quickly determine if a document might contain the target field.
tlrx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I left many comments but most of them are nits.
I think this is ready for review, we'll improve the format step by step.
server/src/main/java/org/elasticsearch/index/codec/bloomfilter/BloomFilterHashFunctions.java
Show resolved
Hide resolved
...c/main/java/org/elasticsearch/index/codec/bloomfilter/ES93BloomFilterStoredFieldsFormat.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/elasticsearch/index/codec/bloomfilter/ES93BloomFilterStoredFieldsFormat.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/elasticsearch/index/codec/bloomfilter/ES93BloomFilterStoredFieldsFormat.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/elasticsearch/index/codec/bloomfilter/ES93BloomFilterStoredFieldsFormat.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/elasticsearch/index/codec/bloomfilter/ES93BloomFilterStoredFieldsFormat.java
Outdated
Show resolved
Hide resolved
| "", | ||
| TestUtil.getDefaultCodec().storedFieldsFormat(), | ||
| ByteSizeValue.ofKb(2), | ||
| IdFieldMapper.NAME |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could have a better coverage if we were randomly using some field names that are generated in BaseStoredFieldsFormatTestCase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid that we'll broke some of the integration tests that rely on these fields actually being stored.
|
|
||
| private void finishBloomFilterStoredFormat() throws IOException { | ||
| BloomFilterMetadata bloomFilterMetadata = null; | ||
| if (bloomFilterFieldInfo != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is null, then we haven't seen an _id stored field. Does it still make sense to write an empty file then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, for merges maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's for merges. Once we get merges working, bloomFilterFieldInfo should be always not null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks!
| */ | ||
| public class ES93BloomFilterStoredFieldsFormat extends StoredFieldsFormat { | ||
| public static final String STORED_FIELDS_BLOOM_FILTER_FORMAT_NAME = "ES93BloomFilterStoredFieldsFormat"; | ||
| public static final String STORED_FIELDS_BLOOM_FILTER_EXTENSION = "sfbf"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sfbf will be Special Francisco's Bloom Filter in my mind forever once this is merged.
| var success = false; | ||
| try { | ||
| bloomFilterFieldReader = BloomFilterFieldReader.open(directory, si, fn, context, segmentSuffix); | ||
| success = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to return the delegateReader even if bloomFilterFieldReader is null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily, but I would expect that the bloom filter field reader will be always != null
Skip skipping the bloom filter field for all data types Compute the bloom filter size in a simpler way
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
|
Hi @fcofdez, I've created a changelog YAML for you. |
tlrx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| + " or less (rounded to nearest power of two)" | ||
| ); | ||
| } | ||
| this.bloomFilterSizeInBits = (int) Math.multiplyExact(closestPowerOfTwoBloomFilterSizeInBytes, Byte.SIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| this.bloomFilterSizeInBits = (int) Math.multiplyExact(closestPowerOfTwoBloomFilterSizeInBytes, Byte.SIZE); | |
| this.bloomFilterSizeInBits = Math.toIntExact(Math.multiplyExact(closestPowerOfTwoBloomFilterSizeInBytes, Byte.SIZE)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit isn't addressed yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 1e4ea4c
| if (info.getName().equals(bloomFilterFieldName)) { | ||
| addToBloomFilter(info, value); | ||
| } else { | ||
| delegateWriter.writeField(info, value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it depends on how this bloom stored field format will be registered. If it's registered only for the _id field through (an adjusted version of) PerFieldFormatSupplier, what I think we'll do, then I prefer to have all field behave the same as we know other types of fields won't be called.
| } | ||
|
|
||
| private void addToBloomFilter(FieldInfo info, BytesRef value) { | ||
| bloomFilterFieldInfo = info; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think there was an assertion on info.name being equal to the bloomFilterFieldName which I liked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 4293207
|
|
||
| private void finishBloomFilterStoredFormat() throws IOException { | ||
| BloomFilterMetadata bloomFilterMetadata = null; | ||
| if (bloomFilterFieldInfo != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks!
…z/elasticsearch into bloom-filter-stored-fields-format
kkrik-es
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martijnvg knows this part better for sure.
server/src/main/java/org/elasticsearch/index/codec/bloomfilter/BloomFilterHashFunctions.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/codec/bloomfilter/BloomFilterHashFunctions.java
Show resolved
Hide resolved
martijnvg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
| + " or less (rounded to nearest power of two)" | ||
| ); | ||
| } | ||
| this.bloomFilterSizeInBits = (int) Math.multiplyExact(closestPowerOfTwoBloomFilterSizeInBytes, Byte.SIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit isn't addressed yet
| public static final String STORED_FIELDS_BLOOM_FILTER_FORMAT_NAME = "ES93BloomFilterStoredFieldsFormat"; | ||
| public static final String STORED_FIELDS_BLOOM_FILTER_EXTENSION = "sfbf"; | ||
| public static final String STORED_FIELDS_METADATA_BLOOM_FILTER_EXTENSION = "sfbfm"; | ||
| private static final int VERSION_START = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think typically codec versions start at zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 605c93d
|
|
||
| @Override | ||
| public int merge(MergeState mergeState) throws IOException { | ||
| // Skip merging the bloom filter for now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity what is the plan to support merging? Iirc ES87BloomFilterPostingsFormat supports this by rebuilding the bloom filter from the merged terms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we would just OR the bitsets, but that depends a bit on the sizing strategy. Otherwise we could just rebuild the bloom filter as you mentioned.
| this.bloomFilterData = bloomFilterData; | ||
| } | ||
|
|
||
| public boolean mayContainTerm(String field, BytesRef term) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: instead of makeing the BloomFilterFieldReader public, maybe keep it package protected / private like other classes in here and add an public interface for this specific method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 6550601
|
Thanks all for the reviews! |
Introduces a new stored fields format that builds a Bloom filter for a specific field to enable fast existence checks without storing the field itself. This delegates storage of all other fields to another StoredFieldsFormat while maintaining the ability to quickly determine if a document might contain the target field.
Missing parts