Skip to content

Conversation

@fcofdez
Copy link
Contributor

@fcofdez fcofdez commented Oct 29, 2025

Introduces a new stored fields format that builds a Bloom filter for a specific field to enable fast existence checks without storing the field itself. This delegates storage of all other fields to another StoredFieldsFormat while maintaining the ability to quickly determine if a document might contain the target field.

Missing parts

  • Merge support for the bloom filter
  • Dynamic bloom filter sizing
  • More extensive testing

…hecks

Introduces a new stored fields format that builds a Bloom filter for a
specific field to enable fast existence checks without storing the field
itself. This delegates storage of all other fields to another
StoredFieldsFormat while maintaining the ability to quickly determine if
a document might contain the target field.
@fcofdez fcofdez requested a review from tlrx October 29, 2025 11:55
Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I left many comments but most of them are nits.

I think this is ready for review, we'll improve the format step by step.

"",
TestUtil.getDefaultCodec().storedFieldsFormat(),
ByteSizeValue.ofKb(2),
IdFieldMapper.NAME
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could have a better coverage if we were randomly using some field names that are generated in BaseStoredFieldsFormatTestCase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that we'll broke some of the integration tests that rely on these fields actually being stored.


private void finishBloomFilterStoredFormat() throws IOException {
BloomFilterMetadata bloomFilterMetadata = null;
if (bloomFilterFieldInfo != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is null, then we haven't seen an _id stored field. Does it still make sense to write an empty file then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, for merges maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's for merges. Once we get merges working, bloomFilterFieldInfo should be always not null

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks!

*/
public class ES93BloomFilterStoredFieldsFormat extends StoredFieldsFormat {
public static final String STORED_FIELDS_BLOOM_FILTER_FORMAT_NAME = "ES93BloomFilterStoredFieldsFormat";
public static final String STORED_FIELDS_BLOOM_FILTER_EXTENSION = "sfbf";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sfbf will be Special Francisco's Bloom Filter in my mind forever once this is merged.

var success = false;
try {
bloomFilterFieldReader = BloomFilterFieldReader.open(directory, si, fn, context, segmentSuffix);
success = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to return the delegateReader even if bloomFilterFieldReader is null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily, but I would expect that the bloom filter field reader will be always != null

Skip skipping the bloom filter field for all data types
Compute the bloom filter size in a simpler way
@fcofdez fcofdez requested a review from tlrx October 30, 2025 14:08
@fcofdez fcofdez marked this pull request as ready for review October 30, 2025 14:09
@fcofdez fcofdez removed the WIP label Oct 30, 2025
@fcofdez fcofdez requested review from kkrik-es and martijnvg October 30, 2025 14:09
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@elasticsearchmachine
Copy link
Collaborator

Hi @fcofdez, I've created a changelog YAML for you.

Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

+ " or less (rounded to nearest power of two)"
);
}
this.bloomFilterSizeInBits = (int) Math.multiplyExact(closestPowerOfTwoBloomFilterSizeInBytes, Byte.SIZE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
this.bloomFilterSizeInBits = (int) Math.multiplyExact(closestPowerOfTwoBloomFilterSizeInBytes, Byte.SIZE);
this.bloomFilterSizeInBits = Math.toIntExact(Math.multiplyExact(closestPowerOfTwoBloomFilterSizeInBytes, Byte.SIZE));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit isn't addressed yet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 1e4ea4c

if (info.getName().equals(bloomFilterFieldName)) {
addToBloomFilter(info, value);
} else {
delegateWriter.writeField(info, value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends on how this bloom stored field format will be registered. If it's registered only for the _id field through (an adjusted version of) PerFieldFormatSupplier, what I think we'll do, then I prefer to have all field behave the same as we know other types of fields won't be called.

}

private void addToBloomFilter(FieldInfo info, BytesRef value) {
bloomFilterFieldInfo = info;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think there was an assertion on info.name being equal to the bloomFilterFieldName which I liked

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 4293207


private void finishBloomFilterStoredFormat() throws IOException {
BloomFilterMetadata bloomFilterMetadata = null;
if (bloomFilterFieldInfo != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks!

Copy link
Contributor

@kkrik-es kkrik-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martijnvg knows this part better for sure.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

+ " or less (rounded to nearest power of two)"
);
}
this.bloomFilterSizeInBits = (int) Math.multiplyExact(closestPowerOfTwoBloomFilterSizeInBytes, Byte.SIZE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit isn't addressed yet

public static final String STORED_FIELDS_BLOOM_FILTER_FORMAT_NAME = "ES93BloomFilterStoredFieldsFormat";
public static final String STORED_FIELDS_BLOOM_FILTER_EXTENSION = "sfbf";
public static final String STORED_FIELDS_METADATA_BLOOM_FILTER_EXTENSION = "sfbfm";
private static final int VERSION_START = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think typically codec versions start at zero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 605c93d


@Override
public int merge(MergeState mergeState) throws IOException {
// Skip merging the bloom filter for now
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity what is the plan to support merging? Iirc ES87BloomFilterPostingsFormat supports this by rebuilding the bloom filter from the merged terms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would just OR the bitsets, but that depends a bit on the sizing strategy. Otherwise we could just rebuild the bloom filter as you mentioned.

this.bloomFilterData = bloomFilterData;
}

public boolean mayContainTerm(String field, BytesRef term) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: instead of makeing the BloomFilterFieldReader public, maybe keep it package protected / private like other classes in here and add an public interface for this specific method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6550601

@fcofdez fcofdez requested a review from burqen November 3, 2025 08:28
@fcofdez fcofdez merged commit 2f6ea7a into elastic:main Nov 3, 2025
34 checks passed
@fcofdez
Copy link
Contributor Author

fcofdez commented Nov 3, 2025

Thanks all for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants