Skip to content

Support DataInput as source for StoredField #14213

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Feb 19, 2025

Conversation

Tim-Brooks
Copy link
Contributor

@Tim-Brooks Tim-Brooks commented Feb 6, 2025

Allowing indexing stored-only StoredField directly from DataInput.
Introduces StoredFieldDataInput record to associate a length with
DataInput.

@Tim-Brooks
Copy link
Contributor Author

Tim-Brooks commented Feb 6, 2025

Summary

I am opening this proposed change to support writing a stored field from a byte source which does not require a contiguous array allocation. The reason I am proposing this is because there are times when we would like to store large stored fields and the requirement to provide a fully contiguous byte array can cause issues on smaller heaps. Particularly when the original data is already on heap in a non-contiguous source.

I took a stab at this using a DataInput as a source for the indexing. I went with this initial approach as it aligns with the fact that StoredFieldsWriter already supports DataInput (seemingly for merges).

I wrapped the DataInput in a record style class StoredFieldDataInput to associated length with it.

If this approach has support I will continue to refine the PR. In particular, I was uncertain whether Lucene would want DataInput to be fully supported in Field similar to stringValue, readerValue, doubleValue, etc, etc (with getters and setters). Or stick with what I did where it is only really supported in StoredField as a storedValue. Also I would be interested in what additional test classes I should modify with this type of ensure coverage.

Finally, would we want to modify StoredFieldsWriter#writeField(FieldInfo info, DataInput value, int length) to use this StoredFieldDataInput abstraction instead of also having the length int (if others support the introduction of this abstraction)?

Alternatives

DataInput is only one potential approach. I took it because there was already some work around DataInput with stored fields.

A ByteRef[] or ByteBuffer[] would also work for our use. DataInput has the downside of requiring a local intermediate buffer in ByteBuffersDataOutput to copy into direct bytes. BytesRef[] would work but then not allow direct memory as a source (doesn't matter to our use case but worth noting). ByteBuffer[] supports everything (direct, no intermediate buffer) but is theoretically a bit less flexible than DataInput which is a very flexible abstraction.

Any of these approaches are fine for my use case and I would be happy to work on whichever has the most support and consensus.

@jpountz
Copy link
Contributor

jpountz commented Feb 10, 2025

I went with this initial approach as it aligns with the fact that StoredFieldsWriter already supports DataInput (seemingly for merges).

Indeed, the change that introduced this capability had a similar motivation to yours: #12581.

I wrapped the DataInput in a record style class StoredFieldDataInput to associated length with it.

I'm curious if we should make it an actual record?

I was uncertain whether Lucene would want DataInput to be fully supported in Field similar to stringValue, readerValue, doubleValue, etc,

I like what you did better. I don't want to have to deal with the case when a value is provided through a DataInput and needs to be analyzed / indexed.

Finally, would we want to modify StoredFieldsWriter#writeField(FieldInfo info, DataInput value, int length) to use this StoredFieldDataInput abstraction

It's slightly awkward to have an argument called "value" that doesn't fully encapsulate all the information that is necessary to know what the actual value is, so I like your proposal to modify it.

DataInput is only one potential approach. I took it because there was already some work around DataInput with stored fields.

This looks like the best approach to me, I like the symmetry with the read API in StoredFieldVisitor.

In general the change makes sense to me, I'm curious to get @iverase 's opinion since he's worked on the same change on the read side (StoredFieldsVisitor). It is also consistent with the fact that analyzed fields support passing a java.io.Reader instead of a fully materialized String.

@Tim-Brooks
Copy link
Contributor Author

Tim-Brooks commented Feb 10, 2025

I'm curious if we should make it an actual record?

Haha probably. I actually did not check Lucene's language level when producing the PR.

I'll continue to refine this a bit more based on the premise that DataInput is the preferred abstraction while I wait for more feedback.

@iverase
Copy link
Contributor

iverase commented Feb 12, 2025

+1 I like the idea of encapsulating the DataInput and length inside StoredFieldDataInput, and yes, it gives symmetry between the read and the write side.

Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@iverase
Copy link
Contributor

iverase commented Feb 13, 2025

@Tim-Brooks Could you add an entry in CHANGES.txt? It should be under the 10.2 version, thanks!

@Tim-Brooks
Copy link
Contributor Author

@Tim-Brooks Could you add an entry in CHANGES.txt? It should be under the 10.2 version, thanks!

I made this change and added some more tests. Let me know if any additional tests make sense.

@iverase
Copy link
Contributor

iverase commented Feb 18, 2025

Thanks @Tim-Brooks, tests look good.

I will be merging this soon.

@iverase iverase merged commit f12e44b into apache:main Feb 19, 2025
6 checks passed
iverase pushed a commit that referenced this pull request Feb 19, 2025
 Allowing indexing stored-only StoredField directly from DataInput.
@iverase
Copy link
Contributor

iverase commented Feb 19, 2025

Thank you @Tim-Brooks !

@iverase iverase added this to the 10.2.0 milestone Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants