Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DataInput as source for StoredField #14213

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

Tim-Brooks
Copy link
Contributor

Allows a StoredField to be created from a DataInput.

@Tim-Brooks
Copy link
Contributor Author

Tim-Brooks commented Feb 6, 2025

Summary

I am opening this proposed change to support writing a stored field from a byte source which does not require a contiguous array allocation. The reason I am proposing this is because there are times when we would like to store large stored fields and the requirement to provide a fully contiguous byte array can cause issues on smaller heaps. Particularly when the original data is already on heap in a non-contiguous source.

I took a stab at this using a DataInput as a source for the indexing. I went with this initial approach as it aligns with the fact that StoredFieldsWriter already supports DataInput (seemingly for merges).

I wrapped the DataInput in a record style class StoredFieldDataInput to associated length with it.

If this approach has support I will continue to refine the PR. In particular, I was uncertain whether Lucene would want DataInput to be fully supported in Field similar to stringValue, readerValue, doubleValue, etc, etc (with getters and setters). Or stick with what I did where it is only really supported in StoredField as a storedValue. Also I would be interested in what additional test classes I should modify with this type of ensure coverage.

Finally, would we want to modify StoredFieldsWriter#writeField(FieldInfo info, DataInput value, int length) to use this StoredFieldDataInput abstraction instead of also having the length int (if others support the introduction of this abstraction)?

Alternatives

DataInput is only one potential approach. I took it because there was already some work around DataInput with stored fields.

A ByteRef[] or ByteBuffer[] would also work for our use. DataInput has the downside of requiring a local intermediate buffer in ByteBuffersDataOutput to copy into direct bytes. BytesRef[] would work but then not allow direct memory as a source (doesn't matter to our use case but worth noting). ByteBuffer[] supports everything (direct, no intermediate buffer) but is theoretically a bit less flexible than DataInput which is a very flexible abstraction.

Any of these approaches are fine for my use case and I would be happy to work on whichever has the most support and consensus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant