-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Remove vectors from _source
transparently
#130382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ield loader in mapping
…it is set to true
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
Hi @jimczi, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this touches a ton of files, I am honestly surprised by how little required changes there are 😅
We don't need anything special for reindex?
...rc/yamlRestTest/resources/rest-api-spec/test/search.vectors/240_source_synthetic_vectors.yml
Show resolved
Hide resolved
...rc/yamlRestTest/resources/rest-api-spec/test/search.vectors/240_source_synthetic_vectors.yml
Show resolved
Hide resolved
...rc/yamlRestTest/resources/rest-api-spec/test/search.vectors/240_source_synthetic_vectors.yml
Show resolved
Hide resolved
@jimczi I think you need to add the flag here: Then you can add it to all the test runners (there are many of them, I would just add them to all the ones that has the IVF_FORMAT feature). |
…o match the xcontent parsing
Yep, I left that for a follow-up. Just updated the description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimczi do you want to handle rank_vectors
and sparse_vector
in a separate PR?
Otherwise this looks good.
...rc/yamlRestTest/resources/rest-api-spec/test/search.vectors/240_source_synthetic_vectors.yml
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/MappingLookup.java
Outdated
Show resolved
Hide resolved
I opened #130540 for the failures in https://buildkite.com/elastic/elasticsearch-pull-request/builds/78733. They're unrelated but legit ones. |
Yep that's the plan |
server/src/main/java/org/elasticsearch/index/IndexSettings.java
Outdated
Show resolved
Hide resolved
Single area changelog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments, but from a mapping side in general LGTM.
server/src/main/java/org/elasticsearch/index/mapper/SourceFieldMapper.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/SourceFieldMapper.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/SourceFieldMapper.java
Outdated
Show resolved
Hide resolved
This change adds the support for synthetic vectors (added in elastic#130382) in the rank_vectors field type.
This change adds the support for synthetic vectors (added in #130382) in the rank_vectors field type.
This change adds the support for synthetic vectors (added in elastic#130382) in the sparse_vector field type.
This change adds the support for synthetic vectors (added in #130382) in the sparse_vector field type.
This change modifies reindex behavior to always include vector fields, even if the target index omits embeddings from _source. This prepares for scenarios where embeddings may be automatically excluded (elastic#130382)
This change modifies reindex behavior to always include vector fields, even if the target index omits embeddings from _source. This prepares for scenarios where embeddings may be automatically excluded (#130382).
Summary
This PR introduces a new hybrid mode for the
_source
field that stores the original source without dense vector fields. The goal is to reduce storage overhead and improve performance, especially as vector sizes grow. The setting also affects whether vectors are returned in search and get APIs, which matters even for synthetic source, since reconstructing vectors from doc values can be expensive.Background
Today, Elasticsearch supports two modes for
_source
:_source
is reconstructed from doc values at read time.However, dense vector fields have become problematic:
_source
is wasteful._source
representation is often overly precise (double precision), which isn’t needed for search/indexing.While switching to full synthetic is an option, retrieving the full original
_source
(minus vectors) is often faster and more practical than pulling individual fields from individual storage when the number of metadata fields is high.What This PR Adds
We’re introducing a hybrid source mode:
_source
, minus anydense_vector
fields.Key Behavior
When enabled,
dense_vector
fields are excluded from_source
at index time.The setting also controls whether vectors are returned in search and get APIs:
You can override behavior at query time using the
exclude_vectors
option.The setting is:
Motivation
This hybrid option is designed for use cases where users:
_source
.By making this setting default for newly created indices in a follow up, we can help users avoid surprises from the hidden cost of storing and returning high-dimensional vectors.
Benchmark Results
Benchmarking this PR against
main
using theopenai
rally track shows substantial improvements at the cost of a loss of precision when retrieving the original vectors:Miscellaneous
Reindexing is not covered in this PR. Since it's one of the main use cases for returning vectors, the plan is for reindex to force the inclusion of vectors by default. This will be addressed in a follow-up, as this PR is already quite large.