-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[CLN] schema: build default with config ef & default_knn_index, remove #document population in defaults #5775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
4bb47ac to
53142f3
Compare
This comment has been minimized.
This comment has been minimized.
b3975e9 to
83d8fb6
Compare
This comment has been minimized.
This comment has been minimized.
b428f20 to
8f2ee14
Compare
This comment has been minimized.
This comment has been minimized.
9ab2efd to
2f1e7bc
Compare
This comment has been minimized.
This comment has been minimized.
7ad31cf to
dcb63e8
Compare
dcb63e8 to
3cd61f8
Compare
3cd61f8 to
1c0b019
Compare
8ad206c to
abba8f9
Compare
1c0b019 to
463f83b
Compare
061ee06 to
fe04c8f
Compare
| // for both defaults and #embedding key | ||
| let mut new_schema = Schema::new_default(default_knn_index); | ||
|
|
||
| if collection_config.embedding_function.is_some() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using and_then is more idiomatic rust here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also this should be method instead of inlining here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im not sure what this means
rust/types/src/collection_schema.rs
Outdated
| space: Some(hnsw_config.space.clone()), | ||
| embedding_function: collection_config.embedding_function.clone(), | ||
| source_key: Some(DOCUMENT_KEY.to_string()), // Default source key | ||
| source_key: None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: will need to migrate existing collections for uniformity
rust/types/src/collection_schema.rs
Outdated
| ) -> Result<Schema, SchemaError> { | ||
| // Start with a default schema structure | ||
| let mut schema = Schema::new_default(KnnIndex::Spann); // Default to HNSW, will be overridden | ||
| let mut schema = Schema::new_default(default_knn_index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does it matter what default you pass here since it is overridden below by config anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't but the code feels cleaner like this instead of creating a spann always
rust/types/src/collection_schema.rs
Outdated
| if let Some(embedding_types) = schema.keys.get_mut(EMBEDDING_KEY) { | ||
| if let Some(float_list) = &mut embedding_types.float_list { | ||
| if let Some(vector_index) = &mut float_list.vector_index { | ||
| let mut vector_config = vector_config.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need to clone vector_config here
| ) | ||
| .map_err(CollectionsWithSegmentsProviderError::InvalidSchema)?; | ||
| collection_and_segments_sysdb.collection.schema = Some(reconciled_schema); | ||
| if collection_and_segments_sysdb.collection.schema.is_none() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have we tested that schema is None and not {} for older collections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
| .map_err(CollectionsWithSegmentsProviderError::InvalidSchema)?; | ||
| collection_and_segments_sysdb.collection.schema = Some(reconciled_schema); | ||
| if collection_and_segments_sysdb.collection.schema.is_none() { | ||
| collection_and_segments_sysdb.collection.schema = Some( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are always passing schema down to the reader then should we update the reader in distributed_hnsw.rs to use schema instead of collection config (with fallback to legacy metadata). Similar to local_hnsw.rs. Makes things uniform and easier to understand. (I understand that the current code is also correct, this is just a code design nit)
| ) | ||
| .map_err(SpannSegmentWriterError::InvalidSchema)?; | ||
| let schema = if let Some(schema) = collection.schema.as_ref() { | ||
| schema.clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why clone here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because i cant do collection.schema directly, i have to borrow the schema with as_ref
the other alternative would be to borrow collection.schema directly. this works, just want to confirm this is safe?
let schema = if let Some(schema) = &collection.schema {
schema
} else {
&Schema::convert_collection_config_to_schema(&collection.config, KnnIndex::Spann)
.map_err(SpannSegmentWriterError::InvalidSchema)?
};
rust/types/src/collection.rs
Outdated
| /// The read path needs to tolerate collections that only have a configuration persisted. | ||
| /// This helper hydrates `schema` from the stored configuration when needed, or regenerates | ||
| /// the configuration from the existing schema to keep both representations consistent. | ||
| pub fn reconcile_schema_for_read(&mut self, knn_index: KnnIndex) -> Result<(), SchemaError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why read path needs to take a default_knn config. That seems like unnecessary because the knn index type has already been decided and persisted
fe04c8f to
db3b74e
Compare
This comment has been minimized.
This comment has been minimized.
db3b74e to
f8cfb53
Compare
| Int *IntValueType `json:"int,omitempty"` | ||
| Float *FloatValueType `json:"float,omitempty"` | ||
| Boolean *BoolValueType `json:"boolean,omitempty"` | ||
| Boolean *BoolValueType `json:"bool,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[CriticalError]
The JSON tag has been changed from "boolean" to "bool" in the Go struct and corresponding test JSON strings. However, there's a critical compatibility issue to consider:
-
Breaking Change Impact: This JSON field name change will break compatibility with existing stored data or API clients that expect the
"boolean"field name. -
Migration Strategy: If this change is intentional to match Rust's naming convention, you'll need to ensure:
- Database migration to rename existing
booleanfields tobool - API versioning to handle both field names during transition
- Client SDK updates across all languages
- Database migration to rename existing
If this is purely for Go-Rust JSON compatibility (as mentioned in the PR description), consider:
// Option 1: Support both during transition
Boolean *BoolValueType `json:"bool,omitempty" legacy:"boolean,omitempty"`
// Option 2: Custom marshaling to handle both namesCould you clarify the migration plan for this breaking change?
Context for Agents
[**CriticalError**]
The JSON tag has been changed from `"boolean"` to `"bool"` in the Go struct and corresponding test JSON strings. However, there's a critical compatibility issue to consider:
1. **Breaking Change Impact**: This JSON field name change will break compatibility with existing stored data or API clients that expect the `"boolean"` field name.
2. **Migration Strategy**: If this change is intentional to match Rust's naming convention, you'll need to ensure:
- Database migration to rename existing `boolean` fields to `bool`
- API versioning to handle both field names during transition
- Client SDK updates across all languages
If this is purely for Go-Rust JSON compatibility (as mentioned in the PR description), consider:
```go
// Option 1: Support both during transition
Boolean *BoolValueType `json:"bool,omitempty" legacy:"boolean,omitempty"`
// Option 2: Custom marshaling to handle both names
```
Could you clarify the migration plan for this breaking change?
File: go/pkg/sysdb/coordinator/model/collection_configuration.go
Line: 252f8cfb53 to
2b76659
Compare
| collection_and_segments_sysdb.collection.schema = Some(reconciled_schema); | ||
| if collection_and_segments_sysdb.collection.schema.is_none() { | ||
| collection_and_segments_sysdb.collection.schema = Some( | ||
| Schema::convert_collection_config_to_schema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Critical schema handling gap: When collection.schema is None in the reader path, this code converts config to schema but doesn't persist the result. The conversion is done in-memory only, meaning:
- Inconsistent state: The reader gets the converted schema, but the database still has
schema = None - Performance impact: Schema conversion happens on every read operation instead of once
- Potential race conditions: Multiple readers could be converting the same config simultaneously
Consider adding a mechanism to persist the converted schema back to the database after conversion, or implement a write-through cache to avoid repeated conversions.
// After conversion, consider persisting:
if was_none_before_conversion {
// Persist the converted schema to avoid future conversions
self.persist_schema_if_needed(collection_id, &converted_schema).await?;
}Context for Agents
[**BestPractice**]
Critical schema handling gap: When `collection.schema` is `None` in the reader path, this code converts config to schema but doesn't persist the result. The conversion is done in-memory only, meaning:
1. **Inconsistent state**: The reader gets the converted schema, but the database still has `schema = None`
2. **Performance impact**: Schema conversion happens on every read operation instead of once
3. **Potential race conditions**: Multiple readers could be converting the same config simultaneously
Consider adding a mechanism to persist the converted schema back to the database after conversion, or implement a write-through cache to avoid repeated conversions.
```rust
// After conversion, consider persisting:
if was_none_before_conversion {
// Persist the converted schema to avoid future conversions
self.persist_schema_if_needed(collection_id, &converted_schema).await?;
}
```
File: rust/frontend/src/get_collection_with_segments_provider.rs
Line: 1902b76659 to
e0d238f
Compare
5fafa38 to
7c682b1
Compare
| .collection | ||
| .reconcile_schema_with_config(KnnIndex::Hnsw)?; | ||
| if !schema_previously_persisted { | ||
| collection_and_segments.collection.schema = Some( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Minor change - for local we init the schema here if None but for distributed we init in the writer? Would be good to make the behavior uniform across the two
rust/types/src/collection.rs
Outdated
|
|
||
| impl Collection { | ||
| /// Reconcile the collection schema and configuration, ensuring both are consistent. | ||
| pub fn reconcile_schema_with_config(&mut self, knn_index: KnnIndex) -> Result<(), SchemaError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this method used anywhere now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this method. Only the in-memory frontend uses it but that should also be changed?
7c682b1 to
8671ea1
Compare
…ent population in defaults
8671ea1 to
c129192
Compare

Description of changes
Summarize the changes made by this PR.
Test plan
How are these changes tested?
added unit tests for all 8 default cases (config hnsw or spann default, schema hnsw or spann default, default_knn_index), test thta #document does not populate for defaults, and embedding functions do.
pytestfor python,yarn testfor js,cargo testfor rustMigration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
Observability plan
What is the plan to instrument and monitor this change?
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?