Push filtering DB scans into SQL on the auto-index path#97
Open
voxmenthe wants to merge 2 commits intolightonai:mainfrom
Open
Push filtering DB scans into SQL on the auto-index path#97voxmenthe wants to merge 2 commits intolightonai:mainfrom
voxmenthe wants to merge 2 commits intolightonai:mainfrom
Conversation
`colgrep` calls `try_index` on every search. Two helpers on that path materialized every metadata row of the entire index just to inspect a single column: - `cleanup_orphaned_entries` loaded all rows via `filtering::get` only to collect the distinct `file` paths. - `reconcile_document_counts` loaded all rows only to find `_subset_` IDs >= the vector index document count. On indexes with ~100k code units this is tens of megabytes of JSON deserialization on every query, which dominates search latency on otherwise quick lookups. Add `filtering::get_distinct_strings(index_path, column)` and use it for the file enumeration. Use the existing `filtering::where_condition` to push the orphan-id filter into SQL. Both pulls now read at most a few KB instead of the whole METADATA table. Includes tests for the new helper covering the happy path, missing DB, unknown column, and invalid-identifier rejection.
Collaborator
|
Hi @voxmenthe this is a cool PR, I'd like simply to see a benchmark please to merge, how much time it save / memory |
Reproducible benchmark for the two pushdowns. Builds a synthetic
METADATA table and times the OLD full-table load + Rust-side filter
against the NEW SQL pushdown for both `cleanup_orphaned_entries` and
`reconcile_document_counts`.
cargo run --release --example bench_metadata_filter -p next-plaid -- --quick
cargo run --release --example bench_metadata_filter -p next-plaid -- \
--sizes 10000,50000,95000,200000 --iters 5
Sample results on a 2024 MacBook (release build):
rows | distinct files speedup | orphan ids speedup | bytes (OLD -> NEW)
10k | 11x | 52x | 10 MB -> 34 KB / 6 B
50k | 10x | 318x | 51 MB -> 174 KB / 31 B
95k | 10x | 564x | 97 MB -> 333 KB / 61 B
200k | 9.8x | 1177x | 206 MB -> 714 KB / 141 B
The OLD path materializes the full METADATA table on every search (it ran
in `cleanup_orphaned_entries`, plus again in `reconcile_document_counts`
when desynced). The NEW path reads at most a few hundred KB.
Author
Added to PR thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
colgreprunstry_indexon every search. Two helpers on that path were materializing every metadata row of the entire index just to inspect a single column:cleanup_orphaned_entries(colgrep/src/index/mod.rs) calledfiltering::get(index_path, None, &[], None)solely to collect the distinctfilepaths from the result. That executesSELECT * FROM METADATA ORDER BY _subset_and JSON-deserializes every column — includingcode— for every row.reconcile_document_counts(colgrep/src/index/mod.rs) did the same full-table dump, then iterated in Rust to find rows whose_subset_exceeded the vector index document count.On a ~95 k-unit index this is tens of MB of deserialization on every query, which dominates latency for otherwise-quick searches and is a frequent cause of "colgrep hangs after the desync warning" reports.
What changed
filtering::get_distinct_strings(index_path, column)innext-plaid/src/filtering.rs. Runs a singleSELECT DISTINCT "<col>" FROM METADATA WHERE "<col>" IS NOT NULL, validates the column name as a safe identifier, and returns it asVec<String>. Returns an empty vector when the DB or column is absent (matching the previous lenient behavior).cleanup_orphaned_entriesnow usesget_distinct_strings(index_path, "file")instead of loading every row.reconcile_document_countsnow uses the existingfiltering::where_condition(index_path, "_subset_ >= ?", &[json!(vector_count)])to fetch orphan IDs directly. The reconciliation semantics are unchanged.Both reads now transfer a few KB instead of the whole METADATA table.
Performance
Reproducible benchmark added at
next-plaid/examples/bench_metadata_filter.rs. It builds a synthetic METADATA table at varying sizes and times the OLD path (full-tablefiltering::get+ Rust-side filter, exactly what the two functions were doing) against the NEW SQL pushdown. Best-of-5, 2024 MacBook release build, 600 B of syntheticcodeper row, 8 units per file:The bench asserts result-set equivalence between OLD and NEW before timing, so it doubles as a regression check. Reproduce with:
cleanup_orphaned_entriesruns on every search, so its 300 ms+ savings (at 95 k rows) hit every query.reconcile_document_countsruns only when the index/DB counts disagree, but that is exactly the state where users observe minute-long stalls.Tests
next-plaid/src/filtering.rs:test_get_distinct_strings_returns_unique_valuestest_get_distinct_strings_missing_db_returns_emptytest_get_distinct_strings_unknown_column_returns_emptytest_get_distinct_strings_rejects_invalid_column_namecargo test -p next-plaid --lib filtering::— 44 passed (4 new, 40 existing)cargo test -p colgrep --lib— 511 passedcargo check -p next-plaid -p colgrep— cleanThe benchmark example is in a separate commit so it can be dropped cleanly if you'd rather not add example binaries to the crate.
Test plan
cargo checkpasses for both crates