Push filtering DB scans into SQL on the auto-index path by voxmenthe · Pull Request #97 · lightonai/next-plaid

voxmenthe · 2026-05-06T07:04:13Z

Summary

colgrep runs try_index on every search. Two helpers on that path were materializing every metadata row of the entire index just to inspect a single column:

cleanup_orphaned_entries (colgrep/src/index/mod.rs) called filtering::get(index_path, None, &[], None) solely to collect the distinct file paths from the result. That executes SELECT * FROM METADATA ORDER BY _subset_ and JSON-deserializes every column — including code — for every row.
reconcile_document_counts (colgrep/src/index/mod.rs) did the same full-table dump, then iterated in Rust to find rows whose _subset_ exceeded the vector index document count.

On a ~95 k-unit index this is tens of MB of deserialization on every query, which dominates latency for otherwise-quick searches and is a frequent cause of "colgrep hangs after the desync warning" reports.

What changed

New helper filtering::get_distinct_strings(index_path, column) in next-plaid/src/filtering.rs. Runs a single SELECT DISTINCT "<col>" FROM METADATA WHERE "<col>" IS NOT NULL, validates the column name as a safe identifier, and returns it as Vec<String>. Returns an empty vector when the DB or column is absent (matching the previous lenient behavior).
cleanup_orphaned_entries now uses get_distinct_strings(index_path, "file") instead of loading every row.
reconcile_document_counts now uses the existing filtering::where_condition(index_path, "_subset_ >= ?", &[json!(vector_count)]) to fetch orphan IDs directly. The reconciliation semantics are unchanged.

Both reads now transfer a few KB instead of the whole METADATA table.

Performance

Reproducible benchmark added at next-plaid/examples/bench_metadata_filter.rs. It builds a synthetic METADATA table at varying sizes and times the OLD path (full-table filtering::get + Rust-side filter, exactly what the two functions were doing) against the NEW SQL pushdown. Best-of-5, 2024 MacBook release build, 600 B of synthetic code per row, 8 units per file:

n_rows	op	OLD time	NEW time	speedup	bytes (OLD → NEW)
10 k	distinct files (cleanup)	35.7 ms	3.2 ms	11.2×	10.21 MB → 33.93 KB
10 k	orphan ids (reconcile)	34.7 ms	664.3 µs	52.2×	10.21 MB → 6 B
50 k	distinct files (cleanup)	181.7 ms	17.0 ms	10.7×	51.29 MB → 174.00 KB
50 k	orphan ids (reconcile)	172.7 ms	542.4 µs	318.3×	51.29 MB → 31 B
95 k	distinct files (cleanup)	341.8 ms	32.0 ms	10.7×	97.53 MB → 333.43 KB
95 k	orphan ids (reconcile)	330.5 ms	586.0 µs	564.0×	97.53 MB → 61 B
200 k	distinct files (cleanup)	718.9 ms	73.2 ms	9.8×	205.94 MB → 713.94 KB
200 k	orphan ids (reconcile)	696.6 ms	592.0 µs	1176.8×	205.94 MB → 141 B

The bench asserts result-set equivalence between OLD and NEW before timing, so it doubles as a regression check. Reproduce with:

cargo run --release --example bench_metadata_filter -p next-plaid -- --quick
cargo run --release --example bench_metadata_filter -p next-plaid -- \
    --sizes 10000,50000,95000,200000 --iters 5

cleanup_orphaned_entries runs on every search, so its 300 ms+ savings (at 95 k rows) hit every query. reconcile_document_counts runs only when the index/DB counts disagree, but that is exactly the state where users observe minute-long stalls.

Tests

New unit tests in next-plaid/src/filtering.rs:
- test_get_distinct_strings_returns_unique_values
- test_get_distinct_strings_missing_db_returns_empty
- test_get_distinct_strings_unknown_column_returns_empty
- test_get_distinct_strings_rejects_invalid_column_name
cargo test -p next-plaid --lib filtering:: — 44 passed (4 new, 40 existing)
cargo test -p colgrep --lib — 511 passed
cargo check -p next-plaid -p colgrep — clean

The benchmark example is in a separate commit so it can be dropped cleanly if you'd rather not add example binaries to the crate.

Test plan

New helper covered by unit tests (happy path, missing DB, unknown column, invalid identifier)
Existing filtering test suite still passes
Full colgrep lib test suite still passes
cargo check passes for both crates
Bench asserts OLD/NEW result equivalence before timing

`colgrep` calls `try_index` on every search. Two helpers on that path materialized every metadata row of the entire index just to inspect a single column: - `cleanup_orphaned_entries` loaded all rows via `filtering::get` only to collect the distinct `file` paths. - `reconcile_document_counts` loaded all rows only to find `_subset_` IDs >= the vector index document count. On indexes with ~100k code units this is tens of megabytes of JSON deserialization on every query, which dominates search latency on otherwise quick lookups. Add `filtering::get_distinct_strings(index_path, column)` and use it for the file enumeration. Use the existing `filtering::where_condition` to push the orphan-id filter into SQL. Both pulls now read at most a few KB instead of the whole METADATA table. Includes tests for the new helper covering the happy path, missing DB, unknown column, and invalid-identifier rejection.

raphaelsty · 2026-05-06T08:06:46Z

Hi @voxmenthe this is a cool PR, I'd like simply to see a benchmark please to merge, how much time it save / memory

Reproducible benchmark for the two pushdowns. Builds a synthetic METADATA table and times the OLD full-table load + Rust-side filter against the NEW SQL pushdown for both `cleanup_orphaned_entries` and `reconcile_document_counts`. cargo run --release --example bench_metadata_filter -p next-plaid -- --quick cargo run --release --example bench_metadata_filter -p next-plaid -- \ --sizes 10000,50000,95000,200000 --iters 5 Sample results on a 2024 MacBook (release build): rows | distinct files speedup | orphan ids speedup | bytes (OLD -> NEW) 10k | 11x | 52x | 10 MB -> 34 KB / 6 B 50k | 10x | 318x | 51 MB -> 174 KB / 31 B 95k | 10x | 564x | 97 MB -> 333 KB / 61 B 200k | 9.8x | 1177x | 206 MB -> 714 KB / 141 B The OLD path materializes the full METADATA table on every search (it ran in `cleanup_orphaned_entries`, plus again in `reconcile_document_counts` when desynced). The NEW path reads at most a few hundred KB.

voxmenthe · 2026-05-07T03:38:12Z

Hi @voxmenthe this is a cool PR, I'd like simply to see a benchmark please to merge, how much time it save / memory

Added to PR thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Push filtering DB scans into SQL on the auto-index path#97

Push filtering DB scans into SQL on the auto-index path#97
voxmenthe wants to merge 2 commits intolightonai:mainfrom
voxmenthe:perf/skip-full-table-loads-on-search

voxmenthe commented May 6, 2026 •

edited

Loading

Uh oh!

raphaelsty commented May 6, 2026 •

edited

Loading

Uh oh!

voxmenthe commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

voxmenthe commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Performance

Tests

Test plan

Uh oh!

raphaelsty commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

voxmenthe commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

voxmenthe commented May 6, 2026 •

edited

Loading

raphaelsty commented May 6, 2026 •

edited

Loading