Skip to content

Push filtering DB scans into SQL on the auto-index path#97

Open
voxmenthe wants to merge 2 commits intolightonai:mainfrom
voxmenthe:perf/skip-full-table-loads-on-search
Open

Push filtering DB scans into SQL on the auto-index path#97
voxmenthe wants to merge 2 commits intolightonai:mainfrom
voxmenthe:perf/skip-full-table-loads-on-search

Conversation

@voxmenthe
Copy link
Copy Markdown

@voxmenthe voxmenthe commented May 6, 2026

Summary

colgrep runs try_index on every search. Two helpers on that path were materializing every metadata row of the entire index just to inspect a single column:

  • cleanup_orphaned_entries (colgrep/src/index/mod.rs) called filtering::get(index_path, None, &[], None) solely to collect the distinct file paths from the result. That executes SELECT * FROM METADATA ORDER BY _subset_ and JSON-deserializes every column — including code — for every row.
  • reconcile_document_counts (colgrep/src/index/mod.rs) did the same full-table dump, then iterated in Rust to find rows whose _subset_ exceeded the vector index document count.

On a ~95 k-unit index this is tens of MB of deserialization on every query, which dominates latency for otherwise-quick searches and is a frequent cause of "colgrep hangs after the desync warning" reports.

What changed

  • New helper filtering::get_distinct_strings(index_path, column) in next-plaid/src/filtering.rs. Runs a single SELECT DISTINCT "<col>" FROM METADATA WHERE "<col>" IS NOT NULL, validates the column name as a safe identifier, and returns it as Vec<String>. Returns an empty vector when the DB or column is absent (matching the previous lenient behavior).
  • cleanup_orphaned_entries now uses get_distinct_strings(index_path, "file") instead of loading every row.
  • reconcile_document_counts now uses the existing filtering::where_condition(index_path, "_subset_ >= ?", &[json!(vector_count)]) to fetch orphan IDs directly. The reconciliation semantics are unchanged.

Both reads now transfer a few KB instead of the whole METADATA table.

Performance

Reproducible benchmark added at next-plaid/examples/bench_metadata_filter.rs. It builds a synthetic METADATA table at varying sizes and times the OLD path (full-table filtering::get + Rust-side filter, exactly what the two functions were doing) against the NEW SQL pushdown. Best-of-5, 2024 MacBook release build, 600 B of synthetic code per row, 8 units per file:

n_rows op OLD time NEW time speedup bytes (OLD → NEW)
10 k distinct files (cleanup) 35.7 ms 3.2 ms 11.2× 10.21 MB → 33.93 KB
10 k orphan ids (reconcile) 34.7 ms 664.3 µs 52.2× 10.21 MB → 6 B
50 k distinct files (cleanup) 181.7 ms 17.0 ms 10.7× 51.29 MB → 174.00 KB
50 k orphan ids (reconcile) 172.7 ms 542.4 µs 318.3× 51.29 MB → 31 B
95 k distinct files (cleanup) 341.8 ms 32.0 ms 10.7× 97.53 MB → 333.43 KB
95 k orphan ids (reconcile) 330.5 ms 586.0 µs 564.0× 97.53 MB → 61 B
200 k distinct files (cleanup) 718.9 ms 73.2 ms 9.8× 205.94 MB → 713.94 KB
200 k orphan ids (reconcile) 696.6 ms 592.0 µs 1176.8× 205.94 MB → 141 B

The bench asserts result-set equivalence between OLD and NEW before timing, so it doubles as a regression check. Reproduce with:

cargo run --release --example bench_metadata_filter -p next-plaid -- --quick
cargo run --release --example bench_metadata_filter -p next-plaid -- \
    --sizes 10000,50000,95000,200000 --iters 5

cleanup_orphaned_entries runs on every search, so its 300 ms+ savings (at 95 k rows) hit every query. reconcile_document_counts runs only when the index/DB counts disagree, but that is exactly the state where users observe minute-long stalls.

Tests

  • New unit tests in next-plaid/src/filtering.rs:
    • test_get_distinct_strings_returns_unique_values
    • test_get_distinct_strings_missing_db_returns_empty
    • test_get_distinct_strings_unknown_column_returns_empty
    • test_get_distinct_strings_rejects_invalid_column_name
  • cargo test -p next-plaid --lib filtering:: — 44 passed (4 new, 40 existing)
  • cargo test -p colgrep --lib — 511 passed
  • cargo check -p next-plaid -p colgrep — clean

The benchmark example is in a separate commit so it can be dropped cleanly if you'd rather not add example binaries to the crate.

Test plan

  • New helper covered by unit tests (happy path, missing DB, unknown column, invalid identifier)
  • Existing filtering test suite still passes
  • Full colgrep lib test suite still passes
  • cargo check passes for both crates
  • Bench asserts OLD/NEW result equivalence before timing

`colgrep` calls `try_index` on every search. Two helpers on that path
materialized every metadata row of the entire index just to inspect a
single column:

- `cleanup_orphaned_entries` loaded all rows via `filtering::get` only to
  collect the distinct `file` paths.
- `reconcile_document_counts` loaded all rows only to find `_subset_`
  IDs >= the vector index document count.

On indexes with ~100k code units this is tens of megabytes of JSON
deserialization on every query, which dominates search latency on
otherwise quick lookups.

Add `filtering::get_distinct_strings(index_path, column)` and use it
for the file enumeration. Use the existing `filtering::where_condition`
to push the orphan-id filter into SQL. Both pulls now read at most a
few KB instead of the whole METADATA table.

Includes tests for the new helper covering the happy path, missing DB,
unknown column, and invalid-identifier rejection.
@raphaelsty
Copy link
Copy Markdown
Collaborator

raphaelsty commented May 6, 2026

Hi @voxmenthe this is a cool PR, I'd like simply to see a benchmark please to merge, how much time it save / memory

Reproducible benchmark for the two pushdowns. Builds a synthetic
METADATA table and times the OLD full-table load + Rust-side filter
against the NEW SQL pushdown for both `cleanup_orphaned_entries` and
`reconcile_document_counts`.

  cargo run --release --example bench_metadata_filter -p next-plaid -- --quick
  cargo run --release --example bench_metadata_filter -p next-plaid -- \
      --sizes 10000,50000,95000,200000 --iters 5

Sample results on a 2024 MacBook (release build):

  rows  | distinct files speedup | orphan ids speedup | bytes (OLD -> NEW)
  10k   | 11x                    | 52x                | 10 MB  -> 34 KB / 6 B
  50k   | 10x                    | 318x               | 51 MB  -> 174 KB / 31 B
  95k   | 10x                    | 564x               | 97 MB  -> 333 KB / 61 B
  200k  | 9.8x                   | 1177x              | 206 MB -> 714 KB / 141 B

The OLD path materializes the full METADATA table on every search (it ran
in `cleanup_orphaned_entries`, plus again in `reconcile_document_counts`
when desynced). The NEW path reads at most a few hundred KB.
@voxmenthe
Copy link
Copy Markdown
Author

Hi @voxmenthe this is a cool PR, I'd like simply to see a benchmark please to merge, how much time it save / memory

Added to PR thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants