fix(index): batch incremental deletes (#116) and fix stuck-dirty / partial-seed bugs (#115) by raphaelsty · Pull Request #117 · lightonai/next-plaid

raphaelsty · 2026-06-01T16:18:03Z

Summary

Fixes the two most recently reported issues, all three bugs living in the incremental-update path in colgrep/src/index/mod.rs. Each fix ships with a model-free regression test, and I verified each test goes red→green by temporarily reverting the corresponding fix.

Closes #116. Closes #115.

#116 — incremental update hangs for minutes after a large diff

delete_file_from_index() was called once per changed/deleted file, and each call is a full-index rewrite:

next_plaid::delete_from_index rewrites every chunk and rebuilds the IVF from all remaining codes
filtering::delete rebuilds the entire metadata table (temp table + re-insert with renumbered _subset_)

With N changed files that is O(N × index_size) — the reporter's ~276 deleted .rbi files hung for >9 min on a ~1.2 GB index.

Fix: every per-file delete loop now calls a single delete_files_from_index(index_path, files) that collects all doc IDs up front and deletes once — a single O(index_size) rewrite regardless of how many files changed. The IDs for all files are read before any deletion, because both primitives renumber surviving documents; interleaving reads and deletes would invalidate the not-yet-deleted IDs.

Regression test test_delete_files_from_index_is_a_single_rewrite builds a model-free index (4 files × 3 docs), deletes 3 files, and asserts the rewrite primitive is invoked exactly once (was once-per-doc) and that exactly the right documents survive.

#115 (bug 1) — an index can stay permanently dirty

When a previous run left the index dirty, incremental_update runs a repair — but the "nothing to do" early return never cleared the flag. The index stayed dirty: true forever, paying for a repair on every future run.

Fix: on that early-return path, when the prior state was dirty, persist dirty = false (the repair already reconciled the store).

Verified end-to-end against the released binary: with no file changes, colgrep 1.5.0 leaves dirty: true across repeated runs; the patched binary clears it to false. Regression test: test_incremental_update_clears_dirty_with_no_changes.

#115 (bug 2) — an interrupted resumable build can be used as a seed source

A resumable build writes metadata.json and checkpoints state with dirty = false after each committed batch, so an interrupted build (a .building marker is present) passed every seed-usability check — non-empty, non-dirty, current version, metadata + filtering DB present — while holding only a fraction of its documents. Worktree seeding would then copy that partial index and treat it as complete.

Fix: reject any seed candidate whose index dir has a .building marker. The usability checks were extracted into a small testable seed_source_state() helper. Regression test: test_seed_source_rejects_in_progress_build.

Testing

cargo test -p colgrep --lib → 545 passed (4 new)
cargo clippy --all-targets -- -D warnings clean; cargo fmt clean
Each new test confirmed to fail before its fix and pass after (red→green)
End-to-end repro of bug 1 with the released vs. patched binary

Notes

The deferral of changed-file deletion until just before re-encode (for concurrent-reader / interrupt safety) is preserved — deletes are still issued at the same point, just batched.

…dirty Fixes three bugs reported in #116 and #115, all in the incremental-update path. Each fix ships with a model-free regression test. #116 — slow incremental update after large diffs delete_file_from_index() was called once per changed/deleted file, and each call is a full-index rewrite (delete_from_index rewrites every chunk + rebuilds the IVF; filtering::delete rewrites the whole metadata table). With N changed files this is O(N × index_size) — ~276 deleted files hung for >9 min on a ~1.2 GB index. Replace every per-file delete loop with a single delete_files_from_index() that collects all doc IDs up front (IDs must be read before any deletion, since both primitives renumber survivors) and deletes once, collapsing the work to a single O(index_size) rewrite. #115 (bug 1) — index could stay permanently dirty When a previous run left the index dirty, incremental_update repairs it, but the "nothing to do" early return never cleared the flag. The index stayed dirty forever, paying for a repair on every future run. Clear and persist dirty=false on that path when the prior state was dirty. #115 (bug 2) — interrupted resumable build used as a seed source A resumable build writes metadata.json and checkpoints state with dirty=false after each batch, so an interrupted build (.building marker present) passed every seed-usability check while holding only a fraction of its documents. Worktree seeding would copy that partial index and treat it as complete. Reject any candidate with a .building marker (logic extracted into seed_source_state()). Tests: red->green verified for all three by temporarily reverting each fix.

This was referenced Jun 1, 2026

colgrep index update takes a long time on certain repo states #116

Closed

Potential bugs in dirty state clearing? #115

Closed

raphaelsty merged commit 5db737c into main Jun 1, 2026
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(index): batch incremental deletes (#116) and fix stuck-dirty / partial-seed bugs (#115)#117

fix(index): batch incremental deletes (#116) and fix stuck-dirty / partial-seed bugs (#115)#117
raphaelsty merged 1 commit into
mainfrom
fix/incremental-update-delete-perf-and-dirty-state

raphaelsty commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raphaelsty commented Jun 1, 2026

Summary

#116 — incremental update hangs for minutes after a large diff

#115 (bug 1) — an index can stay permanently dirty

#115 (bug 2) — an interrupted resumable build can be used as a seed source

Testing

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant