fix(index): batch incremental deletes (#116) and fix stuck-dirty / partial-seed bugs (#115)#117
Merged
raphaelsty merged 1 commit intoJun 1, 2026
Conversation
…dirty Fixes three bugs reported in #116 and #115, all in the incremental-update path. Each fix ships with a model-free regression test. #116 — slow incremental update after large diffs delete_file_from_index() was called once per changed/deleted file, and each call is a full-index rewrite (delete_from_index rewrites every chunk + rebuilds the IVF; filtering::delete rewrites the whole metadata table). With N changed files this is O(N × index_size) — ~276 deleted files hung for >9 min on a ~1.2 GB index. Replace every per-file delete loop with a single delete_files_from_index() that collects all doc IDs up front (IDs must be read before any deletion, since both primitives renumber survivors) and deletes once, collapsing the work to a single O(index_size) rewrite. #115 (bug 1) — index could stay permanently dirty When a previous run left the index dirty, incremental_update repairs it, but the "nothing to do" early return never cleared the flag. The index stayed dirty forever, paying for a repair on every future run. Clear and persist dirty=false on that path when the prior state was dirty. #115 (bug 2) — interrupted resumable build used as a seed source A resumable build writes metadata.json and checkpoints state with dirty=false after each batch, so an interrupted build (.building marker present) passed every seed-usability check while holding only a fraction of its documents. Worktree seeding would copy that partial index and treat it as complete. Reject any candidate with a .building marker (logic extracted into seed_source_state()). Tests: red->green verified for all three by temporarily reverting each fix.
This was referenced Jun 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the two most recently reported issues, all three bugs living in the incremental-update path in
colgrep/src/index/mod.rs. Each fix ships with a model-free regression test, and I verified each test goes red→green by temporarily reverting the corresponding fix.Closes #116. Closes #115.
#116 — incremental update hangs for minutes after a large diff
delete_file_from_index()was called once per changed/deleted file, and each call is a full-index rewrite:next_plaid::delete_from_indexrewrites every chunk and rebuilds the IVF from all remaining codesfiltering::deleterebuilds the entire metadata table (temp table + re-insert with renumbered_subset_)With N changed files that is O(N × index_size) — the reporter's ~276 deleted
.rbifiles hung for >9 min on a ~1.2 GB index.Fix: every per-file delete loop now calls a single
delete_files_from_index(index_path, files)that collects all doc IDs up front and deletes once — a single O(index_size) rewrite regardless of how many files changed. The IDs for all files are read before any deletion, because both primitives renumber surviving documents; interleaving reads and deletes would invalidate the not-yet-deleted IDs.Regression test
test_delete_files_from_index_is_a_single_rewritebuilds a model-free index (4 files × 3 docs), deletes 3 files, and asserts the rewrite primitive is invoked exactly once (was once-per-doc) and that exactly the right documents survive.#115 (bug 1) — an index can stay permanently dirty
When a previous run left the index dirty,
incremental_updateruns a repair — but the "nothing to do" early return never cleared the flag. The index stayeddirty: trueforever, paying for a repair on every future run.Fix: on that early-return path, when the prior state was dirty, persist
dirty = false(the repair already reconciled the store).Verified end-to-end against the released binary: with no file changes,
colgrep 1.5.0leavesdirty: trueacross repeated runs; the patched binary clears it tofalse. Regression test:test_incremental_update_clears_dirty_with_no_changes.#115 (bug 2) — an interrupted resumable build can be used as a seed source
A resumable build writes
metadata.jsonand checkpoints state withdirty = falseafter each committed batch, so an interrupted build (a.buildingmarker is present) passed every seed-usability check — non-empty, non-dirty, current version, metadata + filtering DB present — while holding only a fraction of its documents. Worktree seeding would then copy that partial index and treat it as complete.Fix: reject any seed candidate whose index dir has a
.buildingmarker. The usability checks were extracted into a small testableseed_source_state()helper. Regression test:test_seed_source_rejects_in_progress_build.Testing
cargo test -p colgrep --lib→ 545 passed (4 new)cargo clippy --all-targets -- -D warningsclean;cargo fmtcleanNotes
The deferral of changed-file deletion until just before re-encode (for concurrent-reader / interrupt safety) is preserved — deletes are still issued at the same point, just batched.