Skip to content

fix(index): batch incremental deletes (#116) and fix stuck-dirty / partial-seed bugs (#115)#117

Merged
raphaelsty merged 1 commit into
mainfrom
fix/incremental-update-delete-perf-and-dirty-state
Jun 1, 2026
Merged

fix(index): batch incremental deletes (#116) and fix stuck-dirty / partial-seed bugs (#115)#117
raphaelsty merged 1 commit into
mainfrom
fix/incremental-update-delete-perf-and-dirty-state

Conversation

@raphaelsty
Copy link
Copy Markdown
Collaborator

Summary

Fixes the two most recently reported issues, all three bugs living in the incremental-update path in colgrep/src/index/mod.rs. Each fix ships with a model-free regression test, and I verified each test goes red→green by temporarily reverting the corresponding fix.

Closes #116. Closes #115.

#116 — incremental update hangs for minutes after a large diff

delete_file_from_index() was called once per changed/deleted file, and each call is a full-index rewrite:

  • next_plaid::delete_from_index rewrites every chunk and rebuilds the IVF from all remaining codes
  • filtering::delete rebuilds the entire metadata table (temp table + re-insert with renumbered _subset_)

With N changed files that is O(N × index_size) — the reporter's ~276 deleted .rbi files hung for >9 min on a ~1.2 GB index.

Fix: every per-file delete loop now calls a single delete_files_from_index(index_path, files) that collects all doc IDs up front and deletes once — a single O(index_size) rewrite regardless of how many files changed. The IDs for all files are read before any deletion, because both primitives renumber surviving documents; interleaving reads and deletes would invalidate the not-yet-deleted IDs.

Regression test test_delete_files_from_index_is_a_single_rewrite builds a model-free index (4 files × 3 docs), deletes 3 files, and asserts the rewrite primitive is invoked exactly once (was once-per-doc) and that exactly the right documents survive.

#115 (bug 1) — an index can stay permanently dirty

When a previous run left the index dirty, incremental_update runs a repair — but the "nothing to do" early return never cleared the flag. The index stayed dirty: true forever, paying for a repair on every future run.

Fix: on that early-return path, when the prior state was dirty, persist dirty = false (the repair already reconciled the store).

Verified end-to-end against the released binary: with no file changes, colgrep 1.5.0 leaves dirty: true across repeated runs; the patched binary clears it to false. Regression test: test_incremental_update_clears_dirty_with_no_changes.

#115 (bug 2) — an interrupted resumable build can be used as a seed source

A resumable build writes metadata.json and checkpoints state with dirty = false after each committed batch, so an interrupted build (a .building marker is present) passed every seed-usability check — non-empty, non-dirty, current version, metadata + filtering DB present — while holding only a fraction of its documents. Worktree seeding would then copy that partial index and treat it as complete.

Fix: reject any seed candidate whose index dir has a .building marker. The usability checks were extracted into a small testable seed_source_state() helper. Regression test: test_seed_source_rejects_in_progress_build.

Testing

  • cargo test -p colgrep --lib → 545 passed (4 new)
  • cargo clippy --all-targets -- -D warnings clean; cargo fmt clean
  • Each new test confirmed to fail before its fix and pass after (red→green)
  • End-to-end repro of bug 1 with the released vs. patched binary

Notes

The deferral of changed-file deletion until just before re-encode (for concurrent-reader / interrupt safety) is preserved — deletes are still issued at the same point, just batched.

…dirty

Fixes three bugs reported in #116 and #115, all in the incremental-update
path. Each fix ships with a model-free regression test.

#116 — slow incremental update after large diffs
  delete_file_from_index() was called once per changed/deleted file, and each
  call is a full-index rewrite (delete_from_index rewrites every chunk + rebuilds
  the IVF; filtering::delete rewrites the whole metadata table). With N changed
  files this is O(N × index_size) — ~276 deleted files hung for >9 min on a
  ~1.2 GB index. Replace every per-file delete loop with a single
  delete_files_from_index() that collects all doc IDs up front (IDs must be read
  before any deletion, since both primitives renumber survivors) and deletes once,
  collapsing the work to a single O(index_size) rewrite.

#115 (bug 1) — index could stay permanently dirty
  When a previous run left the index dirty, incremental_update repairs it, but the
  "nothing to do" early return never cleared the flag. The index stayed dirty
  forever, paying for a repair on every future run. Clear and persist dirty=false
  on that path when the prior state was dirty.

#115 (bug 2) — interrupted resumable build used as a seed source
  A resumable build writes metadata.json and checkpoints state with dirty=false
  after each batch, so an interrupted build (.building marker present) passed every
  seed-usability check while holding only a fraction of its documents. Worktree
  seeding would copy that partial index and treat it as complete. Reject any
  candidate with a .building marker (logic extracted into seed_source_state()).

Tests: red->green verified for all three by temporarily reverting each fix.
@raphaelsty raphaelsty merged commit 5db737c into main Jun 1, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

colgrep index update takes a long time on certain repo states Potential bugs in dirty state clearing?

1 participant