Skip to content

Resumable, checkpointed initial index build (fix "rebuilds every time")#110

Merged
raphaelsty merged 2 commits into
mainfrom
feat/resumable-index-build
May 28, 2026
Merged

Resumable, checkpointed initial index build (fix "rebuilds every time")#110
raphaelsty merged 2 commits into
mainfrom
feat/resumable-index-build

Conversation

@raphaelsty
Copy link
Copy Markdown
Collaborator

@raphaelsty raphaelsty commented May 28, 2026

Problem

ColGrep was reported to "rebuild the index every time," unrelated to file changes. Root cause: a first-time full build that exceeds the caller's timeout (an agent's ~2-min command limit, Ctrl-C, etc.) is killed before it finishes. full_rebuild encodes into a throwaway index.tmp and only persists state.json on success, so an interrupted build saves nothing → the next search restarts from scratch, forever. Reproduced on a real transformers/src/transformers/models tree (2160 files): still building at 200s, nothing persisted after a kill.

Fix

The chunk pipeline already commits index data incrementally — the only things lost were building into index.tmp and end-only state persistence. So:

  • build_resumable: for the first build (and resumes), encode directly into the real index dir in file-coherent batches of ~4096 units, saving state.json after each committed batch. An interrupted run keeps the batches it finished.
  • A .building marker routes the next run back into build_resumable, which skips already-committed files and continues. Cleared on completion.
  • Idempotent resume: each batch deletes a file's prior docs before re-embedding, and repair_index_db_sync trims any partial-chunk desync — so interruptions never accumulate duplicate/orphan documents.
  • Zero-unit files (empty / import-only) are recorded immediately (parity with full_rebuild).
  • Refactored index()/try_index() to share run_indexing().

Version bumps still rebuild from scratch ✅

The version/force check runs before the resume branch and clears any .building marker, so a CLI-version change (or clear/forced rebuild) always discards the old index and does the atomic full_rebuild — never a mixed-version index. Verified by repro.

Stress testing (release binary)

ctrlc only handles SIGINT, so a real agent timeout (SIGTERM/SIGKILL) is a hard kill mid-write — that's the scenario tested.

Scenario Result
Graceful interrupt (SIGINT), 13s window, 4000 files ✅ Completes in 2 rounds; final index identical to clean build (4000 files / 16000 docs, same top hits)
Hard kill (SIGKILL) ×20, 5s window ✅ No corruption, no duplicate/orphan docs (count oscillates around committed value, never grows), state never regresses⚠️ stalls at 1 batch (window ≈ one batch + overhead)
Hard kill (SIGKILL), 12s window, 4000 files ✅ Monotonic 2048→3072→4000, completes in 3 rounds, content-equivalent (4000 files / 16000 docs, no dups)
Version bump ✅ Full rebuild from scratch
Zero-unit files (empty / import-only) ✅ Recorded correctly

No regressions: full colgrep suite (541 lib + 55 bin) + cargo fmt/clippy + make ci-quick green.

Known limitations (inherent trade-offs)

  1. Short timeouts. Each resume repeats fixed overhead (rescan + re-parse remaining files + model reload, ~4–5s). If the kill window is barely larger than (overhead + one batch), progress stalls — but stays monotonic and never loses committed work or corrupts. Realistic windows (~2 min) complete in 1–3 rounds. A follow-up could stream parsing (parse+encode+commit per file-chunk) to make time-to-first-commit independent of repo size.
  2. Ranking on near-ties. A resumed build has complete recall but, like any incrementally-built index (PLAID is approximate; centroids seed from the first batch), may rank near-identical docs slightly differently than a pristine single-shot build. Consistent with normal incremental operation.

Forced/corrupted/version-bump rebuilds intentionally keep the atomic full_rebuild (tmp-swap) path, which keeps a working index searchable while it rebuilds.

Fixes the "rebuilds the whole index every time" report: a first-time full
build that exceeds the caller's timeout (e.g. an agent's 2-min command
limit) was killed before it finished, and since full_rebuild encodes into
a throwaway index.tmp and only persists state.json on success, it saved
nothing — so the next search restarted from scratch, forever.

The chunk pipeline already commits index data incrementally; the only
things lost on interruption were (1) building into index.tmp and (2)
end-only state persistence.

- New build_resumable: for the first build (and its resumes), encode
  directly into the real index dir in file-coherent batches of ~4096
  units, saving state.json after each committed batch. An interrupted run
  keeps the batches it finished.
- A .building marker routes the next run back into build_resumable so it
  resumes (skipping already-committed files) instead of restarting. The
  marker is cleared on completion.
- Each batch deletes a file's prior docs before re-embedding, so resuming
  after a mid-batch interruption is idempotent (no duplicate documents);
  repair_index_db_sync trims any partial-chunk desync on resume.
- Refactored index()/try_index() to share run_indexing(). Version bumps
  and forced/corrupted rebuilds still take the atomic full_rebuild path
  (and clear any stale .building marker), so a CLI-version change always
  rebuilds from scratch.

Stress tested: interrupting a 4000-file build mid-way resumes with
monotonic progress and a final index identical to a clean build (same
file count, same num_documents — no duplicates — identical top hits).
@raphaelsty raphaelsty force-pushed the feat/resumable-index-build branch from 000d991 to 58782ac Compare May 28, 2026 19:40
Files that parse but yield no code units (empty or import-only files like
__init__.py) were given a file_info entry but never inserted into state by
build_resumable, since the batch loop only iterates files that have units.
full_rebuild records them, so this was a parity gap: such files stayed in
the resume "todo" set and were re-parsed on every resume round, and were
only recorded after the build completed (via a later incremental_update).

Record them as done (and persist) right after parsing — they need no
embedding, so marking them complete at any point is safe and survives an
interruption.
@raphaelsty raphaelsty merged commit 073fa3b into main May 28, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant