Resumable, checkpointed initial index build (fix "rebuilds every time")#110
Merged
Conversation
Fixes the "rebuilds the whole index every time" report: a first-time full build that exceeds the caller's timeout (e.g. an agent's 2-min command limit) was killed before it finished, and since full_rebuild encodes into a throwaway index.tmp and only persists state.json on success, it saved nothing — so the next search restarted from scratch, forever. The chunk pipeline already commits index data incrementally; the only things lost on interruption were (1) building into index.tmp and (2) end-only state persistence. - New build_resumable: for the first build (and its resumes), encode directly into the real index dir in file-coherent batches of ~4096 units, saving state.json after each committed batch. An interrupted run keeps the batches it finished. - A .building marker routes the next run back into build_resumable so it resumes (skipping already-committed files) instead of restarting. The marker is cleared on completion. - Each batch deletes a file's prior docs before re-embedding, so resuming after a mid-batch interruption is idempotent (no duplicate documents); repair_index_db_sync trims any partial-chunk desync on resume. - Refactored index()/try_index() to share run_indexing(). Version bumps and forced/corrupted rebuilds still take the atomic full_rebuild path (and clear any stale .building marker), so a CLI-version change always rebuilds from scratch. Stress tested: interrupting a 4000-file build mid-way resumes with monotonic progress and a final index identical to a clean build (same file count, same num_documents — no duplicates — identical top hits).
000d991 to
58782ac
Compare
Files that parse but yield no code units (empty or import-only files like __init__.py) were given a file_info entry but never inserted into state by build_resumable, since the batch loop only iterates files that have units. full_rebuild records them, so this was a parity gap: such files stayed in the resume "todo" set and were re-parsed on every resume round, and were only recorded after the build completed (via a later incremental_update). Record them as done (and persist) right after parsing — they need no embedding, so marking them complete at any point is safe and survives an interruption.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
ColGrep was reported to "rebuild the index every time," unrelated to file changes. Root cause: a first-time full build that exceeds the caller's timeout (an agent's ~2-min command limit, Ctrl-C, etc.) is killed before it finishes.
full_rebuildencodes into a throwawayindex.tmpand only persistsstate.jsonon success, so an interrupted build saves nothing → the next search restarts from scratch, forever. Reproduced on a realtransformers/src/transformers/modelstree (2160 files): still building at 200s, nothing persisted after a kill.Fix
The chunk pipeline already commits index data incrementally — the only things lost were building into
index.tmpand end-only state persistence. So:build_resumable: for the first build (and resumes), encode directly into the real index dir in file-coherent batches of ~4096 units, savingstate.jsonafter each committed batch. An interrupted run keeps the batches it finished..buildingmarker routes the next run back intobuild_resumable, which skips already-committed files and continues. Cleared on completion.repair_index_db_synctrims any partial-chunk desync — so interruptions never accumulate duplicate/orphan documents.full_rebuild).index()/try_index()to sharerun_indexing().Version bumps still rebuild from scratch ✅
The version/force check runs before the resume branch and clears any
.buildingmarker, so a CLI-version change (orclear/forced rebuild) always discards the old index and does the atomicfull_rebuild— never a mixed-version index. Verified by repro.Stress testing (release binary)
ctrlconly handles SIGINT, so a real agent timeout (SIGTERM/SIGKILL) is a hard kill mid-write — that's the scenario tested.No regressions: full colgrep suite (541 lib + 55 bin) +
cargo fmt/clippy+make ci-quickgreen.Known limitations (inherent trade-offs)
Forced/corrupted/version-bump rebuilds intentionally keep the atomic
full_rebuild(tmp-swap) path, which keeps a working index searchable while it rebuilds.