Resumable, checkpointed initial index build (fix "rebuilds every time") by raphaelsty · Pull Request #110 · lightonai/next-plaid

raphaelsty · 2026-05-28T19:37:19Z

Problem

ColGrep was reported to "rebuild the index every time," unrelated to file changes. Root cause: a first-time full build that exceeds the caller's timeout (an agent's ~2-min command limit, Ctrl-C, etc.) is killed before it finishes. full_rebuild encodes into a throwaway index.tmp and only persists state.json on success, so an interrupted build saves nothing → the next search restarts from scratch, forever. Reproduced on a real transformers/src/transformers/models tree (2160 files): still building at 200s, nothing persisted after a kill.

Fix

The chunk pipeline already commits index data incrementally — the only things lost were building into index.tmp and end-only state persistence. So:

build_resumable: for the first build (and resumes), encode directly into the real index dir in file-coherent batches of ~4096 units, saving state.json after each committed batch. An interrupted run keeps the batches it finished.
A .building marker routes the next run back into build_resumable, which skips already-committed files and continues. Cleared on completion.
Idempotent resume: each batch deletes a file's prior docs before re-embedding, and repair_index_db_sync trims any partial-chunk desync — so interruptions never accumulate duplicate/orphan documents.
Zero-unit files (empty / import-only) are recorded immediately (parity with full_rebuild).
Refactored index()/try_index() to share run_indexing().

Version bumps still rebuild from scratch ✅

The version/force check runs before the resume branch and clears any .building marker, so a CLI-version change (or clear/forced rebuild) always discards the old index and does the atomic full_rebuild — never a mixed-version index. Verified by repro.

Stress testing (release binary)

ctrlc only handles SIGINT, so a real agent timeout (SIGTERM/SIGKILL) is a hard kill mid-write — that's the scenario tested.

Scenario	Result
Graceful interrupt (SIGINT), 13s window, 4000 files	✅ Completes in 2 rounds; final index identical to clean build (4000 files / 16000 docs, same top hits)
Hard kill (SIGKILL) ×20, 5s window	✅ No corruption, no duplicate/orphan docs (count oscillates around committed value, never grows), state never regresses — ⚠️ stalls at 1 batch (window ≈ one batch + overhead)
Hard kill (SIGKILL), 12s window, 4000 files	✅ Monotonic 2048→3072→4000, completes in 3 rounds, content-equivalent (4000 files / 16000 docs, no dups)
Version bump	✅ Full rebuild from scratch
Zero-unit files (empty / import-only)	✅ Recorded correctly

No regressions: full colgrep suite (541 lib + 55 bin) + cargo fmt/clippy + make ci-quick green.

Known limitations (inherent trade-offs)

Short timeouts. Each resume repeats fixed overhead (rescan + re-parse remaining files + model reload, ~4–5s). If the kill window is barely larger than (overhead + one batch), progress stalls — but stays monotonic and never loses committed work or corrupts. Realistic windows (~2 min) complete in 1–3 rounds. A follow-up could stream parsing (parse+encode+commit per file-chunk) to make time-to-first-commit independent of repo size.
Ranking on near-ties. A resumed build has complete recall but, like any incrementally-built index (PLAID is approximate; centroids seed from the first batch), may rank near-identical docs slightly differently than a pristine single-shot build. Consistent with normal incremental operation.

Forced/corrupted/version-bump rebuilds intentionally keep the atomic full_rebuild (tmp-swap) path, which keeps a working index searchable while it rebuilds.

Fixes the "rebuilds the whole index every time" report: a first-time full build that exceeds the caller's timeout (e.g. an agent's 2-min command limit) was killed before it finished, and since full_rebuild encodes into a throwaway index.tmp and only persists state.json on success, it saved nothing — so the next search restarted from scratch, forever. The chunk pipeline already commits index data incrementally; the only things lost on interruption were (1) building into index.tmp and (2) end-only state persistence. - New build_resumable: for the first build (and its resumes), encode directly into the real index dir in file-coherent batches of ~4096 units, saving state.json after each committed batch. An interrupted run keeps the batches it finished. - A .building marker routes the next run back into build_resumable so it resumes (skipping already-committed files) instead of restarting. The marker is cleared on completion. - Each batch deletes a file's prior docs before re-embedding, so resuming after a mid-batch interruption is idempotent (no duplicate documents); repair_index_db_sync trims any partial-chunk desync on resume. - Refactored index()/try_index() to share run_indexing(). Version bumps and forced/corrupted rebuilds still take the atomic full_rebuild path (and clear any stale .building marker), so a CLI-version change always rebuilds from scratch. Stress tested: interrupting a 4000-file build mid-way resumes with monotonic progress and a final index identical to a clean build (same file count, same num_documents — no duplicates — identical top hits).

Files that parse but yield no code units (empty or import-only files like __init__.py) were given a file_info entry but never inserted into state by build_resumable, since the batch loop only iterates files that have units. full_rebuild records them, so this was a parity gap: such files stayed in the resume "todo" set and were re-parsed on every resume round, and were only recorded after the build completed (via a later incremental_update). Record them as done (and persist) right after parsing — they need no embedding, so marking them complete at any point is safe and survives an interruption.

raphaelsty force-pushed the feat/resumable-index-build branch from 000d991 to 58782ac Compare May 28, 2026 19:40

raphaelsty merged commit 073fa3b into main May 28, 2026
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resumable, checkpointed initial index build (fix "rebuilds every time")#110

Resumable, checkpointed initial index build (fix "rebuilds every time")#110
raphaelsty merged 2 commits into
mainfrom
feat/resumable-index-build

raphaelsty commented May 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raphaelsty commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Version bumps still rebuild from scratch ✅

Stress testing (release binary)

Known limitations (inherent trade-offs)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

raphaelsty commented May 28, 2026 •

edited

Loading