Make index/centroids updates crash-safe and expose update progress in health by DeeJayPee · Pull Request #104 · lightonai/next-plaid

DeeJayPee · 2026-05-22T08:06:31Z

This PR fixes a production failure mode where next-plaid-api could crash with SIGBUS during a query while a bulk update was rewriting index files.

The root issue was that critical files such as centroids.npy were rewritten in place with File::create(...), which truncates the file immediately. If another request was reading the old mmap-backed file at the same time, Linux could raise SIGBUS; if the process died before the rewrite completed, the index was left with a zero-byte centroids.npy, causing restart failures like:

Index load failed: NPY file too small

What Changed

Writes critical index files through atomic temp-file writes followed by rename.
Covers centroids.npy, ivf_lengths.npy, metadata.json, cluster_threshold.npy, IVF files, chunk files, and index creation/update paths.
Keeps the previous loaded index available to readers until the updated index has been written and successfully reloaded.
Improves NPY load errors to include the file path and observed byte size.
Adds in-memory update progress tracking.
Extends /health with an updates field showing active and recent update status.
Tracks stages such as queued, encoding, batching, centroid_expansion, kmeans, index_write, metadata_write, reload, complete, and failed.

Example /health Output

{
  "status": "healthy",
  "version": "1.3.1",
  "loaded_indices": 1,
  "index_dir": "/var/lib/next-plaid/indices",
  "memory_usage_bytes": 2312192000,
  "indices": [],
  "updates": [
    {
      "index": "my_test_index",
      "status": "running",
      "stage": "centroid_expansion",
      "queued_documents": 135500,
      "processed_documents": 500,
      "started_at": "2026-05-22T00:01:27Z",
      "updated_at": "2026-05-22T00:05:21Z",
      "elapsed_ms": 235491,
      "message": "finding embeddings outside existing centroids"
    }
  ],
  "model": null
}

Validation

cargo check -p next-plaid-api --locked
cargo test -p next-plaid atomic_write_failure_preserves_original_file --locked
cargo test -p next-plaid-api health --locked
cargo test -p next-plaid-api update_health_status_tracks_active_and_completed_updates --locked

Write critical index files through atomic temp-file renames so interrupted updates do not leave truncated centroids or metadata behind. Expose queued, running, completed, and failed update progress directly in the health response, and keep the previous loaded index available until reload succeeds.

raphaelsty

Thanks for the MR @DeeJayPee! Really clean read on the SIGBUS root cause, and the atomic_write_file mechanism is implemented carefully (same parent directory for the temp file, create_new to dodge the TOCTOU race, sync_all on both the file and the parent dir, cleanup on failure). The /health progress will save real time the next time someone files an indexer ticket. Left a few inline thoughts, nothing blocking apart from the Windows path.

raphaelsty · 2026-05-22T22:29:34Z

-            state_clone.unload_index(&name_inner);
-
            // Check sync before updating: if filtering DB exists, counts must match
            let index_path = std::path::Path::new(&path_str);


Worth keeping the upfront unload on Windows. On POSIX the atomic rename works because the old mmap holds the unlinked inode, but on Windows fs::rename cannot replace a file while it is currently mapped (returns ERROR_SHARING_VIOLATION). A cfg gate keeps the new behaviour on Linux and macOS while preserving the previous Windows workaround:

#[cfg(target_os = "windows")] state_clone.unload_index(&name_inner);

Same pattern would help in process_batch if Windows is still a target.

raphaelsty · 2026-05-22T22:29:34Z

+    UPDATE_PROGRESS.with(|slot| {
+        *slot.borrow_mut() = Some(Box::new(callback));
+    });
+    let result = operation();


If operation() panics, the thread_local stays populated. Since tokio reuses spawn_blocking workers, the next update on this thread inherits a stale callback pointing into a freed closure. A small RAII guard makes it panic safe:

struct ProgressGuard; impl Drop for ProgressGuard { fn drop(&mut self) { UPDATE_PROGRESS.with(|slot| *slot.borrow_mut() = None); } }

Then let _g = ProgressGuard; before let result = operation();.

raphaelsty · 2026-05-22T22:29:34Z

+    /// Get active and recent update progress for the health endpoint.
+    pub fn get_update_health_statuses(&self) -> Vec<UpdateHealthStatus> {
+        let now = SystemTime::now();
+        let mut progress = self.update_progress.write();


Heads up: this takes a write lock on every /health call to prune. With monitoring plus load balancer polling, that contends with update workers writing progress. Two cheap options: try_write and skip pruning when contended, or fold pruning into the record_update_* paths so reads only need a read lock.

… and ensure progress callbacks are properly cleared on panic

DeeJayPee · 2026-05-25T16:32:43Z

Good catches, agreed on all three. I dont use Windows... I’ll keep the POSIX behavior to preserve mmap readers during atomic rename, restore the upfront unload behind #[cfg(target_os = "windows")] in both update paths, make the thread-local progress callback panic-safe with an RAII guard, and move pruning out of /health reads so health only takes a read lock.

- update.rs: factor the thread-local progress callback into a `ProgressCallback` type alias (fixes clippy::type_complexity, which -D warnings turns into a build error in `make ci-quick`) and use a const thread-local initializer. - state.rs: freeze `elapsed_ms` at updated_at-started_at for terminal (complete/failed) updates so a finished job stops running up the clock on every /health poll; add a regression test. - state.rs: move the `#[cfg(test)] mod tests` block to the end of the file. Co-authored-by: GUIOT Jean-Philippe <239749+DeeJayPee@users.noreply.github.com> Co-authored-by: Raphael Sourty <24591024+raphaelsty@users.noreply.github.com>

raphaelsty · 2026-05-28T22:47:37Z

Hi @DeeJayPee thank you for this MR, merging it :)

raphaelsty reviewed May 22, 2026

View reviewed changes

GUIOT Jean-Philippe added 3 commits May 25, 2026 12:52

Fix Windows file replacement issues, improve update progress cleanup,…

435ac71

… and ensure progress callbacks are properly cleared on panic

Merge branch 'codex/crash-safe-update-health-progress'

551fec8

Merge branch 'main' of https://github.com/lightonai/next-plaid

25cf373

raphaelsty force-pushed the main branch from 8c7e54c to 1068073 Compare May 28, 2026 22:32

raphaelsty merged commit 971c637 into lightonai:main May 28, 2026
20 checks passed

raphaelsty mentioned this pull request May 28, 2026

Reject embedding dims where dim*nbits isn't a multiple of 8 (instead of panicking) #111

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make index/centroids updates crash-safe and expose update progress in health#104

Make index/centroids updates crash-safe and expose update progress in health#104
raphaelsty merged 5 commits into
lightonai:mainfrom
DeeJayPee:main

DeeJayPee commented May 22, 2026

Uh oh!

raphaelsty left a comment

Uh oh!

raphaelsty May 22, 2026

Uh oh!

raphaelsty May 22, 2026

Uh oh!

raphaelsty May 22, 2026

Uh oh!

DeeJayPee commented May 25, 2026

Uh oh!

raphaelsty commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DeeJayPee commented May 22, 2026

Uh oh!

raphaelsty left a comment

Choose a reason for hiding this comment

Uh oh!

raphaelsty May 22, 2026

Choose a reason for hiding this comment

Uh oh!

raphaelsty May 22, 2026

Choose a reason for hiding this comment

Uh oh!

raphaelsty May 22, 2026

Choose a reason for hiding this comment

Uh oh!

DeeJayPee commented May 25, 2026

Uh oh!

raphaelsty commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants