Skip to content

Make index/centroids updates crash-safe and expose update progress in health#104

Merged
raphaelsty merged 5 commits into
lightonai:mainfrom
DeeJayPee:main
May 28, 2026
Merged

Make index/centroids updates crash-safe and expose update progress in health#104
raphaelsty merged 5 commits into
lightonai:mainfrom
DeeJayPee:main

Conversation

@DeeJayPee
Copy link
Copy Markdown
Contributor

This PR fixes a production failure mode where next-plaid-api could crash with SIGBUS during a query while a bulk update was rewriting index files.

The root issue was that critical files such as centroids.npy were rewritten in place with File::create(...), which truncates the file immediately. If another request was reading the old mmap-backed file at the same time, Linux could raise SIGBUS; if the process died before the rewrite completed, the index was left with a zero-byte centroids.npy, causing restart failures like:

Index load failed: NPY file too small

What Changed

  • Writes critical index files through atomic temp-file writes followed by rename.
  • Covers centroids.npy, ivf_lengths.npy, metadata.json, cluster_threshold.npy, IVF files, chunk files, and index creation/update paths.
  • Keeps the previous loaded index available to readers until the updated index has been written and successfully reloaded.
  • Improves NPY load errors to include the file path and observed byte size.
  • Adds in-memory update progress tracking.
  • Extends /health with an updates field showing active and recent update status.
  • Tracks stages such as queued, encoding, batching, centroid_expansion, kmeans, index_write, metadata_write, reload, complete, and failed.

Example /health Output

{
  "status": "healthy",
  "version": "1.3.1",
  "loaded_indices": 1,
  "index_dir": "/var/lib/next-plaid/indices",
  "memory_usage_bytes": 2312192000,
  "indices": [],
  "updates": [
    {
      "index": "my_test_index",
      "status": "running",
      "stage": "centroid_expansion",
      "queued_documents": 135500,
      "processed_documents": 500,
      "started_at": "2026-05-22T00:01:27Z",
      "updated_at": "2026-05-22T00:05:21Z",
      "elapsed_ms": 235491,
      "message": "finding embeddings outside existing centroids"
    }
  ],
  "model": null
}

Validation

  • cargo check -p next-plaid-api --locked
  • cargo test -p next-plaid atomic_write_failure_preserves_original_file --locked
  • cargo test -p next-plaid-api health --locked
  • cargo test -p next-plaid-api update_health_status_tracks_active_and_completed_updates --locked

Write critical index files through atomic temp-file renames so interrupted updates do not leave truncated centroids or metadata behind.

Expose queued, running, completed, and failed update progress directly in the health response, and keep the previous loaded index available until reload succeeds.
Copy link
Copy Markdown
Collaborator

@raphaelsty raphaelsty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the MR @DeeJayPee! Really clean read on the SIGBUS root cause, and the atomic_write_file mechanism is implemented carefully (same parent directory for the temp file, create_new to dodge the TOCTOU race, sync_all on both the file and the parent dir, cleanup on failure). The /health progress will save real time the next time someone files an indexer ticket. Left a few inline thoughts, nothing blocking apart from the Windows path.

state_clone.unload_index(&name_inner);

// Check sync before updating: if filtering DB exists, counts must match
let index_path = std::path::Path::new(&path_str);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth keeping the upfront unload on Windows. On POSIX the atomic rename works because the old mmap holds the unlinked inode, but on Windows fs::rename cannot replace a file while it is currently mapped (returns ERROR_SHARING_VIOLATION). A cfg gate keeps the new behaviour on Linux and macOS while preserving the previous Windows workaround:

#[cfg(target_os = "windows")]
state_clone.unload_index(&name_inner);

Same pattern would help in process_batch if Windows is still a target.

Comment thread next-plaid/src/update.rs Outdated
UPDATE_PROGRESS.with(|slot| {
*slot.borrow_mut() = Some(Box::new(callback));
});
let result = operation();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If operation() panics, the thread_local stays populated. Since tokio reuses spawn_blocking workers, the next update on this thread inherits a stale callback pointing into a freed closure. A small RAII guard makes it panic safe:

struct ProgressGuard;
impl Drop for ProgressGuard {
    fn drop(&mut self) {
        UPDATE_PROGRESS.with(|slot| *slot.borrow_mut() = None);
    }
}

Then let _g = ProgressGuard; before let result = operation();.

Comment thread next-plaid-api/src/state.rs Outdated
/// Get active and recent update progress for the health endpoint.
pub fn get_update_health_statuses(&self) -> Vec<UpdateHealthStatus> {
let now = SystemTime::now();
let mut progress = self.update_progress.write();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up: this takes a write lock on every /health call to prune. With monitoring plus load balancer polling, that contends with update workers writing progress. Two cheap options: try_write and skip pruning when contended, or fold pruning into the record_update_* paths so reads only need a read lock.

@DeeJayPee
Copy link
Copy Markdown
Contributor Author

Good catches, agreed on all three. I dont use Windows... I’ll keep the POSIX behavior to preserve mmap readers during atomic rename, restore the upfront unload behind #[cfg(target_os = "windows")] in both update paths, make the thread-local progress callback panic-safe with an RAII guard, and move pruning out of /health reads so health only takes a read lock.

- update.rs: factor the thread-local progress callback into a
  `ProgressCallback` type alias (fixes clippy::type_complexity, which
  -D warnings turns into a build error in `make ci-quick`) and use a
  const thread-local initializer.
- state.rs: freeze `elapsed_ms` at updated_at-started_at for terminal
  (complete/failed) updates so a finished job stops running up the clock
  on every /health poll; add a regression test.
- state.rs: move the `#[cfg(test)] mod tests` block to the end of the file.

Co-authored-by: GUIOT Jean-Philippe <239749+DeeJayPee@users.noreply.github.com>
Co-authored-by: Raphael Sourty <24591024+raphaelsty@users.noreply.github.com>
@raphaelsty
Copy link
Copy Markdown
Collaborator

Hi @DeeJayPee thank you for this MR, merging it :)

@raphaelsty raphaelsty merged commit 971c637 into lightonai:main May 28, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants