Make index/centroids updates crash-safe and expose update progress in health#104
Conversation
Write critical index files through atomic temp-file renames so interrupted updates do not leave truncated centroids or metadata behind. Expose queued, running, completed, and failed update progress directly in the health response, and keep the previous loaded index available until reload succeeds.
raphaelsty
left a comment
There was a problem hiding this comment.
Thanks for the MR @DeeJayPee! Really clean read on the SIGBUS root cause, and the atomic_write_file mechanism is implemented carefully (same parent directory for the temp file, create_new to dodge the TOCTOU race, sync_all on both the file and the parent dir, cleanup on failure). The /health progress will save real time the next time someone files an indexer ticket. Left a few inline thoughts, nothing blocking apart from the Windows path.
| state_clone.unload_index(&name_inner); | ||
|
|
||
| // Check sync before updating: if filtering DB exists, counts must match | ||
| let index_path = std::path::Path::new(&path_str); |
There was a problem hiding this comment.
Worth keeping the upfront unload on Windows. On POSIX the atomic rename works because the old mmap holds the unlinked inode, but on Windows fs::rename cannot replace a file while it is currently mapped (returns ERROR_SHARING_VIOLATION). A cfg gate keeps the new behaviour on Linux and macOS while preserving the previous Windows workaround:
#[cfg(target_os = "windows")]
state_clone.unload_index(&name_inner);Same pattern would help in process_batch if Windows is still a target.
| UPDATE_PROGRESS.with(|slot| { | ||
| *slot.borrow_mut() = Some(Box::new(callback)); | ||
| }); | ||
| let result = operation(); |
There was a problem hiding this comment.
If operation() panics, the thread_local stays populated. Since tokio reuses spawn_blocking workers, the next update on this thread inherits a stale callback pointing into a freed closure. A small RAII guard makes it panic safe:
struct ProgressGuard;
impl Drop for ProgressGuard {
fn drop(&mut self) {
UPDATE_PROGRESS.with(|slot| *slot.borrow_mut() = None);
}
}Then let _g = ProgressGuard; before let result = operation();.
| /// Get active and recent update progress for the health endpoint. | ||
| pub fn get_update_health_statuses(&self) -> Vec<UpdateHealthStatus> { | ||
| let now = SystemTime::now(); | ||
| let mut progress = self.update_progress.write(); |
There was a problem hiding this comment.
Heads up: this takes a write lock on every /health call to prune. With monitoring plus load balancer polling, that contends with update workers writing progress. Two cheap options: try_write and skip pruning when contended, or fold pruning into the record_update_* paths so reads only need a read lock.
… and ensure progress callbacks are properly cleared on panic
|
Good catches, agreed on all three. I dont use Windows... I’ll keep the POSIX behavior to preserve mmap readers during atomic rename, restore the upfront unload behind #[cfg(target_os = "windows")] in both update paths, make the thread-local progress callback panic-safe with an RAII guard, and move pruning out of /health reads so health only takes a read lock. |
- update.rs: factor the thread-local progress callback into a `ProgressCallback` type alias (fixes clippy::type_complexity, which -D warnings turns into a build error in `make ci-quick`) and use a const thread-local initializer. - state.rs: freeze `elapsed_ms` at updated_at-started_at for terminal (complete/failed) updates so a finished job stops running up the clock on every /health poll; add a regression test. - state.rs: move the `#[cfg(test)] mod tests` block to the end of the file. Co-authored-by: GUIOT Jean-Philippe <239749+DeeJayPee@users.noreply.github.com> Co-authored-by: Raphael Sourty <24591024+raphaelsty@users.noreply.github.com>
|
Hi @DeeJayPee thank you for this MR, merging it :) |
This PR fixes a production failure mode where
next-plaid-apicould crash withSIGBUSduring a query while a bulk update was rewriting index files.The root issue was that critical files such as
centroids.npywere rewritten in place withFile::create(...), which truncates the file immediately. If another request was reading the old mmap-backed file at the same time, Linux could raiseSIGBUS; if the process died before the rewrite completed, the index was left with a zero-bytecentroids.npy, causing restart failures like:What Changed
centroids.npy,ivf_lengths.npy,metadata.json,cluster_threshold.npy, IVF files, chunk files, and index creation/update paths./healthwith anupdatesfield showing active and recent update status.queued,encoding,batching,centroid_expansion,kmeans,index_write,metadata_write,reload,complete, andfailed.Example
/healthOutput{ "status": "healthy", "version": "1.3.1", "loaded_indices": 1, "index_dir": "/var/lib/next-plaid/indices", "memory_usage_bytes": 2312192000, "indices": [], "updates": [ { "index": "my_test_index", "status": "running", "stage": "centroid_expansion", "queued_documents": 135500, "processed_documents": 500, "started_at": "2026-05-22T00:01:27Z", "updated_at": "2026-05-22T00:05:21Z", "elapsed_ms": 235491, "message": "finding embeddings outside existing centroids" } ], "model": null }Validation
cargo check -p next-plaid-api --lockedcargo test -p next-plaid atomic_write_failure_preserves_original_file --lockedcargo test -p next-plaid-api health --lockedcargo test -p next-plaid-api update_health_status_tracks_active_and_completed_updates --locked