fix: prevent mv lock timeout causing missing L0/L1 files by r266-tech · Pull Request #1064 · volcengine/OpenViking

r266-tech · 2026-03-28T20:38:28Z

Problem

The SemanticProcessor fails to move generated .abstract.md (L0) and .overview.md (L1) files from the temp directory to the target resource directory. Server logs show "Failed to acquire mv lock" errors, resulting in missing layer files even though the VLM successfully generated them.

Root cause: Two compounding issues:

TransactionConfig.lock_timeout and LockManager both default to 0.0, meaning any lock contention causes immediate failure — no waiting, no retry.
SemanticProcessor._sync_topdown_recursive() has no retry logic around viking_fs.mv() calls. When concurrent operations compete for subtree locks, the first contention kills the move.

Fix

1. Change default `lock_timeout` from `0.0` to `5.0` (seconds)

Updated in both TransactionConfig and LockManager/init_lock_manager. Five seconds is enough for transient contention to clear without causing indefinite blocking on real deadlocks. Users who want fail-fast behavior can still set lock_timeout=0.

2. Add retry logic with backoff for mv operations

New _mv_with_retry() helper in semantic_processor.py:

Retries up to 3 times with increasing delay (0.3s → 0.6s → 0.9s)
Only retries on lock-related errors (checks if "lock" is in the error message)
Logs a warning on each retry for debugging
Raises the original exception if retries are exhausted

Applied to all 4 viking_fs.mv() call sites in _sync_topdown_recursive().

Changes

File	Change
`openviking_cli/utils/config/transaction_config.py`	Default `lock_timeout`: `0.0` → `5.0`
`openviking/storage/transaction/lock_manager.py`	Default `lock_timeout`: `0.0` → `5.0` (2 locations)
`openviking/storage/queuefs/semantic_processor.py`	New `_mv_with_retry()` helper, 4 call sites updated

3 files changed, +37/-13

Two-part fix for SemanticProcessor failing to move generated layer files (.abstract.md, .overview.md) from temp to target directory. 1. Change default lock_timeout from 0.0 to 5.0 (seconds) - LockManager and TransactionConfig both defaulted to 0.0, meaning any lock contention caused immediate failure - Updated to 5.0s: enough for transient contention, not so long that a real deadlock blocks indefinitely - Users can still set lock_timeout=0 for fail-fast behavior 2. Add retry logic with backoff for mv operations in SemanticProcessor._sync_topdown_recursive() - New _mv_with_retry() helper: retries up to 3 times with increasing delay (0.3s, 0.6s, 0.9s) on lock errors - Applied to all 4 viking_fs.mv() call sites - Logs warning on each retry attempt for debugging Closes volcengine#1047

CLAassistant · 2026-03-28T20:38:35Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

github-actions · 2026-03-28T20:39:11Z

Failed to generate code suggestions for PR

github-project-automation bot moved this to Backlog in OpenViking project Mar 28, 2026

github-project-automation bot added this to OpenViking project Mar 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent mv lock timeout causing missing L0/L1 files#1064

fix: prevent mv lock timeout causing missing L0/L1 files#1064
r266-tech wants to merge 1 commit intovolcengine:mainfrom
r266-tech:fix/lock-timeout-mv-retry

r266-tech commented Mar 28, 2026

Uh oh!

CLAassistant commented Mar 28, 2026

Uh oh!

github-actions bot commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

r266-tech commented Mar 28, 2026

Problem

Fix

1. Change default lock_timeout from 0.0 to 5.0 (seconds)

2. Add retry logic with backoff for mv operations

Changes

Uh oh!

CLAassistant commented Mar 28, 2026

Uh oh!

github-actions bot commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. Change default `lock_timeout` from `0.0` to `5.0` (seconds)