Skip to content

fix(openclaw-plugin): add defensive re-spawn for OpenViking subproces…#1053

Merged
qin-ctx merged 1 commit intovolcengine:mainfrom
huangxun375-stack:main
Mar 28, 2026
Merged

fix(openclaw-plugin): add defensive re-spawn for OpenViking subproces…#1053
qin-ctx merged 1 commit intovolcengine:mainfrom
huangxun375-stack:main

Conversation

@huangxun375-stack
Copy link
Copy Markdown
Contributor

@huangxun375-stack huangxun375-stack commented Mar 28, 2026

…s after Gateway restart

Description

Fix OpenViking subprocess not recovering after Gateway force-restart, causing all memory-dependent requests to time out indefinitely. Always log subprocess exit events (including code=0) for better diagnostics. Add diagnostic logging to ov_archive_expand tool invocations.

Related Issue

input long text → restart Gateway → query → timeout with no response.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

Always log subprocess exit: Removed the code !== 0 guard in the exit handler so exits with code=0 are also logged, preventing silent process disappearance
Defensive re-spawn: When isSpawner=false in local mode, check if a valid process actually exists (via cache + health check); if not, trigger a fresh spawn using the same env/config as the primary spawn path
ov_archive_expand diagnostics: Added structured logging for tool invocations (archiveId, sessionId), successful expansions (message count, char count), and failures (error details)

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux
    • macOS
    • Windows

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Screenshots (if applicable)

Additional Notes

Root cause: The plugin is registered multiple times per Gateway startup (gateway subsystem + per-session). Only the first start() call to find a pending entry in localClientPendingPromises becomes the spawner. After a force-restart, race conditions in module loading can cause all start() calls to miss the pending entry, leaving isSpawner=false for every call. The original else branch silently swallowed health-check failures, so the system entered a permanently broken state. The defensive re-spawn acts as a self-healing fallback that detects "no valid process" and recovers automatically.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 28, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions
Copy link
Copy Markdown

Failed to generate code suggestions for PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants