Skip to content

fix(gateway): fail fast when gateway detects systemd conflict on startup#829

Open
octo-patch wants to merge 1 commit intoValueCell-ai:mainfrom
octo-patch:fix/issue-695-systemd-conflict-fail-fast
Open

fix(gateway): fail fast when gateway detects systemd conflict on startup#829
octo-patch wants to merge 1 commit intoValueCell-ai:mainfrom
octo-patch:fix/issue-695-systemd-conflict-fail-fast

Conversation

@octo-patch
Copy link
Copy Markdown
Contributor

Fixes #695

Problem

When the openclaw gateway binary detects it is already managed by systemd, it emits already running under systemd; waiting 5000ms before retrying startup to stderr and retries internally. ClawX classified the resulting process exit as a transient startup error, so the startup orchestrator retried the sequence up to 3 times per attempt and the reconnect manager retried up to 10 more times — producing a long, frustrating retry storm with no recovery.

Solution

  • startup-recovery.ts: Added SYSTEMD_CONFLICT_PATTERNS, isSystemdConflictSignal, and hasSystemdConflictSignal. getGatewayStartupRecoveryAction now returns 'fail' immediately when any startup stderr line matches the systemd conflict pattern, short-circuiting the retry loop.
  • startup-stderr.ts: Classified already running under systemd messages as debug level so the brief log window before the process exits does not flood the log with repeated warnings.

Testing

Added 6 new unit tests covering:

  • isSystemdConflictSignal positive/negative cases (including case-insensitive matching)
  • hasSystemdConflictSignal with mixed stderr lines
  • getGatewayStartupRecoveryAction returns 'fail' on systemd conflict even when the error would otherwise qualify as transient
  • Regression: getGatewayStartupRecoveryAction still returns 'retry' for transient errors without the systemd signal

All 11 tests in gateway-startup-recovery.test.ts pass.

When the openclaw gateway binary is already managed by systemd it emits
"already running under systemd; waiting 5000ms before retrying startup"
to stderr and loops internally.  ClawX previously classified the
resulting process exit as a transient error and retried the startup
sequence (up to 3 attempts in the orchestrator, then up to 10 via the
reconnect manager), producing a long visible retry storm with no
recovery path.

Detect this signal in startup-recovery and return 'fail' immediately so
the manager surfaces a clear error state on the first attempt.  Also
downgrade the log level for the message to 'debug' in startup-stderr so
the short window before the process exits does not flood the log.

Fixes ValueCell-ai#695
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Gateway already running under systemd

1 participant