Skip to content

fix(monitor): break errcode -14 death loop, unblock REST outbound#161

Open
DraixAgent wants to merge 2 commits into
Tencent:mainfrom
DraixAgent:feature/fix-155-errcode-14-death-loop
Open

fix(monitor): break errcode -14 death loop, unblock REST outbound#161
DraixAgent wants to merge 2 commits into
Tencent:mainfrom
DraixAgent:feature/fix-155-errcode-14-death-loop

Conversation

@DraixAgent
Copy link
Copy Markdown

Fixes #155

Problem

When iLink returns errcode: -14 (session expired) from getUpdates, the plugin entered a 60-minute unrecoverable death loop:

  1. monitor.ts called pauseSession(accountId) → 60-minute global lock.
  2. session-guard.ts's assertSessionActive blocked all outbound traffic during the pause — including sendMessage / sendTyping / sendMedia REST calls that are independent of the long-poll session.
  3. After 60 minutes the loop retried getUpdates without first calling notifyStart to rebuild the server session, so it immediately got -14 again → another 60-minute pause → loop.

The only escape was the OpenClaw core channelHealthMonitor (5-min check, 30-min stale-socket threshold), so inbound messages were lost for up to 30 minutes between recoveries, and any cron/message-tool outbound send during a pause silently failed.

Root cause

The pause window in session-guard.ts was meant to back off the long-poll loop, but two things conflated it with the wrong scope:

  • Scope creep: the same pause map was queried from outbound REST send paths, so a long-poll cooldown ended up gating REST traffic that has its own independent session on the server.
  • No active recovery: the monitor never attempted notifyStart between waking up from the pause and re-issuing getUpdates, so the server session was still stale on retry.

Fix

Two commits, each addressing one of the two responsibilities:

1. fix(monitor): call notifyStart on errcode -14 before pausing session

The -14 branch in monitor.ts now:

  1. Sets a short backoff window (initial 5s, not the 60-min wall) on the pause map so concurrent inbound paths know we're recovering.
  2. Calls notifyStart (with a 10s timeout) to rebuild the server-side session.
  3. On success: clearSessionPause, log recovery, and retry getUpdates after 5s.
  4. On failure: exponential backoff (×2 each round, capped at 5 min) and retry — never the original 60-minute wall.

A successful poll on the happy path also resets the backoff counter, so a transient -14 doesn't penalize future recoveries.

session-guard.ts gets two small affordances to make this work cleanly without leaking state:

  • pauseSession(accountId, durationMs?) — optional custom duration for backoff.
  • clearSessionPause(accountId) — explicit clear on successful recovery.

2. fix(session-guard): allow REST outbound during long-poll pause

The assertSessionActive calls in channel.ts's outbound paths (sendWeixinOutbound, sendMedia) are removed. The session-pause state is now strictly scoped to the long-poll loop. If a REST call hits an expired session on the server, the API layer surfaces the server's own error — no client-side gate needed, and crucially no client-side wall blocking unrelated traffic.

assertSessionActive is kept exported (and tested) so any future inbound-side caller that legitimately wants the gate can still use it; the docstring on session-guard.ts now spells out the rule.

Tests

Six new tests in src/monitor/monitor.test.ts covering the recovery contract end-to-end:

  • -14 triggers notifyStart before any long pause.
  • A successful notifyStart clears the pause and getUpdates resumes within seconds.
  • notifyStart failure falls back to a <60-min pause (regression guard).
  • On persistent -14, a second notifyStart attempt happens well inside the 60-min window (no death loop).
  • Normal traffic: notifyStart is never called, pause map never touched (regression guard).
  • -14 after a prior successful recovery → recovers again (no half-open state).

Three new tests in src/messaging/outbound-bypass.test.ts pinning the new contract for outbound REST:

  • sendMessageWeixin succeeds while the session is paused.
  • sendMessageWeixin does not consult or mutate the pause map.
  • Underlying REST errors propagate without touching pause state.

Five new tests in src/api/session-guard.test.ts for the API additions:

  • pauseSession honors a custom durationMs.
  • clearSessionPause removes an active pause.
  • clearSessionPause is a no-op when no pause is set.
  • clearSessionPause is per-account.
  • (Existing assertSessionActive / isSessionPaused / re-pause coverage is unchanged.)

All existing tests still pass. One pre-existing unrelated failure (src/auth/pairing.test.ts > "uses withFileLock for concurrency safety") exists on main and was not touched.

How to verify locally

npm install
npx vitest run src/monitor/monitor.test.ts \
              src/api/session-guard.test.ts \
              src/messaging/outbound-bypass.test.ts
npm run typecheck

You can also reproduce the original bug against the pre-fix code by mocking getUpdates to return { errcode: -14 } once and observing that notifyStart is never called and getRemainingPauseMs reports ~60 * 60 * 1000 for the full hour.

Files touched

  • src/monitor/monitor.ts-14 recovery branch + backoff.
  • src/api/session-guard.tspauseSession duration arg, new clearSessionPause, updated docs.
  • src/channel.ts — remove assertSessionActive from outbound paths, document why.
  • src/monitor/monitor.test.ts — new, 6 tests.
  • src/messaging/outbound-bypass.test.ts — new, 3 tests.
  • src/api/session-guard.test.ts — extended, 5 new tests.

Non-obvious decisions

  • Why not delete pauseSession entirely? It's still useful as the inbound-side bookkeeping signal during the brief notifyStart attempt, and as the carrier for the exponential-backoff window when recovery itself fails. Removing it would either require a separate state machine or push the wait into ad-hoc sleeps in the monitor.
  • Why clearSessionPause instead of just letting the pause expire naturally? A successful notifyStart completes in ~ms; without an explicit clear, the next getUpdates would either wait out the (now-irrelevant) backoff window, or — if we shortened the window aggressively — race against the retry.
  • Why is removing assertSessionActive from outbound safe even if the lock is shared? The pause map was never a mutex; it was a one-bit "should the long-poll back off?" flag. Outbound REST traffic was reading that flag and treating it as authoritative for an unrelated transport, which was the bug. Removing the reads is the actual fix — there is no shared resource being protected.

…encent#155)

When iLink returns errcode -14 (session expired) from getUpdates, the
monitor used to pause the account for a full hour and then retry
getUpdates without rebuilding the server-side session, immediately
hitting -14 again — a 60-minute death loop with up to 30 minutes of
inbound traffic lost on each iteration.

This change makes the -14 branch:

  1. set a short backoff window (not the 60-minute wall) on the
     long-poll pause map,
  2. call `notifyStart` to rebuild the server session,
  3. clear the pause and retry getUpdates within seconds on success,
  4. fall back to exponential backoff (capped at 5 min) when
     notifyStart itself fails, so the loop keeps trying instead of
     dying for an hour.

Also extends `pauseSession` with an optional `durationMs` and adds
`clearSessionPause` so the monitor can express the new lifecycle
without touching internal state.

Refs Tencent#155
…nt#155)

The session-pause state was originally consulted from outbound send
paths (`sendMessage`, `sendMedia`, `sendTyping`), which meant a
60-minute long-poll cooldown also blocked REST traffic that is
independent of the long-poll session. Combined with the -14 death
loop fixed in the previous commit, this caused outbound messages
(cron/message-tool sends) to silently fail for up to 30 minutes.

The pause is intentionally scoped to the long-poll loop only:

  * remove the `assertSessionActive` calls from the channel.ts
    outbound paths,
  * document the new contract on `session-guard.ts` (and at the
    removal sites in channel.ts),
  * keep `assertSessionActive` exported and tested for any future
    inbound-side caller that legitimately wants the gate.

If a REST call hits an expired session, the server still returns its
own error which the API layer surfaces — no client-side gate needed.

Refs Tencent#155
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] errcode -14 session expiry enters 60-min death loop — outbound messages blocked, no auto-recovery

1 participant