fix(monitor): break errcode -14 death loop, unblock REST outbound by DraixAgent · Pull Request #161 · Tencent/openclaw-weixin

DraixAgent · 2026-05-16T00:37:26Z

Fixes #155

Problem

When iLink returns errcode: -14 (session expired) from getUpdates, the plugin entered a 60-minute unrecoverable death loop:

monitor.ts called pauseSession(accountId) → 60-minute global lock.
session-guard.ts's assertSessionActive blocked all outbound traffic during the pause — including sendMessage / sendTyping / sendMedia REST calls that are independent of the long-poll session.
After 60 minutes the loop retried getUpdates without first calling notifyStart to rebuild the server session, so it immediately got -14 again → another 60-minute pause → loop.

The only escape was the OpenClaw core channelHealthMonitor (5-min check, 30-min stale-socket threshold), so inbound messages were lost for up to 30 minutes between recoveries, and any cron/message-tool outbound send during a pause silently failed.

Root cause

The pause window in session-guard.ts was meant to back off the long-poll loop, but two things conflated it with the wrong scope:

Scope creep: the same pause map was queried from outbound REST send paths, so a long-poll cooldown ended up gating REST traffic that has its own independent session on the server.
No active recovery: the monitor never attempted notifyStart between waking up from the pause and re-issuing getUpdates, so the server session was still stale on retry.

Fix

Two commits, each addressing one of the two responsibilities:

1. `fix(monitor): call notifyStart on errcode -14 before pausing session`

The -14 branch in monitor.ts now:

Sets a short backoff window (initial 5s, not the 60-min wall) on the pause map so concurrent inbound paths know we're recovering.
Calls notifyStart (with a 10s timeout) to rebuild the server-side session.
On success: clearSessionPause, log recovery, and retry getUpdates after 5s.
On failure: exponential backoff (×2 each round, capped at 5 min) and retry — never the original 60-minute wall.

A successful poll on the happy path also resets the backoff counter, so a transient -14 doesn't penalize future recoveries.

session-guard.ts gets two small affordances to make this work cleanly without leaking state:

pauseSession(accountId, durationMs?) — optional custom duration for backoff.
clearSessionPause(accountId) — explicit clear on successful recovery.

2. `fix(session-guard): allow REST outbound during long-poll pause`

The assertSessionActive calls in channel.ts's outbound paths (sendWeixinOutbound, sendMedia) are removed. The session-pause state is now strictly scoped to the long-poll loop. If a REST call hits an expired session on the server, the API layer surfaces the server's own error — no client-side gate needed, and crucially no client-side wall blocking unrelated traffic.

assertSessionActive is kept exported (and tested) so any future inbound-side caller that legitimately wants the gate can still use it; the docstring on session-guard.ts now spells out the rule.

Tests

Six new tests in src/monitor/monitor.test.ts covering the recovery contract end-to-end:

-14 triggers notifyStart before any long pause.
A successful notifyStart clears the pause and getUpdates resumes within seconds.
notifyStart failure falls back to a <60-min pause (regression guard).
On persistent -14, a second notifyStart attempt happens well inside the 60-min window (no death loop).
Normal traffic: notifyStart is never called, pause map never touched (regression guard).
-14 after a prior successful recovery → recovers again (no half-open state).

Three new tests in src/messaging/outbound-bypass.test.ts pinning the new contract for outbound REST:

sendMessageWeixin succeeds while the session is paused.
sendMessageWeixin does not consult or mutate the pause map.
Underlying REST errors propagate without touching pause state.

Five new tests in src/api/session-guard.test.ts for the API additions:

pauseSession honors a custom durationMs.
clearSessionPause removes an active pause.
clearSessionPause is a no-op when no pause is set.
clearSessionPause is per-account.
(Existing assertSessionActive / isSessionPaused / re-pause coverage is unchanged.)

All existing tests still pass. One pre-existing unrelated failure (src/auth/pairing.test.ts > "uses withFileLock for concurrency safety") exists on main and was not touched.

How to verify locally

npm install
npx vitest run src/monitor/monitor.test.ts \
              src/api/session-guard.test.ts \
              src/messaging/outbound-bypass.test.ts
npm run typecheck

You can also reproduce the original bug against the pre-fix code by mocking getUpdates to return { errcode: -14 } once and observing that notifyStart is never called and getRemainingPauseMs reports ~60 * 60 * 1000 for the full hour.

Files touched

src/monitor/monitor.ts — -14 recovery branch + backoff.
src/api/session-guard.ts — pauseSession duration arg, new clearSessionPause, updated docs.
src/channel.ts — remove assertSessionActive from outbound paths, document why.
src/monitor/monitor.test.ts — new, 6 tests.
src/messaging/outbound-bypass.test.ts — new, 3 tests.
src/api/session-guard.test.ts — extended, 5 new tests.

Non-obvious decisions

Why not delete pauseSession entirely? It's still useful as the inbound-side bookkeeping signal during the brief notifyStart attempt, and as the carrier for the exponential-backoff window when recovery itself fails. Removing it would either require a separate state machine or push the wait into ad-hoc sleeps in the monitor.
Why clearSessionPause instead of just letting the pause expire naturally? A successful notifyStart completes in ~ms; without an explicit clear, the next getUpdates would either wait out the (now-irrelevant) backoff window, or — if we shortened the window aggressively — race against the retry.
Why is removing assertSessionActive from outbound safe even if the lock is shared? The pause map was never a mutex; it was a one-bit "should the long-poll back off?" flag. Outbound REST traffic was reading that flag and treating it as authoritative for an unrelated transport, which was the bug. Removing the reads is the actual fix — there is no shared resource being protected.

…encent#155) When iLink returns errcode -14 (session expired) from getUpdates, the monitor used to pause the account for a full hour and then retry getUpdates without rebuilding the server-side session, immediately hitting -14 again — a 60-minute death loop with up to 30 minutes of inbound traffic lost on each iteration. This change makes the -14 branch: 1. set a short backoff window (not the 60-minute wall) on the long-poll pause map, 2. call `notifyStart` to rebuild the server session, 3. clear the pause and retry getUpdates within seconds on success, 4. fall back to exponential backoff (capped at 5 min) when notifyStart itself fails, so the loop keeps trying instead of dying for an hour. Also extends `pauseSession` with an optional `durationMs` and adds `clearSessionPause` so the monitor can express the new lifecycle without touching internal state. Refs Tencent#155

…nt#155) The session-pause state was originally consulted from outbound send paths (`sendMessage`, `sendMedia`, `sendTyping`), which meant a 60-minute long-poll cooldown also blocked REST traffic that is independent of the long-poll session. Combined with the -14 death loop fixed in the previous commit, this caused outbound messages (cron/message-tool sends) to silently fail for up to 30 minutes. The pause is intentionally scoped to the long-poll loop only: * remove the `assertSessionActive` calls from the channel.ts outbound paths, * document the new contract on `session-guard.ts` (and at the removal sites in channel.ts), * keep `assertSessionActive` exported and tested for any future inbound-side caller that legitimately wants the gate. If a REST call hits an expired session, the server still returns its own error which the API layer surfaces — no client-side gate needed. Refs Tencent#155

DraixAgent added 2 commits May 15, 2026 21:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(monitor): break errcode -14 death loop, unblock REST outbound#161

fix(monitor): break errcode -14 death loop, unblock REST outbound#161
DraixAgent wants to merge 2 commits into
Tencent:mainfrom
DraixAgent:feature/fix-155-errcode-14-death-loop

DraixAgent commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DraixAgent commented May 16, 2026

Problem

Root cause

Fix

1. fix(monitor): call notifyStart on errcode -14 before pausing session

2. fix(session-guard): allow REST outbound during long-poll pause

Tests

How to verify locally

Files touched

Non-obvious decisions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `fix(monitor): call notifyStart on errcode -14 before pausing session`

2. `fix(session-guard): allow REST outbound during long-poll pause`