fix(monitor): break errcode -14 death loop, unblock REST outbound#161
Open
DraixAgent wants to merge 2 commits into
Open
fix(monitor): break errcode -14 death loop, unblock REST outbound#161DraixAgent wants to merge 2 commits into
DraixAgent wants to merge 2 commits into
Conversation
…encent#155) When iLink returns errcode -14 (session expired) from getUpdates, the monitor used to pause the account for a full hour and then retry getUpdates without rebuilding the server-side session, immediately hitting -14 again — a 60-minute death loop with up to 30 minutes of inbound traffic lost on each iteration. This change makes the -14 branch: 1. set a short backoff window (not the 60-minute wall) on the long-poll pause map, 2. call `notifyStart` to rebuild the server session, 3. clear the pause and retry getUpdates within seconds on success, 4. fall back to exponential backoff (capped at 5 min) when notifyStart itself fails, so the loop keeps trying instead of dying for an hour. Also extends `pauseSession` with an optional `durationMs` and adds `clearSessionPause` so the monitor can express the new lifecycle without touching internal state. Refs Tencent#155
…nt#155) The session-pause state was originally consulted from outbound send paths (`sendMessage`, `sendMedia`, `sendTyping`), which meant a 60-minute long-poll cooldown also blocked REST traffic that is independent of the long-poll session. Combined with the -14 death loop fixed in the previous commit, this caused outbound messages (cron/message-tool sends) to silently fail for up to 30 minutes. The pause is intentionally scoped to the long-poll loop only: * remove the `assertSessionActive` calls from the channel.ts outbound paths, * document the new contract on `session-guard.ts` (and at the removal sites in channel.ts), * keep `assertSessionActive` exported and tested for any future inbound-side caller that legitimately wants the gate. If a REST call hits an expired session, the server still returns its own error which the API layer surfaces — no client-side gate needed. Refs Tencent#155
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #155
Problem
When iLink returns
errcode: -14(session expired) fromgetUpdates, the plugin entered a 60-minute unrecoverable death loop:monitor.tscalledpauseSession(accountId)→ 60-minute global lock.session-guard.ts'sassertSessionActiveblocked all outbound traffic during the pause — includingsendMessage/sendTyping/sendMediaREST calls that are independent of the long-poll session.getUpdateswithout first callingnotifyStartto rebuild the server session, so it immediately got-14again → another 60-minute pause → loop.The only escape was the OpenClaw core
channelHealthMonitor(5-min check, 30-min stale-socket threshold), so inbound messages were lost for up to 30 minutes between recoveries, and any cron/message-tool outbound send during a pause silently failed.Root cause
The pause window in
session-guard.tswas meant to back off the long-poll loop, but two things conflated it with the wrong scope:notifyStartbetween waking up from the pause and re-issuinggetUpdates, so the server session was still stale on retry.Fix
Two commits, each addressing one of the two responsibilities:
1.
fix(monitor): call notifyStart on errcode -14 before pausing sessionThe
-14branch inmonitor.tsnow:notifyStart(with a 10s timeout) to rebuild the server-side session.clearSessionPause, log recovery, and retrygetUpdatesafter 5s.A successful poll on the happy path also resets the backoff counter, so a transient
-14doesn't penalize future recoveries.session-guard.tsgets two small affordances to make this work cleanly without leaking state:pauseSession(accountId, durationMs?)— optional custom duration for backoff.clearSessionPause(accountId)— explicit clear on successful recovery.2.
fix(session-guard): allow REST outbound during long-poll pauseThe
assertSessionActivecalls inchannel.ts's outbound paths (sendWeixinOutbound,sendMedia) are removed. The session-pause state is now strictly scoped to the long-poll loop. If a REST call hits an expired session on the server, the API layer surfaces the server's own error — no client-side gate needed, and crucially no client-side wall blocking unrelated traffic.assertSessionActiveis kept exported (and tested) so any future inbound-side caller that legitimately wants the gate can still use it; the docstring onsession-guard.tsnow spells out the rule.Tests
Six new tests in
src/monitor/monitor.test.tscovering the recovery contract end-to-end:-14triggersnotifyStartbefore any long pause.notifyStartclears the pause andgetUpdatesresumes within seconds.notifyStartfailure falls back to a <60-min pause (regression guard).-14, a secondnotifyStartattempt happens well inside the 60-min window (no death loop).notifyStartis never called, pause map never touched (regression guard).-14after a prior successful recovery → recovers again (no half-open state).Three new tests in
src/messaging/outbound-bypass.test.tspinning the new contract for outbound REST:sendMessageWeixinsucceeds while the session is paused.sendMessageWeixindoes not consult or mutate the pause map.Five new tests in
src/api/session-guard.test.tsfor the API additions:pauseSessionhonors a customdurationMs.clearSessionPauseremoves an active pause.clearSessionPauseis a no-op when no pause is set.clearSessionPauseis per-account.assertSessionActive/isSessionPaused/ re-pause coverage is unchanged.)All existing tests still pass. One pre-existing unrelated failure (
src/auth/pairing.test.ts > "uses withFileLock for concurrency safety") exists onmainand was not touched.How to verify locally
npm install npx vitest run src/monitor/monitor.test.ts \ src/api/session-guard.test.ts \ src/messaging/outbound-bypass.test.ts npm run typecheckYou can also reproduce the original bug against the pre-fix code by mocking
getUpdatesto return{ errcode: -14 }once and observing thatnotifyStartis never called andgetRemainingPauseMsreports~60 * 60 * 1000for the full hour.Files touched
src/monitor/monitor.ts—-14recovery branch + backoff.src/api/session-guard.ts—pauseSessionduration arg, newclearSessionPause, updated docs.src/channel.ts— removeassertSessionActivefrom outbound paths, document why.src/monitor/monitor.test.ts— new, 6 tests.src/messaging/outbound-bypass.test.ts— new, 3 tests.src/api/session-guard.test.ts— extended, 5 new tests.Non-obvious decisions
pauseSessionentirely? It's still useful as the inbound-side bookkeeping signal during the briefnotifyStartattempt, and as the carrier for the exponential-backoff window when recovery itself fails. Removing it would either require a separate state machine or push the wait into ad-hoc sleeps in the monitor.clearSessionPauseinstead of just letting the pause expire naturally? A successfulnotifyStartcompletes in ~ms; without an explicit clear, the nextgetUpdateswould either wait out the (now-irrelevant) backoff window, or — if we shortened the window aggressively — race against the retry.assertSessionActivefrom outbound safe even if the lock is shared? The pause map was never a mutex; it was a one-bit "should the long-poll back off?" flag. Outbound REST traffic was reading that flag and treating it as authoritative for an unrelated transport, which was the bug. Removing the reads is the actual fix — there is no shared resource being protected.