Skip to content

fix(relay): prevent agent session corruption on timeout#405

Merged
chenhg5 merged 2 commits intochenhg5:mainfrom
pinximei:fix/relay-timeout-session-survival
Apr 2, 2026
Merged

fix(relay): prevent agent session corruption on timeout#405
chenhg5 merged 2 commits intochenhg5:mainfrom
pinximei:fix/relay-timeout-session-survival

Conversation

@pinximei
Copy link
Copy Markdown
Contributor

@pinximei pinximei commented Apr 1, 2026

Summary

  • fix(relay): After a relay timeout, the target agent became permanently unresponsive. Three root causes fixed:
    1. StartSession received the relay timeout context - when the deadline fired, exec.CommandContext killed the Claude process mid-turn, corrupting session state. Now uses the engine context.
    2. defer agentSession.Close() killed the process immediately on return. Now drains in a background goroutine (drainRelaySession) so the agent finishes its turn and saves state cleanly.
    3. No fallback when resuming a corrupted session - added retry logic that clears the stale session ID and starts fresh.
  • fix(feishu): Respond to @ALL messages in group chats. Feishu @ALL sends @_all with 0 mentions, which was ignored when group_reply_all=false. Added respond_to_at_everyone_and_here option.

Test plan

  • TestHandleRelay_ReturnsPartialOnTimeout - partial response returned, session closed cleanly in background
  • TestHandleRelay_TimeoutWithoutTextReturnsContextError - error returned, no goroutine leak
  • TestHandleRelay_ResumeFailureFallsBackToFreshSession - stale session ID cleared, fresh session created
  • All existing relay tests pass

pinxi_mei and others added 2 commits April 1, 2026 17:25
Feishu @ALL sends {"text":"@_all"} with 0 mentions, which was being
ignored when group_reply_all=false. Add respond_to_at_everyone_and_here
option to allow bots to respond to @ALL messages.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three issues caused agents to become unresponsive after a relay timeout:

1. StartSession received the relay timeout context -- when the deadline
   fired, exec.CommandContext killed the Claude process mid-turn,
   corrupting the session state. Now uses the engine context instead.

2. defer agentSession.Close() killed the process immediately on return.
   Now the session drains in a background goroutine (drainRelaySession)
   so the agent can finish its turn and save state cleanly.

3. No fallback when resuming a corrupted session. Added retry logic that
   clears the stale session ID and starts fresh, matching the existing
   interactive message flow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Owner

@chenhg5 chenhg5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Well-designed fix for relay timeout handling.

Review summary:

  • ✅ Correct: Uses engine context instead of relay timeout context to prevent killing agent process
  • ✅ drainRelaySession goroutine for clean session state preservation
  • ✅ Fallback to fresh session on resume failure
  • ✅ All tests pass
  • ✅ CI passes

Good fix for session corruption issue.

@chenhg5 chenhg5 merged commit a0bc3f9 into chenhg5:main Apr 2, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants