Skip to content

fix(daemon): add backpressure control and command serialization to prevent IPC EAGAIN#529

Merged
ctate merged 1 commit intovercel-labs:mainfrom
shohu:fix/daemon-ipc-reliability
Feb 23, 2026
Merged

fix(daemon): add backpressure control and command serialization to prevent IPC EAGAIN#529
ctate merged 1 commit intovercel-labs:mainfrom
shohu:fix/daemon-ipc-reliability

Conversation

@shohu
Copy link
Contributor

@shohu shohu commented Feb 23, 2026

Summary

Fixes the daemon-side root causes of EAGAIN (os error 35 on macOS / error 11 on Linux) that crash the Rust CLI during IPC reads.

This complements #329 (CLI-side retry logic) by preventing the conditions that trigger EAGAIN in the first place.

Changes

src/browser.ts — Configurable Playwright timeout via environment variable

  • Add getDefaultTimeout() helper that reads AGENT_BROWSER_DEFAULT_TIMEOUT env var
  • Replace all hardcoded setDefaultTimeout(60000) calls (5 locations) with setDefaultTimeout(getDefaultTimeout())
  • CDP and recording contexts (10s timeout) are not affected
  • Default remains 60s when env var is unset — fully backward compatible

src/daemon.ts — Backpressure-aware writes + command serialization

  • Add safeWrite() helper that waits for drain event when socket.write() returns false, with proper cleanup on close/error
  • Serialize command execution per socket via a queue to prevent concurrent socket.write() calls that cause kernel buffer contention
  • Add .catch() guards for unhandled rejection safety

Root Cause Analysis

Two daemon-side issues combine to cause EAGAIN:

  1. Playwright timeout > CLI IPC timeout: setDefaultTimeout(60000) means Playwright can block for up to 60s, but the Rust CLI times out earlier. The daemon never responds, and the CLI's read_line() hits EAGAIN.

  2. Uncontrolled socket.write() concurrency: Multiple async command handlers can call socket.write() in parallel. When payloads are large (e.g., snapshot responses), the kernel buffer fills up and subsequent writes/reads fail with EAGAIN.

Testing

  • All 363 existing tests pass (9 skipped)
  • Prettier formatting verified
  • TypeScript compilation clean
  • Manually tested with a heavy React Native/Expo app (~1000+ DOM nodes): 10 consecutive snapshot commands complete without os error 35

Usage

# Set a shorter timeout (e.g., 30s) to prevent daemon hangs
export AGENT_BROWSER_DEFAULT_TIMEOUT=30000

Refs #322

…event IPC EAGAIN

- Add AGENT_BROWSER_DEFAULT_TIMEOUT env var to override Playwright's
  default 60s timeout (CDP/recording 10s timeouts unaffected)
- Add backpressure-aware safeWrite() that waits for drain when socket
  buffer is full, preventing data loss under load
- Serialize command execution per socket via queue to prevent concurrent
  writes that cause buffer contention

These daemon-side fixes complement vercel-labs#329 (CLI-side EAGAIN retry) by
addressing the root causes: Playwright operations that outlast the
CLI's IPC timeout, and concurrent socket.write() calls that overflow
the kernel buffer.

Tested with heavy React app (1000+ DOM nodes) — 10 consecutive
snapshot commands complete without os error 35/11.

Refs vercel-labs#322
@vercel
Copy link
Contributor

vercel bot commented Feb 23, 2026

@shohu is attempting to deploy a commit to the Vercel Labs Team on Vercel.

A member of the Team first needs to authorize it.

@ctate
Copy link
Collaborator

ctate commented Feb 23, 2026

This is so good - thanks @shohu

@ctate ctate merged commit 16c4ef2 into vercel-labs:main Feb 23, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants