Skip to content

Fix swebenchmultimodal timeout for large repo images#579

Closed
simonrosenberg wants to merge 4 commits intomainfrom
fix/swebenchmultimodal-api-timeout
Closed

Fix swebenchmultimodal timeout for large repo images#579
simonrosenberg wants to merge 4 commits intomainfrom
fix/swebenchmultimodal-api-timeout

Conversation

@simonrosenberg
Copy link
Copy Markdown
Collaborator

@simonrosenberg simonrosenberg commented Mar 26, 2026

Summary

  • Increases api_timeout from default 60s to 300s for swebenchmultimodal remote workspaces
  • Configurable via REMOTE_API_TIMEOUT env var

Problem

The agent-server performs lazy initialization on the first send_message() call — it spawns the ACP subprocess, does protocol handshake, and creates a session inside _ensure_agent_ready(). For large prebuilt images (wp-calypso, p5.js, marked), this initialization consistently exceeds the default 60s api_timeout, causing httpx.ReadTimeout on the client side.

Evidence from eval runs on 2026-03-26 (eval-23598243168-claude-son, eval-23600690440-claude-4-6):

Repo Timeout failure rate Notes
Automattic/wp-calypso 100% (37/37) Init always >> 60s
processing/p5.js 100% (16/16) Init always >> 60s
markedjs/marked 100% (11/11) Init always >> 60s
diegomura/react-pdf ~56% (5/9) Init ~60s (borderline)
chartjs/Chart.js ~29% (10/34) Init usually < 60s

Key observations:

  • Zero recovery across retries (each retry creates a fresh pod → same timeout)
  • resource_factor escalation (1→2→4→8) does not help because it's a time issue, not a resource issue
  • The /health endpoint passes (it's a no-op return "OK") but the server can't handle requests yet

SDK issue for the proper fix (move init to startup): OpenHands/software-agent-sdk#2588

Test plan

  • Run swebenchmultimodal with ACP agent on large-image instances (wp-calypso, p5.js, marked) and verify they no longer timeout during send_message
  • Verify REMOTE_API_TIMEOUT env var override works

🤖 Generated with Claude Code

Debug Agent and others added 3 commits March 26, 2026 16:33
SDK v1.15.0 changed the agent-server Dockerfile builder stage from
--managed-python (self-contained venv) to --python-preference only-system
(venv symlinks to system Python). This broke eval image builds because
Dockerfile.agent-layer copies /agent-server from the builder into an
eval-base image that doesn't have Python 3.13 at /usr/local/bin/,
causing "no such file or directory" on every runtime pod.

Fix: copy the Python 3.13 binary, stdlib, and shared library from the
builder stage into the final image so the venv symlinks resolve.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ldconfig fails in eval-base images that run as non-root user.
Use LD_LIBRARY_PATH=/usr/local/lib instead to make libpython3.13
discoverable without needing root privileges.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The default api_timeout of 60s is too short for the ACP agent lazy
initialization that happens on the first send_message() call. For large
repo images (wp-calypso, p5.js, marked), the ACP subprocess startup
consistently exceeds 60s, causing 100% failure rates even at
resource_factor=8.

Raising to 300s (configurable via REMOTE_API_TIMEOUT env var) gives the
agent-server enough time to complete lazy init for large images.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Pragmatic timeout fix for a real, well-documented problem

Analysis:

  • Code is clean: simple env var passthrough following the existing startup_timeout pattern
  • No complexity, no breaking changes, no security issues
  • Problem evidence is solid (100% failure rates on large images)
  • Acknowledged as a workaround (SDK #1095 for proper fix)

Issue:

  • Test plan checkboxes are unchecked - no proof the fix actually works

Recommendation:
Before merging, run eval on at least one problematic repo (wp-calypso, p5.js, or marked) with REMOTE_API_TIMEOUT=300 and add evidence to the PR description showing it succeeds. For a 2-line timeout change the code is obviously correct, but proving it solves the real-world problem is not optional.

Verdict: ✅ Worth merging after verification

Key Insight: Obvious solution to a timeout problem - just prove it works in production.

…-images' into fix/swebenchmultimodal-api-timeout
@simonrosenberg
Copy link
Copy Markdown
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants