Skip to content

swebenchmultimodal: diegomura/react-pdf-1178 consistently fails runtime init across models and runs #645

@simonrosenberg

Description

@simonrosenberg

Summary

The diegomura__react-pdf-1178 instance in swebenchmultimodal-dev consistently fails
runtime init
across multiple models and runs, even after the runtime-api / sdk fix for
#523 (OpenHands/software-agent-sdk#2656). This is deterministic, not transient: every
attempt in every run fails with the same 502→503 runtime init pattern.

Two sibling instances (react-pdf-1280, react-pdf-1552) also fail frequently but
intermittently — they sometimes succeed on retry.

Evidence

Run 1: 2026-04-06, opus4.6 + sonnet4.5 (evaluation run 24040981135)

Model Job react-pdf-1178 react-pdf-1280 react-pdf-1552
claude-4.6-opus eval-24040981135-claude-4-6 ❌ failed 4x ✅ (on retry) ✅ (on retry)
claude-sonnet-4-5-20250929 eval-24040981135-claude-son ❌ failed 4x ❌ failed 4x ❌ failed 4x

Run 2: 2026-04-04, gemini-3.1 (jobs eval-23983344719-gemini-3-1, eval-23968391657-gemini-3-1)

react-pdf-1178 failed on both runs (Datadog logs confirm failed after 4 attempts).

So: 3 distinct models × at least 3 eval runs over 3 days — react-pdf-1178 failed every
time.

Error pattern

From kube_job:eval-24040981135-claude-4-6 (Opus), instance diegomura__react-pdf-1178:

2026-04-06 19:43:17 ERROR  HTTP request failed (502 Bad Gateway): ...
  url: https://vqjiqojpacbxmmaw.eval-runtime.all-hands.dev/api/acp/conversations/9e52265f-...
2026-04-06 19:43:19 ERROR  HTTP request failed (503 Service Unavailable): ...
... (20+ seconds of 1/s polling with 503s) ...
2026-04-06 19:43:37 WARNING [worker] runtime init failure instance=diegomura__react-pdf-1178
    attempt=1 retry=1 runtime_id=vqjiqojpacbxmmaw
    error=Conversation run failed ...: Remote conversation ended with error
... (3 more attempts with different runtime_ids, same pattern) ...
2026-04-06 21:59:16 ERROR  [worker] Instance diegomura__react-pdf-1178 failed after 4 attempts.

Each attempt spins up a fresh runtime pod (new runtime_id), hits 502/503 on /api/acp/conversations/..., polls for 20+ seconds, gives up, and moves on. After 4 such attempts the
worker marks it failed.

Suspected root cause

Likely the same family as #352 (still OPEN): the
react-pdf-1178 image is slow to boot (or has a startup issue), the agent-server startup
probe on port 60000 fails, the pod is force-stopped, and the client sees 502/503 on
/api/acp/conversations/....

Unlike #523 (which was about large wp-calypso/p5.js images and was fixed by sdk#2656),
this one is narrower — a single react-pdf instance that consistently fails. Possibilities:

  1. Broken image: react-pdf-1178 Dockerfile produces an image that never becomes
    healthy (e.g. node version regression — see root_cause_acp_node.md in memory; Node
    v12–v14 crashes claude-agent-acp).
  2. Unreasonable startup time: image starts but takes longer than the startup probe
    window.
  3. ACP-specific init bug: the instance repo requires setup that ACP agents trip on.

Repro

/trigger-benchmarks swebenchmultimodal agent_type=acp-claude model_ids=claude-4.6-opus \
  eval_limit=500 reason="repro react-pdf-1178 failure"

Expected: react-pdf-1178 fails after 4 runtime init attempts.

Proposed next steps

  1. Inspect the SWE-bench-M image for diegomura/react-pdf-1178 — manually pull it,
    start agent-server in it, verify /server_info on port 60000 comes up.
  2. Check if Node.js is old in that image (see known ACP/Node issue); force Node 22
    install if missing.
  3. If image is fundamentally broken, consider blacklisting the 3 flaky instances
    (1178, 1280, 1552) or filing upstream with SWE-bench-M.
  4. Add a fast-fail path in the runtime init polling: if 5xx persists for >5s on the
    first attempt, abort that attempt instead of polling for 20s (saves ~60s per failed
    instance — currently ~90s of wall-clock wasted on retries).

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions