Skip to content

fix(runtime): launch CLI agents on Windows#29

Merged
alchemistklk merged 2 commits into
nexu-io:mainfrom
hoangmanhtuan165:fix/windows-agent-spawn
Jun 17, 2026
Merged

fix(runtime): launch CLI agents on Windows#29
alchemistklk merged 2 commits into
nexu-io:mainfrom
hoangmanhtuan165:fix/windows-agent-spawn

Conversation

@hoangmanhtuan165

Copy link
Copy Markdown

Problem

On Windows the studio composer couldn't send any message — the agent picker showed every CLI agent (claude, codex, …) as unavailable, and selecting "Anthropic API (direct)" needed an API key. Forcing a CLI agent through failed with exit code -4058 (ENOENT) / empty reply.

Two Windows-only root causes:

  1. Detection used which, which only exists on POSIX shells. The studio server runs under node directly, so execFile('which', …) always failed → every CLI agent reported available: false, even when installed.
  2. Spawn passed the bare bin name. On Windows claude is really claude.cmd; Node's spawn() can't launch a batch file without a shell → ENOENT (-4058).

Fix

  • detect.ts: use where on win32, which elsewhere; take the first match line.
  • spawn.ts: resolve the real bin path via where and run through a shell on win32 so .cmd shims launch; keep the cheap PATH path on POSIX. The CLI spawn branch is now async (path resolution is async) but still returns a synchronous SpawnHandle via an AbortController.

No behaviour change on macOS/Linux (same which + bare-bin spawn path).

Verification

Verified end-to-end on Windows 11 against a real claude login (no API key):

POST /api/agents/claude/test
→ {"ok":true,"exit_code":0,"ms":15256,"bytes":6,"stdout_head":"hello\n"}

Before the fix the same call returned exit_code:-4058 (empty) / exit_code:1. Agent picker now correctly shows claude and codex as available.

Note: a separate "cannot be launched inside another Claude Code session" error appears only when the studio server is started from within a Claude Code session (inherited CLAUDECODE=1); not a code issue — launching studio from a normal terminal is unaffected. Left out of this PR.

Agent detection used `which`, which only exists on POSIX shells; the
studio server runs under `node` directly so every CLI agent (claude,
codex, ...) was reported unavailable on Windows. Spawning also passed
the bare bin name, so `claude` (really `claude.cmd`) failed to launch
with ENOENT (-4058).

- detect.ts: use `where` on win32, `which` elsewhere; take first match
- spawn.ts: resolve the real bin path via `where` and run through a
  shell on win32 so `.cmd` shims launch; keep the cheap PATH path on
  POSIX. CLI spawn is now async (path resolution is async) but still
  returns a synchronous SpawnHandle via an AbortController.

Verified end-to-end on Windows: claude agent test returns exit 0 with
real output.
@lefarcen lefarcen requested a review from Siri-Ray June 7, 2026 14:03
@lefarcen lefarcen added size/M Size M (100-299 LOC) risk/medium Medium risk type/bugfix Bug fix labels Jun 7, 2026
@lefarcen

lefarcen commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Heads-up: PR #3 and PR #10 are also open against the same Windows CLI runtime path. All three touch packages/runtime/src/detect.ts / packages/runtime/src/spawn.ts around win32 detection and launch behavior.

You and the other authors may want to compare approaches so effort doesn't get duplicated while maintainers decide what lands.

…I-key agent

The studio composer returned empty replies mid-flow on Windows (e.g. at
the confirm/generate phase) with "No ANTHROPIC_API_KEY in env", even
though a logged-in `claude` CLI was present.

Two causes:

- detectAll() probes ~13 agents with Promise.all, so a dozen `where`
  lookups spawn at once. Under that contention the 2s timeout was too
  tight and intermittently marked installed CLI agents (claude/codex)
  unavailable. Bump the which/where timeout to 8s.

- When no agent is pinned, the resolver fell back to `anthropic-api`
  (HTTP, needs a key). Combined with the flaky probe above and a
  non-persisted project.agentId, a later turn could resolve to
  anthropic-api and fail with "No ANTHROPIC_API_KEY". Now prefer a
  ready-to-run CLI agent, only use anthropic-api when it's actually
  configured, and persist the resolved agent on the project so every
  turn in a session uses the same one.

Verified end-to-end on Windows: detection returns claude=available
stably (3/3), and a full opener→confirm→generate run produces HTML
(preview_ready) via the claude CLI with no API key.
@hoangmanhtuan165

Copy link
Copy Markdown
Author

Added a second commit (f829d56) for two more Windows-only issues found while testing the full studio flow:

  1. Flaky agent detectiondetectAll() probes ~13 agents via Promise.all, so a dozen where lookups spawn at once. Under that contention the 2s timeout intermittently marked installed CLI agents (claude/codex) unavailable. Bumped the which/where timeout to 8s.
  2. Bad fallback — when no agent is pinned, the resolver fell back to anthropic-api (needs a key). With the flaky probe above + a non-persisted project.agentId, a later turn (e.g. confirm/generate) could resolve to anthropic-api and fail with "No ANTHROPIC_API_KEY" → empty reply. Now prefers a ready-to-run CLI agent, only uses anthropic-api when actually configured, and persists the resolved agent on the project.

Verified end-to-end on Windows: detection returns claude=available stably (3/3), and a full opener→confirm→generate run produces HTML (preview_ready) via the claude CLI with no API key.

@lefarcen

lefarcen commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

One update since my earlier note: PR #31 has gone through multiple review rounds and was APPROVED on the same packages/runtime/src/detect.ts / packages/runtime/src/spawn.ts path. It covers PATHEXT-aware agent detection and .cmd/.bat shim unwrapping without shell: true, with a new regression test suite. It's now at the front of the merge queue.

That overlaps with the core ENOENT fix in this PR. The pieces that look unique here and aren't in #31:

  • packages/cli/src/studio-server.ts — studio-flow changes
  • Flaky detection timeout bump (8 s) for where/which under contention
  • Agent fallback preference (prefer CLI → only use anthropic-api when actually configured) + project-level agent persistence

Worth watching what exactly lands from #31 first. If the Windows ENOENT fix ships via #31, narrowing this PR to those three unique pieces would make the path to merge faster and cleaner. Happy to keep this open and revisit once #31 resolves — just wanted to make sure you had the latest context.

@alchemistklk alchemistklk merged commit c67070b into nexu-io:main Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

risk/medium Medium risk size/M Size M (100-299 LOC) type/bugfix Bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants