Skip to content

Comments

Better stop behavior, support override entrypoint#99

Merged
sjmiller609 merged 11 commits intomainfrom
stop-when-app-fails
Feb 13, 2026
Merged

Better stop behavior, support override entrypoint#99
sjmiller609 merged 11 commits intomainfrom
stop-when-app-fails

Conversation

@sjmiller609
Copy link
Collaborator

@sjmiller609 sjmiller609 commented Feb 13, 2026

Summary

  • Graceful shutdown: adds Shutdown gRPC RPC so hypeman stop sends SIGTERM to the app (via guest-agent -> init -> entrypoint), with a configurable timeout and hard-kill fallback. Replaces syscall.Exit with reboot(POWER_OFF) to eliminate the kernel panic on app exit.
  • Exit info propagation: init writes a HYPEMAN-EXIT sentinel to serial console before shutdown. Host lazily parses it when the VM stops and persists exit_code + exit_message to instance metadata and the API response. Includes signal descriptions, OOM detection via /dev/kmsg, and shell convention codes (126/127).
  • Entrypoint/Cmd override on create and start, matching docker run --entrypoint / docker run <image> <command>.
  • All init log phases prefixed with hypeman-init: to distinguish from guest-agent and app output.

Test plan

  • Unit tests for describeExitCode, formatExitSentinel, isOOMLine, parseExitSentinelLine (including carriage return handling)
  • TestAppExitPropagation: alpine with /nonexistent-command -> exit 127 "command not found" in metadata
  • TestOOMExitPropagation: 128MB VM with memory-filling loop -> exit 137 "killed by signal 9 (killed) - OOM" in metadata
  • TestBasicEndToEnd: graceful stop -> verify ExitCode populated -> restart -> verify ExitCode cleared -> stop -> delete
  • Graceful stop assertions added to TestQEMUBasicEndToEnd

Note

Medium Risk
Touches core instance lifecycle (start/stop/init) and introduces new gRPC/API contract changes, which could affect VM shutdown behavior and metadata persistence if edge cases/regressions exist.

Overview
Implements graceful VM stop by adding a Shutdown gRPC RPC (guest-agent signals PID 1) and updating StopInstance to attempt a timed graceful shutdown before falling back to hard-killing the hypervisor.

Adds exit info propagation: guest init now writes a machine-parseable HYPEMAN-EXIT sentinel (including OOM detection) and powers off via reboot(POWER_OFF); the host lazily parses/persists exit_code/exit_message from serial logs and returns them via the API.

Adds Docker-like entrypoint/cmd overrides on CreateInstance and StartInstance (with restart clearing stale exit info), updates OpenAPI/generated client to accept a JSON body for start, and adjusts/extends integration+unit tests plus increases test timeouts.

Written by Cursor Bugbot for commit 06cd0e8. This will update automatically on new commits. Configure here.

Add graceful shutdown, exit info propagation, and richer init logging

Replace syscall.Exit with reboot(POWER_OFF) to eliminate kernel panic
when the entrypoint exits. Init now traps signals and forwards them to
the entrypoint child, enabling graceful stop via a new Shutdown gRPC RPC.

Exit info (code + human-readable message) is written as a machine-parseable
sentinel to the serial console, lazily parsed by the host when it discovers
the VM has stopped, persisted to instance metadata, and exposed on the API.

Special cases: signal-killed processes report 128+signal with signal name,
SIGKILL checks /dev/kmsg for OOM annotation, 126/127 describe shell errors.

Also adds entrypoint/cmd override on instance creation (like docker run),
and prefixes all init log phases with "hypeman-init:" for clarity.
@sjmiller609 sjmiller609 changed the title Better stop behavior Better stop behavior, support override entrypoint Feb 13, 2026
@github-actions
Copy link

github-actions bot commented Feb 13, 2026

✱ Stainless preview builds

This PR will update the hypeman SDKs with the following commit message.

feat: Better stop behavior
⚠️ hypeman-openapi studio · code

There was a regression in your SDK.
generate ⚠️

⚠️ hypeman-typescript studio · code

There was a regression in your SDK.
generate ❗build ✅lint ✅test ✅

npm install https://pkg.stainless.com/s/hypeman-typescript/9f372846f08ec591342f525498449861990821c5/dist.tar.gz
⚠️ hypeman-go studio · code

There was a regression in your SDK.
generate ⚠️lint ✅test ✅

go get github.com/stainless-sdks/hypeman-go@12cfb4e4ddfc198c65b09efeb73c8db0fd609f5f

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-02-13 19:59:23 UTC

Address code review feedback on shutdown/exit-info PR

- Fix checkOOMKill hanging: open /dev/kmsg with O_NONBLOCK and add 1s
  timeout to prevent init from blocking if the kernel ring buffer stalls.
- Scan serial log from end: read last 8KB instead of scanning from the
  beginning, since the sentinel is always near the end of the file.
- Strip whitespace from serial log lines before parsing sentinel to
  handle carriage returns added by the TTY.
- Clear ExitCode/ExitMessage on StartInstance so stale exit info from a
  previous run doesn't persist after restart.
- Add Entrypoint/Cmd override to StartInstance (matching CreateInstance),
  with optional request body on POST /instances/{id}/start.
- Integration test for stop -> start -> verify exit info cleared.
- describeExitCode returns "exit code N" instead of generic "error" for
  codes 1-125.
- Start body required -- Changed required: false to required: true in OpenAPI spec. Clients send {} for no overrides.
- Metadata race eliminated -- parseExitSentinel is now a pure reader returning (code, msg, ok). toInstance populates exit info in-memory only (no writes). Persistence happens in two places, both under the instance write lock:
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Copy link
Contributor

@hiroTamada hiroTamada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good -- clean design for graceful shutdown, exit info propagation, and entrypoint/cmd overrides. A couple nits on typos in manager.go.

@sjmiller609 sjmiller609 merged commit 564c4f7 into main Feb 13, 2026
4 checks passed
@sjmiller609 sjmiller609 deleted the stop-when-app-fails branch February 13, 2026 19:57
rgarcia added a commit that referenced this pull request Feb 14, 2026
StartInstance now takes a StartInstanceRequest parameter (from PR #99).

Co-Authored-By: Claude Opus 4.6 <[email protected]>
rgarcia added a commit that referenced this pull request Feb 14, 2026
After PR #99, init does reboot(POWER_OFF) when the entrypoint exits.
Alpine's default entrypoint (/bin/sh) exits immediately with no stdin,
killing the VM before exec tests can run. Add Cmd: sleep infinity to
keep the VM alive, matching the pattern in volumes_test.go.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
rgarcia added a commit that referenced this pull request Feb 14, 2026
After PR #99, init does reboot(POWER_OFF) when the entrypoint exits.
Alpine's /bin/sh exits immediately with no stdin, killing the VM before
exec can run. nginx:alpine has a long-running daemon entrypoint that
keeps the VM alive, matching the pattern in exec_test.go.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
rgarcia added a commit that referenced this pull request Feb 15, 2026
* refactor: cross-platform foundation for macOS support

Split platform-specific code into _linux.go and _darwin.go files across
resources, network, devices, ingress, vmm, and vm_metrics packages.
Add hypervisor abstraction with registration pattern (RegisterSocketName,
RegisterVsockDialerFactory, RegisterClientFactory) to decouple instance
management from specific hypervisor implementations. Add "vz" to the
OpenAPI hypervisor type enum, erofs disk format support, and insecure
registry option for builds.

No behavioral changes on Linux. macOS can now compile but has no VM
functionality yet.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat: add macOS VM support via Apple Virtualization.framework

Add vz hypervisor implementation that runs VMs on macOS using Apple's
Virtualization.framework via a codesigned subprocess (vz-shim). Includes
vsock-based guest communication, shared directory mounts for disk access,
and macOS-native networking via vmnet.

Key components:
- cmd/vz-shim: subprocess that creates and manages vz VMs
- lib/hypervisor/vz: starter, client, and vsock dialer for vz
- Makefile targets: build-darwin, test-darwin, dev-darwin, sign-darwin
- CI: macOS runner for test-darwin
- scripts/install.sh: macOS support (launchd, Homebrew, codesign)

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: embed vz.entitlements and fix macOS runtime issues

- Embed vz.entitlements as a Go resource and write it to a temp file at
  runtime for codesigning, replacing the broken entitlementsPath() that
  looked for the file next to the executable
- Add vz-shim copy step in .air.darwin.toml so the go:embed directive
  can find the binary during dev builds
- Add --entitlements flag to codesign in install.sh download path so
  binaries receive the virtualization entitlement
- Prepend /opt/homebrew/opt/e2fsprogs/sbin to launchd plist PATH so
  mkfs.ext4 from keg-only e2fsprogs is found at runtime

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: remove stale disk platform files from rebase

disk_darwin.go and disk_linux.go were unified into disk.go in PR #89
but snuck back in during the rebase as new files with no conflicts.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: vsock proxy data loss, zombie reaping, and remove vz-shim from install

- Read from bufio.Reader instead of raw conn in vsock proxy to prevent
  silent data loss when the buffered reader consumed beyond the newline
- Replace cmd.Process.Release() with go cmd.Wait() to properly reap
  vz-shim child processes instead of leaving zombies
- Update hypervisor README to reflect vz subprocess model (not in-process)
- Remove vz-shim from install/uninstall scripts (it's embedded in
  hypeman-api and extracted at runtime)
- Add CLI smoke tests (hypeman ps, hypeman images) to e2e install test

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: e2e test config sourcing and missing CLI handling

- Extract JWT_SECRET/PORT with grep instead of sourcing the config file,
  which breaks on macOS where paths contain spaces
- Skip CLI smoke tests gracefully when CLI binary is not installed
  (e.g., no darwin/arm64 release available)

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: remove unused registry-push flag from gen-jwt

Builder images are now auto-built on startup, so manual push workflow
and the -registry-push flag are no longer needed. The underlying
repo_access JWT infrastructure remains for other registry auth flows.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: remove unnecessary .gitkeep for vz-shim embed dir

The vz-shim embed is darwin-only (build tag), so the directory isn't
needed on Linux. On macOS the Makefile creates it before compiling.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* feat: add VM lifecycle smoke test to e2e install test

Tests pull, run, exec, stop, and rm using the CLI against a real
alpine VM to verify the full stack works end-to-end after install.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: create vz-shim embed directory in Makefile before copying

The .gitkeep was removed so the directory no longer exists in the repo.
The Makefile needs to mkdir -p before copying the built binary.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Fix macOS CLI install missing $SUDO prefix

The macOS CLI install on line 653 used bare 'install' while all other
binary installs to $INSTALL_DIR used '$SUDO install'. When /usr/local/bin
isn't writable and $SUDO is set to 'sudo', this caused a permission error
that aborted the script (due to set -e) after the service was already
running, leaving a partial installation.

Applied via @cursor push command

* fix: CLI install on macOS and make CLI a hard requirement in e2e

CLI releases use goreleaser naming ("macos" not "darwin", .zip not
.tar.gz). Fix artifact lookup and extraction to handle both formats.

Make CLI presence a hard fail in e2e test — if the install script
can't install the CLI, that's a real failure.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: remove nonexistent 'hypeman images' from e2e test

The CLI doesn't have an 'images' subcommand. The VM lifecycle tests
(pull, run, exec, stop, rm) cover real functionality.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: retry hypeman run after async pull in e2e test

Image pulls are async — 'hypeman pull' returns immediately with
status:pending. Retry 'hypeman run' in a loop until the image
is available.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: clean up macOS development docs and Makefile comment

- Remove "Alternative Commands" section (make dev covers it)
- Remove known limitations that are implementation details or wrong:
  disk format is handled automatically, snapshots aren't supported,
  network ingress is internal, vz-shim is a subprocess not in-process
- Keep disk format and snapshots as brief notes
- Makefile: 'run' target comment says "for agents" not "for testing"

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: README treats Linux and macOS as equal platforms

- Requirements: remove "Production"/"Experimental" labels
- Quick Start: "Linux and macOS supported"
- CLI section: reword for local-first usage, remove "remote" framing
- Remove entire "macOS Support" section (platform details belong in
  DEVELOPMENT.md, not the user-facing README)

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Address PR review: fix shutdown semantics, reduce timeout, add vz integration tests

Changes based on PR review feedback:
- Reduce vz HTTP client timeout from 30s to 10s (local Unix socket)
- Add comment on 2GB memory safety default in vz-shim
- Fix graceful shutdown to only send ACPI power button without immediate
  force-kill fallback, aligning with CH/QEMU semantics
- Add macOS vz integration tests (TestVZBasicLifecycle, TestVZExecAndShutdown)

Test infrastructure improvements:
- Use short /tmp/ paths for vz test temp dirs to avoid macOS 104-byte
  Unix socket path limit (t.TempDir() paths are too long)
- Capture vz-shim stderr and log file contents in error messages for
  better diagnostics when shim fails to start

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix(vz): fix guest-agent exec format error on instance restart

After a force-kill (vm.Stop), the overlay filesystem could have a
corrupted guest-agent binary. The lazy copy optimization skipped
re-copying the binary if it already existed, causing exec format
error on restart. Always copy from initrd to ensure correctness.

Also adds restart coverage to TestVZBasicLifecycle (stop → start →
exec → verify) with diagnostic log dumping on failure.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Address hiroTamada review nits

- Warn on codesign failure instead of silently swallowing (install.sh)
- Fix vz control interface description: HTTP, not gRPC (README.md)
- Remove dead if/else that set same path on both branches (e2e-install-test.sh)

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* fix: respect NetworkEnabled=false in vz shim by not creating NIC when networks is empty

When NetworkEnabled=false, the instance's Networks slice is intentionally
empty. The vz shim was incorrectly treating an empty networks slice as
'add default NAT NIC', which gave the guest network access even when
the caller explicitly disabled networking.

Now, when networks is empty, configureNetwork returns immediately without
attaching any NIC, matching the behavior of QEMU and Cloud Hypervisor.

Applied via @cursor push command

* Fix StartInstance call to match new signature

StartInstance now takes a StartInstanceRequest parameter (from PR #99).

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Fix vz tests: keep VM alive with sleep infinity

After PR #99, init does reboot(POWER_OFF) when the entrypoint exits.
Alpine's default entrypoint (/bin/sh) exits immediately with no stdin,
killing the VM before exec tests can run. Add Cmd: sleep infinity to
keep the VM alive, matching the pattern in volumes_test.go.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Fix e2e test: use nginx:alpine instead of alpine:latest

After PR #99, init does reboot(POWER_OFF) when the entrypoint exits.
Alpine's /bin/sh exits immediately with no stdin, killing the VM before
exec can run. nginx:alpine has a long-running daemon entrypoint that
keeps the VM alive, matching the pattern in exec_test.go.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Fix e2e clean slate: remove data dir from previous failed runs

Phase 1 was calling uninstall.sh without KEEP_DATA=false, so the data
directory (including stale VMs from previous failed runs) persisted.
This caused name_conflict errors when the test tried to create
e2e-test-vm again.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
Co-authored-by: Cursor Agent <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants