Skip to content

Add ctl.sh daemon lifecycle script#1672

Closed
Michaelyklam wants to merge 1 commit into
nesquena:masterfrom
Michaelyklam:fix/issue-591-ctl-sh
Closed

Add ctl.sh daemon lifecycle script#1672
Michaelyklam wants to merge 1 commit into
nesquena:masterfrom
Michaelyklam:fix/issue-591-ctl-sh

Conversation

@Michaelyklam
Copy link
Copy Markdown
Contributor

@Michaelyklam Michaelyklam commented May 5, 2026

Thinking Path

  • Self-hosters need a predictable lifecycle interface instead of memorizing fuser/pkill plus start.sh incantations.
  • A repo-root shell script is the smallest useful v1 because it preserves the current no-packaging/no-new-dependency setup.
  • PID ownership is the risky part: stop must remove stale PID files without killing unrelated processes.
  • status should combine local PID state with the WebUI /health endpoint so operators can distinguish “process exists” from “server is actually healthy”.
  • The tests use temp homes and fake long-lived launchers to verify daemon behavior without touching real ~/.hermes state or killing real WebUI processes.

What Changed

  • Added ctl.sh with:
    • start daemon wrapper around bootstrap.py --no-browser --foreground
    • stop with PID ownership checks, stale PID cleanup, SIGTERM wait, and SIGKILL fallback
    • restart as stop + start
    • status showing running/stopped, PID, uptime, bound host/port, log path, and bounded /health result
    • logs --lines N plus follow/no-follow modes
  • Uses ~/.hermes/webui.pid, ~/.hermes/webui.log, and ~/.hermes/webui.ctl.env by default, with override env vars for tests/advanced use.
  • Preserves repo .env conventions while letting inline overrides like HERMES_WEBUI_HOST=0.0.0.0 ./ctl.sh start win; the wrapper also passes explicit host/port args to bootstrap so bootstrap’s own .env load cannot undo those overrides.
  • Documented the new lifecycle commands in the README quick-start section.
  • Added tests/test_ctl_script.py covering PID/log paths, foreground/no-browser launch args, .env + inline override precedence, stale PID safety, and logs line-count behavior.

Closes #591

Why It Matters

  • VM/homelab/self-hosted users get a standard start/stop/restart/status/logs surface without needing a systemd unit or package entry point.
  • PID ownership checks avoid the dangerous failure mode of killing an unrelated process because a stale PID file happened to point at it.
  • The status output gives enough operational detail to debug the common “PID exists but health is unreachable” class of issues.

Verification

  • bash -n ctl.sh — passed
  • /home/michael/.hermes/hermes-agent/venv/bin/python -m pytest tests/test_ctl_script.py -q — 4 passed
  • git diff --check HEAD — passed
  • env -u HERMES_CONFIG_PATH /home/michael/.hermes/hermes-agent/venv/bin/python -m pytest tests/ -q — attempted twice locally. The first run failed after repo-cleaning tests removed the then-uncommitted new test file; after committing the test file, the rerun still failed in this shared Kanban host with broad test-server ConnectionRefusedError failures while several other full-suite WebUI workers were running concurrently. No failures were isolated to the ctl.sh targeted test after the commit.
  • GitHub Actions on PR Add ctl.sh daemon lifecycle script #1672 — passed on Python 3.11, 3.12, and 3.13 for commit 7385daf.

No browser media included because this is a CLI/script/docs change and does not change the WebUI interface.

Risks / Follow-ups

  • This is intentionally a repository script MVP, not a packaged hermes-webui binary or system service installer.
  • logs follows by default; callers in automation should pass --no-follow when they want a bounded command.
  • The health probe depends on curl being available; without it, status reports the health line as unknown rather than failing the command.

Model Used

  • OpenAI Codex gpt-5.5
  • Tool use: Hermes Kanban tools, git worktree, GitHub CLI, pytest, bash syntax checks, shell/process inspection

@Michaelyklam Michaelyklam force-pushed the fix/issue-591-ctl-sh branch from 88e2ebd to 7385daf Compare May 5, 2026 00:11
@nesquena
Copy link
Copy Markdown
Owner

nesquena commented May 5, 2026

@Michaelyklam You and your agent are going to town on PRs, thanks again! I’d love to have you join as a regular contributor, sent you an email!

most productive PR day from a single author this project has ever had!

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Initial review — ctl.sh lifecycle script

Pulled the branch into an isolated worktree against origin/master (v0.51.0) and ran pytest tests/test_ctl_script.py. 3 of 4 tests pass; one fails on this developer's checkout because of a test-isolation issue worth fixing before merge.

The failing test

tests/test_ctl_script.py::test_start_writes_pid_under_hermes_home_runs_foreground_no_browser_and_logs FAILED

The assertion that fails is:

assert str(hermes_home / "webui") in fake_output

…where fake_output instead contains the STATE_DIR from the repo's own .env rather than from the isolated tmp_path home. On my checkout, ~/hermes-webui-public/.env contains:

HERMES_WEBUI_STATE_DIR=/home/hermes/.hermes/webui-public
HERMES_WEBUI_PORT=8787
HERMES_WEBUI_HOST=127.0.0.1

…and fake-python.log ends up with state=/home/hermes/.hermes/webui-public instead of the expected tmp_path/.hermes/webui.

Root cause

run_ctl() in the test fixture explicitly removes HERMES_WEBUI_STATE_DIR (and friends) from the inherited env (tests/test_ctl_script.py:22-31) and the test itself does not pass HERMES_WEBUI_STATE_DIR in the per-test env={...} map.

In ctl.sh::start_cmd(), the very first thing that runs (after ensure_home) is:

_load_repo_dotenv_preserving_env
export HERMES_WEBUI_STATE_DIR="${HERMES_WEBUI_STATE_DIR:-${DEFAULT_STATE_DIR}}"

The "preserving" loop in _load_repo_dotenv_preserving_env (lines 33-46) only restores keys for which [[ -v key ]] is true on entry. Since the test fixture deliberately unsets HERMES_WEBUI_STATE_DIR, that key is NOT in the preserved list — so when set -a; source "${env_file}" runs against /home/hermes/hermes-webui-public/.env, the dotenv's HERMES_WEBUI_STATE_DIR value wins and overrides what _DEFAULT_STATE_DIR would have produced from HERMES_HOME=tmp_path/.hermes.

This means:

  • The test's correctness depends on whether the developer's ~/hermes-webui-public/.env happens to have HERMES_WEBUI_STATE_DIR set (it does on mine; CI may not).
  • More broadly, ctl start run from any checkout with a populated .env will silently disregard HERMES_HOME overrides for any var that's also in .env. That's not necessarily wrong for a self-hoster (the .env is the operator's source of truth), but it is the reason your isolation test is flaky on real-world checkouts.

Suggested fix

Smallest change that makes the test deterministic and matches the test's stated intent:

# tests/test_ctl_script.py — inside test_start_writes_pid_under_hermes_home_...
result = run_ctl(
    tmp_path,
    "start",
    env={
        "HERMES_WEBUI_PYTHON": str(fake_python),
        "FAKE_PYTHON_LOG": str(fake_log),
        "HERMES_WEBUI_HOST": "0.0.0.0",
        "HERMES_WEBUI_PORT": "18991",
        "HERMES_WEBUI_STATE_DIR": str(hermes_home / "webui"),  # <-- pin explicitly
    },
)

Pinning HERMES_WEBUI_STATE_DIR per-test mirrors what you already do for HOST/PORT and avoids relying on _load_repo_dotenv_preserving_env skipping unset keys.

Optional, more defensive alternative: have run_ctl() chdir into a synthetic repo root that contains a stub bootstrap.py and no .env, so the dotenv loader has nothing to inject. That would also give better isolation for the _is_owned_webui_pid state_repo check, which currently passes only because REPO_ROOT resolves through BASH_SOURCE.

What's solid

The rest of ctl.sh reads well:

  • PID-ownership guard (_is_owned_webui_pid) correctly cross-checks STATE_FILE's REPO_ROOT against the script's actual location AND validates the running process command line includes one of bootstrap.py/server.py/start.sh from that repo. That's the right safety bar for stop to not kill unrelated processes.
  • stop_cmd SIGTERM → wait 5s → SIGKILL is reasonable.
  • status separates "PID exists" from /health reachability, which matches what the PR description promised.
  • README diff is minimal.

Once the test-isolation fix lands, this looks ready for a final pass. Nice work.

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Closed by the v0.51.1 release in PR #1681 (merged at e23ba59). Massive thanks @Michaelyklam — this is now 19 merged PRs across the v0.50.292–v0.51.1 release window, an extraordinary contribution rate. Each PR was per-claim-vs-diff verified against your description and every security-relevant code path checked under independent review (Opus advisor, 6/6 questions clean). Your closes #N references are all accurate, your Thinking Path / What Changed / Why It Matters body template is consistently helpful, and your test coverage is solid behavioral scope (not source-string scaffolding) on every PR.

Live on production: https://github.com/nesquena/hermes-webui/releases/tag/v0.51.1

🚀

nesquena-hermes pushed a commit to Michaelyklam/hermes-webui that referenced this pull request May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: ctl.sh — start/stop/restart/status daemon management script

3 participants