Skip to content

feat: enhance Docker setup and startup script for Hermes WebUI#1686

Closed
binhpt310 wants to merge 1 commit into
nesquena:masterfrom
binhpt310:docker-fixes-hanging-wait
Closed

feat: enhance Docker setup and startup script for Hermes WebUI#1686
binhpt310 wants to merge 1 commit into
nesquena:masterfrom
binhpt310:docker-fixes-hanging-wait

Conversation

@binhpt310
Copy link
Copy Markdown
Contributor

Problem

Running ./start.sh inside the Docker container causes two failures:

  1. xz: Cannot exec: No such file or directoryxz-utils missing
  2. npm install hangs indefinitely during agent auto-install
  3. .env UID/GID variables are readonly in bash

Solution

  • Added xz-utils and git to Dockerfile
  • Pre-installed Node.js 22 LTS system-wide (no runtime download)
  • Pre-baked hermes-agent source from sibling repo into the image
  • Set HERMES_WEBUI_AGENT_DIR to skip network installs
  • Fixed start.sh to safely parse .env (skip readonly vars)
  • Auto-drop root privileges in start.sh to avoid permission issues
  • Updated docker-compose.yml build context to include agent source

Testing

docker compose build --no-cache
docker compose up -d
docker exec -it hermes-webui-hermes-webui-1 /apptoo/start.sh

- Added git and xz-utils to Dockerfile for improved build capabilities.
- Updated docker-compose.yml to specify build context and Dockerfile location.
- Enhanced start.sh to handle environment variable parsing more robustly and set Python executable path.
@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Review

Reading the diff at f4b618d against origin/master, the implementation is actually quite clean — coarse aggregate metrics, dependency-free, no process/path/argv leakage, dedicated module mirror of api/agent_health.py. But there are two issues that need addressing before merge.

🔴 Blocker: syntax error in the regression test file

tests/test_issue693_system_health_panel.py:16 has a literal *** in place of a Path expression:

ROUTES_PY = (REPO_ROOT / "api" / "routes.py").read_text(encoding="utf-8")
AUTH_PY=*** / "api" / "auth.py").read_text(encoding="utf-8")

That looks like an editor/tool artefact from a redacted diff — the test file simply will not parse. python3 -m py_compile errors at line 16. The intent is clearly:

AUTH_PY = (REPO_ROOT / "api" / "auth.py").read_text(encoding="utf-8")

…which is then asserted against at line 107 (assert '"/api/system/health"' not in AUTH_PY, "system metrics must not be public"). Please fix this single line — without it the entire test module fails to import and the suite errors out.

🟡 Authentication

The endpoint goes through check_auth in server.py:130 (do_GET calls check_auth(self, parsed) before handle_get), and /api/system/health is not in api/auth.py:21-25's PUBLIC_PATHS allowlist. So when WebUI auth is enabled (single-user password is configured), the endpoint is correctly gated. ✅ The test at line 107 asserts the endpoint name is not in auth.py — that's a fine canary, just note it doesn't actually verify cookie verification in the flow (only that nobody added the path to the public list). Worth a follow-up integration test that hits /api/system/health with is_auth_enabled=True and no cookie, expects 401. Not a blocker.

What concerns me more is the default-no-auth case: a fresh WebUI install with no password set has is_auth_enabled() == False (see api/auth.py:142-...), so /api/system/health returns CPU/RAM/disk percentages on every poll to anyone who can reach the bind address. The default bind is 127.0.0.1 so this is mostly defence-in-depth, but if anyone runs WebUI on a multi-user box or behind a misconfigured reverse proxy, the panel becomes a passive sidechannel. Two possible mitigations, both small:

  1. Gate the endpoint behind is_auth_enabled() even on the public-allow path — return 404 if auth is off (so the feature only lights up for password-protected installs). The frontend would see unavailable and hide the panel via the .system-health-panel.unavailable{display:none} rule already in static/style.css:295.
  2. Or rate-limit it — 5s polling × N tabs per session is fine, but there's no throttle.

Note: the polled /api/health/agent (api/routes.py:2491) and /api/dashboard/status (:2500) sit alongside this endpoint with the same auth posture, so this isn't a new precedent — just calling it out because it's the right time to think about it.

🟢 Implementation looks good

api/system_health.py reads only /proc/stat and /proc/meminfo and uses shutil.disk_usage("/") — no command exec, no argv, no env. The CPU sampler:

def _cpu_percent() -> float:
    start = _read_proc_stat_cpu()
    time.sleep(_CPU_SAMPLE_SECONDS)  # 0.05s
    end = _read_proc_stat_cpu()
    return _cpu_delta_percent(start, end)

50ms blocking sleep on every poll is fine in practice (BaseHTTPServer is threaded), but worth noting this will serialize behind the GIL alongside other GETs. At 5s polling per client × small instance, totally negligible.

_safe_error() strips exception messages to just the type name — good defence against /proc paths leaking when /proc/stat is unreadable on a hardened container. ✅

The frontend at static/ui.js:3068-3155 has the right shape: visibilityState gating to stop polling on hidden tabs, hard .unavailable class on macOS/Windows/non-Linux hosts where /proc/stat raises, single 5s setInterval, and the panel is already wired into index.html:79-100 above the layout. Mobile @media rules at style.css:1295-1299 collapse the bar to a column — looks reasonable.

One small nit: _systemHealthTimer is started both inline and on DOMContentLoaded — the if(_systemHealthTimer) return; guard in startSystemHealthMonitor() makes that idempotent, but the visibility listener is added via addEventListener without a removal path. Not a leak (singleton document), just inelegant.

Action items

  1. Fix the syntax error at tests/test_issue693_system_health_panel.py:16 — this is the only blocker.
  2. (Recommended) Add an early-return when is_auth_enabled() is False, or document that the panel exposes coarse host metrics on no-auth installs.
  3. (Optional) Follow-up integration test that hits /api/system/health with auth-on/no-cookie → 401.

Once the test file parses I'd be happy with this. Closes #693 — the feature itself is well-scoped and matches the feature request shape.

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Thanks for the patch. Three of the symptoms you describe are real (xz-utils missing, UID/GID readonly trip-up, npm install running at agent install), but the fix as written has structural issues. After pulling the branch and reading the diff against origin/master, this needs a rework before it can land.

CI is red on this branch

test (3.13) fails with:

tests/test_bootstrap_dotenv.py:181: AssertionError:
start.sh should still source .env (regression guard)

The guard at tests/test_bootstrap_dotenv.py:176-188:

def test_start_sh_and_bootstrap_equivalent_env_loading(self):
    start_sh = (REPO_ROOT / "start.sh").read_text(encoding="utf-8")
    ...
    assert "source" in start_sh and ".env" in start_sh, (
        "start.sh should still source .env (regression guard)"
    )

Your replacement loop never says source (it iterates with read -r raw_line), so the guard fires. The 3.11 and 3.12 jobs were CANCELLED so we do not know their state, but 3.13 is FAILED. A clean fix that keeps the existing guard happy:

set -a
source <(grep -vE "^[[:space:]]*(export[[:space:]]+)?(UID|GID|EUID|EGID|PPID|PID)=" "${REPO_ROOT}/.env")
set +a

Same effect as your filter, half the lines, no test churn.

Compose context change is breaking

The PR rewrites docker-compose.yml to:

build:
  context: ..
  dockerfile: hermes-webui/Dockerfile

and the Dockerfile then does COPY hermes-webui/ /apptoo/ and COPY hermes-agent-desktop/hermes-agent /opt/hermes/. That assumes a parent layout with hermes-webui/ and hermes-agent-desktop/hermes-agent/ as siblings.

The canonical user flow is clone the repo, cd in, then bring up the stack. With this PR applied, that breaks: context: .. resolves to whatever happens to sit one level above the clone, and COPY hermes-agent-desktop/hermes-agent /opt/hermes/ will fail outright because that path does not exist in the standard layout. This looks like a leak from your local multi-repo workspace into the public PR.

For two-container layouts, docker-compose.two-container.yml already exists (referenced at docker_init.bash:351) and uses an -v /path/to/hermes-agent:... mount instead of a build-context expansion. Keep the single-container compose context as ..

The /app/venv early-exit in start.sh bypasses docker_init.bash

if [[ -f "/.within_container" && -x "/app/venv/bin/python" ]]; then
  export HERMES_WEBUI_PYTHON="/app/venv/bin/python"
  exec "/app/venv/bin/python" "${REPO_ROOT}/bootstrap.py" --no-browser "$@"
fi

docker_init.bash is the container CMD and does substantive work that bootstrap.py does not: WANTED_UID/GID detection from mounted volumes (docker_init.bash:60-83), the hermeswebui user switch via usermod, /uv_cache ownership setup, and ensure_hindsight_client_docker_dependency self-healing (:307). If start.sh ever runs inside a container outside the CMD path (e.g. a debug exec), this branch silently skips all of that.

Also: the Dockerfile pre-bakes /opt/hermes but does not pre-create /app/venv. That path does not exist until docker_init.bash:297 runs uv venv venv. So the [[ -x "/app/venv/bin/python" ]] guard is dead on a fresh image and only fires on subsequent restarts after init has already run once. Fragile state machine.

Pieces that look correct and should be split out

  • apt-get install ... xz-utils git: legit, needed for the node-22 tar extract; git is also useful in the existing image.
  • The UID|GID|EUID|EGID|PPID|PID|_ skip-list: real bug, real fix.
  • Trailing whitespace and line-continuation cleanup in the groupadd block: harmless.

Recommendation

Split this into two surgical PRs:

  1. fix(start.sh): skip bash readonly vars when sourcing .env -- apply the filter via the grep -vE + source <(...) form above so the regression guard at tests/test_bootstrap_dotenv.py:181 keeps passing. No Dockerfile changes.
  2. fix(docker): add xz-utils + git to base image -- just the apt-get hunk. No compose changes.

The pre-bake-agent-source plus change-compose-context piece needs a separate design conversation. docker_init.bash:330-335 already supports /opt/hermes as a fallback agent path, so if the goal is offline-friendly images, the cleaner approach is a multi-stage Dockerfile that fetches the agent by tag during build, not a build-context expansion that requires a specific filesystem layout outside the repo.

One question on the npm install hang: I could not find any npm install invocation in bootstrap.py, docker_init.bash, or start.sh. If that hang is happening inside the agent install scripts, it is a hermes-agent fix, not a webui one. Could you point at the exact log line where it hangs? That helps triage the right repo.

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Thanks @binhpt310 — your fixes for the xz-utils / git package gaps, the readonly-UID/GID .env parsing crash, and the auto-drop-privileges helper are exactly the right kind of operational hardening for the Docker setup. They're holding up nicely under review.

We need to defer this PR from the v0.51.5 release pass because the Dockerfile + docker-compose.yml shape it introduces makes the build context-dependent on a sibling repo:

# Dockerfile (lines ~80-100):
COPY hermes-agent-desktop/hermes-agent /opt/hermes/
COPY --chown=hermeswebuitoo:hermeswebuitoo hermes-webui/ /apptoo/
# docker-compose.yml:
build:
  context: ..
  dockerfile: hermes-webui/Dockerfile

This means the build now requires a parent directory containing both hermes-webui/ and hermes-agent-desktop/ siblings. A user who runs the canonical:

git clone https://github.com/nesquena/hermes-webui
cd hermes-webui
docker compose build

…will hit COPY failed: file not found in build context: hermes-agent-desktop/hermes-agent because there's no sibling repo. That breaks Docker for every standalone clone, which is the most common path.

There are a few ways forward — pick whichever fits best:

Option A — Build arg (smallest change):

ARG WITH_AGENT_SOURCE=0
RUN if [ "$WITH_AGENT_SOURCE" = "1" ]; then \
      mkdir -p /opt/hermes && \
      cp -r /tmp/hermes-agent-desktop/hermes-agent /opt/hermes/; \
    fi
  • keep docker-compose.yml's context: . (no parent-context change). Users with the sibling repo opt-in via --build-arg WITH_AGENT_SOURCE=1.

Option B — Vendor a download helper:
A docker_init.bash step that fetches hermes-agent from PyPI / npm at image build time. Closer to the existing bootstrap.py install path but at build time, so no runtime download.

Option C — Drop the agent-pre-bake and keep the rest:
Ship the xz-utils / git / readonly-UID parser / auto-drop-privileges fixes alone (those are independently valuable) and leave the agent-source baking for a follow-up that has access to a vendoring channel.

The other improvements in this PR are all valuable on their own — the .env readonly-vars parser is a real fix, the auto-drop-privileges via sudo -n -u hermeswebui is exactly the right shape, and the pre-built-container fast path through /.within_container + /app/venv/bin/python is clean.

A secondary nit (not blocking): the start.sh re-execs as hermeswebui (UID 1024) but the Dockerfile chowns /apptoo to hermeswebuitoo:hermeswebuitoo (UID 1025). I couldn't see how bootstrap.py writes .hermes/ and venv state from a user that doesn't own the tree — worth a docker compose up smoke test to confirm a successful end-to-end boot before merge.

Happy to absorb a v2 with Option A or C as soon as it lands. The infrastructure half of this is unambiguously good.

@nesquena nesquena added the hold label May 5, 2026
githb-ac pushed a commit to githb-ac-org/hermes-webui that referenced this pull request May 5, 2026
4 PRs (1 surface addition, 3 fixes):
- nesquena#1688 VPS resource health Insights panel (@Michaelyklam, closes nesquena#693)
- nesquena#1709 preserve scroll on stream completion (@Michaelyklam, closes nesquena#1690)
- nesquena#1711 hide rename tooltip on folders (@nesquena-hermes, closes nesquena#1710)
- nesquena#1712 guard localStorage.setItem against QuotaExceededError (@24601)

Tests: 4504 → 4527 (+23). Opus: SHIP, 6/6 verification clean.

Held back: nesquena#1686 (Docker enhance) — Opus flagged sibling-repo dep that
breaks standalone clones. Left open for follow-up.

Co-authored-by: Michael Lam <Michaelyklam1@gmail.com>
Co-authored-by: 24601 <noreply@github.com>
@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Closing — operational hardening salvaged into focused follow-up; structural Docker changes deferred

Thanks @binhpt310 — closing this out cleanly with a salvage map.

What ships

The three operational-hardening fixes from this PR were the unambiguously net-new and shippable cluster, so I extracted them into a focused follow-up: PR #1969.

That PR ships:

  1. start.sh .env readonly-var parser — your fix for the UID: readonly variable crash when .env carries the UID=$(id -u) line that docker-compose.yml's macOS instructions document. I used a slightly tighter shape (source <(grep -vE '...' .env) instead of the manual while IFS= read loop) so the regression guard at tests/test_bootstrap_dotenv.py:181 keeps both source and .env on the same line, but the behavior is identical to your patch.
  2. start.sh root re-exec to hermeswebui — your if [[ $EUID -eq 0 ]]; then exec sudo -n -u hermeswebui "$0" "$@"; fi block, with two defensive additions (only fires if hermeswebui user actually exists AND sudo is on PATH) so the guard is a no-op on host machines without the container user setup.
  3. Dockerfile adds xz-utils + git — your two apt-package additions. xz-utils for .tar.xz decompression, git for git describe and any clone-based agent install path.

11 new regression tests in tests/test_docker_env_readonly_vars.py (4 source-grep + 5 behavioral against synthetic .env files + 2 Dockerfile package-presence).

Co-authored-by: binhpt310 on the merge commit, plus inline credit comments in start.sh and the PR body. CHANGELOG entry will credit you when v0.51.32 ships.

What was deferred

Two larger structural changes from this PR don't ship in the follow-up:

A. Pre-baking hermes-agent source into the image

# Dockerfile (your version, lines ~80-100):
COPY hermes-agent-desktop/hermes-agent /opt/hermes/
COPY --chown=hermeswebuitoo:hermeswebuitoo hermes-webui/ /apptoo/
# docker-compose.yml (your version):
build:
  context: ..
  dockerfile: hermes-webui/Dockerfile

This requires a parent directory with both hermes-webui/ and hermes-agent-desktop/ as siblings. The canonical user flow:

git clone https://github.com/nesquena/hermes-webui
cd hermes-webui
docker compose build

…would hit COPY failed: file not found in build context: hermes-agent-desktop/hermes-agent because there's no sibling repo. That breaks Docker for every standalone clone, which is the most common path.

If you'd like to retry, the cleanest shape is a build arg with the sibling-repo COPY gated behind it:

ARG WITH_AGENT_SOURCE=0
RUN if [ "$WITH_AGENT_SOURCE" = "1" ]; then \
      mkdir -p /opt/hermes && \
      cp -r /tmp/hermes-agent-desktop/hermes-agent /opt/hermes/; \
    fi
ENV HERMES_WEBUI_AGENT_DIR=/opt/hermes

…with docker-compose.yml's context: . unchanged. Users with the sibling repo opt in via --build-arg WITH_AGENT_SOURCE=1; everyone else gets the current default (agent install at runtime via bootstrap), unchanged.

Happy to review a follow-up PR shaped like that.

B. Pre-installing Node.js 22 LTS system-wide

The motivation (avoid npm install running at agent-install time) is real, but the fix shape (bake the full Node.js distribution into every image, ~50 MB) is one option among several:

  • Option 1 — bake Node.js: simplest, +50 MB everywhere even for users who don't need browser tools
  • Option 2 — install Node.js only when MCP/browser tools are configured: smaller default image but requires a build-arg or post-build step
  • Option 3 — keep npm install but cache the installed deps in a Docker layer: no Node binary needed in the image at all if browser tools aren't used

Worth evaluating separately from the three operational fixes that just shipped. If you want to refile this with one of those shapes, happy to review.

Why split rather than fix in-place

Your three operational fixes are independent — they don't depend on the agent prebake or Node install. Salvaging them now means:

  • The crash on .env with UID=... stops affecting macOS users today
  • Your contribution ships and gets credited in CHANGELOG / contributor list
  • The structural questions on the agent prebake and Node install can be discussed separately on their own merits, without holding up unambiguously good fixes

Thanks again — these are exactly the kind of operational fixes that catch real user-visible bugs. Looking forward to the next round.

nesquena-hermes added a commit that referenced this pull request May 9, 2026
… apt deps)

Three independent operational hardening fixes salvaged from PR #1686
(@binhpt310) after the parent PR was deferred over a separate sibling-repo
build-context concern unrelated to these fixes:

1. start.sh's .env loader now filters readonly bash vars (UID, GID, EUID,
   EGID, PPID) before `source`-ing.  docker-compose.yml's macOS instructions
   document `echo "UID=$(id -u)" >> .env` to set host UID/GID for bind-mount
   permission fixing — that .env was crashing start.sh with
   `UID: readonly variable` when `set -a; source ...; set +a` tried to
   assign to those names.  Replaced with
   `source <(grep -vE '^[[:space:]]*(export[[:space:]]+)?(UID|GID|EUID|EGID|PPID)=' "${REPO_ROOT}/.env")`.
   The bootstrap regression guard at tests/test_bootstrap_dotenv.py:181
   still passes — both `source` and `.env` are still on the modified line.

2. start.sh now defensively re-execs as the unprivileged hermeswebui user
   when invoked as root.  Fires only when EUID==0 AND a hermeswebui user
   actually exists AND sudo is on PATH — so it's a no-op on host machines
   without the container user setup.  The production image's entrypoint
   (docker_init.bash) already drops to hermeswebui before invoking start.sh,
   so this is a no-op on the canonical container path; it only matters for
   `sudo ./start.sh` or accidental root shells inside the container during
   interactive debugging.

3. Dockerfile installs xz-utils + git apt packages.  xz-utils is required
   to decompress .tar.xz archives (e.g. Node.js distribution tarballs);
   git is needed for `git describe` (powers WEBUI_VERSION resolution at
   api/updates.py:_detect_webui_version) and any clone-based agent install
   path.  Both are tiny apt packages on top of python:3.12-slim with no
   measurable image-size impact.

What's NOT in this commit (deferred from #1686):

- Pre-baking hermes-agent source into the image via
  `COPY hermes-agent-desktop/hermes-agent /opt/hermes/` plus a build-context
  flip to `..`.  Requires a sibling-repo layout that breaks the canonical
  `git clone hermes-webui && cd hermes-webui && docker compose build` flow.
  The right shape is a build arg gating the COPY behind
  --build-arg WITH_AGENT_SOURCE=1; left to a separate PR.
- Pre-installing Node.js 22 LTS system-wide.  Real motivation but worth
  evaluating the fix shape (full Node bake vs. opt-in vs. layer cache)
  separately from these three operational fixes.

Tests: tests/test_docker_env_readonly_vars.py — 11 tests (4 source-grep
on the start.sh filter pattern + 5 behavioral that actually run bash
against synthetic .env files containing readonly vars + 2 Dockerfile
package-presence tests).  All 11 pass.  Behavioral tests skip if bash
is not on PATH.

Full suite: 5028 → 5036 passing (+8 net new after pytest collection
counted some behavioral tests under skip), 0 regressions, 147.84s.

Closes the operational-hardening portion of #1686.

Co-authored-by: binhpt310 <binhpt310@users.noreply.github.com>
nesquena-hermes added a commit that referenced this pull request May 9, 2026
fix(docker): salvage operational hardening from #1686 — .env readonly-var parser + xz-utils/git apt deps + root re-exec
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants