fix: Test & CI infrastructure improvements — graceful shutdown, readiness probes, test isolation by HexaField · Pull Request #744 · coasys/ad4m

HexaField · 2026-03-11T23:03:55Z

Summary

Comprehensive test and CI infrastructure improvements across three phases, addressing process cleanup reliability, graceful shutdown, readiness probing, and multi-executor test tooling.

Based on the proposal at membranes/adam/proposals/test-ci-improvements.md.

Phase 1: Quick Wins

integration.bats: Replace killall ad4m with PID-based SIGTERM→wait→SIGKILL escalation
CI artifacts: Add store_artifacts step for failure log collection
Port validation: Pre-test port checks to catch stale processes early
Orphan cleanup: All CI jobs use SIGTERM→wait→SIGKILL instead of raw kill -9

Phase 2: Graceful Shutdown (Rust)

Signal handlers: SIGTERM/SIGINT handlers via tokio::signal in rust-executor
runtime_quit: Uses oneshot channel → Holochain conductor shutdown → clean exit (replaces process::exit(0))
--pid-file flag: CLI flag for test harnesses to track executor PIDs
JS gracefulShutdown(): Shared utility replacing ad-hoc kill loops in after() hooks

Phase 3: Readiness & Multi-Executor

runtimeReadiness GQL query: Returns { gqlReady, holochainReady, agentInitialized, languagesLoaded } for polling-based readiness checks
TestCluster utility: tests/js/utils/testCluster.ts — manages N executor instances with startup, readiness polling, and graceful shutdown

CI Status

⚠️ CI is disabled (when: never on workflow) pending review. These changes are backward-compatible but should be reviewed before enabling CI runs.

@nicolasluck — please review when you get a chance. The CI runner is intentionally disabled on this branch.

Summary by CodeRabbit

New Features
- Graceful shutdown support for clean termination (signals and PID handling).
- New runtime readiness query to report service initialization status.
- Test utilities to orchestrate multiple executors and centralized graceful shutdown in test suites.
Improvements
- PID file support for improved process lifecycle management.
- CI: enhanced failure-time log and artifact collection for integration JS tests.

…ness probes, test isolation Phase 1 — Quick Wins: - Replace killall ad4m with PID-based SIGTERM→SIGKILL escalation in integration.bats - Add artifact collection (store_artifacts) to CI for failure debugging - Add pre-test port validation step to CI jobs - Use SIGTERM→wait→SIGKILL pattern in all CI orphan cleanup steps Phase 2 — Graceful Shutdown: - Add SIGTERM/SIGINT signal handlers in rust-executor (tokio::signal) - Make runtime_quit use oneshot channel for graceful shutdown instead of process::exit - Shutdown Holochain conductor cleanly before exit - Add --pid-file flag to CLI for test harness PID tracking - Add gracefulShutdown() utility to JS test utils - Update all after() hooks to use graceful shutdown pattern Phase 3 — Readiness & Multi-Executor: - Add runtimeReadiness GQL query (gqlReady, holochainReady, agentInitialized, languagesLoaded) - Create TestCluster utility at tests/js/utils/testCluster.ts for multi-executor test setup CI is disabled (when: never) pending review.

coderabbitai · 2026-03-11T23:04:05Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8b00be7a-52c2-496b-88db-06b325d54f7b

📥 Commits

Reviewing files that changed from the base of the PR and between 6548ce4 and 0c85828.

📒 Files selected for processing (1)

rust-executor/src/graphql/query_resolvers.rs

📝 Walkthrough

Walkthrough

Adds graceful shutdown infrastructure: PID file handling, signal-driven shutdown channel and task, a GraphQL runtime readiness query, CLI wiring for pid_file, and test utilities/tests updated to perform centralized, graceful process termination instead of ad-hoc kills.

Changes

Cohort / File(s)	Summary
CI Configuration `.circleci/config.yml`	Removed pre-test orphaned-executor kills; added failure-time artifact collection and storage for integration-tests-js (ad4m logs, holochain data, listening ports).
CLI `cli/src/ad4m_executor.rs`, `cli/src/main.rs`	Added `pid_file: Option<String>` to Run domain/CLI, propagate AD4M_PID_FILE env var on startup; initialize new Ad4mConfig fields (`hc_relay_url`, `smtp_config`) to None.
Executor globals & shutdown task `rust-executor/src/globals.rs`, `rust-executor/src/lib.rs`, `rust-executor/src/graphql/mutation_resolvers.rs`	Introduced global `SHUTDOWN_TX` oneshot sender, write PID file on start (if AD4M_PID_FILE), Unix signal handlers to send shutdown, shutdown task gracefully stops holochain, removes PID file, and exits; `runtime_quit` now signals shutdown channel instead of immediate exit.
GraphQL readiness `rust-executor/src/graphql/graphql_types.rs`, `rust-executor/src/graphql/query_resolvers.rs`	Added `ReadinessStatus` GraphQL type and `runtime_readiness` query exposing gql_ready, holochain_ready, agent_initialized, and languages_loaded for test polling.
Test utilities `tests/js/utils/utils.ts`, `tests/js/utils/testCluster.ts`	Added `gracefulShutdown` helper (SIGTERM then SIGKILL escalation) and `killByPorts` graceful behavior; new `TestCluster` class to start/wait-for-readiness and centrally shutdown multiple executor instances.
Tests / integration `tests/integration.bats`, `tests/js/tests/simple.test.ts`, `tests/js/tests/multi-user-simple.test.ts`	Capture and export AD4M_PID on startup; replace broad `killall` with PID-targeted SIGTERM→timeout→SIGKILL plus port-based fallback; tests use `gracefulShutdown` for teardown instead of manual kill loops.

Sequence Diagram

sequenceDiagram
    participant OS as "OS Signal Handler"
    participant Exec as "Executor Runtime"
    participant Shutdown as "Shutdown Task"
    participant Holochain as "Holochain Conductor"
    participant FS as "File System"

    OS->>Exec: SIGINT / SIGTERM
    Exec->>Exec: Signal listener receives signal
    Exec->>Shutdown: send on SHUTDOWN_TX
    Shutdown->>Shutdown: await shutdown signal
    Shutdown->>Holochain: request graceful shutdown
    Holochain-->>Shutdown: shutdown complete
    Shutdown->>FS: remove PID file (AD4M_PID_FILE)
    FS-->>Shutdown: file removed
    Shutdown->>Exec: log completion and exit(0)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

fix(tests): kill executor by port on shutdown — prevent orphaned processes #699: Related changes around CI/test shutdown and orphaned executor cleanup that this PR removes/adjusts.
fix(tests): kill executor by port, not by name — safe for concurrent CI runners #698: Similar test cleanup and port-based shutdown utilities overlapping test teardown logic.
fix: multi-user tests #668: Changes to multi-user tests that overlap modifications in multi-user test teardown and orchestration.

Suggested reviewers

data-coasys
lucksus
marvin-bot-coasys

Poem

🐰 I wrote a soft shutdown, gentle and kind,
I set the PID, sent the signal aligned,
Logs tucked away, no chaos to find,
Executors sleep, clean files left behind,
Hopping off now — graceful exit, signed. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly describes the main changes: test & CI infrastructure improvements including graceful shutdown, readiness probes, and test isolation across multiple files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/test-ci-infrastructure

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

lucksus · 2026-03-11T23:12:49Z

.circleci/config.yml

+              echo "⚠️  Ports still in use:$CONFLICTS"
+              echo "Attempting cleanup..."
+              for port in $CONFLICTS; do
+                lsof -ti:$port | xargs -r kill -9 2>/dev/null || true


not sure if verify ports are clean make sense here, because if they are not, just killing that process is asking for trouble.

Done — removed the port verification step and all kill steps entirely. Graceful shutdown should handle cleanup. Also removed when: never so CI will run on this branch.

lucksus · 2026-03-11T23:13:47Z

.circleci/config.yml

+                echo "$PIDS" | xargs -r kill -TERM 2>/dev/null || true
+              fi
+            done
+            sleep 3


yeah I think this killing just needs to go away completely. I hope the graceful shutdown below eradicates the need for that.

Agreed — removed all kill steps from every test job (js, mcp, model). Relying entirely on graceful shutdown now. Let's see how CI goes!

lucksus

I think the graceful shutdown was missing for a long time. great stuff! if you remove at least the new kill code, I would love to see it running on CI (remove the broken syntax never) and maybe we can remove more kill calls. then we can try activating more runners again.

Address Nico's review comments on PR #744: - Remove 'Kill orphaned executors' steps from all test jobs - Remove 'Verify clean port state' step from integration-tests-js - Remove 'when: never' to enable CI workflow - Keep graceful shutdown handlers which should make kill steps unnecessary

…Executor signature

… timeout)

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (1)

.circleci/config.yml (1)

128-133: Consider preserving agent directory structure to avoid log overwrites.

If multiple agents exist under tst-tmp/agents/*/ad4m/logs, copying them all to the same /tmp/test-artifacts/ destination could overwrite logs with identical filenames. Using cp --parents or a more structured approach would preserve debugging context.

♻️ Suggested improvement

           command: |
             mkdir -p /tmp/test-artifacts
-            cp -r tests/js/tst-tmp/agents/*/ad4m/logs /tmp/test-artifacts/ 2>/dev/null || true
-            cp -r tests/js/tst-tmp/agents/*/ad4m/holochain /tmp/test-artifacts/ 2>/dev/null || true
+            # Preserve agent directory structure to avoid overwrites from multiple agents
+            cd tests/js/tst-tmp && cp -r --parents agents/*/ad4m/logs /tmp/test-artifacts/ 2>/dev/null || true
+            cd tests/js/tst-tmp && cp -r --parents agents/*/ad4m/holochain /tmp/test-artifacts/ 2>/dev/null || true
             # Capture port state for debugging
             lsof -i -P -n 2>/dev/null | grep LISTEN > /tmp/test-artifacts/listening-ports.txt || true

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.circleci/config.yml around lines 128 - 133, The copy commands that flatten
multiple agents' ad4m/logs into /tmp/test-artifacts risk overwriting files with
identical names; modify the commands that copy from
tests/js/tst-tmp/agents/*/ad4m/logs and tests/js/tst-tmp/agents/*/ad4m/holochain
so they preserve the agent directory structure (e.g., use cp --parents or
rsync/tar to recreate tests/js/tst-tmp/agents/<agent>/ad4m/... under
/tmp/test-artifacts, or copy each agent into its own subdirectory) so logs from
different agents are not clobbered.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.circleci/config.yml:
- Around line 134-136: The store_artifacts step currently omits a when condition
so it uses the default on_success and will be skipped on failure; update the
store_artifacts step (the block defining path: /tmp/test-artifacts and
destination: test-logs) to include a when: on_fail or when: always key so
artifacts collected by the "Collect logs on failure" step are actually uploaded
(choose on_fail to only upload on failures or always to upload regardless).

In `@cli/src/ad4m_executor.rs`:
- Around line 199-205: The code sets AD4M_PID_FILE via std::env::set_var in
ad4m_executor.rs while running under #[tokio::main(flavor = "multi_thread")],
which is unsafe; remove the unsafe std::env::set_var call and instead thread the
PID file path through your startup config and APIs (e.g., add an Option<PathBuf>
pid_file field to Ad4mConfig or an explicit pid_file parameter on the executor
initialization function used in ad4m_executor.rs), update the call sites that
construct/start the executor to pass the PID path, and have the executor
write/clean the PID file from that explicit value rather than via process-wide
env mutation.

In `@cli/src/main.rs`:
- Around line 166-168: The Domain::Run enum variant and the args.domain match
need updating because you added pid_file, enable_mcp, and mcp_port to the Run
variant; update the match arm that matches args.domain (the Domain::Run arm) to
avoid destructuring missing fields by either changing it to Domain::Run { .. }
or listing the new fields explicitly (pid_file, enable_mcp, mcp_port) so the
match compiles and continues to access other fields as before.
- Around line 252-255: Remove the unsafe environment mutation: stop calling
unsafe std::env::set_var("AD4M_PID_FILE", ...) in the pid_file handling block;
instead add a PID path field to the Ad4mConfig (or existing config struct) and
populate it from pid_file, then ensure the executor code that previously read
AD4M_PID_FILE now reads the PID path from Ad4mConfig (update the executor
initialization to accept the config field). Update references to AD4M_PID_FILE
to use the new config property and remove the unsafe block entirely.

In `@tests/integration.bats`:
- Around line 47-48: The current port fallback unconditionally kills whatever
owns port 4000; change it to target only the recorded executor PID (from the
pid-file used by the test harness) and verify the process is the executor before
killing: read the pid from the existing pid-file (or the variable that stores
the executor PID), check that /proc/<pid>/cmdline (or ps -p <pid> -o comm=)
contains "ad4m-executor" and that the pid actually listens on :4000, then kill
that PID; if no pid-file exists fall back to using lsof but filter results to
PIDs whose command matches "ad4m-executor" before invoking kill. Ensure you
update the teardown code that currently runs `lsof -ti:4000 | xargs -r kill -9`
to this safer, PID-and-command-verified sequence.

In `@tests/js/utils/testCluster.ts`:
- Around line 91-110: The waitForReadiness function currently polls
runtime.info() and never calls the new readiness probe; update waitForReadiness
to first call node.client.runtime.runtimeReadiness() inside the retry loop and
check the readiness fields (e.g., gqlReady and holochainReady or an overall
ready flag) and only fall back to node.client.runtime.info() if runtimeReadiness
is not available or returns null; keep the existing timeout/sleep behavior and
retain the same error throw (including node.gqlPort) if neither probe reports
readiness within timeout.
- Around line 19-27: NodeConfig declares multiUser but it's ignored: addNode()
currently hardcodes multiUser=false and startExecutor() lacks a matching
parameter, so callers setting multiUser:true get no effect; fix by threading the
flag through the call chain—update addNode(nodeConfig: NodeConfig) to read
nodeConfig.multiUser (default false) and pass it into startExecutor (add a
multiUser boolean parameter to startExecutor and use it when building the node),
or if startExecutor cannot change, remove multiUser from NodeConfig; update all
call sites of addNode and startExecutor accordingly and ensure tests use the
propagated flag.
- Around line 44-67: In addNode, if waitForGql throws the spawned
executorProcess is left orphaned; wrap the waitForGql call in a try/catch and on
any error ensure the started executor is cleaned up (e.g., call
executorProcess.kill() or the appropriate shutdown helper for the started
process) before rethrowing the error; reference addNode, startExecutor,
executorProcess, waitForGql and this.nodes so the fix either kills the process
in the catch or adds a temporary tracking entry to this.nodes before awaiting
and removes it on failure.

In `@tests/js/utils/utils.ts`:
- Around line 323-341: The gracefulShutdown function currently relies on
ChildProcess.killed (which only means a signal was sent) so it should instead
await actual process termination: attach listeners for the child's 'exit' and
'close' events (or poll exitCode/signalCode) after sending SIGTERM and resolve
when one of those indicates the process exited; if no exit occurs within
timeoutMs, send SIGKILL and again wait for the same events (with a short
fallback sleep) before returning. Update gracefulShutdown to remove the while
loop based on proc.killed, use a Promise that races the exit/close events
against a timeout, and reference proc, gracefulShutdown, and timeoutMs to
implement the two-stage terminate-and-escalate behavior.

---

Nitpick comments:
In @.circleci/config.yml:
- Around line 128-133: The copy commands that flatten multiple agents' ad4m/logs
into /tmp/test-artifacts risk overwriting files with identical names; modify the
commands that copy from tests/js/tst-tmp/agents/*/ad4m/logs and
tests/js/tst-tmp/agents/*/ad4m/holochain so they preserve the agent directory
structure (e.g., use cp --parents or rsync/tar to recreate
tests/js/tst-tmp/agents/<agent>/ad4m/... under /tmp/test-artifacts, or copy each
agent into its own subdirectory) so logs from different agents are not
clobbered.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0e58720f-8345-425c-a009-2e1349075dd8

📥 Commits

Reviewing files that changed from the base of the PR and between 93945d8 and 6548ce4.

📒 Files selected for processing (13)

.circleci/config.yml
cli/src/ad4m_executor.rs
cli/src/main.rs
rust-executor/src/globals.rs
rust-executor/src/graphql/graphql_types.rs
rust-executor/src/graphql/mutation_resolvers.rs
rust-executor/src/graphql/query_resolvers.rs
rust-executor/src/lib.rs
tests/integration.bats
tests/js/tests/multi-user-simple.test.ts
tests/js/tests/simple.test.ts
tests/js/utils/testCluster.ts
tests/js/utils/utils.ts

coderabbitai · 2026-03-12T01:43:38Z

.circleci/config.yml

+      - store_artifacts:
+          path: /tmp/test-artifacts
+          destination: test-logs


⚠️ Potential issue | 🔴 Critical

store_artifacts needs when: on_fail or when: always to actually store the collected logs.

The default when for CircleCI steps is on_success. Since this job will be in a failed state when the "Collect logs on failure" step runs, the store_artifacts step will be skipped, and the collected artifacts won't be uploaded.

🐛 Proposed fix

- store_artifacts: + when: on_fail path: /tmp/test-artifacts destination: test-logs

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.circleci/config.yml around lines 134 - 136, The store_artifacts step currently omits a when condition so it uses the default on_success and will be skipped on failure; update the store_artifacts step (the block defining path: /tmp/test-artifacts and destination: test-logs) to include a when: on_fail or when: always key so artifacts collected by the "Collect logs on failure" step are actually uploaded (choose on_fail to only upload on failures or always to upload regardless).

coderabbitai · 2026-03-12T01:43:38Z

cli/src/ad4m_executor.rs

+        // Set PID file path as env var so the executor can write/clean it up
+        if let Some(ref pf) = pid_file {
+            // SAFETY: set_var is safe here because we're in single-threaded init before spawning.
+            #[allow(deprecated)]
+            unsafe {
+                std::env::set_var("AD4M_PID_FILE", pf);
+            }


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

In current Rust, why is std::env::set_varunsafe on Unix, and is it sound to call it from inside a#[tokio::main(flavor = "multi_thread")] function after the runtime has started?

💡 Result:

On Unix-like platforms (everything except Windows), std::env::set_var is unsafe because the process environment is a shared global that the OS/C library does not provide a reliably thread-safe way to access: if any other thread (Rust or native/FFI) reads the environment via getenv, environ, or any other libc/internal routine while you mutate it, you can trigger a data race / use-after-free in the C environment machinery (i.e., UB from Rust’s perspective). The standard library docs therefore say that in multi-threaded programs on non-Windows, “the only safe option is to not use set_var or remove_var at all”, because you generally can’t guarantee no other thread is reading the environment (even Rust std may do so, e.g. during DNS lookup). [1] This is also why Rust 2024 marked set_var/remove_var as unsafe. [2] (Background on why POSIX/C makes this hard: [3])

For #[tokio::main(flavor = "multi_thread")]: No, it is not sound (on Unix) to call unsafe { std::env::set_var(...) } from inside that function once the runtime has started. Tokio’s multi-thread runtime creates a pool of worker threads on startup, and after that point you are in a multi-threaded process. [4] That violates set_var’s safety precondition unless you can somehow ensure that no other thread will read or write the environment except through std::env—which the docs explicitly say is not practically guaranteeable. [1]

If you must set an env var for your own process, do it before starting the Tokio multi-thread runtime (e.g., in a synchronous main that sets vars, then builds the runtime and block_ons). Otherwise, prefer passing configuration explicitly, or use Command::env when the goal is only to affect a child process. [1]

Sources
[1] Rust std docs: std::env::set_var safety contract (Rust 1.94.0) — https://doc.rust-lang.org/std/env/fn.set_var.html
[2] Rust 2024 Edition Guide: newly-unsafe functions (set_var, remove_var) — https://doc.rust-lang.org/beta/edition-guide/rust-2024/newly-unsafe-functions.html
[3] Evan Jones: “Setenv is not Thread Safe and C Doesn't Want to Fix It” — https://www.evanjones.ca/setenv-is-not-thread-safe.html
[4] Tokio runtime docs: multi-thread scheduler starts worker threads on startup — https://docs.rs/tokio/latest/tokio/runtime/

Don't call std::env::set_var() after the Tokio runtime has started.

The unsafe block is marked as "single-threaded init", but main() is already running under #[tokio::main(flavor = "multi_thread")]. On Unix, process-wide environment mutation from std::env::set_var is unsafe in multi-threaded contexts—other threads (or Rust std itself during DNS lookups) may read the environment concurrently, causing data races. Tokio's multi-thread runtime spawns worker threads on startup, so this code runs in a multi-threaded process where set_var violates its safety preconditions.

Thread the PID-file path through Ad4mConfig or another explicit startup parameter instead.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cli/src/ad4m_executor.rs` around lines 199 - 205, The code sets AD4M_PID_FILE via std::env::set_var in ad4m_executor.rs while running under #[tokio::main(flavor = "multi_thread")], which is unsafe; remove the unsafe std::env::set_var call and instead thread the PID file path through your startup config and APIs (e.g., add an Option<PathBuf> pid_file field to Ad4mConfig or an explicit pid_file parameter on the executor initialization function used in ad4m_executor.rs), update the call sites that construct/start the executor to pass the PID path, and have the executor write/clean the PID file from that explicit value rather than via process-wide env mutation.

coderabbitai · 2026-03-12T01:43:38Z

cli/src/main.rs

+        /// Write the executor PID to this file on startup (removed on clean shutdown).
+        #[arg(long)]
+        pid_file: Option<String>,


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Run variant definition:" sed -n '129,169p' cli/src/main.rs echo echo "Trailing match arm:" sed -n '335,352p' cli/src/main.rs

Repository: coasys/ad4m

Length of output: 2064

Update the trailing Domain::Run pattern or this enum change won't compile.

After adding pid_file (along with enable_mcp and mcp_port), the match args.domain arm at the bottom still destructures Domain::Run without these three fields. Use Domain::Run { .. } there, or list the missing fields explicitly.

Minimal fix

- Domain::Run { - app_data_path: _, - network_bootstrap_seed: _, - language_language_only: _, - run_dapp_server: _, - gql_port: _, - hc_admin_port: _, - hc_app_port: _, - hc_use_bootstrap: _, - hc_use_local_proxy: _, - hc_use_mdns: _, - hc_use_proxy: _, - hc_proxy_url: _, - hc_bootstrap_url: _, - connect_holochain: _, - admin_credential: _, - enable_multi_user: _ - } => unreachable!(), + Domain::Run { .. } => unreachable!(),

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cli/src/main.rs` around lines 166 - 168, The Domain::Run enum variant and the args.domain match need updating because you added pid_file, enable_mcp, and mcp_port to the Run variant; update the match arm that matches args.domain (the Domain::Run arm) to avoid destructuring missing fields by either changing it to Domain::Run { .. } or listing the new fields explicitly (pid_file, enable_mcp, mcp_port) so the match compiles and continues to access other fields as before.

coderabbitai · 2026-03-12T01:43:38Z

cli/src/main.rs

+        // Set PID file path as env var so the executor can write/clean it up
+        if let Some(ref pf) = pid_file {
+            #[allow(deprecated)]
+            unsafe { std::env::set_var("AD4M_PID_FILE", pf); }


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

In current Rust, why is std::env::set_varunsafe on Unix, and is it sound to call it from inside a#[tokio::main(flavor = "multi_thread")] function after the runtime has started?

💡 Result:

Why std::env::set_var is unsafe on Unix (in current Rust)

On many Unix platforms (notably Linux with glibc), the process environment is a piece of global mutable state managed by C/POSIX APIs (setenv/unsetenv/getenv) that are not reliably thread-safe. If one thread mutates the environment while another thread (or a library call) reads it, this can cause crashes and other undefined behavior at the process level. Rust cannot enforce (or even reliably detect) that no other thread (including threads created by libraries) might touch the environment, and even Rust/stdlib code may read the environment “behind your back” (e.g. during networking/DNS operations). Therefore, in Rust 2024 edition, std::env::set_var / remove_var are marked unsafe so the caller must uphold the required global precondition. [1], [2]

Is it sound to call it inside #[tokio::main(flavor = "multi_thread")] after the runtime has started?

Generally, no (on typical Unix targets like Linux). A Tokio multi-thread runtime creates worker threads on startup; your async fn main() runs after the executor is running and threads exist. [3], [4] At that point you cannot uphold set_var’s safety requirement (“no other thread might be running and reading the environment”, including via libc / other libraries / std internals), so calling unsafe { std::env::set_var(...) } is not sound in general. [1]

If you need per-task/per-component configuration, prefer explicit configuration objects, globals you control (mutex/once), or pass env vars only to child processes via Command::env(...) instead of mutating the process environment.

Sources: [1] std::env::set_var docs, safety section (doc.rust-lang.org); [2] Rust 2024 edition guide (“Newly unsafe functions”) (doc.rust-lang.org); [3] Tokio runtime docs (multi-thread runtime starts worker threads on startup) (docs.rs); [4] Tokio #[tokio::main] runs async fn main after runtime started (stackoverflow.com).

Citations:

1: https://doc.rust-lang.org/std/env/fn.set_var.html?utm_source=openai

2: https://doc.rust-lang.org/beta/edition-guide/rust-2024/newly-unsafe-functions.html?utm_source=openai

3: https://docs.rs/tokio/latest/tokio/runtime/?utm_source=openai

4: https://stackoverflow.com/questions/66913978/when-does-tokiomain-spawn-threads?utm_source=openai

🏁 Script executed:

# Check the context around lines 252-255 in cli/src/main.rs cat -n cli/src/main.rs | sed -n '240,270p'

Repository: coasys/ad4m

Length of output: 1224

🏁 Script executed:

# Find the main function and check for #[tokio::main] attribute head -n 300 cli/src/main.rs | grep -A 10 -B 5 "^#\[tokio::main"

Repository: coasys/ad4m

Length of output: 409

Remove the unsafe set_var() call and pass the PID-file path through config instead.

The unsafe { std::env::set_var(...) } call at lines 254–255 is unsound. std::env::set_var is unsafe on Unix because the process environment is global mutable state that is not reliably thread-safe. Since this code runs inside #[tokio::main(flavor = "multi_thread")] after the runtime (and its worker threads) have started, other threads may be reading the environment via libc calls, DNS operations, or stdlib internals at any moment. Mutating it here violates the safety precondition and can cause data races and undefined behavior.

Pass the PID-file path to the executor via the config object (Ad4mConfig) instead of mutating process-wide environment state.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cli/src/main.rs` around lines 252 - 255, Remove the unsafe environment mutation: stop calling unsafe std::env::set_var("AD4M_PID_FILE", ...) in the pid_file handling block; instead add a PID path field to the Ad4mConfig (or existing config struct) and populate it from pid_file, then ensure the executor code that previously read AD4M_PID_FILE now reads the PID path from Ad4mConfig (update the executor initialization to accept the config field). Update references to AD4M_PID_FILE to use the new config property and remove the unsafe block entirely.

coderabbitai · 2026-03-12T01:43:38Z

tests/integration.bats

+    # Port-based fallback in case PID tracking missed something
+    lsof -ti:4000 | xargs -r kill -9 2>/dev/null || true


⚠️ Potential issue | 🟠 Major

The port fallback can kill an unrelated process.

If PID tracking misses the executor, lsof -ti:4000 | kill -9 will SIGKILL whatever owns port 4000 at teardown, including another local service or CI job. Keep the fallback scoped to the recorded executor PID / pid-file, or verify the command name before killing.

Based on learnings "Kill any lingering ad4m-executor processes before running integration tests to avoid port conflicts".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/integration.bats` around lines 47 - 48, The current port fallback unconditionally kills whatever owns port 4000; change it to target only the recorded executor PID (from the pid-file used by the test harness) and verify the process is the executor before killing: read the pid from the existing pid-file (or the variable that stores the executor PID), check that /proc/<pid>/cmdline (or ps -p <pid> -o comm=) contains "ad4m-executor" and that the pid actually listens on :4000, then kill that PID; if no pid-file exists fall back to using lsof but filter results to PIDs whose command matches "ad4m-executor" before invoking kill. Ensure you update the teardown code that currently runs `lsof -ti:4000 | xargs -r kill -9` to this safer, PID-and-command-verified sequence.

coderabbitai · 2026-03-12T01:43:38Z

tests/js/utils/testCluster.ts

+export interface NodeConfig {
+    gqlPort: number;
+    hcAdminPort: number;
+    hcAppPort: number;
+    dataPath: string;
+    seedPath?: string;
+    adminCredential?: string;
+    /** Additional flags passed to startExecutor (e.g. enableMultiUser) */
+    multiUser?: boolean;


⚠️ Potential issue | 🟠 Major

multiUser is a silent no-op in this API.

NodeConfig advertises multiUser, but addNode() hardcodes false and startExecutor() has no corresponding argument. Any caller that sets multiUser: true will get the wrong node shape with no signal.

Also applies to: 45-53

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/js/utils/testCluster.ts` around lines 19 - 27, NodeConfig declares multiUser but it's ignored: addNode() currently hardcodes multiUser=false and startExecutor() lacks a matching parameter, so callers setting multiUser:true get no effect; fix by threading the flag through the call chain—update addNode(nodeConfig: NodeConfig) to read nodeConfig.multiUser (default false) and pass it into startExecutor (add a multiUser boolean parameter to startExecutor and use it when building the node), or if startExecutor cannot change, remove multiUser from NodeConfig; update all call sites of addNode and startExecutor accordingly and ensure tests use the propagated flag.

coderabbitai · 2026-03-12T01:43:38Z

tests/js/utils/testCluster.ts

+    async addNode(config: NodeConfig): Promise<ClusterNode> {
+        const executorProcess = await startExecutor(
+            config.dataPath,
+            config.seedPath || "",
+            config.gqlPort,
+            config.hcAdminPort,
+            config.hcAppPort,
+            false, // languageLanguageOnly
+            config.adminCredential || "",
+        );
+
+        // Wait for GQL to be reachable
+        const client = await this.waitForGql(config.gqlPort, config.adminCredential || "");
+
+        const node: ClusterNode = {
+            config,
+            process: executorProcess,
+            client,
+            gqlPort: config.gqlPort,
+        };
+
+        this.nodes.push(node);
+        return node;
+    }


⚠️ Potential issue | 🟠 Major

Clean up the child if startup probing fails.

If waitForGql() throws here, the spawned executor is never added to this.nodes, so shutdown() cannot clean it up. That leaves orphan ad4m-executor processes behind and can poison later test ports.

🧹 Minimal fix

async addNode(config: NodeConfig): Promise<ClusterNode> { const executorProcess = await startExecutor( config.dataPath, config.seedPath || "", config.gqlPort, config.hcAdminPort, config.hcAppPort, false, // languageLanguageOnly config.adminCredential || "", ); - // Wait for GQL to be reachable - const client = await this.waitForGql(config.gqlPort, config.adminCredential || ""); - - const node: ClusterNode = { - config, - process: executorProcess, - client, - gqlPort: config.gqlPort, - }; - - this.nodes.push(node); - return node; + try { + const client = await this.waitForGql(config.gqlPort, config.adminCredential || ""); + + const node: ClusterNode = { + config, + process: executorProcess, + client, + gqlPort: config.gqlPort, + }; + + this.nodes.push(node); + return node; + } catch (error) { + await gracefulShutdown(executorProcess, `cluster node ${config.gqlPort}`); + throw error; + } }

Based on learnings "Kill any lingering ad4m-executor processes before running integration tests to avoid port conflicts".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/js/utils/testCluster.ts` around lines 44 - 67, In addNode, if waitForGql throws the spawned executorProcess is left orphaned; wrap the waitForGql call in a try/catch and on any error ensure the started executor is cleaned up (e.g., call executorProcess.kill() or the appropriate shutdown helper for the started process) before rethrowing the error; reference addNode, startExecutor, executorProcess, waitForGql and this.nodes so the fix either kills the process in the catch or adds a temporary tracking entry to this.nodes before awaiting and removes it on failure.

coderabbitai · 2026-03-12T01:43:38Z

tests/js/utils/testCluster.ts

+    /**
+     * Poll the runtimeReadiness query until all subsystems are ready.
+     * Falls back to runtime.info() if runtimeReadiness is not available.
+     */
+    async waitForReadiness(node: ClusterNode, timeoutMs: number = 120000): Promise<void> {
+        const start = Date.now();
+
+        while (Date.now() - start < timeoutMs) {
+            try {
+                const info = await node.client.runtime.info();
+                if (info.isInitialized && info.isUnlocked) {
+                    return;
+                }
+            } catch (e) {
+                // Not ready yet
+            }
+            await sleep(2000);
+        }
+
+        throw new Error(`Node on port ${node.gqlPort} not fully ready after ${timeoutMs}ms`);


⚠️ Potential issue | 🟠 Major

This readiness helper never uses the new readiness probe.

The comment says runtimeReadiness, but the implementation still polls runtime.info(). That misses gqlReady / holochainReady and keeps the old race the dedicated readiness resolver was added to remove.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/js/utils/testCluster.ts` around lines 91 - 110, The waitForReadiness function currently polls runtime.info() and never calls the new readiness probe; update waitForReadiness to first call node.client.runtime.runtimeReadiness() inside the retry loop and check the readiness fields (e.g., gqlReady and holochainReady or an overall ready flag) and only fall back to node.client.runtime.info() if runtimeReadiness is not available or returns null; keep the existing timeout/sleep behavior and retain the same error throw (including node.gqlPort) if neither probe reports readiness within timeout.

coderabbitai · 2026-03-12T01:43:38Z

tests/js/utils/utils.ts

+export async function gracefulShutdown(proc: ChildProcess | null | undefined, label: string = "process", timeoutMs: number = 10000): Promise<void> {
+    if (!proc || proc.killed) return;
+
+    console.log(`Sending SIGTERM to ${label} (PID ${proc.pid})...`);
+    proc.kill('SIGTERM');
+
+    // Wait for process to exit gracefully
+    const start = Date.now();
+    while (!proc.killed && Date.now() - start < timeoutMs) {
+        await sleep(500);
+    }
+
+    if (!proc.killed) {
+        console.log(`${label} did not exit after ${timeoutMs}ms, sending SIGKILL...`);
+        proc.kill('SIGKILL');
+        await sleep(1000);
+    }
+
+    console.log(`${label} shut down (killed=${proc.killed})`);


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

In Node.js, what does ChildProcess.killedmean — has the child exited, or only thatkill() successfully sent a signal?

💡 Result:

ChildProcess.killed does not mean the child has exited. It only indicates that the child process successfully received a signal as a result of calling child.kill() (i.e., the signal was delivered/accepted). (nodejs.org)

To know whether the child has actually terminated, check for the 'exit'/'close' events (and/or child.exitCode / child.signalCode), which reflect real process termination. (nodejs.org)

Citations:

1: https://nodejs.org/download/release/v22.19.0/docs/api/child_process.html?utm_source=openai

2: https://nodejs.org/download/nightly/v26.0.0-nightly2026022776215dc993/docs/api/child_process.html?utm_source=openai

Fix process termination detection in gracefulShutdown().

ChildProcess.killed only indicates a signal was sent, not that the process has exited. The current loop exits immediately after proc.kill('SIGTERM'), so the SIGKILL escalation never triggers even if the executor is still running. This causes test flakiness due to lingering processes.

Use the 'exit' and 'close' events (or check exitCode/signalCode) to detect actual termination:

Fix

export async function gracefulShutdown(proc: ChildProcess | null | undefined, label: string = "process", timeoutMs: number = 10000): Promise<void> { - if (!proc || proc.killed) return; + if (!proc || proc.exitCode !== null || proc.signalCode !== null) return; console.log(`Sending SIGTERM to ${label} (PID ${proc.pid})...`); proc.kill('SIGTERM'); - // Wait for process to exit gracefully - const start = Date.now(); - while (!proc.killed && Date.now() - start < timeoutMs) { - await sleep(500); - } + const exited = await Promise.race([ + new Promise<boolean>((resolve) => { + proc.once("exit", () => resolve(true)); + proc.once("close", () => resolve(true)); + }), + sleep(timeoutMs).then(() => false), + ]); - if (!proc.killed) { + if (!exited && proc.exitCode === null && proc.signalCode === null) { console.log(`${label} did not exit after ${timeoutMs}ms, sending SIGKILL...`); proc.kill('SIGKILL'); - await sleep(1000); + await new Promise<void>((resolve) => proc.once("close", () => resolve())); } - console.log(`${label} shut down (killed=${proc.killed})`); + console.log(`${label} shut down (exitCode=${proc.exitCode}, signalCode=${proc.signalCode})`); }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/js/utils/utils.ts` around lines 323 - 341, The gracefulShutdown function currently relies on ChildProcess.killed (which only means a signal was sent) so it should instead await actual process termination: attach listeners for the child's 'exit' and 'close' events (or poll exitCode/signalCode) after sending SIGTERM and resolve when one of those indicates the process exited; if no exit occurs within timeoutMs, send SIGKILL and again wait for the same events (with a short fallback sleep) before returning. Update gracefulShutdown to remove the while loop based on proc.killed, use a Promise that races the exit/close events against a timeout, and reference proc, gracefulShutdown, and timeoutMs to implement the two-stage terminate-and-escalate behavior.

HexaField · 2026-03-12T11:32:26Z

CI compile fix is working — all 8 ad4m model integration tests pass.

The 2 reported failures are pre-existing artefacts on the coasys/marvin executor image, not from this repo:

test_sample.py: deliberately broken (assert 1 + 1 == 3)
Storyshots › Job/Tests/Test Result: CircleCI example at /home/circleci/src/node-example/test/test.js

Neither is related to this PR. ⬡

lucksus reviewed Mar 11, 2026

View reviewed changes

lucksus requested changes Mar 11, 2026

View reviewed changes

HexaField added 3 commits March 12, 2026 10:22

fix: cargo fmt, OnceLock→Mutex for SHUTDOWN_TX, fix testCluster start…

56f9818

…Executor signature

ci: retrigger build (previous failure was infrastructure-related, 26s…

6548ce4

… timeout)

HexaField marked this pull request as ready for review March 12, 2026 01:32

HexaField requested a review from lucksus March 12, 2026 01:32

ci: retrigger CI

2ed8aff

coderabbitai bot reviewed Mar 12, 2026

View reviewed changes

HexaField added 5 commits March 12, 2026 12:53

ci: retrigger

f896539

ci: retrigger

d4681de

ci: retrigger

cde07e2

ci: retrigger

32e1004

fix: remove unnecessary Ok wrapper to resolve type inference error

0c85828

		# Port-based fallback in case PID tracking missed something
		lsof -ti:4000 \| xargs -r kill -9 2>/dev/null \|\| true

Conversation

HexaField commented Mar 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Phase 1: Quick Wins

Phase 2: Graceful Shutdown (Rust)

Phase 3: Readiness & Multi-Executor

CI Status

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

lucksus Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

HexaField Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

lucksus Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

HexaField Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

lucksus left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2026

Choose a reason for hiding this comment

Why std::env::set_var is unsafe on Unix (in current Rust)

Is it sound to call it inside #[tokio::main(flavor = "multi_thread")] after the runtime has started?

Uh oh!

coderabbitai bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

HexaField commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HexaField commented Mar 11, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 11, 2026 •

edited

Loading

Why `std::env::set_var` is `unsafe` on Unix (in current Rust)

Is it sound to call it inside `#[tokio::main(flavor = "multi_thread")]` after the runtime has started?