fix: Test & CI infrastructure improvements — graceful shutdown, readiness probes, test isolation#744
fix: Test & CI infrastructure improvements — graceful shutdown, readiness probes, test isolation#744
Conversation
…ness probes, test isolation Phase 1 — Quick Wins: - Replace killall ad4m with PID-based SIGTERM→SIGKILL escalation in integration.bats - Add artifact collection (store_artifacts) to CI for failure debugging - Add pre-test port validation step to CI jobs - Use SIGTERM→wait→SIGKILL pattern in all CI orphan cleanup steps Phase 2 — Graceful Shutdown: - Add SIGTERM/SIGINT signal handlers in rust-executor (tokio::signal) - Make runtime_quit use oneshot channel for graceful shutdown instead of process::exit - Shutdown Holochain conductor cleanly before exit - Add --pid-file flag to CLI for test harness PID tracking - Add gracefulShutdown() utility to JS test utils - Update all after() hooks to use graceful shutdown pattern Phase 3 — Readiness & Multi-Executor: - Add runtimeReadiness GQL query (gqlReady, holochainReady, agentInitialized, languagesLoaded) - Create TestCluster utility at tests/js/utils/testCluster.ts for multi-executor test setup CI is disabled (when: never) pending review.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds graceful shutdown infrastructure: PID file handling, signal-driven shutdown channel and task, a GraphQL runtime readiness query, CLI wiring for pid_file, and test utilities/tests updated to perform centralized, graceful process termination instead of ad-hoc kills. Changes
Sequence DiagramsequenceDiagram
participant OS as "OS Signal Handler"
participant Exec as "Executor Runtime"
participant Shutdown as "Shutdown Task"
participant Holochain as "Holochain Conductor"
participant FS as "File System"
OS->>Exec: SIGINT / SIGTERM
Exec->>Exec: Signal listener receives signal
Exec->>Shutdown: send on SHUTDOWN_TX
Shutdown->>Shutdown: await shutdown signal
Shutdown->>Holochain: request graceful shutdown
Holochain-->>Shutdown: shutdown complete
Shutdown->>FS: remove PID file (AD4M_PID_FILE)
FS-->>Shutdown: file removed
Shutdown->>Exec: log completion and exit(0)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
.circleci/config.yml
Outdated
| echo "⚠️ Ports still in use:$CONFLICTS" | ||
| echo "Attempting cleanup..." | ||
| for port in $CONFLICTS; do | ||
| lsof -ti:$port | xargs -r kill -9 2>/dev/null || true |
There was a problem hiding this comment.
not sure if verify ports are clean make sense here, because if they are not, just killing that process is asking for trouble.
There was a problem hiding this comment.
Done — removed the port verification step and all kill steps entirely. Graceful shutdown should handle cleanup. Also removed when: never so CI will run on this branch.
.circleci/config.yml
Outdated
| echo "$PIDS" | xargs -r kill -TERM 2>/dev/null || true | ||
| fi | ||
| done | ||
| sleep 3 |
There was a problem hiding this comment.
yeah I think this killing just needs to go away completely. I hope the graceful shutdown below eradicates the need for that.
There was a problem hiding this comment.
Agreed — removed all kill steps from every test job (js, mcp, model). Relying entirely on graceful shutdown now. Let's see how CI goes!
lucksus
left a comment
There was a problem hiding this comment.
I think the graceful shutdown was missing for a long time. great stuff! if you remove at least the new kill code, I would love to see it running on CI (remove the broken syntax never) and maybe we can remove more kill calls. then we can try activating more runners again.
Address Nico's review comments on PR #744: - Remove 'Kill orphaned executors' steps from all test jobs - Remove 'Verify clean port state' step from integration-tests-js - Remove 'when: never' to enable CI workflow - Keep graceful shutdown handlers which should make kill steps unnecessary
…Executor signature
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (1)
.circleci/config.yml (1)
128-133: Consider preserving agent directory structure to avoid log overwrites.If multiple agents exist under
tst-tmp/agents/*/ad4m/logs, copying them all to the same/tmp/test-artifacts/destination could overwrite logs with identical filenames. Usingcp --parentsor a more structured approach would preserve debugging context.♻️ Suggested improvement
command: | mkdir -p /tmp/test-artifacts - cp -r tests/js/tst-tmp/agents/*/ad4m/logs /tmp/test-artifacts/ 2>/dev/null || true - cp -r tests/js/tst-tmp/agents/*/ad4m/holochain /tmp/test-artifacts/ 2>/dev/null || true + # Preserve agent directory structure to avoid overwrites from multiple agents + cd tests/js/tst-tmp && cp -r --parents agents/*/ad4m/logs /tmp/test-artifacts/ 2>/dev/null || true + cd tests/js/tst-tmp && cp -r --parents agents/*/ad4m/holochain /tmp/test-artifacts/ 2>/dev/null || true # Capture port state for debugging lsof -i -P -n 2>/dev/null | grep LISTEN > /tmp/test-artifacts/listening-ports.txt || true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.circleci/config.yml around lines 128 - 133, The copy commands that flatten multiple agents' ad4m/logs into /tmp/test-artifacts risk overwriting files with identical names; modify the commands that copy from tests/js/tst-tmp/agents/*/ad4m/logs and tests/js/tst-tmp/agents/*/ad4m/holochain so they preserve the agent directory structure (e.g., use cp --parents or rsync/tar to recreate tests/js/tst-tmp/agents/<agent>/ad4m/... under /tmp/test-artifacts, or copy each agent into its own subdirectory) so logs from different agents are not clobbered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.circleci/config.yml:
- Around line 134-136: The store_artifacts step currently omits a when condition
so it uses the default on_success and will be skipped on failure; update the
store_artifacts step (the block defining path: /tmp/test-artifacts and
destination: test-logs) to include a when: on_fail or when: always key so
artifacts collected by the "Collect logs on failure" step are actually uploaded
(choose on_fail to only upload on failures or always to upload regardless).
In `@cli/src/ad4m_executor.rs`:
- Around line 199-205: The code sets AD4M_PID_FILE via std::env::set_var in
ad4m_executor.rs while running under #[tokio::main(flavor = "multi_thread")],
which is unsafe; remove the unsafe std::env::set_var call and instead thread the
PID file path through your startup config and APIs (e.g., add an Option<PathBuf>
pid_file field to Ad4mConfig or an explicit pid_file parameter on the executor
initialization function used in ad4m_executor.rs), update the call sites that
construct/start the executor to pass the PID path, and have the executor
write/clean the PID file from that explicit value rather than via process-wide
env mutation.
In `@cli/src/main.rs`:
- Around line 166-168: The Domain::Run enum variant and the args.domain match
need updating because you added pid_file, enable_mcp, and mcp_port to the Run
variant; update the match arm that matches args.domain (the Domain::Run arm) to
avoid destructuring missing fields by either changing it to Domain::Run { .. }
or listing the new fields explicitly (pid_file, enable_mcp, mcp_port) so the
match compiles and continues to access other fields as before.
- Around line 252-255: Remove the unsafe environment mutation: stop calling
unsafe std::env::set_var("AD4M_PID_FILE", ...) in the pid_file handling block;
instead add a PID path field to the Ad4mConfig (or existing config struct) and
populate it from pid_file, then ensure the executor code that previously read
AD4M_PID_FILE now reads the PID path from Ad4mConfig (update the executor
initialization to accept the config field). Update references to AD4M_PID_FILE
to use the new config property and remove the unsafe block entirely.
In `@tests/integration.bats`:
- Around line 47-48: The current port fallback unconditionally kills whatever
owns port 4000; change it to target only the recorded executor PID (from the
pid-file used by the test harness) and verify the process is the executor before
killing: read the pid from the existing pid-file (or the variable that stores
the executor PID), check that /proc/<pid>/cmdline (or ps -p <pid> -o comm=)
contains "ad4m-executor" and that the pid actually listens on :4000, then kill
that PID; if no pid-file exists fall back to using lsof but filter results to
PIDs whose command matches "ad4m-executor" before invoking kill. Ensure you
update the teardown code that currently runs `lsof -ti:4000 | xargs -r kill -9`
to this safer, PID-and-command-verified sequence.
In `@tests/js/utils/testCluster.ts`:
- Around line 91-110: The waitForReadiness function currently polls
runtime.info() and never calls the new readiness probe; update waitForReadiness
to first call node.client.runtime.runtimeReadiness() inside the retry loop and
check the readiness fields (e.g., gqlReady and holochainReady or an overall
ready flag) and only fall back to node.client.runtime.info() if runtimeReadiness
is not available or returns null; keep the existing timeout/sleep behavior and
retain the same error throw (including node.gqlPort) if neither probe reports
readiness within timeout.
- Around line 19-27: NodeConfig declares multiUser but it's ignored: addNode()
currently hardcodes multiUser=false and startExecutor() lacks a matching
parameter, so callers setting multiUser:true get no effect; fix by threading the
flag through the call chain—update addNode(nodeConfig: NodeConfig) to read
nodeConfig.multiUser (default false) and pass it into startExecutor (add a
multiUser boolean parameter to startExecutor and use it when building the node),
or if startExecutor cannot change, remove multiUser from NodeConfig; update all
call sites of addNode and startExecutor accordingly and ensure tests use the
propagated flag.
- Around line 44-67: In addNode, if waitForGql throws the spawned
executorProcess is left orphaned; wrap the waitForGql call in a try/catch and on
any error ensure the started executor is cleaned up (e.g., call
executorProcess.kill() or the appropriate shutdown helper for the started
process) before rethrowing the error; reference addNode, startExecutor,
executorProcess, waitForGql and this.nodes so the fix either kills the process
in the catch or adds a temporary tracking entry to this.nodes before awaiting
and removes it on failure.
In `@tests/js/utils/utils.ts`:
- Around line 323-341: The gracefulShutdown function currently relies on
ChildProcess.killed (which only means a signal was sent) so it should instead
await actual process termination: attach listeners for the child's 'exit' and
'close' events (or poll exitCode/signalCode) after sending SIGTERM and resolve
when one of those indicates the process exited; if no exit occurs within
timeoutMs, send SIGKILL and again wait for the same events (with a short
fallback sleep) before returning. Update gracefulShutdown to remove the while
loop based on proc.killed, use a Promise that races the exit/close events
against a timeout, and reference proc, gracefulShutdown, and timeoutMs to
implement the two-stage terminate-and-escalate behavior.
---
Nitpick comments:
In @.circleci/config.yml:
- Around line 128-133: The copy commands that flatten multiple agents' ad4m/logs
into /tmp/test-artifacts risk overwriting files with identical names; modify the
commands that copy from tests/js/tst-tmp/agents/*/ad4m/logs and
tests/js/tst-tmp/agents/*/ad4m/holochain so they preserve the agent directory
structure (e.g., use cp --parents or rsync/tar to recreate
tests/js/tst-tmp/agents/<agent>/ad4m/... under /tmp/test-artifacts, or copy each
agent into its own subdirectory) so logs from different agents are not
clobbered.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0e58720f-8345-425c-a009-2e1349075dd8
📒 Files selected for processing (13)
.circleci/config.ymlcli/src/ad4m_executor.rscli/src/main.rsrust-executor/src/globals.rsrust-executor/src/graphql/graphql_types.rsrust-executor/src/graphql/mutation_resolvers.rsrust-executor/src/graphql/query_resolvers.rsrust-executor/src/lib.rstests/integration.batstests/js/tests/multi-user-simple.test.tstests/js/tests/simple.test.tstests/js/utils/testCluster.tstests/js/utils/utils.ts
| - store_artifacts: | ||
| path: /tmp/test-artifacts | ||
| destination: test-logs |
There was a problem hiding this comment.
store_artifacts needs when: on_fail or when: always to actually store the collected logs.
The default when for CircleCI steps is on_success. Since this job will be in a failed state when the "Collect logs on failure" step runs, the store_artifacts step will be skipped, and the collected artifacts won't be uploaded.
🐛 Proposed fix
- store_artifacts:
+ when: on_fail
path: /tmp/test-artifacts
destination: test-logs🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.circleci/config.yml around lines 134 - 136, The store_artifacts step
currently omits a when condition so it uses the default on_success and will be
skipped on failure; update the store_artifacts step (the block defining path:
/tmp/test-artifacts and destination: test-logs) to include a when: on_fail or
when: always key so artifacts collected by the "Collect logs on failure" step
are actually uploaded (choose on_fail to only upload on failures or always to
upload regardless).
| // Set PID file path as env var so the executor can write/clean it up | ||
| if let Some(ref pf) = pid_file { | ||
| // SAFETY: set_var is safe here because we're in single-threaded init before spawning. | ||
| #[allow(deprecated)] | ||
| unsafe { | ||
| std::env::set_var("AD4M_PID_FILE", pf); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In current Rust, why is std::env::set_varunsafe on Unix, and is it sound to call it from inside a#[tokio::main(flavor = "multi_thread")] function after the runtime has started?
💡 Result:
On Unix-like platforms (everything except Windows), std::env::set_var is unsafe because the process environment is a shared global that the OS/C library does not provide a reliably thread-safe way to access: if any other thread (Rust or native/FFI) reads the environment via getenv, environ, or any other libc/internal routine while you mutate it, you can trigger a data race / use-after-free in the C environment machinery (i.e., UB from Rust’s perspective). The standard library docs therefore say that in multi-threaded programs on non-Windows, “the only safe option is to not use set_var or remove_var at all”, because you generally can’t guarantee no other thread is reading the environment (even Rust std may do so, e.g. during DNS lookup). [1] This is also why Rust 2024 marked set_var/remove_var as unsafe. [2] (Background on why POSIX/C makes this hard: [3])
For #[tokio::main(flavor = "multi_thread")]: No, it is not sound (on Unix) to call unsafe { std::env::set_var(...) } from inside that function once the runtime has started. Tokio’s multi-thread runtime creates a pool of worker threads on startup, and after that point you are in a multi-threaded process. [4] That violates set_var’s safety precondition unless you can somehow ensure that no other thread will read or write the environment except through std::env—which the docs explicitly say is not practically guaranteeable. [1]
If you must set an env var for your own process, do it before starting the Tokio multi-thread runtime (e.g., in a synchronous main that sets vars, then builds the runtime and block_ons). Otherwise, prefer passing configuration explicitly, or use Command::env when the goal is only to affect a child process. [1]
Sources
[1] Rust std docs: std::env::set_var safety contract (Rust 1.94.0) — https://doc.rust-lang.org/std/env/fn.set_var.html
[2] Rust 2024 Edition Guide: newly-unsafe functions (set_var, remove_var) — https://doc.rust-lang.org/beta/edition-guide/rust-2024/newly-unsafe-functions.html
[3] Evan Jones: “Setenv is not Thread Safe and C Doesn't Want to Fix It” — https://www.evanjones.ca/setenv-is-not-thread-safe.html
[4] Tokio runtime docs: multi-thread scheduler starts worker threads on startup — https://docs.rs/tokio/latest/tokio/runtime/
Don't call std::env::set_var() after the Tokio runtime has started.
The unsafe block is marked as "single-threaded init", but main() is already running under #[tokio::main(flavor = "multi_thread")]. On Unix, process-wide environment mutation from std::env::set_var is unsafe in multi-threaded contexts—other threads (or Rust std itself during DNS lookups) may read the environment concurrently, causing data races. Tokio's multi-thread runtime spawns worker threads on startup, so this code runs in a multi-threaded process where set_var violates its safety preconditions.
Thread the PID-file path through Ad4mConfig or another explicit startup parameter instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cli/src/ad4m_executor.rs` around lines 199 - 205, The code sets AD4M_PID_FILE
via std::env::set_var in ad4m_executor.rs while running under
#[tokio::main(flavor = "multi_thread")], which is unsafe; remove the unsafe
std::env::set_var call and instead thread the PID file path through your startup
config and APIs (e.g., add an Option<PathBuf> pid_file field to Ad4mConfig or an
explicit pid_file parameter on the executor initialization function used in
ad4m_executor.rs), update the call sites that construct/start the executor to
pass the PID path, and have the executor write/clean the PID file from that
explicit value rather than via process-wide env mutation.
| /// Write the executor PID to this file on startup (removed on clean shutdown). | ||
| #[arg(long)] | ||
| pid_file: Option<String>, |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Run variant definition:"
sed -n '129,169p' cli/src/main.rs
echo
echo "Trailing match arm:"
sed -n '335,352p' cli/src/main.rsRepository: coasys/ad4m
Length of output: 2064
Update the trailing Domain::Run pattern or this enum change won't compile.
After adding pid_file (along with enable_mcp and mcp_port), the match args.domain arm at the bottom still destructures Domain::Run without these three fields. Use Domain::Run { .. } there, or list the missing fields explicitly.
Minimal fix
- Domain::Run {
- app_data_path: _,
- network_bootstrap_seed: _,
- language_language_only: _,
- run_dapp_server: _,
- gql_port: _,
- hc_admin_port: _,
- hc_app_port: _,
- hc_use_bootstrap: _,
- hc_use_local_proxy: _,
- hc_use_mdns: _,
- hc_use_proxy: _,
- hc_proxy_url: _,
- hc_bootstrap_url: _,
- connect_holochain: _,
- admin_credential: _,
- enable_multi_user: _
- } => unreachable!(),
+ Domain::Run { .. } => unreachable!(),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cli/src/main.rs` around lines 166 - 168, The Domain::Run enum variant and the
args.domain match need updating because you added pid_file, enable_mcp, and
mcp_port to the Run variant; update the match arm that matches args.domain (the
Domain::Run arm) to avoid destructuring missing fields by either changing it to
Domain::Run { .. } or listing the new fields explicitly (pid_file, enable_mcp,
mcp_port) so the match compiles and continues to access other fields as before.
| // Set PID file path as env var so the executor can write/clean it up | ||
| if let Some(ref pf) = pid_file { | ||
| #[allow(deprecated)] | ||
| unsafe { std::env::set_var("AD4M_PID_FILE", pf); } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In current Rust, why is std::env::set_varunsafe on Unix, and is it sound to call it from inside a#[tokio::main(flavor = "multi_thread")] function after the runtime has started?
💡 Result:
Why std::env::set_var is unsafe on Unix (in current Rust)
On many Unix platforms (notably Linux with glibc), the process environment is a piece of global mutable state managed by C/POSIX APIs (setenv/unsetenv/getenv) that are not reliably thread-safe. If one thread mutates the environment while another thread (or a library call) reads it, this can cause crashes and other undefined behavior at the process level. Rust cannot enforce (or even reliably detect) that no other thread (including threads created by libraries) might touch the environment, and even Rust/stdlib code may read the environment “behind your back” (e.g. during networking/DNS operations). Therefore, in Rust 2024 edition, std::env::set_var / remove_var are marked unsafe so the caller must uphold the required global precondition. [1], [2]
Is it sound to call it inside #[tokio::main(flavor = "multi_thread")] after the runtime has started?
Generally, no (on typical Unix targets like Linux). A Tokio multi-thread runtime creates worker threads on startup; your async fn main() runs after the executor is running and threads exist. [3], [4] At that point you cannot uphold set_var’s safety requirement (“no other thread might be running and reading the environment”, including via libc / other libraries / std internals), so calling unsafe { std::env::set_var(...) } is not sound in general. [1]
If you need per-task/per-component configuration, prefer explicit configuration objects, globals you control (mutex/once), or pass env vars only to child processes via Command::env(...) instead of mutating the process environment.
Sources: [1] std::env::set_var docs, safety section (doc.rust-lang.org); [2] Rust 2024 edition guide (“Newly unsafe functions”) (doc.rust-lang.org); [3] Tokio runtime docs (multi-thread runtime starts worker threads on startup) (docs.rs); [4] Tokio #[tokio::main] runs async fn main after runtime started (stackoverflow.com).
Citations:
- 1: https://doc.rust-lang.org/std/env/fn.set_var.html?utm_source=openai
- 2: https://doc.rust-lang.org/beta/edition-guide/rust-2024/newly-unsafe-functions.html?utm_source=openai
- 3: https://docs.rs/tokio/latest/tokio/runtime/?utm_source=openai
- 4: https://stackoverflow.com/questions/66913978/when-does-tokiomain-spawn-threads?utm_source=openai
🏁 Script executed:
# Check the context around lines 252-255 in cli/src/main.rs
cat -n cli/src/main.rs | sed -n '240,270p'Repository: coasys/ad4m
Length of output: 1224
🏁 Script executed:
# Find the main function and check for #[tokio::main] attribute
head -n 300 cli/src/main.rs | grep -A 10 -B 5 "^#\[tokio::main"Repository: coasys/ad4m
Length of output: 409
Remove the unsafe set_var() call and pass the PID-file path through config instead.
The unsafe { std::env::set_var(...) } call at lines 254–255 is unsound. std::env::set_var is unsafe on Unix because the process environment is global mutable state that is not reliably thread-safe. Since this code runs inside #[tokio::main(flavor = "multi_thread")] after the runtime (and its worker threads) have started, other threads may be reading the environment via libc calls, DNS operations, or stdlib internals at any moment. Mutating it here violates the safety precondition and can cause data races and undefined behavior.
Pass the PID-file path to the executor via the config object (Ad4mConfig) instead of mutating process-wide environment state.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cli/src/main.rs` around lines 252 - 255, Remove the unsafe environment
mutation: stop calling unsafe std::env::set_var("AD4M_PID_FILE", ...) in the
pid_file handling block; instead add a PID path field to the Ad4mConfig (or
existing config struct) and populate it from pid_file, then ensure the executor
code that previously read AD4M_PID_FILE now reads the PID path from Ad4mConfig
(update the executor initialization to accept the config field). Update
references to AD4M_PID_FILE to use the new config property and remove the unsafe
block entirely.
| # Port-based fallback in case PID tracking missed something | ||
| lsof -ti:4000 | xargs -r kill -9 2>/dev/null || true |
There was a problem hiding this comment.
The port fallback can kill an unrelated process.
If PID tracking misses the executor, lsof -ti:4000 | kill -9 will SIGKILL whatever owns port 4000 at teardown, including another local service or CI job. Keep the fallback scoped to the recorded executor PID / pid-file, or verify the command name before killing.
Based on learnings "Kill any lingering ad4m-executor processes before running integration tests to avoid port conflicts".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/integration.bats` around lines 47 - 48, The current port fallback
unconditionally kills whatever owns port 4000; change it to target only the
recorded executor PID (from the pid-file used by the test harness) and verify
the process is the executor before killing: read the pid from the existing
pid-file (or the variable that stores the executor PID), check that
/proc/<pid>/cmdline (or ps -p <pid> -o comm=) contains "ad4m-executor" and that
the pid actually listens on :4000, then kill that PID; if no pid-file exists
fall back to using lsof but filter results to PIDs whose command matches
"ad4m-executor" before invoking kill. Ensure you update the teardown code that
currently runs `lsof -ti:4000 | xargs -r kill -9` to this safer,
PID-and-command-verified sequence.
| export interface NodeConfig { | ||
| gqlPort: number; | ||
| hcAdminPort: number; | ||
| hcAppPort: number; | ||
| dataPath: string; | ||
| seedPath?: string; | ||
| adminCredential?: string; | ||
| /** Additional flags passed to startExecutor (e.g. enableMultiUser) */ | ||
| multiUser?: boolean; |
There was a problem hiding this comment.
multiUser is a silent no-op in this API.
NodeConfig advertises multiUser, but addNode() hardcodes false and startExecutor() has no corresponding argument. Any caller that sets multiUser: true will get the wrong node shape with no signal.
Also applies to: 45-53
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/js/utils/testCluster.ts` around lines 19 - 27, NodeConfig declares
multiUser but it's ignored: addNode() currently hardcodes multiUser=false and
startExecutor() lacks a matching parameter, so callers setting multiUser:true
get no effect; fix by threading the flag through the call chain—update
addNode(nodeConfig: NodeConfig) to read nodeConfig.multiUser (default false) and
pass it into startExecutor (add a multiUser boolean parameter to startExecutor
and use it when building the node), or if startExecutor cannot change, remove
multiUser from NodeConfig; update all call sites of addNode and startExecutor
accordingly and ensure tests use the propagated flag.
| async addNode(config: NodeConfig): Promise<ClusterNode> { | ||
| const executorProcess = await startExecutor( | ||
| config.dataPath, | ||
| config.seedPath || "", | ||
| config.gqlPort, | ||
| config.hcAdminPort, | ||
| config.hcAppPort, | ||
| false, // languageLanguageOnly | ||
| config.adminCredential || "", | ||
| ); | ||
|
|
||
| // Wait for GQL to be reachable | ||
| const client = await this.waitForGql(config.gqlPort, config.adminCredential || ""); | ||
|
|
||
| const node: ClusterNode = { | ||
| config, | ||
| process: executorProcess, | ||
| client, | ||
| gqlPort: config.gqlPort, | ||
| }; | ||
|
|
||
| this.nodes.push(node); | ||
| return node; | ||
| } |
There was a problem hiding this comment.
Clean up the child if startup probing fails.
If waitForGql() throws here, the spawned executor is never added to this.nodes, so shutdown() cannot clean it up. That leaves orphan ad4m-executor processes behind and can poison later test ports.
🧹 Minimal fix
async addNode(config: NodeConfig): Promise<ClusterNode> {
const executorProcess = await startExecutor(
config.dataPath,
config.seedPath || "",
config.gqlPort,
config.hcAdminPort,
config.hcAppPort,
false, // languageLanguageOnly
config.adminCredential || "",
);
- // Wait for GQL to be reachable
- const client = await this.waitForGql(config.gqlPort, config.adminCredential || "");
-
- const node: ClusterNode = {
- config,
- process: executorProcess,
- client,
- gqlPort: config.gqlPort,
- };
-
- this.nodes.push(node);
- return node;
+ try {
+ const client = await this.waitForGql(config.gqlPort, config.adminCredential || "");
+
+ const node: ClusterNode = {
+ config,
+ process: executorProcess,
+ client,
+ gqlPort: config.gqlPort,
+ };
+
+ this.nodes.push(node);
+ return node;
+ } catch (error) {
+ await gracefulShutdown(executorProcess, `cluster node ${config.gqlPort}`);
+ throw error;
+ }
}Based on learnings "Kill any lingering ad4m-executor processes before running integration tests to avoid port conflicts".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/js/utils/testCluster.ts` around lines 44 - 67, In addNode, if
waitForGql throws the spawned executorProcess is left orphaned; wrap the
waitForGql call in a try/catch and on any error ensure the started executor is
cleaned up (e.g., call executorProcess.kill() or the appropriate shutdown helper
for the started process) before rethrowing the error; reference addNode,
startExecutor, executorProcess, waitForGql and this.nodes so the fix either
kills the process in the catch or adds a temporary tracking entry to this.nodes
before awaiting and removes it on failure.
| /** | ||
| * Poll the runtimeReadiness query until all subsystems are ready. | ||
| * Falls back to runtime.info() if runtimeReadiness is not available. | ||
| */ | ||
| async waitForReadiness(node: ClusterNode, timeoutMs: number = 120000): Promise<void> { | ||
| const start = Date.now(); | ||
|
|
||
| while (Date.now() - start < timeoutMs) { | ||
| try { | ||
| const info = await node.client.runtime.info(); | ||
| if (info.isInitialized && info.isUnlocked) { | ||
| return; | ||
| } | ||
| } catch (e) { | ||
| // Not ready yet | ||
| } | ||
| await sleep(2000); | ||
| } | ||
|
|
||
| throw new Error(`Node on port ${node.gqlPort} not fully ready after ${timeoutMs}ms`); |
There was a problem hiding this comment.
This readiness helper never uses the new readiness probe.
The comment says runtimeReadiness, but the implementation still polls runtime.info(). That misses gqlReady / holochainReady and keeps the old race the dedicated readiness resolver was added to remove.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/js/utils/testCluster.ts` around lines 91 - 110, The waitForReadiness
function currently polls runtime.info() and never calls the new readiness probe;
update waitForReadiness to first call node.client.runtime.runtimeReadiness()
inside the retry loop and check the readiness fields (e.g., gqlReady and
holochainReady or an overall ready flag) and only fall back to
node.client.runtime.info() if runtimeReadiness is not available or returns null;
keep the existing timeout/sleep behavior and retain the same error throw
(including node.gqlPort) if neither probe reports readiness within timeout.
| export async function gracefulShutdown(proc: ChildProcess | null | undefined, label: string = "process", timeoutMs: number = 10000): Promise<void> { | ||
| if (!proc || proc.killed) return; | ||
|
|
||
| console.log(`Sending SIGTERM to ${label} (PID ${proc.pid})...`); | ||
| proc.kill('SIGTERM'); | ||
|
|
||
| // Wait for process to exit gracefully | ||
| const start = Date.now(); | ||
| while (!proc.killed && Date.now() - start < timeoutMs) { | ||
| await sleep(500); | ||
| } | ||
|
|
||
| if (!proc.killed) { | ||
| console.log(`${label} did not exit after ${timeoutMs}ms, sending SIGKILL...`); | ||
| proc.kill('SIGKILL'); | ||
| await sleep(1000); | ||
| } | ||
|
|
||
| console.log(`${label} shut down (killed=${proc.killed})`); |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In Node.js, what does ChildProcess.killedmean — has the child exited, or only thatkill() successfully sent a signal?
💡 Result:
ChildProcess.killed does not mean the child has exited. It only indicates that the child process successfully received a signal as a result of calling child.kill() (i.e., the signal was delivered/accepted). (nodejs.org)
To know whether the child has actually terminated, check for the 'exit'/'close' events (and/or child.exitCode / child.signalCode), which reflect real process termination. (nodejs.org)
Citations:
- 1: https://nodejs.org/download/release/v22.19.0/docs/api/child_process.html?utm_source=openai
- 2: https://nodejs.org/download/nightly/v26.0.0-nightly2026022776215dc993/docs/api/child_process.html?utm_source=openai
Fix process termination detection in gracefulShutdown().
ChildProcess.killed only indicates a signal was sent, not that the process has exited. The current loop exits immediately after proc.kill('SIGTERM'), so the SIGKILL escalation never triggers even if the executor is still running. This causes test flakiness due to lingering processes.
Use the 'exit' and 'close' events (or check exitCode/signalCode) to detect actual termination:
Fix
export async function gracefulShutdown(proc: ChildProcess | null | undefined, label: string = "process", timeoutMs: number = 10000): Promise<void> {
- if (!proc || proc.killed) return;
+ if (!proc || proc.exitCode !== null || proc.signalCode !== null) return;
console.log(`Sending SIGTERM to ${label} (PID ${proc.pid})...`);
proc.kill('SIGTERM');
- // Wait for process to exit gracefully
- const start = Date.now();
- while (!proc.killed && Date.now() - start < timeoutMs) {
- await sleep(500);
- }
+ const exited = await Promise.race([
+ new Promise<boolean>((resolve) => {
+ proc.once("exit", () => resolve(true));
+ proc.once("close", () => resolve(true));
+ }),
+ sleep(timeoutMs).then(() => false),
+ ]);
- if (!proc.killed) {
+ if (!exited && proc.exitCode === null && proc.signalCode === null) {
console.log(`${label} did not exit after ${timeoutMs}ms, sending SIGKILL...`);
proc.kill('SIGKILL');
- await sleep(1000);
+ await new Promise<void>((resolve) => proc.once("close", () => resolve()));
}
- console.log(`${label} shut down (killed=${proc.killed})`);
+ console.log(`${label} shut down (exitCode=${proc.exitCode}, signalCode=${proc.signalCode})`);
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/js/utils/utils.ts` around lines 323 - 341, The gracefulShutdown
function currently relies on ChildProcess.killed (which only means a signal was
sent) so it should instead await actual process termination: attach listeners
for the child's 'exit' and 'close' events (or poll exitCode/signalCode) after
sending SIGTERM and resolve when one of those indicates the process exited; if
no exit occurs within timeoutMs, send SIGKILL and again wait for the same events
(with a short fallback sleep) before returning. Update gracefulShutdown to
remove the while loop based on proc.killed, use a Promise that races the
exit/close events against a timeout, and reference proc, gracefulShutdown, and
timeoutMs to implement the two-stage terminate-and-escalate behavior.
|
CI compile fix is working — all 8 ad4m model integration tests pass. The 2 reported failures are pre-existing artefacts on the
Neither is related to this PR. ⬡ |
Summary
Comprehensive test and CI infrastructure improvements across three phases, addressing process cleanup reliability, graceful shutdown, readiness probing, and multi-executor test tooling.
Based on the proposal at
membranes/adam/proposals/test-ci-improvements.md.Phase 1: Quick Wins
killall ad4mwith PID-based SIGTERM→wait→SIGKILL escalationstore_artifactsstep for failure log collectionkill -9Phase 2: Graceful Shutdown (Rust)
tokio::signalin rust-executorprocess::exit(0))Phase 3: Readiness & Multi-Executor
{ gqlReady, holochainReady, agentInitialized, languagesLoaded }for polling-based readiness checkstests/js/utils/testCluster.ts— manages N executor instances with startup, readiness polling, and graceful shutdownCI Status
when: neveron workflow) pending review. These changes are backward-compatible but should be reviewed before enabling CI runs.@nicolasluck — please review when you get a chance. The CI runner is intentionally disabled on this branch.
Summary by CodeRabbit
New Features
Improvements