fix: detect and fail fast on multi-threaded test execution#36
fix: detect and fail fast on multi-threaded test execution#36
Conversation
Add Integration Constraints section to limitations.md: - One first::test() per #[test] - Async tests not supported - Not thread-safe - No nested workspaces Add limitations summary to lib.rs crate docs. Closes #18
- Rewrite README for conciseness and professional formatting - Switch license from MIT to Apache 2.0 - Update project logo - Rewrite all architecture docs (~60% reduction) - Rewrite limitations.md and proof/reference_wal.md - Remove empty invariants.md
- Generate unique base directory per orchestrator run using UUID + PID - Pass base directory to child processes via FIRST_WORK_DIR - Derive leaf run_N directory in child process from base + target - Update extract_test_name to use thread name for better isolation - Add reproduction test case (tests/repro_issue_29.rs) - Closes #29
FIRST uses SIGKILL to simulate power loss, which terminates the entire process. When multiple tests share the same process (the default in cargo test), one test hitting a crash point kills all other tests. This change implements the acceptance criteria from issue #30: - Detect RUST_TEST_THREADS environment variable (set by cargo test) - Detect --test-threads command-line argument - Fail fast with a clear error message explaining the issue and solution - Allow bypassing with FIRST_SKIP_THREAD_CHECK=1 for advanced use cases Fixes #30
There was a problem hiding this comment.
5 issues found across 16 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/orchestrator.rs">
<violation number="1" location="src/orchestrator.rs:49">
P2: The new UUID-scoped `base_dir` is never cleaned up, so successful runs leak a unique temp directory each execution.</violation>
</file>
<file name="tests/repro_issue_29.rs">
<violation number="1" location="tests/repro_issue_29.rs:24">
P2: This test can fail nondeterministically because it asserts persisted file contents after an intentional SIGKILL without fsyncing the write first.</violation>
</file>
<file name="src/test.rs">
<violation number="1" location="src/test.rs:120">
P3: Duplicate target/work-dir derivation was introduced in both Execution and Verify paths; extract it to a shared helper to avoid drift between phases.</violation>
</file>
<file name="docs/proof/reference_wal.md">
<violation number="1" location="docs/proof/reference_wal.md:36">
P2: The documented `commit` diff uses `?` in a function signature that doesn't return `Result`, making the snippet invalid and inconsistent with the actual code.</violation>
</file>
<file name="README.md">
<violation number="1" location="README.md:2">
P2: README references a non-existent image file (`assets/image.jpg`), causing a broken banner image.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| let pid = std::process::id(); | ||
| let base_dir = std::env::temp_dir() | ||
| .join("first") | ||
| .join(format!("first-{}-{}", pid, run_uuid)); |
There was a problem hiding this comment.
P2: The new UUID-scoped base_dir is never cleaned up, so successful runs leak a unique temp directory each execution.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/orchestrator.rs, line 49:
<comment>The new UUID-scoped `base_dir` is never cleaned up, so successful runs leak a unique temp directory each execution.</comment>
<file context>
@@ -42,8 +40,17 @@ where
+ let pid = std::process::id();
+ let base_dir = std::env::temp_dir()
+ .join("first")
+ .join(format!("first-{}-{}", pid, run_uuid));
+
loop {
</file context>
| // Sleep to increase chance of overlap | ||
| thread::sleep(Duration::from_millis(100)); | ||
|
|
||
| first::crash_point("point_a"); |
There was a problem hiding this comment.
P2: This test can fail nondeterministically because it asserts persisted file contents after an intentional SIGKILL without fsyncing the write first.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At tests/repro_issue_29.rs, line 24:
<comment>This test can fail nondeterministically because it asserts persisted file contents after an intentional SIGKILL without fsyncing the write first.</comment>
<file context>
@@ -0,0 +1,65 @@
+ // Sleep to increase chance of overlap
+ thread::sleep(Duration::from_millis(100));
+
+ first::crash_point("point_a");
+ })
+ .verify(|env, _| {
</file context>
| crash_point("after_commit_write"); | ||
| self.file.sync_all().expect("failed to fsync after COMMIT"); | ||
| pub fn commit(&mut self, txid: TxId) { | ||
| + self.file.sync_all()?; // ← ensure records durable FIRST |
There was a problem hiding this comment.
P2: The documented commit diff uses ? in a function signature that doesn't return Result, making the snippet invalid and inconsistent with the actual code.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/proof/reference_wal.md, line 36:
<comment>The documented `commit` diff uses `?` in a function signature that doesn't return `Result`, making the snippet invalid and inconsistent with the actual code.</comment>
<file context>
@@ -1,148 +1,54 @@
- crash_point("after_commit_write");
- self.file.sync_all().expect("failed to fsync after COMMIT");
+ pub fn commit(&mut self, txid: TxId) {
++ self.file.sync_all()?; // ← ensure records durable FIRST
+ writeln!(self.file, "COMMIT {}", txid)?;
+ self.file.sync_all()?;
</file context>
| @@ -1,501 +1,157 @@ | |||
| <p align="center"> | |||
| <img src="assets/logo.svg" alt="FIRST logo" width="240"> | |||
| <img src="assets/image.jpg" alt="FIRST logo" width="520"> | |||
There was a problem hiding this comment.
P2: README references a non-existent image file (assets/image.jpg), causing a broken banner image.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At README.md, line 2:
<comment>README references a non-existent image file (`assets/image.jpg`), causing a broken banner image.</comment>
<file context>
@@ -1,501 +1,157 @@
<p align="center">
- <img src="assets/logo.svg" alt="FIRST logo" width="240">
+ <img src="assets/image.jpg" alt="FIRST logo" width="520">
</p>
</file context>
| <img src="assets/image.jpg" alt="FIRST logo" width="520"> | |
| <img src="assets/logo.svg" alt="FIRST logo" width="520"> |
| Phase::Execution => { | ||
| if let Some(run_fn) = self.run_fn { | ||
| // Derive leaf directory from base + target | ||
| let target = std::env::var("FIRST_CRASH_TARGET") |
There was a problem hiding this comment.
P3: Duplicate target/work-dir derivation was introduced in both Execution and Verify paths; extract it to a shared helper to avoid drift between phases.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/test.rs, line 120:
<comment>Duplicate target/work-dir derivation was introduced in both Execution and Verify paths; extract it to a shared helper to avoid drift between phases.</comment>
<file context>
@@ -116,12 +116,26 @@ where
Phase::Execution => {
if let Some(run_fn) = self.run_fn {
+ // Derive leaf directory from base + target
+ let target = std::env::var("FIRST_CRASH_TARGET")
+ .ok()
+ .and_then(|s| s.parse::<usize>().ok())
</file context>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2e47322a48
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
src/rt.rs
Outdated
| eprintln!( | ||
| "ERROR: FIRST detected multi-threaded test execution (RUST_TEST_THREADS={})", | ||
| thread_count | ||
| ); | ||
| eprintln!(); | ||
| eprintln!("FIRST uses SIGKILL to simulate power loss, which terminates the ENTIRE process."); | ||
| eprintln!("When multiple tests share the same process, one crash kills all tests."); | ||
| eprintln!(); | ||
| eprintln!("SOLUTION: Run tests with --test-threads=1:"); | ||
| eprintln!(" cargo test -- --test-threads=1"); | ||
| eprintln!(); | ||
| eprintln!("Or set FIRST_SKIP_THREAD_CHECK=1 at your own risk."); | ||
| std::process::exit(1); |
There was a problem hiding this comment.
Preserve thread-check diagnostics instead of dropping stderr
detect_multi_threaded_test() emits a detailed remediation message and then exits, but this runs in the EXECUTION child where spawn_child() pipes stderr through parse_crash_metadata() and discards non-crash lines, so users only see a generic execution failure and lose the new guidance. In runs that hit this branch (e.g., RUST_TEST_THREADS/--test-threads configurations), the fail-fast behavior works but the actionable error text is effectively hidden.
Useful? React with 👍 / 👎.
When a child process in EXECUTION phase fails (e.g., thread check fail-fast), the stderr output was being captured but discarded unless it contained crash metadata. This made the detailed thread-check error message invisible to users. Now we capture and print the child's stderr output when the child fails, so users can see the full error message. Also fixes clippy warning about collapsible if statements.
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/orchestrator.rs">
<violation number="1" location="src/orchestrator.rs:294">
P2: Avoid unbounded buffering of child stderr; cap captured output to prevent excessive memory use on verbose failures.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| Ok(0) => break, // EOF | ||
| Ok(_) => { | ||
| let trimmed = line.trim(); | ||
| full_output.push_str(trimmed); |
There was a problem hiding this comment.
P2: Avoid unbounded buffering of child stderr; cap captured output to prevent excessive memory use on verbose failures.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/orchestrator.rs, line 294:
<comment>Avoid unbounded buffering of child stderr; cap captured output to prevent excessive memory use on verbose failures.</comment>
<file context>
@@ -272,18 +280,31 @@ fn spawn_child_with_crash_info(
+ Ok(0) => break, // EOF
+ Ok(_) => {
+ let trimmed = line.trim();
+ full_output.push_str(trimmed);
+ full_output.push('\n');
+ // Look for JSON crash metadata
</file context>
Summary
This PR fixes issue #30:
crash_point() SIGKILL terminates entire test harness.FIRST uses
SIGKILLto simulate power loss, which terminates the entire process. When multiple tests share the same process (the default incargo test), one test hitting a crash point kills all other tests.Changes
RUST_TEST_THREADSenvironment variable (set by cargo test)--test-threadscommand-line argumentFIRST_SKIP_THREAD_CHECK=1for advanced use casesTesting
RUST_TEST_THREADS=4→ fails with clear error--test-threads=4→ fails with clear error--test-threads=1FIRST_SKIP_THREAD_CHECK=1allows bypassing the checkFixes #30
Summary by cubic
Detects multi-threaded test runs and fails fast with a clear message to prevent a SIGKILL from killing the entire test harness. Also isolates per-run work directories to avoid collisions and now propagates child stderr so errors are visible.
Bug Fixes
Migration
Written for commit 3f7ae36. Summary will update on new commits.