Skip to content

fix: detect and fail fast on multi-threaded test execution#36

Merged
siphonite merged 8 commits intomainfrom
fix/issue-30-sigkill-test-harness
Mar 3, 2026
Merged

fix: detect and fail fast on multi-threaded test execution#36
siphonite merged 8 commits intomainfrom
fix/issue-30-sigkill-test-harness

Conversation

@siphonite
Copy link
Owner

@siphonite siphonite commented Mar 3, 2026

Summary

This PR fixes issue #30: crash_point() SIGKILL terminates entire test harness.

FIRST uses SIGKILL to simulate power loss, which terminates the entire process. When multiple tests share the same process (the default in cargo test), one test hitting a crash point kills all other tests.

Changes

  • Added detection for RUST_TEST_THREADS environment variable (set by cargo test)
  • Added detection for --test-threads command-line argument
  • Fail fast with a clear error message explaining the issue and solution
  • Allow bypassing with FIRST_SKIP_THREAD_CHECK=1 for advanced use cases

Testing

  • Verified detection works with RUST_TEST_THREADS=4 → fails with clear error
  • Verified detection works with --test-threads=4 → fails with clear error
  • Verified tests pass with --test-threads=1
  • Verified FIRST_SKIP_THREAD_CHECK=1 allows bypassing the check

Fixes #30


Summary by cubic

Detects multi-threaded test runs and fails fast with a clear message to prevent a SIGKILL from killing the entire test harness. Also isolates per-run work directories to avoid collisions and now propagates child stderr so errors are visible.

  • Bug Fixes

    • Detects RUST_TEST_THREADS and --test-threads in Execution; exits if >1 with guidance; override via FIRST_SKIP_THREAD_CHECK=1.
    • Generates a unique base dir per orchestrator run (PID + UUID), passes FIRST_WORK_DIR to children, and derives run_N paths to prevent collisions; adds a repro test.
    • Propagates stderr from failed child processes so thread-check and other errors are shown.
    • Fixes clippy “collapsible if” warnings.
  • Migration

    • Run tests single-threaded: cargo test -- --test-threads=1 (or set RUST_TEST_THREADS=1).
    • Only set FIRST_SKIP_THREAD_CHECK=1 if each test runs in its own process.
    • Work artifacts now live under $TMPDIR/first/first--/run_N.

Written for commit 3f7ae36. Summary will update on new commits.

Add Integration Constraints section to limitations.md:
- One first::test() per #[test]
- Async tests not supported
- Not thread-safe
- No nested workspaces

Add limitations summary to lib.rs crate docs.

Closes #18
- Rewrite README for conciseness and professional formatting
- Switch license from MIT to Apache 2.0
- Update project logo
- Rewrite all architecture docs (~60% reduction)
- Rewrite limitations.md and proof/reference_wal.md
- Remove empty invariants.md
- Generate unique base directory per orchestrator run using UUID + PID
- Pass base directory to child processes via FIRST_WORK_DIR
- Derive leaf run_N directory in child process from base + target
- Update extract_test_name to use thread name for better isolation
- Add reproduction test case (tests/repro_issue_29.rs)
- Closes #29
FIRST uses SIGKILL to simulate power loss, which terminates the entire
process. When multiple tests share the same process (the default in cargo
test), one test hitting a crash point kills all other tests.

This change implements the acceptance criteria from issue #30:
- Detect RUST_TEST_THREADS environment variable (set by cargo test)
- Detect --test-threads command-line argument
- Fail fast with a clear error message explaining the issue and solution
- Allow bypassing with FIRST_SKIP_THREAD_CHECK=1 for advanced use cases

Fixes #30
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 16 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/orchestrator.rs">

<violation number="1" location="src/orchestrator.rs:49">
P2: The new UUID-scoped `base_dir` is never cleaned up, so successful runs leak a unique temp directory each execution.</violation>
</file>

<file name="tests/repro_issue_29.rs">

<violation number="1" location="tests/repro_issue_29.rs:24">
P2: This test can fail nondeterministically because it asserts persisted file contents after an intentional SIGKILL without fsyncing the write first.</violation>
</file>

<file name="src/test.rs">

<violation number="1" location="src/test.rs:120">
P3: Duplicate target/work-dir derivation was introduced in both Execution and Verify paths; extract it to a shared helper to avoid drift between phases.</violation>
</file>

<file name="docs/proof/reference_wal.md">

<violation number="1" location="docs/proof/reference_wal.md:36">
P2: The documented `commit` diff uses `?` in a function signature that doesn't return `Result`, making the snippet invalid and inconsistent with the actual code.</violation>
</file>

<file name="README.md">

<violation number="1" location="README.md:2">
P2: README references a non-existent image file (`assets/image.jpg`), causing a broken banner image.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

let pid = std::process::id();
let base_dir = std::env::temp_dir()
.join("first")
.join(format!("first-{}-{}", pid, run_uuid));
Copy link

@cubic-dev-ai cubic-dev-ai bot Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The new UUID-scoped base_dir is never cleaned up, so successful runs leak a unique temp directory each execution.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/orchestrator.rs, line 49:

<comment>The new UUID-scoped `base_dir` is never cleaned up, so successful runs leak a unique temp directory each execution.</comment>

<file context>
@@ -42,8 +40,17 @@ where
+    let pid = std::process::id();
+    let base_dir = std::env::temp_dir()
+        .join("first")
+        .join(format!("first-{}-{}", pid, run_uuid));
+
     loop {
</file context>
Fix with Cubic

// Sleep to increase chance of overlap
thread::sleep(Duration::from_millis(100));

first::crash_point("point_a");
Copy link

@cubic-dev-ai cubic-dev-ai bot Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: This test can fail nondeterministically because it asserts persisted file contents after an intentional SIGKILL without fsyncing the write first.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At tests/repro_issue_29.rs, line 24:

<comment>This test can fail nondeterministically because it asserts persisted file contents after an intentional SIGKILL without fsyncing the write first.</comment>

<file context>
@@ -0,0 +1,65 @@
+            // Sleep to increase chance of overlap
+            thread::sleep(Duration::from_millis(100));
+
+            first::crash_point("point_a");
+        })
+        .verify(|env, _| {
</file context>
Fix with Cubic

crash_point("after_commit_write");
self.file.sync_all().expect("failed to fsync after COMMIT");
pub fn commit(&mut self, txid: TxId) {
+ self.file.sync_all()?; // ← ensure records durable FIRST
Copy link

@cubic-dev-ai cubic-dev-ai bot Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The documented commit diff uses ? in a function signature that doesn't return Result, making the snippet invalid and inconsistent with the actual code.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/proof/reference_wal.md, line 36:

<comment>The documented `commit` diff uses `?` in a function signature that doesn't return `Result`, making the snippet invalid and inconsistent with the actual code.</comment>

<file context>
@@ -1,148 +1,54 @@
-         crash_point("after_commit_write");
-         self.file.sync_all().expect("failed to fsync after COMMIT");
+  pub fn commit(&mut self, txid: TxId) {
++     self.file.sync_all()?;  // ← ensure records durable FIRST
+      writeln!(self.file, "COMMIT {}", txid)?;
+      self.file.sync_all()?;
</file context>
Fix with Cubic

@@ -1,501 +1,157 @@
<p align="center">
<img src="assets/logo.svg" alt="FIRST logo" width="240">
<img src="assets/image.jpg" alt="FIRST logo" width="520">
Copy link

@cubic-dev-ai cubic-dev-ai bot Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: README references a non-existent image file (assets/image.jpg), causing a broken banner image.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At README.md, line 2:

<comment>README references a non-existent image file (`assets/image.jpg`), causing a broken banner image.</comment>

<file context>
@@ -1,501 +1,157 @@
 <p align="center">
-  <img src="assets/logo.svg" alt="FIRST logo" width="240">
+  <img src="assets/image.jpg" alt="FIRST logo" width="520">
 </p>
 
</file context>
Suggested change
<img src="assets/image.jpg" alt="FIRST logo" width="520">
<img src="assets/logo.svg" alt="FIRST logo" width="520">
Fix with Cubic

Phase::Execution => {
if let Some(run_fn) = self.run_fn {
// Derive leaf directory from base + target
let target = std::env::var("FIRST_CRASH_TARGET")
Copy link

@cubic-dev-ai cubic-dev-ai bot Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Duplicate target/work-dir derivation was introduced in both Execution and Verify paths; extract it to a shared helper to avoid drift between phases.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/test.rs, line 120:

<comment>Duplicate target/work-dir derivation was introduced in both Execution and Verify paths; extract it to a shared helper to avoid drift between phases.</comment>

<file context>
@@ -116,12 +116,26 @@ where
             Phase::Execution => {
                 if let Some(run_fn) = self.run_fn {
+                    // Derive leaf directory from base + target
+                    let target = std::env::var("FIRST_CRASH_TARGET")
+                        .ok()
+                        .and_then(|s| s.parse::<usize>().ok())
</file context>
Fix with Cubic

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2e47322a48

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/rt.rs Outdated
Comment on lines +93 to +105
eprintln!(
"ERROR: FIRST detected multi-threaded test execution (RUST_TEST_THREADS={})",
thread_count
);
eprintln!();
eprintln!("FIRST uses SIGKILL to simulate power loss, which terminates the ENTIRE process.");
eprintln!("When multiple tests share the same process, one crash kills all tests.");
eprintln!();
eprintln!("SOLUTION: Run tests with --test-threads=1:");
eprintln!(" cargo test -- --test-threads=1");
eprintln!();
eprintln!("Or set FIRST_SKIP_THREAD_CHECK=1 at your own risk.");
std::process::exit(1);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve thread-check diagnostics instead of dropping stderr

detect_multi_threaded_test() emits a detailed remediation message and then exits, but this runs in the EXECUTION child where spawn_child() pipes stderr through parse_crash_metadata() and discards non-crash lines, so users only see a generic execution failure and lose the new guidance. In runs that hit this branch (e.g., RUST_TEST_THREADS/--test-threads configurations), the fail-fast behavior works but the actionable error text is effectively hidden.

Useful? React with 👍 / 👎.

When a child process in EXECUTION phase fails (e.g., thread check
fail-fast), the stderr output was being captured but discarded unless
it contained crash metadata. This made the detailed thread-check error
message invisible to users.

Now we capture and print the child's stderr output when the child fails,
so users can see the full error message.

Also fixes clippy warning about collapsible if statements.
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/orchestrator.rs">

<violation number="1" location="src/orchestrator.rs:294">
P2: Avoid unbounded buffering of child stderr; cap captured output to prevent excessive memory use on verbose failures.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Ok(0) => break, // EOF
Ok(_) => {
let trimmed = line.trim();
full_output.push_str(trimmed);
Copy link

@cubic-dev-ai cubic-dev-ai bot Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Avoid unbounded buffering of child stderr; cap captured output to prevent excessive memory use on verbose failures.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/orchestrator.rs, line 294:

<comment>Avoid unbounded buffering of child stderr; cap captured output to prevent excessive memory use on verbose failures.</comment>

<file context>
@@ -272,18 +280,31 @@ fn spawn_child_with_crash_info(
+            Ok(0) => break, // EOF
+            Ok(_) => {
+                let trimmed = line.trim();
+                full_output.push_str(trimmed);
+                full_output.push('\n');
+                // Look for JSON crash metadata
</file context>
Fix with Cubic

@siphonite siphonite merged commit 8067142 into main Mar 3, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CRITICAL: crash_point() SIGKILL terminates entire test harness

1 participant