Skip to content

fix: tolerate transient NotFound from concurrent build scripts#3

Open
willpartcl wants to merge 1 commit into
oooutlk:mainfrom
partcleda:fix/tolerate-transient-not-found
Open

fix: tolerate transient NotFound from concurrent build scripts#3
willpartcl wants to merge 1 commit into
oooutlk:mainfrom
partcleda:fix/tolerate-transient-not-found

Conversation

@willpartcl
Copy link
Copy Markdown

`wait_for_other_builds` walks the build directory looking for stable state. Cargo runs build scripts in parallel; sibling crates create and delete tempfiles inside the same dir while we walk. WalkDir surfaces those as `Err(NotFound)` and `entry.unwrap()` panics.

The fix: treat a WalkDir error as evidence the directory is still in flux — set `waiting = true` and continue. The function's invariant is "wait until directory stops changing"; a NotFound mid-walk is exactly the case where it hasn't yet.

`locate_manifest_paths` has the same TOCTOU pattern: `.exists()` followed by `.read_to_string().expect()` panics if the file disappears between check and read. Replaced with `if let Ok(contents) = read_to_string(...)` so the disappearance becomes a graceful skip.

Observed

In a multi-crate workspace where two crates use `inwelling` (one via `to()`, one via `collect_downstream()`), heavy parallel builds reproduce this:

```
thread 'main' panicked at inwelling-0.5.5/src/lib.rs:207:31:
called `Result::unwrap()` on an `Err` value:
Error { depth: 2, inner: Io { path: Some(".../partser-.../rmetanFrez5"),
err: Os { code: 2, kind: NotFound, ... } } }
```

Non-deterministic — depends on filesystem timing and which sibling build script finishes first.

wait_for_other_builds() walks the entire build dir looking for stable
state. Cargo runs build scripts in parallel; sibling crates create and
delete tempfiles inside the same dir while we walk. WalkDir surfaces
those as Err(NotFound) at iteration time, and entry.unwrap() panics.

The fix: treat WalkDir errors as evidence the directory is still in
flux, so loop again instead of crashing. The function's invariant is
'wait until directory is stable' — a NotFound mid-walk is precisely
the case where it isn't yet.

locate_manifest_paths() has the same TOCTOU pattern between .exists()
and .read_to_string().expect(). Replaced with match-on-read so a
disappeared file becomes a graceful skip instead of a panic.

Observed via partcl CI:
  thread 'main' panicked at inwelling-0.5.5/src/lib.rs:207:31:
  called `Result::unwrap()` on an `Err` value:
  Error { depth: 2, inner: Io { path: Some(".../partser-.../rmetanFrez5"),
                                 err: Os { code: 2, kind: NotFound, ... }}}

Affects all consumers; race is non-deterministic and depends on
filesystem timing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant