Cleaned backlog from the full source audit.
Structure:
- Quick wins first: smaller, high-signal fixes that improve correctness, UX, and trust quickly.
- Architectural issues second: deeper system changes that need broader design work.
-
✅ Fix trace metadata drift so recorded traces always embed the real final
tracePath. Why: Recorded traces can be written successfully while the summary inside the trace omits or misstates the final output path. Impact: This creates misleading artifact metadata and can confuse downstream reporting and tooling. Evidence: src/runtime/engine.rs, src/runtime/engine.rs, src/runtime/engine.rs, src/runtime/tracefile.rs Done when:- ✅
run --recordwrites a trace whose embedded summary path matches the actual file written. - ✅
test --recorddoes the same in both overwrite and append modes. - ✅ Regression tests cover append-mode collision handling.
- ✅
-
✅ Make
fozzy testfail clearly when the caller explicitly supplies a nonexistent scenario path. Why: Today a mixed invocation can pass if one input exists and another explicit file path is wrong. Impact: This silently narrows the executed test set and can produce a false-green result in CI or release gating. Evidence: Discovery and empty-match handling live in src/runtime/engine.rs, with file matching in src/platform/fsutil.rs. Done when:- ✅ Explicit missing paths cause a hard failure with a clear error.
- ✅ Glob patterns still preserve normal glob semantics.
- ✅ Mixed literal-path and glob invocations are covered by tests.
-
✅ Make
fozzy inithonor--config <path>instead of always writingfozzy.toml. Why: The CLI exposes a custom config path but initialization still writes the default filename. Impact: This breaks user expectations and makes scripted bootstrapping unreliable. Evidence: src/runtime/engine.rs, src/main.rs Done when:- ✅ Default init still creates
fozzy.toml. - ✅
--config custom.toml initwritescustom.toml. - ✅ Force/non-force behavior is tested for custom config paths.
- ✅ Default init still creates
-
✅ Stop default
fozzy testruns from silently skipping distributed scenarios while still reporting overall success. Why: The default test discovery can find distributed scenarios, but the runner skips them and can still returnstatus=pass. Impact: This is easy to misread as “all discovered tests passed” when some were never executed. Evidence: src/runtime/engine.rs, src/runtime/engine.rs, src/runtime/engine.rs Done when:- ✅ Mixed regular/distributed discovery no longer produces a misleading false-green result.
- ✅ The final contract is explicit: fail, opt-in skip, or separate discovery domains.
- ✅ Docs and init scaffolding match the final behavior.
-
✅ Make distributed-scenario validation consistent across
fozzy validateandfozzy explore. Why: A distributed scenario can be rejected by validation and still execute successfully through explore. Impact: “Valid” and “runnable” do not currently mean the same thing, which weakens trust in both commands. Evidence: Validator in src/model/scenario.rs, validation call site in src/main.rs, explore loading in src/modes/explore.rs Done when:- ✅ A scenario rejected by
validateis also rejected byexplore, unless an explicit permissive mode exists. - ✅ Missing topology declarations are not silently synthesized during execution.
- ✅ Regression tests cover malformed distributed scenarios.
- ✅ A scenario rejected by
-
✅ Make scenario validation recurse into nested
assert_throwsandassert_rejectsblocks. Why: Top-level validation does not currently validate the nested step programs inside these wrappers. Impact: Malformed nested DSL can slip through preflight and then be treated as an “expected” failure, producing a false-green scenario result. Evidence: Top-level validation in src/model/scenario.rs, nested execution in src/runtime/engine.rs Done when:- ✅ Nested invalid steps fail validation before runtime.
- ✅ The same validation rules apply at top level and in nested blocks.
- ✅ Tests cover nested invalid durations and invalid field combinations.
-
✅ Stop silently discarding source/scenario read failures in topology mapping. Why:
fozzy mapcurrently skips unreadable files and dropped scenarios without surfacing that the report is incomplete. Impact: The command can overstate coverage confidence and under-report risk. Evidence: src/cmd/map_cmd.rs, src/cmd/map_cmd.rs, src/cmd/map_cmd.rs Done when:- ✅ Unreadable source files are reported explicitly.
- ✅ Invalid or unreadable scenario files are reported explicitly.
- ✅ JSON output includes structured degraded-confidence metadata.
-
✅ Harden the TypeScript SDK
stream()path so spawn failures become normal SDK errors. Why:stream()does not currently install anerrorhandler on the child process. Impact: Missing binaries or spawn failures can crash the consumer’s Node process instead of surfacing a catchable SDK error. Evidence: sdk-ts/src/index.ts, reference behavior in sdk-ts/src/index.ts Done when:- ✅ Missing binary errors are catchable.
- ✅ Permission-denied spawn errors are catchable.
- ✅ Normal streaming behavior still works.
-
✅ Consolidate duplicated run/test summary finalization and artifact-writing logic. Why: Similar mechanics are implemented in multiple places with slightly different behavior. Impact: This is how metadata drift and subtle CLI inconsistency happen. Evidence: src/runtime/engine.rs, src/runtime/engine.rs, src/runtime/engine.rs Done when:
- ✅ Run/test/replay/shrink flows use the same summary and artifact conventions where applicable.
- ✅ Collision-policy handling is consistent.
- ✅ Fresh recorded trace paths normalize consistently across
trace verify,replay, andci. - ✅ Direct trace selectors normalize consistently across artifacts/report/memory/profile follow-up commands.
- ✅ Regression coverage exists for the shared logic.
-
✅ Clean up checked-in runtime artifacts and profiling outputs at repo root. Why: The repo currently contains many trace/profile outputs alongside source files. Impact: This adds noise to audits and reviews and can interfere with tooling signal quality. Done when:
- ✅ Incidental runtime outputs are ignored or moved out of the repo root.
- ✅ Intentional fixtures remain documented and clearly separated.
-
✅ Clarify or remove legacy config-loading pathways that no longer match CLI behavior. Why: The CLI now exits on config parse/read errors, but a fallback helper still exists that warns and silently defaults. Impact: This can confuse future contributors about the intended config contract. Evidence: src/platform/config.rs, src/main.rs Done when:
- ✅ The intended config-loading contract is explicit.
- ✅ Library and CLI behavior are documented or unified.
-
✅ Make host-backed runtime operations respect Fozzy timeouts while the host call is actually in flight. Why: Host HTTP and host process steps block inside the step itself, while timeout checks happen only before and after step execution. Impact: A hung host process or slow endpoint can stall a run indefinitely despite
--timeout. Evidence: src/runtime/engine.rs, src/runtime/engine.rs, src/runtime/engine.rs, src/runtime/engine.rs Done when:- ✅ Hung host proc calls time out promptly.
- ✅ Hung host HTTP calls time out promptly.
- ✅ Timeout behavior is recorded and replayed coherently.
-
✅ Enforce host stdout/stderr and HTTP body limits during streaming, not after full buffering. Why: Current size checks happen after the whole payload has already been loaded into memory. Impact: The limits look protective but do not actually prevent memory spikes. Evidence: src/runtime/engine.rs, src/runtime/engine.rs Done when:
- ✅ Oversized host proc output is cut off safely during read.
- ✅ Oversized host HTTP bodies are aborted during read.
- ✅ The failure mode remains debuggable.
- ✅ Rework process-global scenario caches so they do not serve stale content forever and do not grow without bound.
Why:
Parsed scenarios and compiled fuzz targets are cached globally for the lifetime of the process.
Impact:
Long-lived embeddings can observe stale scenarios and unbounded cache growth.
Evidence:
src/model/scenario.rs, src/modes/fuzz.rs
Done when:
- ✅ Cache lifecycle is explicit.
- ✅ Scenario edits can be observed correctly, or the cache semantics are deliberately bounded and documented.
- ✅ Repeated unique temp paths do not cause unbounded growth.
- ✅ Break up oversized control-center modules.
Why:
Several core files are extremely large and mix unrelated responsibilities.
Impact:
This increases review cost, onboarding cost, and the likelihood that fixes land in one path but not its sibling path.
Primary hotspots:
- src/runtime/engine.rs
- src/main.rs
- src/cmd/profile_cmd.rs Done when:
- ✅ Host backend logic is separated cleanly.
- ✅ Trace/summary finalization helpers are separated cleanly.
- ✅ CLI dispatch has clearer module boundaries.
- ✅ Profile subcommands have clearer module boundaries.
- ✅ Profile render/export helpers are separated cleanly.
- ✅ Profile diff/explain/metric analysis helpers are separated cleanly.
- ✅ Profile trace/timeline/profile builders are separated cleanly.
- ✅ Profile loading/doctor/support helpers are separated cleanly.
- ✅ Profile shared types and schema shapes are separated cleanly.
- ✅ Profile tests are separated cleanly from production command wiring.
- ✅ Full/gate CLI workflows are separated cleanly.
- ✅ CLI bootstrap/strict/error helpers are separated cleanly.
- ✅ Runtime init/scaffolding helpers are separated cleanly.
- ✅ Runtime doctor/preflight helpers are separated cleanly.
- ✅ Runtime test aggregation and trace-writing helpers are separated cleanly.
- ✅ Runtime run/replay/shrink orchestration helpers are separated cleanly.
- ✅ Runtime engine is now focused on execution core responsibilities.
- ✅ Behavior stays stable under regression tests.
- ✅ Trace metadata consistency
- ✅ Explicit missing-path failures in
fozzy test - ✅
init --configpath handling - ✅ False-green distributed-scenario handling in
fozzy test - ✅ Distributed validation parity
- ✅ Recursive nested-step validation
- ✅ Topology mapper degraded-read reporting
- ✅ SDK
stream()error handling - ✅ Shared summary/artifact finalization cleanup
- ✅ Repo artifact cleanup
- ✅ Config-loading contract cleanup
- ✅ Scenario cache lifecycle redesign
- ✅ Host timeout enforcement
- ✅ Streaming resource limits for host I/O
- ✅ Large-module refactors
- ✅ New behavior is covered by focused regression tests.
- ✅ Runtime-impacting fixes are validated with Fozzy-first flows:
fozzy doctor --deep --scenario <scenario> --runs 5 --seed <seed> --jsonfozzy test --det --strict <scenarios...> --jsonfozzy run ... --det --record <trace.fozzy> --jsonfozzy trace verify <trace.fozzy> --strict --jsonfozzy replay <trace.fozzy> --jsonfozzy ci <trace.fozzy> --json