Skip to content

feat(fuzzing): add seed corpus preparation#555

Merged
danielcuthbert merged 3 commits into
gadievron:mainfrom
Hinotoi-agent:feat/fuzz-seed-corpus
May 24, 2026
Merged

feat(fuzzing): add seed corpus preparation#555
danielcuthbert merged 3 commits into
gadievron:mainfrom
Hinotoi-agent:feat/fuzz-seed-corpus

Conversation

@Hinotoi-agent

Copy link
Copy Markdown
Contributor

Summary

  • add a deterministic prepare_seed_corpus helper for fuzzing fixtures/examples
  • wire raptor_fuzzing.py --prepare-corpus with output, max-size, and lockfile controls
  • skip likely secret-bearing filenames and write a manifest describing copied/skipped inputs
  • add focused regression coverage for deterministic output, secret skips, size limits, lockfile handling, and idempotent reruns

Validation

  • uv run python -m pytest packages/fuzzing/tests/test_seed_corpus.py -q
  • uv run python -m pytest packages/fuzzing/tests -q
  • uv run ruff check --ignore F541 packages/fuzzing/seed_corpus.py packages/fuzzing/tests/test_seed_corpus.py packages/fuzzing/__init__.py raptor_fuzzing.py
  • uv run python -m compileall -q packages/fuzzing raptor_fuzzing.py
  • git diff --check
  • smoke-tested raptor_fuzzing.py --prepare-corpus ... --seed-out ... --no-sandbox twice against a temporary project to verify deterministic, non-recursive output and .env skipping

@danielcuthbert danielcuthbert self-assigned this May 20, 2026
@danielcuthbert danielcuthbert added the enhancement New feature or request label May 20, 2026
@danielcuthbert

Copy link
Copy Markdown
Collaborator

Thanks our robot helper @Hinotoi-agent im looking now

@danielcuthbert

danielcuthbert commented May 20, 2026

Copy link
Copy Markdown
Collaborator

I have run this locally against the current main state and done a security pass over the changed code. You rock our robot friend for adding this, thank you!!

The good bit: this is a small, focused change and the test coverage is decent. I did not see anything that looks deliberately malicious: no new network calls, no hidden subprocess execution, no credential access, and no import-time side effects. The branch fast-forwards cleanly onto current main, compileall passes, and packages/fuzzing/tests passes for me with 102 passed, 1 skipped.

I would hold off merging for one fix though and it's a small one but it needs sorting:

The new seed corpus helper resets generated output directories before preparing the corpus:

  • packages/fuzzing/seed_corpus.py removes generated kind directories such as json/, text/, xml/, yaml/, binary/, etc.
  • If an operator passes --seed-out as the project/source directory, or any broad directory that already contains folders with those names, RAPTOR will delete real user data before it scans.

I reproduced this safely in /private/tmp: an existing json/operator-data.json was deleted when source_dir == out_dir.

This does not look malicious to me, but it is a dangerous footgun. I think we should add a guard before merge:

  • refuse out_dir == source_dir
  • refuse out_dir being an ancestor of source_dir
  • probably refuse obviously dangerous output paths such as /, $HOME, and the repo root
  • add a regression test proving existing project directories are not deleted

Once that is fixed, I am comfortable with the direction of the PR. It adds useful deterministic seed preparation, but the output path safety needs tightening first.

@Hinotoi-agent

Copy link
Copy Markdown
Contributor Author

Thanks for the careful review and the local repro. I tightened the seed corpus output guard in the PR branch.

What changed:

  • prepare_seed_corpus() now validates the resolved output directory before creating or resetting anything.
  • It refuses out_dir == source_dir.
  • It refuses out_dir being an ancestor of the source directory.
  • It refuses broad/dangerous targets such as filesystem root, the operator home directory, and repository roots.
  • Added regressions proving existing json/operator-data.json content is not deleted when the unsafe source/output layouts are attempted.
  • Kept the existing supported case where the output directory is a dedicated child under the source tree; that path is already excluded from candidate walking and remains useful for project-local generated artifacts.

Validation run locally:

  • uv run python -m pytest packages/fuzzing/tests/test_seed_corpus.py -q → 9 passed
  • uv run python -m pytest packages/fuzzing/tests -q → 106 passed, 1 skipped
  • uv run python -m compileall -q packages/fuzzing raptor_fuzzing.py
  • uv run ruff check packages/fuzzing/seed_corpus.py packages/fuzzing/tests/test_seed_corpus.py
  • git diff --check
  • added-line scans for obvious secret literals and dangerous execution sinks on the touched files

I also noticed ruff check raptor_fuzzing.py still reports pre-existing F541 warnings outside this patch surface, so I kept this follow-up scoped to the seed-corpus helper and its tests.

@danielcuthbert

Copy link
Copy Markdown
Collaborator

Thanks for turning this round.

I re-tested the updated PR against current main and I’m happy with the fix. The output directory guard now blocks the risky cases we discussed before anything gets created or deleted: source-as-output, ancestor dirs, repo roots, home, and filesystem root. I also reproduced the original failure case again and it now fails safely with the existing file left untouched.

Validation on my side:

  • clean merge into current main
  • packages/fuzzing/tests/test_seed_corpus.py passes
  • full packages/fuzzing/tests passes on the merge result
  • compileall passes
  • git diff --check passes
  • no signs of malicious behaviour in the changed code

So from me this is good to merge. Sensible fix, decent tests, and no drama. I, for one, am enjoying working with our new robot overlords when they take feedback this well.

@danielcuthbert danielcuthbert merged commit 8abb53c into gadievron:main May 24, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants