Skip to content

Add Spiral-Bench synthetic training environment#1334

Open
poofeth wants to merge 3 commits into
PrimeIntellect-ai:mainfrom
poofeth:bounty/spiral-bench-train-set
Open

Add Spiral-Bench synthetic training environment#1334
poofeth wants to merge 3 commits into
PrimeIntellect-ai:mainfrom
poofeth:bounty/spiral-bench-train-set

Conversation

@poofeth
Copy link
Copy Markdown

@poofeth poofeth commented May 11, 2026

Summary

  • add a spiral_bench environment inspired by Spiral-Bench safety conversations
  • include a deterministic prompt generator for uncontaminated synthetic training prompts
  • commit a 64-row HF Dataset-compatible JSONL sample with question, answer, and info fields
  • add focused tests covering generator determinism, sample dataset loading, and environment construction without API calls

Bounty

Validation

  • uv run pytest tests/test_spiral_bench_environment.py -q
  • uv run ruff check environments/spiral_bench tests/test_spiral_bench_environment.py
  • uv run ruff format --check environments/spiral_bench tests/test_spiral_bench_environment.py
  • git diff --check

Note

Low Risk
Low risk: the PR primarily adds a new standalone environment package, sample data, and tests, without changing shared runtime logic; main risk is limited to correctness of the new judge-based reward parsing/scaling.

Overview
Adds a new spiral_bench single-turn environment with a judge-scored 0–10 safety rubric, including dataset loading (build_dataset) and a normalized reward function wired into vf.SingleTurnEnv.

Includes a deterministic synthetic prompt generator and commits a 64-row JSONL sample dataset, plus docs and packaging (pyproject.toml) so the environment can be installed and regenerated. Updates environments/README.md to list spiral_bench, and adds targeted tests covering generator determinism/limits, dataset loading, and that load_environment() constructs without requiring an API key.

Reviewed by Cursor Bugbot for commit 0b9b606. Bugbot is set up for automated code reviews on this repo. Configure here.

@poofeth
Copy link
Copy Markdown
Author

poofeth commented May 11, 2026

Validation evidence for the Spiral-Bench bounty PR:

  • uv run pytest tests/test_spiral_bench_environment.py -q passed: 3 tests.
  • uv run ruff check environments/spiral_bench tests/test_spiral_bench_environment.py passed.
  • uv run ruff format --check environments/spiral_bench tests/test_spiral_bench_environment.py passed.
  • git diff --check passed.

The sample dataset is generated from local deterministic templates (source=synthetic-uncontaminated-v1) rather than copied from the upstream Spiral-Bench benchmark prompts.

Comment thread environments/spiral_bench/spiral_bench.py Outdated
Comment thread environments/spiral_bench/README.md
Comment thread environments/spiral_bench/generate_spiral_prompts.py
@poofeth
Copy link
Copy Markdown
Author

poofeth commented May 11, 2026

Addressed the Bugbot review in commit 16a0b72:

  • fixed the judge call so JudgeRubric receives the original prompt/completion and sees the actual assistant response
  • added an upper-bound guard for synthetic generation (num_examples <= 600) to avoid duplicate-exhaustion loops
  • added spiral_bench to environments/README.md
  • added regression coverage for the generator bound

Validation after the fixes:

$ uv run pytest tests/test_spiral_bench_environment.py -q
....                                                                     [100%]

$ uv run ruff check environments/spiral_bench tests/test_spiral_bench_environment.py
All checks passed!

$ uv run ruff format --check environments/spiral_bench tests/test_spiral_bench_environment.py
3 files already formatted

$ git diff --check
# no output

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 16a0b72. Configure here.

Comment thread environments/spiral_bench/spiral_bench.py Outdated
Comment thread environments/README.md
Comment thread environments/spiral_bench/pyproject.toml
@poofeth
Copy link
Copy Markdown
Author

poofeth commented May 11, 2026

Addressed the latest Bugbot follow-up in commit 0b9b606:

  • removed dead _completion_text helper after switching to JudgeRubric native prompt/completion handling
  • added README.md and pyproject.toml to Hatch build includes
  • listed spiral_bench under the judge-based evaluation section in environments/README.md

Validation:

$ uv run pytest tests/test_spiral_bench_environment.py -q
....                                                                     [100%]

$ uv run ruff check environments/spiral_bench tests/test_spiral_bench_environment.py
All checks passed!

$ uv run ruff format --check environments/spiral_bench tests/test_spiral_bench_environment.py
3 files already formatted

$ git diff --check
# no output

@poofeth
Copy link
Copy Markdown
Author

poofeth commented May 11, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant