Add Spiral-Bench synthetic training environment by poofeth · Pull Request #1334 · PrimeIntellect-ai/verifiers

poofeth · 2026-05-11T05:25:11Z

Summary

add a spiral_bench environment inspired by Spiral-Bench safety conversations
include a deterministic prompt generator for uncontaminated synthetic training prompts
commit a 64-row HF Dataset-compatible JSONL sample with question, answer, and info fields
add focused tests covering generator determinism, sample dataset loading, and environment construction without API calls

Bounty

Algora: https://algora.io/PrimeIntellect-ai/bounties/gdFDkFi3PzcwFndN
Claiming the Spiral-Bench + train set bounty.

Validation

uv run pytest tests/test_spiral_bench_environment.py -q
uv run ruff check environments/spiral_bench tests/test_spiral_bench_environment.py
uv run ruff format --check environments/spiral_bench tests/test_spiral_bench_environment.py
git diff --check

Note

Low Risk
Low risk: the PR primarily adds a new standalone environment package, sample data, and tests, without changing shared runtime logic; main risk is limited to correctness of the new judge-based reward parsing/scaling.

Overview
Adds a new spiral_bench single-turn environment with a judge-scored 0–10 safety rubric, including dataset loading (build_dataset) and a normalized reward function wired into vf.SingleTurnEnv.

Includes a deterministic synthetic prompt generator and commits a 64-row JSONL sample dataset, plus docs and packaging (pyproject.toml) so the environment can be installed and regenerated. Updates environments/README.md to list spiral_bench, and adds targeted tests covering generator determinism/limits, dataset loading, and that load_environment() constructs without requiring an API key.

^{Reviewed by Cursor Bugbot for commit 0b9b606. Bugbot is set up for automated code reviews on this repo. Configure here.}

poofeth · 2026-05-11T05:25:23Z

Validation evidence for the Spiral-Bench bounty PR:

uv run pytest tests/test_spiral_bench_environment.py -q passed: 3 tests.
uv run ruff check environments/spiral_bench tests/test_spiral_bench_environment.py passed.
uv run ruff format --check environments/spiral_bench tests/test_spiral_bench_environment.py passed.
git diff --check passed.

The sample dataset is generated from local deterministic templates (source=synthetic-uncontaminated-v1) rather than copied from the upstream Spiral-Bench benchmark prompts.

poofeth · 2026-05-11T05:42:51Z

Addressed the Bugbot review in commit 16a0b72:

fixed the judge call so JudgeRubric receives the original prompt/completion and sees the actual assistant response
added an upper-bound guard for synthetic generation (num_examples <= 600) to avoid duplicate-exhaustion loops
added spiral_bench to environments/README.md
added regression coverage for the generator bound

Validation after the fixes:

$ uv run pytest tests/test_spiral_bench_environment.py -q
....                                                                     [100%]

$ uv run ruff check environments/spiral_bench tests/test_spiral_bench_environment.py
All checks passed!

$ uv run ruff format --check environments/spiral_bench tests/test_spiral_bench_environment.py
3 files already formatted

$ git diff --check
# no output

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 16a0b72. Configure here.}

poofeth · 2026-05-11T05:53:45Z

Addressed the latest Bugbot follow-up in commit 0b9b606:

removed dead _completion_text helper after switching to JudgeRubric native prompt/completion handling
added README.md and pyproject.toml to Hatch build includes
listed spiral_bench under the judge-based evaluation section in environments/README.md

Validation:

$ uv run pytest tests/test_spiral_bench_environment.py -q
....                                                                     [100%]

$ uv run ruff check environments/spiral_bench tests/test_spiral_bench_environment.py
All checks passed!

$ uv run ruff format --check environments/spiral_bench tests/test_spiral_bench_environment.py
3 files already formatted

$ git diff --check
# no output

poofeth · 2026-05-11T09:53:20Z

/claim https://algora.io/PrimeIntellect-ai/bounties/gdFDkFi3PzcwFndN

Add Spiral-Bench synthetic training environment

a745c88

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread environments/spiral_bench/spiral_bench.py Outdated

Comment thread environments/spiral_bench/README.md

Comment thread environments/spiral_bench/generate_spiral_prompts.py

Address Spiral-Bench review feedback

16a0b72

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread environments/spiral_bench/spiral_bench.py Outdated

Comment thread environments/README.md

Comment thread environments/spiral_bench/pyproject.toml

Polish Spiral-Bench package metadata

0b9b606

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Spiral-Bench synthetic training environment#1334

Add Spiral-Bench synthetic training environment#1334
poofeth wants to merge 3 commits into
PrimeIntellect-ai:mainfrom
poofeth:bounty/spiral-bench-train-set

poofeth commented May 11, 2026 •

edited by cursor Bot

Loading

Uh oh!

poofeth commented May 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

poofeth commented May 11, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

poofeth commented May 11, 2026 •

edited

Loading

Uh oh!

poofeth commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

poofeth commented May 11, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bounty

Validation

Uh oh!

poofeth commented May 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

poofeth commented May 11, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

poofeth commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

poofeth commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

poofeth commented May 11, 2026 •

edited by cursor Bot

Loading

poofeth commented May 11, 2026 •

edited

Loading