Skip to content

Add GSM-Infinite training environment#1336

Open
poofeth wants to merge 6 commits into
PrimeIntellect-ai:mainfrom
poofeth:bounty/gsm-infinite-env
Open

Add GSM-Infinite training environment#1336
poofeth wants to merge 6 commits into
PrimeIntellect-ai:mainfrom
poofeth:bounty/gsm-infinite-env

Conversation

@poofeth
Copy link
Copy Markdown

@poofeth poofeth commented May 11, 2026

Summary

  • add a GSM-Infinite train/eval environment backed by the public InfiniAILab Hugging Face datasets
  • support medium/hard/symbolic dataset variants, context-length suffixes, operation splits, and optional train/eval example limits
  • normalize GSM-Infinite rows into Verifiers question/answer/info examples and score final Answer: ... values with an exact-answer reward

Bounty: https://algora.io/PrimeIntellect-ai/bounties/m4592Ap5jB1qm9yd

Submitted as the custom Algora bounty claim for GSM-Infinite.

Validation

  • uv run pytest tests/test_gsm_infinite_environment.py -q
  • uv run ruff check environments/gsm_infinite tests/test_gsm_infinite_environment.py
  • uv run ruff format --check environments/gsm_infinite tests/test_gsm_infinite_environment.py
  • git diff --check
  • live smoke-loaded one row from InfiniAILab/gsm_infinite_medium_0 / ops_2 and verified symbolic dataset naming resolves to InfiniAILab/gsm_infinite_symbolic_example_0

Note

Medium Risk
Introduces a new HF-backed environment and answer parsing/reward logic; main risk is dataset naming/split parameters and answer extraction edge cases affecting scoring correctness.

Overview
Adds a new installable gsm-infinite SingleTurn environment backed by InfiniAILab’s Hugging Face GSM-Infinite datasets, with configurable subset/context-length/operation split and optional train/eval example limits.

Implements row normalization into question/answer/info plus an exact-match reward based on the final Answer: ... in the model output, and registers the environment via a new pyproject.toml entry point; includes docs and focused tests for dataset naming, mapping, limiting, and reward parsing.

Reviewed by Cursor Bugbot for commit 76ab25b. Bugbot is set up for automated code reviews on this repo. Configure here.

@poofeth
Copy link
Copy Markdown
Author

poofeth commented May 11, 2026

Validation evidence for the bounty review:

$ uv run pytest tests/test_gsm_infinite_environment.py -q
.....                                                                    [100%]

$ uv run ruff check environments/gsm_infinite tests/test_gsm_infinite_environment.py
All checks passed!

$ uv run ruff format --check environments/gsm_infinite tests/test_gsm_infinite_environment.py
2 files already formatted

$ git diff --check
# no output

$ uv run python - <<'PY'
# loaded one example through load_gsm_infinite_dataset(num_examples=1)
# result: 1 ['question', 'answer', 'info']; answer: 2
# symbolic name check: InfiniAILab/gsm_infinite_symbolic_example_0
PY

This PR targets https://algora.io/PrimeIntellect-ai/bounties/m4592Ap5jB1qm9yd.

@poofeth
Copy link
Copy Markdown
Author

poofeth commented May 11, 2026

Proactive follow-up in commit a9d549a:

  • added the gsm-infinite environment entry point
  • included README.md in the package build
  • listed gsm_infinite in environments/README.md

Validation after the update:

$ uv run pytest tests/test_gsm_infinite_environment.py -q
.....                                                                    [100%]

$ uv run ruff check environments/gsm_infinite tests/test_gsm_infinite_environment.py
All checks passed!

$ uv run ruff format --check environments/gsm_infinite tests/test_gsm_infinite_environment.py
2 files already formatted

$ git diff --check
# no output

Comment thread environments/gsm_infinite/gsm_infinite.py Outdated
Comment thread environments/gsm_infinite/gsm_infinite.py
@poofeth
Copy link
Copy Markdown
Author

poofeth commented May 11, 2026

Addressed the Bugbot answer-extraction feedback in commit 203cfb1:

  • collapsed the unreachable duplicate Answer: regex path into one case-insensitive prefix parser
  • avoid re-extracting expected answers during reward normalization, preserving multi-value list answers
  • added regression coverage for multi-value answers and prefixed numeric/symbolic extraction

Validation:

$ uv run pytest tests/test_gsm_infinite_environment.py -q
.......                                                                  [100%]

$ uv run ruff check environments/gsm_infinite tests/test_gsm_infinite_environment.py
All checks passed!

$ uv run ruff format --check environments/gsm_infinite tests/test_gsm_infinite_environment.py
2 files already formatted

$ git diff --check
# no output

Comment thread environments/gsm_infinite/gsm_infinite.py Outdated
Comment thread environments/gsm_infinite/gsm_infinite.py Outdated
@poofeth
Copy link
Copy Markdown
Author

poofeth commented May 11, 2026

Addressed the latest Bugbot follow-up in commit 4f732d8:

  • extract_answer now uses the final Answer: marker instead of the first marker, matching the prompt instruction to end with the final result
  • eval_subset, eval_context_length, and eval_operations now only fall back on None, preserving explicit falsy values
  • added regression coverage for intermediate/final answer markers and eval_operations=0

Validation:

$ uv run pytest tests/test_gsm_infinite_environment.py -q
........                                                                 [100%]

$ uv run ruff check environments/gsm_infinite tests/test_gsm_infinite_environment.py
All checks passed!

$ uv run ruff format --check environments/gsm_infinite tests/test_gsm_infinite_environment.py
2 files already formatted

$ git diff --check
# no output

Comment thread environments/gsm_infinite/gsm_infinite.py Outdated
@poofeth
Copy link
Copy Markdown
Author

poofeth commented May 11, 2026

Addressed the latest Bugbot symbolic dataset finding in commit e1e9a26:

  • changed symbolic dataset resolution from InfiniAILab/gsm_infinite_symbolic_example_{context_length} to the full dataset names InfiniAILab/gsm_infinite_symbolic_{context_length}
  • added regression coverage for both symbolic, context_length=0 and symbolic, context_length=8k
  • verified the Hugging Face dataset API currently resolves InfiniAILab/gsm_infinite_symbolic_0, InfiniAILab/gsm_infinite_symbolic_8k, and InfiniAILab/gsm_infinite_symbolic_16k

Validation after the fix:

$ uv run pytest tests/test_gsm_infinite_environment.py -q
........                                                                 [100%]

$ uv run ruff check environments/gsm_infinite tests/test_gsm_infinite_environment.py
All checks passed!

$ uv run ruff format --check environments/gsm_infinite tests/test_gsm_infinite_environment.py
2 files already formatted

$ git diff --check
# no output

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit e1e9a26. Configure here.

Comment thread environments/gsm_infinite/gsm_infinite.py Outdated
@poofeth
Copy link
Copy Markdown
Author

poofeth commented May 11, 2026

Addressed the latest Bugbot answer_list=None fallback finding in commit 76ab25b:

  • row_to_example now falls back to solution when the answer_list key exists but its value is None or otherwise empty
  • added regression coverage so a row with answer_list = None extracts the expected answer from solution instead of producing "None"

Validation:

$ uv run pytest tests/test_gsm_infinite_environment.py -q
.........                                                                [100%]

$ uv run ruff check environments/gsm_infinite tests/test_gsm_infinite_environment.py
All checks passed!

$ uv run ruff format --check environments/gsm_infinite tests/test_gsm_infinite_environment.py
2 files already formatted

$ git diff --check
# no output

@poofeth
Copy link
Copy Markdown
Author

poofeth commented May 11, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant