A comprehensive framework for evaluating mathematical problem-solving capabilities of language models and generating high-quality training data for reinforcement learning (RL).
This suite provides tools for:
- Solution Generation & Verification: Generate and validate mathematical solutions with step-by-step reasoning
- Tournament Evaluation: Compare solutions through tournament-style competitions
- Progress Tracking: Monitor and analyze benchmark performance
- Dataset Processing: Tools for filtering and preparing mathematical problem datasets
- RL Training Data Generation: Create training examples for improving model reasoning and problem-solving
- benchmark.py: Main benchmarking script for evaluating model performance
- tournament_benchmark.py: Tournament-style evaluation of solutions
- process_dataset.py: Processes datasets to ensure high-quality examples with valid answers
- filtering.py: Filters dataset entries based on various criteria
- merge_json.py: Merges multiple JSON files into a single dataset
- shuffle_dataset.py: Shuffles and reassigns IDs to dataset examples
- adversarial_generator.py: Generates pairs of correct and incorrect solutions using multiple agents
- alternating_generator.py: Alternates between solver and adversarial agents to create training examples
- Analysis Agent: Provides problem analysis and approach strategies
- Step Agent: Generates individual solution steps
- Completion Agent: Completes partial solutions
- Judge Agent: Evaluates and compares solution quality
- Loki Agent: Generates deliberately incorrect but convincing solutions
- Numeric Verification: Validates mathematical answers with configurable tolerance
- Step Analysis: Validates solution structure and step coherence
- Progress Tracking: Real-time statistics and performance monitoring
- Tournament Management: Organizes solution competitions with judging
- Python 3.8+
- Key packages: asyncio, sympy, latex2sympy2, aiohttp, tqdm, datasets
- OpenRouter API key (for cloud model access)
The suite supports both local and cloud-based models through a flexible configuration system:
config = BenchmarkConfig(
main="LOCAL", # Main solving model
auxiliary="LOCAL_2", # Auxiliary/judging model
main_port=8000, # Local model ports
auxiliary_port=6000,
max_concurrent=256, # Concurrent processing
best_of=40, # Solutions per problem
completions=35, # Completion attempts
tolerance=1e-6 # Answer comparison tolerance
)
- Local deployments (ports 8000/6000)
- OpenRouter API models (requires API key)
- Multiple model types (Claude, GPT, Gemini, etc.)
export OPENROUTER_API_KEY=your_key_here # If using cloud models
Standard benchmark:
python benchmark.py --main LOCAL --auxiliary LOCAL_2 --max-concurrent 256
Tournament evaluation:
python tournament_benchmark.py --main LOCAL --auxiliary LOCAL_2 --best-of 40
Process and filter dataset:
python auxilary/process_dataset.py --dataset input_dataset --output-dir processed
Filter entries:
python auxilary/filtering.py input.json output.json --types light dark --success-rate-above 0.8
Merge multiple files:
python auxilary/merge_json.py results_folder --output merged.json
Shuffle dataset:
python auxilary/shuffle_dataset.py input.json output.json --seed 42
- Step-by-step structure verification
- LaTeX mathematical notation support
- Numeric answer comparison with tolerance
- Multiple-choice problem detection
- Real-time statistics
- Success rate monitoring
- Judge accuracy tracking
- Tournament performance analysis
- HuggingFace dataset integration
- Local dataset caching
- Filtered problem selection
- Progress persistence
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Submit a Pull Request
MIT License - See LICENSE file for details