This repository implements a scalable approach to code verification using outcome reward models (ORMs) and efficient pruning strategies. The system enables high-throughput code verification by trading off accuracy through a novel filtering approach. Key features include:
- Training and evaluating code verification models
- Multiple scoring methods (binary logit, classification, reward modeling)
- Comprehensive evaluation across multiple benchmark datasets
- Efficient pruning strategies for scalable verification
- Support for various transformer architectures
.
├── configs/             # Configuration files for experiments and evaluation
│   └── evaluation/      # Evaluation configs for running base model
│   └── experiments/     # Full experiment configs
│   └── model/           # Different architectures configs
│   └── preprocessing/   # Prompting configs
│   └── scoring/         # Configurations for different scoring methods.
│   └── suite/           # Suite configurations for evaluation
│   └── trainer/         # Training configs
├── scripts/           
│   ├── data/            # Data processing and generation
│   └── exec_trials/     # Execution trial implementations
├── src/               
│   ├── evaluation/      # Evaluation suite and benchmarks
│   ├── modeling.py      # Model architectures
│   ├── preprocessing.py # Data preparation
│   ├── scoring.py       # Solution scoring
│   └── training/        # Training pipeline
└── figs/                # Project figures and diagrams
For detailed information about specific components:
git clone https://github.com/SprocketLab/orm-code-verifier.git
cd orm-code-verifierThe dependencies for training and evaluation can be installed with:
pip install -r requirements.txtAdditional Commands to run:
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git scratch/bigcode --depth=1
cd scratch/bigcode 
pip install -e .
cd ..
pip install flash-attn --no-build-isolationpython scripts/make_train_data.py \
    --num_proc=4 \
    --black_format \
    --require_pfThis will format the training data and save it to disk so it can be loaded faster. Then you can run:
bash scripts/experiment.sh rm_qsol qwen25-coder-1_5b {DEVICE} {SEED} \
    --precision=bf16 \
    --num_workers=4 \
    --real_batch_size=64 \
    --overwrite \
    --batch_size=2 \
    --val_batch_tokens=12000 \
    gradient_checkpointing=True \
    --eval_batch_tokens=200000Notes:
- rm_qsolis the experiment to run, you can look at the other experiment configs for different setups.- qsolis just the formatting setup for the sequences located in the preprocessing config directory.
- We use the seeds of 1, 1999, and 2024 for our experiments in the paper.
The system supports three types of execution trials for comprehensive evaluation:
- Execution Timing: Measure performance and resource usage
- Syntax Validation: Check code correctness
- Linting Checks: Ensure code quality
To run the strongest verifier:
bash scripts/exec_trials/trial.sh code_contests qc-inst-7b t1.0_n128 32 outputs/ftp32_code_contets 5Key configuration parameters:
- Temperature and sample size (e.g., t1.0_n128 = temperature 1.0, 128 samples)
- Number of parallel workers
- Test execution timeouts
- Maximum tests per problem
For detailed configuration options and security considerations, see the Execution Trials Documentation.
The system provides multiple evaluation configurations, each serving different verification purposes:
- Base (zero_shot): Basic verification without additional checks
- Syntax (zero_shot_syntax): Focuses on syntactic correctness
- Lint (zero_shot_lint): Enforces code style and quality
- N Test:
To run evaluation with a specific configuration:
accelerate launch \
    --gpu_ids 0 \
    --mixed_precision=bf16 \
    --config_file=configs/accelerate.yaml \
    evaluate_model.py \
    --precision=bf16 \
    --device=0 \
    -group={WANB_GROUP_NAME} \
    --overwrite \
    --max_tokens_per_batch=6000 \
    --seed={SEED} \
    --num_workers=16 \
    qc-inst-7b \
    t1.0_n128 \
    checkpoint \
    {CHECKPOINT_PATH} \
    zero_shot_3s10t