Skip to content

Code for Paper: Teaching Language Models to Critique via Reinforcement Learning

License

Notifications You must be signed in to change notification settings

HKUNLP/critic-rl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

2944c8d Β· Feb 17, 2025

History

5 Commits
Feb 11, 2025
Feb 11, 2025
Feb 17, 2025
Feb 11, 2025
Feb 11, 2025
Feb 11, 2025
Feb 11, 2025

Repository files navigation

Teaching Language Models to Critique via Reinforcement Learning

Zhihui Xie*1, Jie Chen*2, Liyu Chen2, Weichao Mao2, Jingjing Xu2, Lingpeng Kong1

1The University of Hong Kong, 2Bytedance, Seed

πŸ“ƒ Paper β€’ πŸ”­ Project Page β€’ πŸ€— Model

We propose CTRL, a framework that trains LLMs to critique without human supervision, enabling them to supervise stronger models and achieve test-time scaling through iterative critique-revisions.

teaser

🎯 Key Results

Light         Dark

  • Test-time Scaling: Qwen2.5-Coder-32B-Ins with the CTRL critic achieves 106.1% relative improvement in Pass@1 on CodeContests through multi-turn critique-revision, while maintaining low error rates across iterations

  • Model-Agnostic: The CTRL critic improves performance across different models (23.5% improvement with GPT-4o) and tasks (CodeContests, LiveCodeBench, MBPP+)

  • Critics-as-RMs: The CTRL critic achieves 64.3% accuracy on JudgeBench as a generative reward model, competitive with stronger models like Claude-3.5-Sonnet

See our project page for detailed analysis and more results.

πŸ“¦ Installation

You can install the dependencies with the following command:

pip install -r requirements.txt

For evaluating code correctness, we use SandboxFusion to deploy the code sandbox.

docker run -it -p 8080:8080 vemlp-cn-beijing.cr.volces.com/preset-images/code-sandbox:server-20241204

export SANDBOX_FUSION_ENDPOINT="your_sandbox_endpoint"

πŸ“ Data Preparation

We use the TACO dataset for training. Preprocess the data using:

python scripts/data/preprocess_taco.py

πŸš€ Training

Our training process consists of two stages: (1) SFT on synthetic data guided by execution feedback and (2) RL with verifiable rewards and GRPO.

Stage I: SFT

We start with generating synthetic data with the following command:

bash examples/gen_taco.sh

Then, fine-tune the model using the following command:

bash examples/train_sft.sh

Stage II: RL

Train with GRPO using verifiable rewards from sandbox execution:

bash examples/train_rl.sh

πŸ“‹ Evaluation

We evaluate the model with the following command (e.g., for CodeContests):

bash examples/eval_codecontests.sh

πŸ“š Citation

If you find this project useful, please consider citing:

@article{xie2025teaching,
  title={Teaching Language Models to Critique via Reinforcement Learning},
  author={Xie, Zhihui and Chen, Liyu and Mao, Weichao and Xu, Jingjing and Kong, Lingpeng and others},
  journal={arXiv preprint arXiv:2502.03492},
  year={2025}
}

πŸ’ Acknowledgement

This project builds upon several amazing open-source projects:

πŸ“ License

This project is licensed under the Apache 2.0 license.

About

Code for Paper: Teaching Language Models to Critique via Reinforcement Learning

Resources

License

Stars

Watchers

Forks

Languages