Teaching Language Models to Critique via Reinforcement Learning

Zhihui Xie*¹, Jie Chen*², Liyu Chen², Weichao Mao², Jingjing Xu², Lingpeng Kong¹

¹The University of Hong Kong, ²Bytedance, Seed

We propose CTRL, a framework that trains LLMs to critique without human supervision, enabling them to supervise stronger models and achieve test-time scaling through iterative critique-revisions.

🎯 Key Results

Test-time Scaling: Qwen2.5-Coder-32B-Ins with the CTRL critic achieves 106.1% relative improvement in Pass@1 on CodeContests through multi-turn critique-revision, while maintaining low error rates across iterations
Model-Agnostic: The CTRL critic improves performance across different models (23.5% improvement with GPT-4o) and tasks (CodeContests, LiveCodeBench, MBPP+)
Critics-as-RMs: The CTRL critic achieves 64.3% accuracy on JudgeBench as a generative reward model, competitive with stronger models like Claude-3.5-Sonnet

See our project page for detailed analysis and more results.

📦 Installation

You can install the dependencies with the following command:

pip install -r requirements.txt

For evaluating code correctness, we use SandboxFusion to deploy the code sandbox.

docker run -it -p 8080:8080 vemlp-cn-beijing.cr.volces.com/preset-images/code-sandbox:server-20241204

export SANDBOX_FUSION_ENDPOINT="your_sandbox_endpoint"

📁 Data Preparation

We use the TACO dataset for training. Preprocess the data using:

python scripts/data/preprocess_taco.py

🚀 Training

Our training process consists of two stages: (1) SFT on synthetic data guided by execution feedback and (2) RL with verifiable rewards and GRPO.

Stage I: SFT

We start with generating synthetic data with the following command:

bash examples/gen_taco.sh

Then, fine-tune the model using the following command:

bash examples/train_sft.sh

Stage II: RL

Train with GRPO using verifiable rewards from sandbox execution:

bash examples/train_rl.sh

📋 Evaluation

We evaluate the model with the following command (e.g., for CodeContests):

bash examples/eval_codecontests.sh

📚 Citation

If you find this project useful, please consider citing:

@article{xie2025teaching,
  title={Teaching Language Models to Critique via Reinforcement Learning},
  author={Xie, Zhihui and Chen, Liyu and Mao, Weichao and Xu, Jingjing and Kong, Lingpeng and others},
  journal={arXiv preprint arXiv:2502.03492},
  year={2025}
}

💐 Acknowledgement

This project builds upon several amazing open-source projects:

verl: RL training framework
deepseek-coder: SFT training scripts
SandboxFusion: Code execution environment

📝 License

This project is licensed under the Apache 2.0 license.

Name	Name	Last commit message	Last commit date
Latest commit zhxieml fix ut format in lcb pre-processing Feb 17, 2025 2944c8d · Feb 17, 2025 History 5 Commits
ctrl	ctrl	release	Feb 11, 2025
examples	examples	fix rl script	Feb 11, 2025
scripts	scripts	fix ut format in lcb pre-processing	Feb 17, 2025
.gitignore	.gitignore	release	Feb 11, 2025
LICENSE	LICENSE	release	Feb 11, 2025
README.md	README.md	update results	Feb 11, 2025
requirements.txt	requirements.txt	release	Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Teaching Language Models to Critique via Reinforcement Learning

🎯 Key Results

📦 Installation

📁 Data Preparation

🚀 Training

Stage I: SFT

Stage II: RL

📋 Evaluation

📚 Citation

💐 Acknowledgement

📝 License

About

Languages

License

HKUNLP/critic-rl

Folders and files

Latest commit

History

Repository files navigation

Teaching Language Models to Critique via Reinforcement Learning

🎯 Key Results

📦 Installation

📁 Data Preparation

🚀 Training

Stage I: SFT

Stage II: RL

📋 Evaluation

📚 Citation

💐 Acknowledgement

📝 License

About

Resources

License

Stars

Watchers

Forks

Languages