This repository includes implementations to reproduce the R1 pipeline for code generation:
- Result: 2K Code-R1 samples +
Qwen2.5-7B-Instruct-1M
beatsQwen2.5-Coder-7B-Instruct
(even better w/ 12K samples). - Finding: Quality of rewards matters. False positives in datasets and execution confuse the model.
- Implementation: A reliable, scalable, and sandboxed pipeline to minimize reward false positives in datasets and execution.
More results and findings to come...
# For training
pip install -e .
pip install vllm==0.7.3
pip install flash-attn --no-build-isolation
pip install wandb IPython matplotlib gpustat # utility
I tried multiple ways for sandboxing including calling code execution servers, running dockerized Python, calling paid services, etc.
firejail
is the approach I found to meet all the three:
- Reliability -- False positive comes when "autograders" have concurrency issue (timeouts), violating OS limits (
OSError
), etc. - Scalability -- e.g., dockerized Python run is generally reliable but too slow (e.g., 20 samples/s on 192 cores).
- Security -- ... otherwise the school IT will email you and stop your server...
sudo add-apt-repository ppa:deki/firejail
sudo apt-get update
sudo apt-get install firejail firejail-profiles
The current version has 12K RL samples (prompt + tests) at 🤗 ganler/code-r1-12k:
- 2K LeetCode data where the tests are generally reliable
- 10K verified data filtered from 26K TACO data.
In general, it's suggesgted to test data & sandbox on every dataset & environment before training code RL. Directly using noisy data and mismatched envornments can lead to reward false positives, confusing the model. These noise could come from (i) wrong tests, (ii) unsolvable prompts (e.g., images tags), and (iii) execution environment mismatch.
To produce locally validated RL data:
python examples/data_preprocess/coder1.py
bash main_grpo.sh
Note
The script was optimized for single-node 8x H200 setup. You might need to customize the settings for your own workstation.
We trained two models based on Qwen2.5-7B-Instruct-1M by pure R1 Zero:
- 🤗 CodeR1-Zero-Qwen2.5-7B-12k-832: using 12K RL samples trained in 832 steps (training logs).
- 🤗 CodeR1-Zero-Qwen2.5-7B-LC2k-1088: using 2K RL samples from LeetCode, trained in 1088 steps (training logs).
Model | LCB (v5) | HumanEval+ | MBPP+ | Average |
---|---|---|---|---|
Qwen2.5-7B-Instruct-1M | 24.0 | 80.5 | 66.7 | 57.1 |
+ Code-R1-Zero (2k - 1088s GRPO) | 28.6 | 84.8 | 70.1 | 61.2 |
+ Code-R1-Zero (12k - 832s GRPO) | 29.7 | 83.5 | 74.3 | 🌟62.5 |
- 2K leetcode training samples can already show promising results without any additional SFT or distillation.
- Adding it to 12K data (10K more verified data from TACO) can further improve the performance.
Some existing models:
Model | LCB (v5) | HumanEval+ | MBPP+ | Average |
---|---|---|---|---|
Qwen2.5-Coder-7B-Instruct | 31.1 | 82.3 | 69.6 | 61.0 |
Eurus-2-7B-PRIME | 23.8 | 65.9 | 29.9 | 39.9 |
Sky-T1-7B | 21.3 | 54.3 | 50.8 | 42.1 |
- Qwen2.5-Coder-7B-Instruct, despite released months back, is still very performant as the best baseline, but we don't know where the improvement comes from.
- Eurus-2-7B-PRIME starts from Qwen2.5-Math-7B-Instruct and is RL only. Its training data includes (unfiltered) extensive coding datasets, including APPS, CodeContests, TACO, and Codeforces. Code-R1-Zero outperforms it significantly despite using fewer data, likely because we use validated datasets and sandboxes.
- Sky-T1-7B uses a combination of RL and SFT/distillation steps. Its RL partially uses PRIME but its training data does not (seem to) include coding datasets.
If you find this work helpful...
@article{code-r1,
title={Code-R1: Reproducing R1 for Code with Reliable Rewards},
author={Liu, Jiawei and Zhang, Lingming},
howpublished={\url{https://github.com/ganler/code-r1}},
year={2025}
}
Apache-2.0. See LICENSE.code-r1 for more details.