Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning

Abstract

Reinforcement Learning (RL) has emerged as a popular training paradigm, particularly when paired with reasoning models. While effective, it primarily focuses on generating responses and lacks mechanisms to explicitly foster critique or reflection. Several recent studies, like Critique-Fine-Tuning (CFT) and Critique-Guided-Distillation (CGD) have shown the benefits of explicitly teaching LLMs how to critique. Motivated by them, we propose Critique Reinforcement Learning (CRL), where the model is tasked with generating a critique for a given (question, solution) pair. The reward is determined solely by whether the final judgment label c of the generated critique aligns with the ground-truth judgment c*. Building on this point, we introduce Critique-Coder, which is trained on a hybrid of RL and CRL by substituting 20% of the standard RL data with CRL data. We fine-tune multiple models (Critique-Coder) and evaluate them on different benchmarks to show their advantages over RL-only models. We show that Critique-Coder consistently outperforms RL-only baselines on all the evaluated benchmarks. Notably, our Critique-Coder-8B can reach over 60% on LiveCodeBench (v5), outperforming other reasoning models like DeepCoder-14B and GPT-o1. Beyond code generation, Critique-Coder also demonstrates enhanced general reasoning abilities, as evidenced by its better performance on logic reasoning tasks from the BBEH dataset. This indicates that the application of CRL on coding datasets enhances general reasoning and critique abilities, which are transferable across a broad range of tasks. Hence, we believe that CRL works as a great complement to standard RL for LLM reasoning.

Overview

We introduce CRL, a novel reinforcement learning (RL) training framework that incorporates critique learning within the RL paradigm. This novel learning approach enhances the model’s critique and reasoning abilities, addressing the lack of critique and reflection incentives typically found in standard RL frameworks. Building on this foundation, we introduce Critique-Coder, a model combining CRL and RL to leverage the strengths of both. CRL fosters critical thinking and reasoning, while RL focuses on optimizing problem-solving.

Comparison between CRL and Standard RL. Standard RL generates solutions based on input questions and evaluates them by executing test cases, while CRL critiques the solution for the paired question and compares the resulting conclusion with the GT to determine its correctness. Experiment shows that RL+CRL can improve not only accuracy, but also the code quality.

Critique-Coder Results

We conducted experiments on two models, Qwen3-4B and Qwen3-8B, in thinking mode. Compared with the base models, Critique-Coder leads to consistent and notable improvements across benchmarks of varying difficulty levels. On Qwen3-4B, for example, the LiveCodeBench score of Critique-Coder rises from 54.2 to 59.0, a gain of +4.8, surpassing the larger Qwen3-8B baseline by +1.5 points. Under identical datasets and training configurations, replacing part of the RL data with CRL consistently yields superior results across all benchmarks. On Qwen3-4B, Critique-Coder exceeds the Qwen3-4B-RL by +2.4 points on LiveCodeBench and improves the overall benchmark average by +1.5 points.

Logic Reasoning Results

To examine whether the critique and reasoning abilities learned by Critique-Coder extend beyond coding tasks, we further evaluate the model on the BIG-Bench Extra Hard (BBEH) logic reasoning benchmarks. As shown in the table, Critique-Coder achieves consistent improvements over both the baseline Qwen3-4B and its RL-trained variant across all four reasoning subtasks.

Reference

Please kindly cite our paper if you use our code or results:

@article{ruan2025critiquecoder,
    title={Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning},
    author={Ruan, Chi and Jiang, Dongfu and Wang, Yubo and Chen, Wenhu},
    journal={ArXiv},
    year={2025},
    volume={2509.22824}
}