Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem

♥♦Yubo Wang, Ping Nie, Kai Zou Lijun Wu ♥♦Wenhu Chen
University of Waterloo, Vector Institute, Netmind.AI, Shanghai AI Lab, Independent
y726wang@uwaterloo.ca, wenhuchen@uwaterloo.ca

Abstract

We have witnessed that strong LLMs like Qwen-Math, MiMo, and Phi-4 possess immense reasoning potential inherited from the pre-training stage. With reinforcement learning (RL), these models can improve dramatically on reasoning tasks. Recent studies have shown that even RL on a single problem can unleash these models' reasoning capabilities. However, RL is not only expensive but also unstable. Even one-shot RL requires hundreds of GPU hours. This raises a critical question: Is there a more efficient way to unleash the reasoning potential of these powerful base LLMs? In this work, we demonstrate that Critique Fine-Tuning (CFT) on only one problem can effectively unleash the reasoning potential of LLMs. Our method constructs critique data by collecting diverse model-generated solutions to a single problem and using teacher LLMs to provide detailed critiques. We fine-tune Qwen and Llama family models, ranging from 1.5B to 14B parameters, on the CFT data and observe significant performance gains across diverse reasoning tasks. For example, with just 5 GPU hours of training, Qwen-Math-7B-CFT show an average improvement of 15% on six math benchmarks and 16% on three logic reasoning benchmarks. These results are comparable to or even surpass the results from RL with 20x less compute. Ablation studies reveal the robustness of one-shot CFT across different prompt problems. These results highlight one-shot CFT as a simple, general, and compute-efficient approach to unleashing the reasoning capabilities of modern LLMs.

teaser_figure

Figure 1: One-shot CFT consistently improves mathematical and logical reasoning. Left: Average accuracy (%) on six mathematical reasoning benchmarks for Qwen and Llama models, comparing base, SFT, RLVR, and CFT with only one training example. Right: In-domain accuracy (%) on three logic reasoning benchmarks (BBEH subtasks) for Qwen2.5-Math-7B. Across both domains, CFT with a single problem significantly outperforms standard supervised fine-tuning and matches or exceeds reinforcement learning with much lower compute.

Overview

Overview of the 1-shot CFT dataset construction and the key difference between SFT and CFT training. Top: Candidate solutions to a single math problem are generated, critiqued, and filtered to form the training set. Bottom: Comparison of training paradigms: (left) SFT supervises the model to generate the reference solution; (right) CFT trains the model to critique a candidate solution, encouraging deeper reasoning and error analysis.

overview

Comparison between SFT and CFT

Comparison between Supervised Fine-Tuning (SFT) and Critique Fine-Tuning (CFT). SFT generates solutions directly, while CFT critiques candidate solutions for correctness.

sft_cft_compare

Performance Comparison on Mathematical Reasoning

Performance (%) on mathematical benchmarks. The base results are measured using the same prompt and evaluation setting with SFT and CFT. The base (sober) is taken from Soberlook with a more comprehensive evaluation. The RL (1 ex) results are from 1-shot rlvr. The delta rows show the performance difference between CFT (1 ex) and the base.

math_main_table

Training Efficiency Comparison

Comparing Model accuracy on Math-500, v.s. the training cost. For the Qwen2.5-Math-7B trained with 1-shot RLVR and 1-shot CFT.

rlvr_efficiency

Performance Comparison on Logic Reasoning

Performance of Qwen2.5-Math-7B on three BIG-Bench Extra Hard (BBEH) logic reasoning subtasks. For each subtask, SFT and CFT are performed using a single example from that subtask, and evaluated on all three tasks. The in-domain (diagonal) results and all results in the last two rows (merged CFT/SFT) are highlighted. The last two rows show results when merging the three problems into a single three-example training set, evaluating generalization across all three subtasks. Best results in each column are in bold

logic_main_table

Ablation: Effectiveness of Seed Examples

We compare one-shot CFT performance on datasets from different seed problem. While all seeds are effective, dsr-cft-p0 (from seed problem π1) achieves the highest average accuracy.

math_ablation

Ablation: Diversity of Candidate Solutions

To analyze the effect of candidate solution diversity, we compare three settings on the seed problem π1. We use a single strong generator (Phi- 4-Reasoning-Plus) and a single weaker generator (Qwen2.5-Math-7B-Instruct) to each produce 100 candidate solutions, generate critiques, and perform CFT. Our main method, by contrast, mixes 100 candidate solutions from 10 different generators before collecting critiques and fine-tuning. The results show that greater diversity in candidate solutions leads to richer error types and reasoning patterns, enabling more effective critique fine-tuning.

diversity_ablation

Case Study

Below is the models' response to a problem at different stages during the One-Shot Critique Fine-Tuning training process. After 20 steps training, the model already answers correctly. At 50 steps, although the model shows some overfitting to the critique data, it still correctly answers the question.

case_study

Reference

Please kindly cite our paper if you use our code or results:
@article{wang2025unleashing,
  title={Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem},
  author={Wang, Yubo and Nie, Ping and Zou, Kai and Wu, Lijun and Chen, Wenhu},
  journal={arXiv preprint arXiv:2506.03295},
  year={2025}
}