Abstract

We have witnessed that strong LLMs like Qwen-Math, MiMo, and Phi-4 possess immense reasoning potential inherited from the pre-training stage. With reinforcement learning (RL), these models can improve dramatically on reasoning tasks. Recent studies have shown that even RL on a single problem can unleash these models' reasoning capabilities. However, RL is not only expensive but also unstable. Even one-shot RL requires hundreds of GPU hours. This raises a critical question: Is there a more efficient way to unleash the reasoning potential of these powerful base LLMs? In this work, we demonstrate that Critique Fine-Tuning (CFT) on only one problem can effectively unleash the reasoning potential of LLMs. Our method constructs critique data by collecting diverse model-generated solutions to a single problem and using teacher LLMs to provide detailed critiques. We fine-tune Qwen and Llama family models, ranging from 1.5B to 14B parameters, on the CFT data and observe significant performance gains across diverse reasoning tasks. For example, with just 5 GPU hours of training, Qwen-Math-7B-CFT show an average improvement of 15% on six math benchmarks and 16% on three logic reasoning benchmarks. These results are comparable to or even surpass the results from RL with 20x less compute. Ablation studies reveal the robustness of one-shot CFT across different prompt problems. These results highlight one-shot CFT as a simple, general, and compute-efficient approach to unleashing the reasoning capabilities of modern LLMs.

Overview

Overview of the 1-shot CFT dataset construction and the key difference between SFT and CFT training. Top: Candidate solutions to a single math problem are generated, critiqued, and filtered to form the training set. Bottom: Comparison of training paradigms: (left) SFT supervises the model to generate the reference solution; (right) CFT trains the model to critique a candidate solution, encouraging deeper reasoning and error analysis.

Comparison between SFT and CFT

Comparison between Supervised Fine-Tuning (SFT) and Critique Fine-Tuning (CFT). SFT generates solutions directly, while CFT critiques candidate solutions for correctness.

Performance Comparison on Mathematical Reasoning

Performance (%) on mathematical benchmarks. The base results are measured using the same prompt and evaluation setting with SFT and CFT. The base (sober) is taken from Soberlook with a more comprehensive evaluation. The RL (1 ex) results are from 1-shot rlvr. The delta rows show the performance difference between CFT (1 ex) and the base.

Training Efficiency Comparison

Comparing Model accuracy on Math-500, v.s. the training cost. For the Qwen2.5-Math-7B trained with 1-shot RLVR and 1-shot CFT.

Performance Comparison on Logic Reasoning

Performance of Qwen2.5-Math-7B on three BIG-Bench Extra Hard (BBEH) logic reasoning subtasks. For each subtask, SFT and CFT are performed using a single example from that subtask, and evaluated on all three tasks. The in-domain (diagonal) results and all results in the last two rows (merged CFT/SFT) are highlighted. The last two rows show results when merging the three problems into a single three-example training set, evaluating generalization across all three subtasks. Best results in each column are in bold

Ablation: Effectiveness of Seed Examples

We compare one-shot CFT performance on datasets from different seed problem. While all seeds are effective, dsr-cft-p0 (from seed problem π1) achieves the highest average accuracy.

Ablation: Diversity of Candidate Solutions

To analyze the effect of candidate solution diversity, we compare three settings on the seed problem π1. We use a single strong generator (Phi- 4-Reasoning-Plus) and a single weaker generator (Qwen2.5-Math-7B-Instruct) to each produce 100 candidate solutions, generate critiques, and perform CFT. Our main method, by contrast, mixes 100 candidate solutions from 10 different generators before collecting critiques and fine-tuning. The results show that greater diversity in candidate solutions leads to richer error types and reasoning patterns, enabling more effective critique fine-tuning.

Case Study

Below is the models' response to a problem at different stages during the One-Shot Critique Fine-Tuning training process. After 20 steps training, the model already answers correctly. At 50 steps, although the model shows some overfitting to the critique data, it still correctly answers the question.

Reference

Please kindly cite our paper if you use our code or results:

@article{wang2025unleashing,
  title={Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem},
  author={Wang, Yubo and Nie, Ping and Zou, Kai and Wu, Lijun and Chen, Wenhu},
  journal={arXiv preprint arXiv:2506.03295},
  year={2025}
}