π₯[2025-04] Our paper VL-Rethinker and Models are out π. We are actively working on releasing the training code and training queries!
Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art.
We identified a critical limitation of GRPO especially for 72B training: the vanishing advantages problem. To mitigate this issue, we introduced a novel technique called Selective Sample Replay (SSR). While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a textual rethinking trigger to the end of initial rollouts in RL training, explicitly enforcing a self-reflection reasoning step.
By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse, and MathVision to achieve significantly to achieve 80.3\%, 61.8\% and 43.9\% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with GPT-o1. Our empirical results show the effectiveness of our approaches.
Our analysis shows the standard GRPO with binary correctness rewards suffers from the vanishing advantages problem, where the amount of zero advantages increase as training progresses. This problem is severe especially for 72B training. As the figure suggests, the percentage of examples exhibiting non-zero advantages steadily declines from approximately 40\% to below 20\% within 256 gradient steps. These observations motivate our approach of Selective Sample Replay.
We train VL-Rethinker with SSR and Forced Rethinking, to empower the ability of deliberate thinking. Intriguingly, as illustrated in figure, VL-Rethinker even identify flaws in the given problem when checking its initial reasoning through rethinking, showcasing a form of emergent metacognitive ability.
VL-Rethinker advances state-of-the-art scores on MathVista, MathVerse, and MathVision. It also achieves open-source SoTA on multi-disciplinary benchmarks such as MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with GPT-o1. We remain active in further pushing the limits of VL-Rethinker.
Explore more about details of our approach, analysis of learned rethinking behaviors, and insights in VLM training within our paper!
@article{vl-rethinker,
title={VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning},
author = {Wang, Haozhe and Qu, Chao and Huang, Zuming and Chu, Wei and Lin, Fangzhen and Chen, Wenhu},
journal={arXiv preprint arXiv:2504.08837},
year={2025}
}