VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, Wenhu Chen,

🔔News

🔥[2025-04-21] Our dataset ViRL39K is out 🚀. It includes a comprehensive collection of 39K QAs, including Math/Phys/Chem/Bio problems, charts/diagrams/tables -based reasoning and broader STEM and social science topics. It also provides model-capability annotations.

🔥[2025-04-14] Our paper VL-Rethinker, Code and Models are out 🚀.

Introduction

Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art.

Overview

We identified a critical limitation of GRPO especially for 72B training: the vanishing advantages problem. To mitigate this issue, we introduced a novel technique called Selective Sample Replay (SSR). While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a textual rethinking trigger to the end of initial rollouts in RL training, explicitly enforcing a self-reflection reasoning step.

By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse, and MathVision to achieve significantly to achieve 80.3\%, 61.8\% and 43.9\% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with GPT-o1. Our empirical results show the effectiveness of our approaches.

The Vanishing Advantages Problem

Our analysis shows the standard GRPO with binary correctness rewards suffers from the vanishing advantages problem, where the amount of zero advantages increase as training progresses. This problem is severe especially for 72B training. As the figure suggests, the percentage of examples exhibiting non-zero advantages steadily declines from approximately 40\% to below 20\% within 256 gradient steps. These observations motivate our approach of Selective Sample Replay.

Qualitative Results of VL-Rethinker

We train VL-Rethinker with SSR and Forced Rethinking, to empower the ability of deliberate thinking. Intriguingly, as illustrated in figure, VL-Rethinker even identify flaws in the given problem when checking its initial reasoning through rethinking, showcasing a form of emergent metacognitive ability.

Performance of VL-Rethinker

VL-Rethinker advances state-of-the-art scores on MathVista, MathVerse, and MathVision. It also achieves open-source SoTA on multi-disciplinary benchmarks such as MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with GPT-o1. We remain active in further pushing the limits of VL-Rethinker.

Reference

@article{vl-rethinker, title={VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning}, author = {Wang, Haozhe and Qu, Chao and Huang, Zuming and Chu, Wei and Lin, Fangzhen and Chen, Wenhu}, journal={arXiv preprint arXiv:2504.08837}, year={2025} }