VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, Wenhu Chen,
HKUST, INF.AI, University of Waterloo
Corresponding to: jasper.whz@outlook.com, wenhuchen@uwaterloo.ca

πŸ””News

πŸ”₯[2025-04] Our paper VL-Rethinker and Models are out πŸš€. We are actively working on releasing the training code and training queries!

Introduction

Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1's performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art.

VL-Rethinker

Overview

We identified a critical limitation of GRPO especially for 72B training: the vanishing advantages problem. To mitigate this issue, we introduced a novel technique called Selective Sample Replay (SSR). While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a textual rethinking trigger to the end of initial rollouts in RL training, explicitly enforcing a self-reflection reasoning step.

algebraic reasoning

By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse, and MathVision to achieve significantly to achieve 80.3\%, 61.8\% and 43.9\% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with GPT-o1. Our empirical results show the effectiveness of our approaches.

The Vanishing Advantages Problem

Our analysis shows the standard GRPO with binary correctness rewards suffers from the vanishing advantages problem, where the amount of zero advantages increase as training progresses. This problem is severe especially for 72B training. As the figure suggests, the percentage of examples exhibiting non-zero advantages steadily declines from approximately 40\% to below 20\% within 256 gradient steps. These observations motivate our approach of Selective Sample Replay.

main

Qualitative Results of VL-Rethinker

We train VL-Rethinker with SSR and Forced Rethinking, to empower the ability of deliberate thinking. Intriguingly, as illustrated in figure, VL-Rethinker even identify flaws in the given problem when checking its initial reasoning through rethinking, showcasing a form of emergent metacognitive ability.

RL results

Performance of VL-Rethinker

VL-Rethinker advances state-of-the-art scores on MathVista, MathVerse, and MathVision. It also achieves open-source SoTA on multi-disciplinary benchmarks such as MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with GPT-o1. We remain active in further pushing the limits of VL-Rethinker.

RL results

Explore More

Explore more about details of our approach, analysis of learned rethinking behaviors, and insights in VLM training within our paper!

Reference

If you find our work useful, please give us a free cite:

			@article{vl-rethinker,
			      title={VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning},
			      author = {Wang, Haozhe and Qu, Chao and Huang, Zuming and Chu, Wei and Lin, Fangzhen and Chen, Wenhu},
			      journal={arXiv preprint arXiv:2504.08837},
			      year={2025}
			}