We made some fun demos! Please check the Interactive Online Demo, and the video demo below.
Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. This reasoning paradigm is also employed by state-of-the-art VLMs, which first extracts relevant cues from mutlimodal inputs, and then perform reasoning purely in the textual format. Despite their success, this prevailing textual reasoning paradigm faces an inherent limitation: the lack of direct interaction with visual inputs -- such as drawing lines/marks, highlighting regions, or zooming in -- hinders the model’s ability to capture fine-grained visual details, including tiny objects, subtle spatial relationships, small embedded text, and nuanced actions in videos. These limitations lead us to pose a question:
Can VLMs perform reasoning steps more directly within the visual modality itself, leveraging computational visual manipulations as actions to guide reasoning?
We introduce the concept of pixel-space reasoning, a novel paradigm where reasoning is not exclusively confined to verbalized format but actively incorporates operations applied directly to the visual inputs. These visual operations serve as integral steps within its reasoning chain, empowering the model to inspect, interrogate, and infer from visual evidence with enhanced fidelity.
However, we identified a critical learning trap problem when cultivating this novel pixel-space reasoning capability. Existing VLMs typically suffers from a disparity in its capabilities: proficient textual reasoning versus nascent (zero-shot) pixel-space reasoning. This inherent imbalance creates a "learning trap" that impedes the development of pixel-space reasoning, stemming from two synergistic issues. Firstly, the model's initial limited mastery over visual operations frequently leads to failure or incorrect outputs, resulting in a higher incidence of negative feedback compared to text-mediated reasoning. Secondly, insufficient intrinsic motivations allows the model to purse the extrinsic correctness reward by simply ignoring the outcomes of visual operations or default to its stronger textual reasoning. This interplay fosters a detrimental cycle where initial failures discourage further attempts, leading to the premature abandonment of exploring and mastering visual operations.
We propose a two-phase training approach to cultivate pixel-space reasoning capabilities in VLMs. The first phase involves instruction tuning on synthesized reasoning traces to familiarize the model with the novel visual operations. Following this, a reinforcement learning (RL) phase leverages a curiosity-driven reward scheme to balance exploration between pixel-space reasoning and textual reasoning.
Instruction tuning is the first phase of our training approach, where the model is familiarized with novel visual operations through synthesized reasoning trajectories.As shown in the datapipe, we first collect high-resolution images with rich information and videos as seed data. Then we localize the reference visual cues (bboxes or frame indexes) through the original annotation of the source data or using gpt's bbox to the generated questions. After that we employ a template-based synthesis approach. This template structures a pixel-space reasoning trajectory as a sequence: initial analysis of the entire visual input, followed by triggering specific visual operations to extract fine-grained details, subsequent analysis of these detailed visual cues, and ultimately arriving at the final answer. To synthesize a trajectory according to this template, we utilize the reference visual cue associated with each vision-language query. We first prompt GPT-4o to generate a textual description summarizing the entire visual input. Then, leveraging the reference visual cue, we prompt GPT-4o for a more detailed textual analysis focusing specifically on that cue. By composing these textual thinking segments and incorporating the visual operation targeting the reference visual cue, we obtain a single-pass reasoning trajectory that effectively interleaves textual reasoning with required visual operations. To enhance model's ability to react to unexpected visual outcome, we manually design some error cases and insert inaccurate visual operations before introducing the correct visual operations to form self-correction trajectories.
To overcome the learning trap where the model defaults to its stronger textual reasoning and neglects underdeveloped pixel-space reasoning, Pixel Reasoner introduces a curiosity-driven reward scheme. This scheme encourages active practice of visual operations by addressing two constraints: (1) the Rate of Pixel-space Reasoning (RaPR) should exceed a threshold H, and (2) the number of visual operations in a response should not exceed a bound N. These constraints are incorporated into training via Lagrangian relaxation, resulting in a modified reward function: \[ r'(x, y) = r(x, y) + \alpha \cdot r_{\text{curiosity}}(x, y) + \beta \cdot r_{\text{penalty}}(y) \] where \( r_{\text{curiosity}} \) encourages exploration and \( r_{\text{penalty}} \) discourages overuse of visual operations. This reward formulation dynamically incentivizes the model to persist in learning pixel-space reasoning despite initial failure, leading to more robust visual understanding.
@article{pixel-reasoner,
title={Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning},
author = {Alex Su and Haozhe Wang and Weiming Ren and Fangzhen Lin and Wenhu Chen},
journal={arXiv preprint arXiv:2505.15966},
year={2024}
}