Abstract

Existing foundation models typically process visual input as pixels and textual input as tokens, a paradigm that contrasts with human perception, where both modalities are processed in a unified manner. With the rise of embodied and agentic AI, where inputs primarily come from camera pixels, the need for a unified perception framework becomes increasingly evident. In this paper, we propose to unify all modalities (text, tables, code, diagrams, images, etc) as pixel inputs, i.e. “Perceive Everything as Pixels” (PEAP). We introduce PixelWorld, a novel evaluation suite that unifies all the mentioned modalities into pixel space to gauge the existing models' performance. Our findings show that (1) PixelWorld outperforms baseline with token-based input in multimodal datasets, benefiting from unified input for better disambiguation, (2) significant declines in reasoning and coding capabilities across all models when processing pixel-based input, underscoring the need to enhance foundation models' perceptual abilities, (3) larger models can maintain strong performance on non-reasoning tasks under PEAP, while smaller models like Phi-3.5-V suffer significant performance degradation, (4) the attention pattern of PEAP is highly aligned with text token input, (5) PEAP can be accelerated significantly by exploiting the spatial sparsity. We conclude that the existing frontier models are competent in pixel perception, however, there is still headroom for improvement.

Method Overview: Perceive Everything as Pixels (PEAP)

We propose to unify all modalities (text, tables, code, diagrams, images, etc.) as pixel inputs. This approach, called PEAP, is designed to better align with human perception and reduce the need for excessive pre-processing.

The diagram below (corresponding to Figure 1 in the paper) illustrates the overall PEAP framework.

PEAP Overview Diagram — **Figure 1:** PEAP framework: we investigate the possibility of perceiving everything as pixels. This framework aligns better with human perception, reducing the need for excessive pre-processing. Evaluated on our benchmark *PixelWorld*, PEAP boosts performance on multimodal tasks (e.g., websites, slides, documents) but struggles with complex, text-centric tasks (e.g., reasoning and coding). Larger models achieve better transferability between pixel- and token-based performance compared to smaller ones. We also observed that text and images exhibit similar attention patterns and reduced the overhead of model reasoning through patch pruning by PEAP-Fast.

Dataset

We construct a comprehensive evaluation suite to systematically analyze and compare VLMs’ performance across text-only, structured, and multimodal inputs. Table 1 below (adapted from the original paper) summarizes the key datasets with their sizes, tasks, and modality transfer methods. These include established benchmarks such as GLUE, SuperGLUE, and MMLU-Pro for text-only tasks, TableBench for structured data, and MathVerse or SlidesVQA for inherently multimodal scenarios.

To create pixel-based input for text-only and structured data, we employ an image synthesis pipeline, which renders plain text or tabular data into images of varying size and font style. Figure 2 (also from the original paper) illustrates an example of how an input prompt is transformed into a visually rendered image for model inference. This approach ensures that layout information is preserved, reducing potential OCR errors and aligning more closely with our “Perceive Everything as Pixels” (PEAP) paradigm.

**Table 1:** Dataset Composition Overview

Synthesized Input Example — **Figure 2:** Example of Synthesized Input

Evaluation Results

We comprehensively evaluated PEAP on text-only, structured, and multimodal tasks. The key findings include:

PEAP outperforms token-based baselines on multimodal tasks by providing better disambiguation.
Significant performance degradation is observed in complex reasoning and coding tasks when processing pixel-based inputs.
Larger models can maintain strong performance under PEAP, whereas smaller models face more challenges.

Attention Visualization

To gain insight into how the model processes pixel inputs, we visualized the attention maps of Qwen2VL-7B’s final layer for both token-based and pixel-based inference. The results, shown below (corresponding to Figure 3 in the paper), reveal that the attention distribution is largely consistent across modalities.

Figure 3: Attention Heatmap Comparison

Efficiency Improvements with PEAP-Fast

Pixel-based input usually incurs higher computational overhead due to redundant blank patches. To mitigate this, we introduce PEAP-Fast, a sparsification algorithm that prunes non-informative regions, thereby reducing inference latency significantly (Table 3) while preserving model accuracy (Table 2).

Reference

Please kindly cite our paper if you use our code or results:

@article{lyu2024pixelworld,
    title={PixelWorld: Towards Perceiving Everything as Pixels},
    author={Lyu, Zhiheng and Ma, Xueguang and Chen, Wenhu},
    year={2025},
    eprint={2501.19339},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={http://arxiv.org/abs/2501.19339},
}