VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search

Yiming Jia, Jiachen Li, Xiang Yue, Bo Li, Ping Nie, Kai Zou, Wenhu Chen
TIGER-Lab@University of Waterloo
Corresponding to: yiming.jia@mail.utoronto.ca, wenhuchen@uwaterloo.ca

Introduction

Vision-Language Models have made significant progress on many perception-focused tasks, however, their progress on reasoning-focused tasks seem to be limited due to the lack of high-quality and diverse training data. In this work, we aim to address the scarcity issue of reasoning-focused multimodal datasets. We propose VisualWebInstruct — a novel approach that leverages search engine to create a diverse, and high-quality dataset spanning multiple disciplines like math, physics, finance, chemistry, etc. Starting with meticulously selected 30,000 seed images, we employ Google Image search to identify websites containing similar images. We collect and process the HTMLs from over 700K unique URL sources. Through a pipeline of content extraction, filtering and synthesis, we build a dataset of approximately 900K question-answer pairs, with 40\% being visual QA pairs and the rest as text QA pairs. Models fine-tuned on VisualWebInstruct demonstrate significant performance gains: (1) training from Llava-OV-mid shows 10-20\% absolute point gains across benchmarks, (2) training from MAmmoTH-VL shows 5\% absoluate gain. Our best model MAmmoTH-VL2 shows state-of-the-art performance within the 10B parameter class on MMMU-Pro-std (40.7\%), MathVerse (42.6\%), and DynaMath (55.7\%). These remarkable results highlight the effectiveness of our dataset in enhancing VLMs' reasoning capabilities for complex multimodal tasks.

🂡 VisualWebInstruct

Overview

VisualWebInstruct is a novel approach to create large-scale, high-quality multimodal reasoning datasets without expensive human annotation. By leveraging Google Image Search, we build a comprehensive dataset spanning multiple disciplines that enhances vision-language models' reasoning capabilities.

VisualWebInstruct Pipeline
The VisualWebInstruct data generation pipeline: seed images → Google Image Search → web extraction → filtering → high-quality multimodal dataset

Our approach addresses a critical bottleneck in visual reasoning models by enabling the creation of diverse instruction data spanning mathematics, physics, chemistry, finance, and more.

Key Results

Performance Gains

MMMU-Pro (Standard)

40.7%

State-of-the-art for 10B parameter models

MathVista

68.1%

Leading performance on mathematical visual reasoning

DynaMath

55.7%

Superior reasoning on dynamic math problems

Overall Improvement

+5-20%

Absolute performance gain across benchmarks

Dataset Statistics

Total QA Pairs

906K

Visual QA Pairs

347K

Unique Images

164K

Subject Distribution

Subject Percentage QA Pairs
Mathematics 62.5% 566K
Physics 14.5% 132K
Finance 7.25% 66K
Chemistry 4.8% 43K
Engineering 4.35% 39K
Others* 6.6% 60K

*Others includes Computer Science (2.25%), Biology (1.4%), General Knowledge (2.45%), and Humanities (0.5%)

Dataset Viewer

Open in Hugging Face

Comprehensive Model Comparison

Model Size MMMU val MMMU-Pro std MMMU-Pro vision MathVista MMVet MathVerse Dyna-Math Avg
Closed-source Models
GPT-4o - 69.1 54.0 49.7 63.8 76.2 50.2 63.7 61.0
Gemini-1.5-Pro - 59.1 49.4 65.8 63.9 64.0 41.2 64.8 58.3
Claude-3.5-Sonnet - 68.3 55.0 48.0 67.7 75.4 44.2 60.5 59.9
Open-source General Vision-Language Models
Molmo 8B 45.3 28.3 18.9 51.6 58.0 18.9 41.6 37.5
Llava-OV 7B 48.8 29.5 18.7 63.2 58.6 26.2 40.3 40.8
Llama-3.2-Inst 11B 50.7 33.0 23.7 51.5 59.3 31.6 40.5 41.5
Qwen2-VL 7B 52.1 37.0 26.9 58.2 62.0 28.2 42.1 43.8
MAmmoTH-VL 7B 50.8 33.2 25.3 66.0 62.3 34.2 44.7 45.2
InternVL2.5 7B 55.8 38.2 30.4 64.4 62.8 39.5 49.8 48.7
Phi-4-mini 5.6B 55.1 39.7 31.2 62.4 60.5 37.6 51.4 48.6
DeepSeek-VL2 27B 51.1 31.4 24.3 62.8 - - - -
Specialized Reasoning Vision-Language Models
Llava-CoT-L 11B 50.1 31.6 20.4 54.8 60.3 30.2 44.8 41.7
Llava-CoT-M 7B 51.4 33.0 23.7 63.8 58.6 39.4 48.3 45.5
LlamaV-o1 11B 49.1 31.5 22.4 54.4 63.6 - - -
Mulberry 7B 55.0 36.8 23.6 63.1 60.9 31.0 45.1 45.0
Insight-V 8B 50.2 30.7 20.5 59.9 60.8 28.7 47.8 42.6
MM-Eureka 8B 49.2 - - 67.1 60.7 40.4 - -
MAmmoTH-VL2 7B 54.7 40.7 26.3 68.1 64.5 42.6 55.7 50.4
∆ over SoTA - -1.1 +1.0 -4.9 +2.1 +0.9 +3.1 +4.3 +1.7

Evaluation results comparing MAmmoTH-VL2 with other models across multiple benchmarks. The best and second-best results across all open-source models are highlighted.

Key Observations

  • MAmmoTH-VL2 achieves state-of-the-art performance in 5 out of 7 benchmarks among all open-source models in the 7B-11B parameter range
  • On MMMU-Pro standard, our model shows 40.7% accuracy, outperforming all other open-source models
  • Particularly strong performance on mathematical reasoning benchmarks, with significant improvements on MathVerse (+3.1% over SoTA) and Dyna-Math (+4.3% over SoTA)
  • Overall average performance of 50.4% represents a +1.7% improvement over the previous state-of-the-art
  • Performance gap with closed-source models has been significantly narrowed, especially on specialized reasoning tasks

Research Impact

VisualWebInstruct demonstrates a novel and scalable approach to creating high-quality multimodal datasets without expensive human annotation, leveraging web search technology to mine over 750,000 unique sources and construct a comprehensive dataset spanning multiple academic disciplines with diverse visual content.

The dataset's diversity in both subjects (spanning Mathematics, Physics, Chemistry, Economics, Engineering and other fields) and image types (over 163,000 unique images) creates a rich learning environment that helps models generalize to a wide range of real-world visual reasoning problems, from elementary concepts to complex college-level problems requiring multi-step deliberation.

MAmmoTH-VL2, our 7B-parameter model fine-tuned on VisualWebInstruct, achieves state-of-the-art performance within its parameter class on multiple benchmarks through a straightforward training approach that requires no complex methodology changes during training or inference, offering a simple yet effective solution for enhancing multimodal reasoning without the need for specialized techniques.

Reference

Please kindly cite our paper if you use our code or results:

                @article{visualwebinstruct,
                    title={VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search},
                    author = {Jia, Yiming and Li, Jiachen and Yue, Xiang and Li, Bo and Nie, Ping and Zou, Kai and Chen, Wenhu},
                    journal={arXiv preprint arXiv:2503.10582},
                    year={2025}
                }