VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search

Introduction

Vision-Language Models have made significant progress on many perception-focused tasks, however, their progress on reasoning-focused tasks seem to be limited due to the lack of high-quality and diverse training data. In this work, we aim to address the scarcity issue of reasoning-focused multimodal datasets. We propose VisualWebInstruct — a novel approach that leverages search engine to create a diverse, and high-quality dataset spanning multiple disciplines like math, physics, finance, chemistry, etc. Starting with meticulously selected 30,000 seed images, we employ Google Image search to identify websites containing similar images. We collect and process the HTMLs from over 700K unique URL sources. Through a pipeline of content extraction, filtering and synthesis, we build a dataset of approximately 900K question-answer pairs, with 40\% being visual QA pairs and the rest as text QA pairs. Models fine-tuned on VisualWebInstruct demonstrate significant performance gains: (1) training from Llava-OV-mid shows 10-20\% absolute point gains across benchmarks, (2) training from MAmmoTH-VL shows 5\% absoluate gain. Our best model MAmmoTH-VL2 shows state-of-the-art performance within the 10B parameter class on MMMU-Pro-std (40.7\%), MathVerse (42.6\%), and DynaMath (55.7\%). These remarkable results highlight the effectiveness of our dataset in enhancing VLMs' reasoning capabilities for complex multimodal tasks.

Overview

VisualWebInstruct is a novel approach to create large-scale, high-quality multimodal reasoning datasets without expensive human annotation. By leveraging Google Image Search, we build a comprehensive dataset spanning multiple disciplines that enhances vision-language models' reasoning capabilities.

VisualWebInstruct Pipeline — The VisualWebInstruct data generation pipeline: seed images → Google Image Search → web extraction → filtering → high-quality multimodal dataset

Our approach addresses a critical bottleneck in visual reasoning models by enabling the creation of diverse instruction data spanning mathematics, physics, chemistry, finance, and more.

Key Results

Performance Gains

MMMU-Pro (Standard)

40.7%

State-of-the-art for 10B parameter models

MathVista

68.1%

Leading performance on mathematical visual reasoning

DynaMath

55.7%

Superior reasoning on dynamic math problems

Overall Improvement

+5-20%

Absolute performance gain across benchmarks

Dataset Statistics

Total QA Pairs

906K

Visual QA Pairs

347K

Unique Images

164K

Subject Distribution

Subject	Percentage	QA Pairs
Mathematics	62.5%	566K
Physics	14.5%	132K
Finance	7.25%	66K
Chemistry	4.8%	43K
Engineering	4.35%	39K
Others*	6.6%	60K

*Others includes Computer Science (2.25%), Biology (1.4%), General Knowledge (2.45%), and Humanities (0.5%)

Dataset Viewer

Open in Hugging Face

Comprehensive Model Comparison

Model	Size	MMMU val	MMMU-Pro std	MMMU-Pro vision	MathVista	MMVet	MathVerse	Dyna-Math	Avg
Closed-source Models
GPT-4o	-	69.1	54.0	49.7	63.8	76.2	50.2	63.7	61.0
Gemini-1.5-Pro	-	59.1	49.4	65.8	63.9	64.0	41.2	64.8	58.3
Claude-3.5-Sonnet	-	68.3	55.0	48.0	67.7	75.4	44.2	60.5	59.9
Open-source General Vision-Language Models
Molmo	8B	45.3	28.3	18.9	51.6	58.0	18.9	41.6	37.5
Llava-OV	7B	48.8	29.5	18.7	63.2	58.6	26.2	40.3	40.8
Llama-3.2-Inst	11B	50.7	33.0	23.7	51.5	59.3	31.6	40.5	41.5
Qwen2-VL	7B	52.1	37.0	26.9	58.2	62.0	28.2	42.1	43.8
MAmmoTH-VL	7B	50.8	33.2	25.3	66.0	62.3	34.2	44.7	45.2
InternVL2.5	7B	55.8	38.2	30.4	64.4	62.8	39.5	49.8	48.7
Phi-4-mini	5.6B	55.1	39.7	31.2	62.4	60.5	37.6	51.4	48.6
DeepSeek-VL2	27B	51.1	31.4	24.3	62.8	-	-	-	-
Specialized Reasoning Vision-Language Models
Llava-CoT-L	11B	50.1	31.6	20.4	54.8	60.3	30.2	44.8	41.7
Llava-CoT-M	7B	51.4	33.0	23.7	63.8	58.6	39.4	48.3	45.5
LlamaV-o1	11B	49.1	31.5	22.4	54.4	63.6	-	-	-
Mulberry	7B	55.0	36.8	23.6	63.1	60.9	31.0	45.1	45.0
Insight-V	8B	50.2	30.7	20.5	59.9	60.8	28.7	47.8	42.6
MM-Eureka	8B	49.2	-	-	67.1	60.7	40.4	-	-
MAmmoTH-VL2	7B	54.7	40.7	26.3	68.1	64.5	42.6	55.7	50.4
∆ over SoTA	-	-1.1	+1.0	-4.9	+2.1	+0.9	+3.1	+4.3	+1.7

Evaluation results comparing MAmmoTH-VL2 with other models across multiple benchmarks. The best and second-best results across all open-source models are highlighted.

Key Observations

MAmmoTH-VL2 achieves state-of-the-art performance in 5 out of 7 benchmarks among all open-source models in the 7B-11B parameter range
On MMMU-Pro standard, our model shows 40.7% accuracy, outperforming all other open-source models
Particularly strong performance on mathematical reasoning benchmarks, with significant improvements on MathVerse (+3.1% over SoTA) and Dyna-Math (+4.3% over SoTA)
Overall average performance of 50.4% represents a +1.7% improvement over the previous state-of-the-art
Performance gap with closed-source models has been significantly narrowed, especially on specialized reasoning tasks

Research Impact

VisualWebInstruct demonstrates a novel and scalable approach to creating high-quality multimodal datasets without expensive human annotation, leveraging web search technology to mine over 750,000 unique sources and construct a comprehensive dataset spanning multiple academic disciplines with diverse visual content.

Our approach significantly improves vision-language models' reasoning capabilities across diverse benchmarks and tasks, with MAmmoTH-VL2 achieving state-of-the-art performance (50.4% average accuracy) across seven key benchmarks including MMMU, MathVista, and Dyna-Math, demonstrating particularly strong results in mathematical reasoning with visual context.

The dataset's diversity in both subjects (spanning Mathematics, Physics, Chemistry, Economics, Engineering and other fields) and image types (over 163,000 unique images) creates a rich learning environment that helps models generalize to a wide range of real-world visual reasoning problems, from elementary concepts to complex college-level problems requiring multi-step deliberation.

MAmmoTH-VL2, our 7B-parameter model fine-tuned on VisualWebInstruct, achieves state-of-the-art performance within its parameter class on multiple benchmarks through a straightforward training approach that requires no complex methodology changes during training or inference, offering a simple yet effective solution for enhancing multimodal reasoning without the need for specialized techniques.

Reference

Please kindly cite our paper if you use our code or results:


                @article{visualwebinstruct,
                    title={VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search},
                    author = {Jia, Yiming and Li, Jiachen and Yue, Xiang and Li, Bo and Nie, Ping and Zou, Kai and Chen, Wenhu},
                    journal={arXiv preprint arXiv:2503.10582},
                    year={2025}
                }