Logo MEGA-Bench

Scaling Multimodal Evaluation to over 500 Real-World Tasks


MEGA-Bench Team

*Core Contributors, Contributed equally in strategic advising
Corresponding to: jcchen.work@gmail.com, wenhuchen@uwaterloo.ca
geometric reasoning

MEGA-Bench contains 505 multimodal tasks with diverse data sources, input/output formats, and skill requirements. The taxonomy tree is derived from the application dimension, which guides and calibrates the annotation process. The benchmark is equiped with a suite of 45 evaluation metrics to handle various output formats beyond multiple-choice questions.

🔔News

[2024-10-14]: Paper released on arXiv. Data and evaluation code will be released soon.

Introduction

We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks, to address the highly heterogeneous daily use cases of end users. Our objective is to optimize for a set of high-quality data samples that cover a highly diverse and rich set of multimodal tasks, while enabling cost-effective and accurate model evaluation. In particular, we collected 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space. Instead of unifying these problems into standard multi-choice questions (like MMMU, MMBench, and MMT-Bench), we embrace a wide range of output formats like numbers, phrases, code, \( \LaTeX \), coordinates, JSON, free-form, etc. To accommodate these formats, we developed over 40 metrics to evaluate these tasks. Unlike existing benchmarks, MEGA-Bench offers a fine-grained capability report across multiple dimensions (e.g., application, input type, input/output format, skill), allowing users to interact with and visualize model capabilities in depth. We evaluate a wide variety of frontier vision-language models on MEGA-Bench to understand their capabilities across these dimensions.

MEGA-Bench Visualization

Interactive Task Viewer

Navigate through our application-based task taxonomy!
Click the leaf node to view a visualization page with detailed task information,
including the evaluation results of several flagship models.
(will be enabled once the data is released)

Per-dimension Statistics

Input Format
Output Format
Visual Input Number
Skills

Breakdown Model Analysis

Choose from the five dimensions and pre-defined model set. Click the model name to toggle display/hide it from the radar map. We also provide a page to show all the details of a single model and compare it with a reference model (Click the "Detailed Model Report" button).

Detailed Results

Leaderboard

We evaluate various models including LLMs and LMMs. In each type, we consider both closed- and open-source models. Our evaluation is conducted under a zero-shot setting to assess the capability of models to generate accurate answers without fine-tuning or few-shot demonstrations on our benchmark. For all models, we use the default prompt provided by each model for multi-choice or open QA, if available. If models do not provide prompts for task types in MMMU, we conduct prompt engineering on the validation set and use the most effective prompt for the later zero-shot experiment.


Proprietary Open-Source

Model Name Core Open-ended Overall

Overall results of different models on the MEGA-BENCH leaderboard. The best-performing model in each category is in-bold, and the second best is underlined

Error Analysis

To understand the limitations of state-of-the-art VLMs, we analyze the GPT-4o (0513) results by manually identifying the error types over a subset of 255 tasks from the Core set of MEGA-Bench. We use the results with Chain-of-Thought prompting (See details in the paper) since the reasoning process helps determine the error type. For GPT-4o, the lack of various reasoning capabilities (e.g., symbolic reasoning for planning/coding tasks, spatial or temporal reasoning for complex perception tasks, etc.) is the dominating failure mode on MEGA-Bench.

error distribution

The task-wise error distribution of GPT-4o (0513) over a subset of 255 MEGA-Bench tasks

Detailed Behavior Inspection

We provide some examples to inspect the behavior of different models on MEGA-bench. The results include both correct and error cases.

BibTeX


          @article{chen2024mega-bench,
            title={MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks},
            author={Chen, Jiacheng and Liang, Tianhao and Siu, Sherman and Wang, Zhengqing and Wang, Kai and Wang, Yubo and Ni, Yuansheng and Zhu, Wang and Jiang, Ziyan and Lyu, Bohan and Jiang, Dongfu and He, Xuan and Liu, Yuan and Hu, Hexiang and Yue, Xiang and Chen, Wenhu},
            journal={arXiv preprint arXiv:2410.10563},
            year={2024},
          }