Mantis

Abstract

🤔 The recent years have witnessed a great array of large multimodal models (LMMs) to effectively solve single-image vision language tasks. However, their abilities to solve multi-image visual language tasks is yet to be improved.
😦 The existing multi-image LMMs (e.g. OpenFlamingo, Emu, Idefics, etc) mostly gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from web, which is neither efficient nor effective.

Mantis-Instruct Data. We present the first fully text-image interleaved multimodal instruction tuning dataset, containing 721K examples from 14 subsets and covering multi-image skills including co-reference, reasoning, comparing, temporal understanding
Mantis Model. We introduce Mantis, an LLaMA-3 based LMM with interleaved text and image as inputs, train on Mantis-Instruct under academic-level resources (i.e. 36 hours on 16xA100-40G)
Performance. Mantis reaches the state-of-the-art performance on five multi-image benchmarks (NLVR2, Q-Bench, BLINK, MVBench, Mantis-Eval), and also maintain a strong single-image performance on par with CogVLM and Emu2.
Open-source. Our curated Mantis-Instruct data, along with the training/evaluation codes, model checkpoints are all released to the public

Interleaved Multi-Image Instruction-Tuning data

Mantis-Instruct has a total of 721K instances, consisting of 14 subsets to cover all the multi-image skills.
Among the 14 subsets, 10 subsets are from the existing datasets. For example, NLVR2, IconQA, etc for reasoning skill; DreamSim, Birds-to-Words, etc for comparison skill; NExT-QA, STAR, for temporal understanding
We additionally curate four new datasets LLaVA-665k-multi, LRV-multi to cover coref skill and Contrast-Caption, Multi-VQA to broaden reasoning skill, where Multi-VQA is generated by prompting GPT-4.
Please check out 🤗 Mantis-Instruct on hugging face datasets for usage.

Subset	Multi-Image Skill	Sample Size
LLaVA-665k-multi	Coref	313K
LRV-multi	Coref	8K
NLVR2	Reason	86K
IconQA	Reason	64K
Contrast-Caption	Reason	36K
ImageCoDe	Reason	17K
Multi-VQA	Reason	5K
Co-Instruct	Compare	151K
Dreamsim	Compare	16K
Spot-the-Diff	Compare	8K
Birds-to-Words	Compare	3K
VIST	Temporal	7K
NExT-QA	Temporal	4K
STAR	Temporal	3K

Mantis:

Mantis applies the LLaVA's architecture, using CLIP/SigLIP as vision encoders and Meta-Llama-3-8B-Instruct as language model. To support super-resolution, we also train a variant based on SigLIP and Fuyu-8B. We consider a two-stage instruction-tuning procedure: connects pre-trained CLIP ViT-L/14 visual encoder and large language model Vicuna, using a simple projection matrix. We consider a two-stage instruction-tuning procedure: Please check out our [Model Zoo].

Performance

Multi-Image VQA: Towards GPT-4-level multi-image understanding

Benchmark	Multi-Image Skill	Held-in/Held-out
NLVR2	Reason	Held-in
Q-bench	Reason	Held-in
Mantis-Eval	Reason & Co-reference	Held-out
BLINK	Reason	Held-out
MVBench	Temporal	Held-out

We select 5 multi-image benchmarks that cover the four crucial multi-image skills: co-reference, reasoning, comparing, temporal understanding to evaluate Mantis.

Evaluation on 5 benchmarks, including NLVR2, Q-Bench, BLINK, MVBench, Mantis-Eval, shows that Mantis achieves the state-of-the-art performance. It demonstrates that Mantis effectively learn the 4 crucial multi-image skills (co-reference, reasoning, comparing, temporal understanding) from the interleaved text-image instructions dataset, Mantis-Instruct. We have surpassed the second best Idefics2-8B (pre-trained on 140M interleaved image-text data) by an average of 9 absolute points, and is only behind GPT-4 by 2 points.

Single Image VQA: Maintain strong performance

We also evaluate Mantis-8B-CLIP and Mantis-8B-SigLIP on various single-image tasks, including TextVQA, VQA-v2, MMBench, MMMU, etc. Mantis models reach on-par average performance with CogVLM, and Emu2-Chat

Examples on Visual Instruction Following

BibTeX

@inproceedings{Jiang2024MANTISIM,
  title={MANTIS: Interleaved Multi-Image Instruction Tuning},
  author={Dongfu Jiang and Xuan He and Huaye Zeng and Cong Wei and Max W.F. Ku and Qian Liu and Wenhu Chen},
  publisher={arXiv2405.01483}
  year={2024},
}

Mantis: Interleaved Multi-Image Instruction Tuning

Balancing Multi-Image and Single-Image Abilities of Large Multimodal Models