UniIR

: Training and Benchmarking Universal Multimodal Information Retrievers

^♠️University of Waterloo, ^♣Georgia Institute of Technology, ^†Hong Kong University of Science and Technology, ^♥Google DeepMind, cong.wei@uwaterloo.ca, yangc@gatech.edu, haonan.chen@uwaterloo.ca, hexiang.frank.hu@gmail.com, ge.zhang@uwaterloo.ca, jiefu@ust.hk, alan.ritter@cc.gatech.edu, wenhuchen@uwaterloo.ca

Abstract

Existing information retrieval (IR) models often assume a homogeneous format, limiting their applicability to diverse user needs, such as searching for images with text descriptions, searching for a news article with a headline image, or finding a similar photo with a query image. To approach such different information-seeking demands, we introduce UniIR UniIR Icon , a unified instruction-guided multimodal retriever capable of handling eight distinct retrieval tasks across modalities. UniIR, a single retrieval system jointly trained on ten diverse multimodal-IR datasets, interprets user instructions to execute various retrieval tasks, demonstrating robust performance across existing datasets and zero-shot generalization to new tasks. Our experiments highlight that multi-task training and instruction tuning are keys to UniIR's generalization ability. Additionally, we construct the M-BEIR, a multimodal retrieval benchmark with comprehensive results, to standardize the evaluation of universal multimodal information retrieval.

UniIR

We propose the UniIR(Universal multimodal Information Retrieval) framework to learn a single retriever to accomplish (possibly) any retrieval task. Unlike traditional IR systems, UniIR needs to follow the instructions to take a heterogeneous query to retrieve from a heterogeneous candidate pool with millions of candidates in diverse modalities.

UniIR Models

We experimented with two multimodal fusion mechanisms for UniIR models, namely score-level fusion and feature-level fusion. To explore the effectiveness of these approaches, we adapted pre-trained models such as CLIP and BLIP for our purposes as follows. Instructions were integrated as prefixes to the text query for UniIR models.

M-BEIR Benchmark

To train and evaluate unified multimodal retrieval models, we build a large-scale retrieval benchmark named M-BEIR (Multimodal BEnchmark for Instructed Retrieval). The M-BEIR benchmark comprises 8 multimodal retrieval tasks and 10 datasets from a variety of domains and image sources. Each task is accompanied by human-authored instructions, encompassing 1.5 million queries and a pool of 5.6 million retrieval candidates in total.

Benchmarking UniIR on M-BEIR

Our evaluation encompasses zero-shot SoTA models, multi-task fine-tuned baselines (fine-tuned jointly on all M-BEIR training data without incorporating instructions), and UniIR models. Retrieval is performed from the M-BEIR 5.6 million candidate pool, which consists of the retrieval corpus from all tasks. We demonstrate that zero-shot models struggle to retrieve queried information from such a heterogeneous pool and instruction-tuning as a crucial component of UniIR.

Top 5 Candidates Retrieved from M-BEIR

UniIR generalizes to unseen retrieval tasks

During the multi-task fine-tuning stage of UniIR, we excluded three datasets (WebQA, OVEN, CIRR) and fine-tuned UniIR models and multi-task baselines on the remaining M-BEIR datasets. At test time, we evaluated the zero-shot performance of all fine-tuned models, as well as SoTA pre-trained retrievers on the three held-out datasets. UniIR models exhibit superior generalization abilities on unseen tasks and datasets compared to baselines.

Reference

@article{wei2023uniir, title={Uniir: Training and benchmarking universal multimodal information retrievers}, author={Wei, Cong and Chen, Yang and Chen, Haonan and Hu, Hexiang and Zhang, Ge and Fu, Jie and Ritter, Alan and Chen, Wenhu}, journal={arXiv preprint arXiv:2311.17136}, year={2023} }

UniIR: Training and Benchmarking Universal Multimodal Information Retrievers

Abstract

Figure 1: We build a universal multimodal information retriever UniIR through instruction tuning. UniIR is capable of accepting any form of query and instruction to retrieve information in any modality.

UniIR

UniIR Models

M-BEIR Benchmark

Benchmarking UniIR on M-BEIR

Top 5 Candidates Retrieved from M-BEIR

UniIR generalizes to unseen retrieval tasks

Reference