Abstract

State-of-the-art transformer-based large multimodal models (LMMs) struggle to handle hour-long video inputs due to the quadratic complexity of the causal self-attention operations, leading to high computational costs during training and inference. Existing token compression-based methods reduce the number of video tokens but often incur information loss and remain inefficient for extremely long sequences. In this paper, we explore an orthogonal direction to build a hybrid Mamba-Transformer model (VAMBA) that employs Mamba-2 blocks to encode video tokens with linear complexity. Without any token reduction, VAMBA can encode more than 1024 frames (640×360) on a single GPU, while transformer-based models can only encode 256 frames. On long video input, VAMBA achieves at least 50% reduction in GPU memory usage during training and inference, and nearly doubles the speed per training step compared to transformer-based LMMs. Our experimental results demonstrate that VAMBA improves accuracy by 4.6% on the challenging hour-long video understanding benchmark LVBench over prior efficient video LMMs, and maintains strong performance on a broad spectrum of long and short video understanding tasks.

Vamba

To reduce the computational/memory cost for video LMMs, previous efforts have primarily focused on reducing the vision tokens in the input sequence. In this study, we investigate an orthogonal direction to previous approaches: instead of compressing the video tokens, we seek to develop an alternative model architecture that improves the efficiency of processing video tokens during training and pre-filling stage of inference. We propose Vamba, a hybrid Mamba-Transformer model for efficient hour-long video understanding. The key insight of our method is that we can design efficient modules to approximate the causal self-attention operation for both text and video tokens in transformer-based LMMs. In particular, we propose to (1) utilize cross-attentions to update text tokens based on video tokens, which is affordable due to the short length of text tokens, and (2) adopt Mamba-2 to process the massive video tokens with linear complexity.

The main computation overhead in the transformer-based LMMs comes from the quadratic complexity of the self-attention in the video tokens. To overcome this issue, we design a hybrid Mamba Transformer architecture to process text and video tokens differently. The key idea of our method is to split the expensive self-attention operation over the entire video and text token sequence into two more efficient components. Since video tokens typically dominate the sequence while text tokens remain few, we maintain the self-attention mechanism exclusively for the text tokens and eliminate it for the video tokens. Instead, we add cross-attention layers that use text tokens as queries and video tokens as keys and values. In the meantime, we propose to employ Mamba blocks to effectively process the video tokens.

Runtime Efficiency

To understand our model’s runtime efficiency gains over the baseline transformer-based LMM (Qwen2-VL-7B), we conduct an efficiency analysis for both training and inference. Our results show that our model requires over 50% less training memory when processing videos with more than 16 frames. This efficiency gain allows us to handle a larger number of video frames during training (512 vs. 128). Furthermore, our efficient design also accelerates model training, achieving nearly twice the speedup per training step when working with more than 64 frames. For model inference, Vamba's memory usage increases more slowly as the frame increases, allowing it to handle four times as many frames on a single NVIDIA A800 80G GPU compared to Qwen2-VL-7B (256 vs. 1024). Regarding computational cost, Vamba reduces FLOPs by 30% to 50% during inference, demonstrating a significantly lower complexity than its transformer-based LMM counterparts.

Experimental Results

We conduct extensive evaluations across various video understanding tasks to demonstrate the effectiveness of Vamba.

Hour-long Video Understanding: Vamba consistently outperforms all efficient video LMMs across the three hour-long video benchmarks, highlighting its exceptional ability to understand and reason over hour-scale videos. Notably, our model surpasses the baseline Qwen2-VL-7B on the LVBench benchmark, and its performance on HourVideo is also very close to Qwen2-VL-7B. These results underscore that Vamba is competitive with the best open-sourced transformer-based LMMs, while being significantly more efficient during training and inference.

Medium-Length or Short Video Understanding: Vamba demonstrates superior performance across three medium-length video understanding benchmarks (with average video durations between 10–20 minutes), ranking first among efficient video LMMs on all metrics. Vamba also achieves competitive performances, ranking first on NExT-QA and DREAM-1K and second on MVBench among efficient LMMs. Overall, our model delivers the best results on medium-length and long video benchmarks, demonstrating its strong ability to handle long-context video-language inputs.

Reference

Please kindly cite our paper if you use our code, data, models or results:

@misc{ren2025vambaunderstandinghourlongvideos,
      title={Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers}, 
      author={Weiming Ren and Wentao Ma and Huan Yang and Cong Wei and Ge Zhang and Wenhu Chen},
      year={2025},
      eprint={2503.11579},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.11579}, 
}