VideoScore2

Think before You Score in Generative Video Evaluation

¹^,†Xuan He*, ²^,†Dongfu Jiang*, ³Ping Nie, ⁴Minghao Liu, ⁷Zhengxuan Jiang,
²Mingyi Su, ⁶Wentao Ma, ⁶Junru Lin, ²Chun Ye, ⁶Yi Lu, ²Keming Wu, ²Benjamin Schneider,
²Quy Duc Do, ²Zhuofeng Li, ⁶Yiming Jia, ²Yuxuan Zhang, ⁹Guo Cheng, ²Haozhe Wang,
⁵Wangchunshu Zhou, ⁸Qunshu Lin, ⁵Yuanxing Zhang, ²^,⁵Ge Zhang, ⁵Wenhao Huang, ²^,†Wenhu Chen,

¹University of Illinois Urbana Champaign, ²University of Waterloo, ³Independent, ⁴2077AI, ⁵M-A-P,
⁶University of Toronto, ⁷Zhejiang University, ⁸Abaka AI, ⁹Netmind.AI

*Equal Contribution, Xuan leads the project

Recent advances in text-to-video generation have produced increasingly realistic and diverse content, yet evaluating such videos remains a fundamental challenge due to their multi-faceted nature encompassing visual quality, semantic alignment, and physical consistency. Existing evaluators and reward models are limited to single opaque scores, lack interpretability, or provide only coarse analysis, making them insufficient for capturing the comprehensive nature of video quality assessment. We present VideoScore2, a multi-dimensional, interpretable, and human-aligned framework that explicitly evaluates visual quality, text-to-video alignment, and physical/common-sense consistency while producing detailed chain-of-thought rationales. Our model is trained on a large-scale dataset VideoFeedback2 containing 27,168 human-annotated videos with both scores and reasoning traces across three dimensions, using a two-stage pipeline of supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO) to enhance analytical robustness. Extensive experiments demonstrate that VideoScore2 achieves superior performance with 44.35 (+5.94) accuracy on our in-domain benchmark VideoScore-Bench-v2 and 50.37 (+4.32) average performance across four out-of-domain benchmarks (VideoGenReward-Bench, VideoPhy2, etc), while providing interpretable assessments that bridge the gap between evaluation and controllable generation through effective reward modeling for Best-of-N sampling

BibTeX

@misc{he2025videoscore2thinkscoregenerative, title={VideoScore2: Think before You Score in Generative Video Evaluation}, author={Xuan He and Dongfu Jiang and Ping Nie and Minghao Liu and Zhengxuan Jiang and Mingyi Su and Wentao Ma and Junru Lin and Chun Ye and Yi Lu and Keming Wu and Benjamin Schneider and Quy Duc Do and Zhuofeng Li and Yiming Jia and Yuxuan Zhang and Guo Cheng and Haozhe Wang and Wangchunshu Zhou and Qunshu Lin and Yuanxing Zhang and Ge Zhang and Wenhao Huang and Wenhu Chen}, year={2025}, eprint={2509.22799}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2509.22799}, }

VideoScore2

Think before You Score in Generative Video Evaluation

Abstract

VideoScore2

Overview

In-Domain Evaluation

Out-of-Domain Evaluation

Best-of-N Sampling

Case Studies

Example 1

Example 2

Example 3

BibTeX