VideoScore
The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging significantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of large-scale human-annotated dataset.
VideoFeedback Dataset.
In this paper, we release VideoFeedback, the first large-scale dataset
containing human-provided multiaspect score over 37.6K synthesized videos
from 11 existing video generative models.
VideoScore.
We train VideoScore (initialized from Mantis) based on VideoFeedback
to enable automatic video quality assessment. Experiments
show that the Spearman correlation between
VideoScore and humans can reach 77.1 on
VideoFeedback-test, beating the prior best
metrics by about 50 points. Further result on
other held-out EvalCrafter, GenAI-Bench, and
VBench show that VideoScore has consistently much higher correlation with human
judges than other metrics.
VideoFeedback Dataset VideoFeedback contains a total of 37.6K text-to-video pairs from 11 popular video generative models, with some real-world videos as data augmentation. The videos are annotated by raters for five evaluation dimensions: Visual Quality (VQ), Temporal Consistency (TC), (DD) Dynamic Degree (DD), Text-to-Video Alignment (TVA) and Factual Consistency (FC), in 1-4 scoring scale. Below we show the detailed description of our VideoFeedback dataset. Please check out 🤗 VideoFeedback on hugging face datasets for usage.
prompt: completely base your choice of which one to visit today on the dish that most entices your taste buds, 1080P, high quality, comic
| VQ | TC | DD | TVA | FC |
|---|---|---|---|---|
| 3 | 3 | 1 | 3 | 3 |
prompt: an African American female video editor editing videos
| VQ | TC | DD | TVA | FC |
|---|---|---|---|---|
| 1 | 1 | 3 | 3 | 1 |
prompt: Cinematic, A light rain is falling. Tea pickers are picking tea in a tea garden, 4K, anime style
| VQ | TC | DD | TVA | FC |
|---|---|---|---|---|
| 3 | 2 | 3 | 3 | 1 |
prompt: crypto new year Christmas santa money dollars pack
| VQ | TC | DD | TVA | FC |
|---|---|---|---|---|
| 1 | 2 | 3 | 3 | 1 |
prompt: Woman receiving a rose and blushing with a smile
| VQ | TC | DD | TVA | FC |
|---|---|---|---|---|
| 2 | 2 | 3 | 3 | 2 |
prompt: panorama gold coast city in future as a dystopian prison
| VQ | TC | DD | TVA | FC |
|---|---|---|---|---|
| 2 | 3 | 3 | 2 | 3 |
prompt: little bear looks surprised as the moon gets smaller
| VQ | TC | DD | TVA | FC |
|---|---|---|---|---|
| 1 | 2 | 3 | 1 | 2 |
prompt: alexandra daddario, upperbody focus, slow motion, cinematic
| VQ | TC | DD | TVA | FC |
|---|---|---|---|---|
| 2 | 2 | 3 | 3 | 1 |
prompt: cinematic portrait of two dogs running away from a medieval man
| VQ | TC | DD | TVA | FC |
|---|---|---|---|---|
| 1 | 2 | 3 | 2 | 1 |
prompt: a skateboard on the bottom of a surfboard, front view
| VQ | TC | DD | TVA | FC |
|---|---|---|---|---|
| 3 | 3 | 3 | 3 | 2 |
prompt: yellow van with trailer starts to back up
| VQ | TC | DD | TVA | FC |
|---|---|---|---|---|
| 4 | 4 | 4 | 4 | 4 |
prompt: five gray wolf pups frolicking and chasing each other around a remote gravel road, surrounded by grass. The pups run and leap, chasing each other, and nipping at each other, playing
| VQ | TC | DD | TVA | FC |
|---|---|---|---|---|
| 4 | 2 | 4 | 2 | 4 |
VideoScore
VideoScore is finetuned on VideoFeedback dataset's 37K training set taking Mantis-8B-Idefics2 as base model. We try generation scoring method and regression scoring method, the former one means model's answer is in a template predefined for video quality evaluation while the latter one outputs 5 logits as evaluation scores in 5 dimensions. Besides, we also make ablation on base model, using Mantis-8B-Idefics2, Idefics2-8B and VideoLLaVA-7B as base models to finetune. Mantis-8B-Idefics2 turns out to have the best performance on video quality evaluation.
VideoFeedback-testWe test VideoScore on VideoFeedback-test set, containing 760 videos with human scores from five dimensions. We take the Spearman correlation between VideoScore and human annotation as performance indicator. Below we show the results of some feature-based metrics like PIQE, CLIP-sim, X-CILIP-Sore etc, and some MLLM-prompting methods like GPT-4o Gemini-1.5-Pro, etc and our VideoScore.
We select 3 dimensions (Visual Quality, Temporal Consistency and Text-to-Video Alignment) from EvalCrafter that match our evaluation aspects and collect 2500+ videos for test. We take the Spearman correlation between VideoScore and human annotation as performance indicator.
GenAI-Bench is a multimodal benchmark for MLLM's capability on preference comparison
for tasks like text-to-video generation, image-editing and others, while
VBench is a comprehensive multi-aspect benchmark suite for
video generative models.
For GenAI-Bench we collect 2100+ videos in test and
for VBench we select a subset from 5 aspects of VBench, like technical
quality, subject consistency etc, then subsample 100 unique prompts for four T2V models (2000 videos totally) for test.
We use averaged score of our five dimensions for MLLM prompting baselines and VideoScore to
give the preference and calculate the pairwise accuracy as performance indicator.
| Metric | Final Avg Score ↓ | VideoFeedback-test | EvalCrafter | GenAI-Bench | VBench |
|---|---|---|---|---|---|
| VideoScore (reg) | 69.6 | 75.7 | 51.1 | 78.5 | 73.0 |
| VideoScore-(gen) | 55.6 | 77.1 | 27.6 | 59.0 | 58.7 |
| Gemini-1.5-Pro | 39.7 | 22.1 | 22.9 | 60.9 | 52.9 |
| Gemini-1.5-Flash | 39.4 | 20.8 | 17.3 | 67.1 | 52.3 |
| GPT-4o | 38.9 | 23.1 | 28.7 | 52.0 | 51.7 |
| CLIP-sim | 31.7 | 8.9 | 36.2 | 34.2 | 47.4 |
| DINO-sim | 30.3 | 7.5 | 32.1 | 38.5 | 43.3 |
| SSIM-sim | 29.5 | 13.4 | 26.9 | 34.1 | 43.5 |
| CLIP-Score | 28.6 | -7.2 | 21.7 | 45.0 | 54.9 |
| LLaVA-1.5-7B | 27.1 | 8.5 | 10.5 | 49.9 | 39.4 |
| LLaVA-1.6-7B | 23.3 | -3.1 | 13.2 | 44.5 | 38.7 |
| X-CLIP-Score | 23.2 | -1.9 | 13.3 | 41.4 | 40.1 |
| PIQE | 19.6 | -10.1 | -1.2 | 34.5 | 55.1 |
| BRISQUE | 19.0 | -20.3 | 3.9 | 38.5 | 53.7 |
| Idefics1 | 18.3 | 6.5 | 0.3 | 34.6 | 31.7 |
| MSE-dyn | 10.6 | -5.5 | -17.0 | 28.4 | 36.5 |
| SSIM-dyn | 9.2 | -12.9 | -26.4 | 31.4 | 44.5 |
The best VideoScore is in bold and the best in baselines is underlined.
VideoFeedback-test
Scale of all the scores is in [1, 2, 3, 4] except for VideoScore (reg),
which outputs five float logits ranging from 0.50 to 4.50.
For scale [1, 2, 3, 4], 1-Bad, 2-Avg, 3-Good, 4-Perfect/Real.
prompt: A robot that throws a stack of paper from a desk
| Method | VQ | TC | DD | TVA | FC | Method | VQ | TC | DD | TVA | FC |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Human score | 3 | 1 | 3 | 3 | 1 | ||||||
| VideoScore (reg) | 2.67 | 0.81 | 3.09 | 2.50 | 0.80 | VideoScore (gen) | 3 | 1 | 3 | 3 | 1 |
| GPT-4o | 3 | 4 | 2 | 3 | 4 | Gemini-1.5-Pro | 3 | 1 | 1 | 3 | 3 |
| Gemini-1.5-Flash | 3 | 1 | 1 | 3 | 3 | LLaVA-1.6-7B | 3 | 3 | 3 | 3 | 3 |
| LLaVA-1.5-7B | 3 | 3 | 3 | 3 | 2 | Idefics1 | 4 | 4 | 3 | 1 | 2 |
| PIQE | 1 | 1 | 1 | 1 | 1 | DINO-sim | 1 | 1 | 1 | 1 | 1 |
| SSIM-dyn | 3 | 3 | 3 | 3 | 3 | CLIP-Score | 2 | 2 | 2 | 2 | 2 |
prompt: Illustrate a bustling market scene, with fresh produce displayed on stalls, attracting villagers eager to purchase, cartoon style
| Method | VQ | TC | DD | TVA | FC | Method | VQ | TC | DD | TVA | FC |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Human score | 1 | 2 | 3 | 2 | 2 | ||||||
| VideoScore (reg) | 1.91 | 1.86 | 2.84 | 2.44 | 1.67 | VideoScore (gen) | 2 | 1 | 3 | 1 | 1 |
| GPT-4o | 3 | 3 | 3 | 4 | 4 | Gemini-1.5-Pro | 2 | 2 | 1 | 3 | 3 |
| Gemini-1.5-Flash | 3 | 1 | 1 | 2 | 3 | LLaVA-1.6-7B | 3 | 3 | 3 | 3 | 3 |
| LLaVA-1.5-7B | 3 | 3 | 3 | 2 | 2 | Idefics1 | 4 | 4 | 3 | 1 | 2 |
| PIQE | 2 | 2 | 2 | 2 | 2 | DINO-sim | 4 | 4 | 4 | 4 | 4 |
| SSIM-dyn | 2 | 2 | 2 | 2 | 2 | CLIP-Score | 3 | 3 | 3 | 3 | 3 |
prompt: Every day must be Sunday Amusement park inside the school
| Method | VQ | TC | DD | TVA | FC | Method | VQ | TC | DD | TVA | FC |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Human score | 1 | 1 | 3 | 2 | 1 | ||||||
| VideoScore (reg) | 1.04 | 1.42 | 2.95 | 1.97 | 1.09 | VideoScore (gen) | 1 | 1 | 3 | 2 | 1 |
| GPT-4o | 3 | 4 | 2 | 3 | 3 | Gemini-1.5-Pro | 2 | 1 | 2 | 2 | 1 |
| Gemini-1.5-Flash | 2 | 1 | 1 | 2 | 1 | LLaVA-1.6-7B | 3 | 3 | 3 | 2 | 2 |
| LLaVA-1.5-7B | 3 | 3 | 3 | 2 | 2 | Idefics1 | 4 | 4 | 3 | 1 | 2 |
| PIQE | 1 | 1 | 1 | 1 | 1 | DINO-sim | 3 | 3 | 3 | 3 | 3 |
| SSIM-dyn | 4 | 4 | 4 | 4 | 4 | CLIP-Score | 2 | 2 | 2 | 2 | 2 |
In each item we have two videos with same prompt and a human preference annotation. For VideoScore and MLLM prompting methods, we use average score of all 5 dimensions to predict preference, while for feature-based metrics, we use their discretized output for the prediction of preference direcly.
Left Video
prompt: a cute dog is playing a ball
Right Video
prompt: a cute dog is playing a ball
Left Video
prompt: An astronaut flying in space, oil painting
Right Video
prompt: An astronaut flying in space, oil painting
@article{he2024videoscore,
title = {VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation},
author = {He, Xuan and Jiang, Dongfu and Zhang, Ge and Ku, Max and Soni, Achint and Siu, Sherman and Chen, Haonan and Chandra, Abhranil and Jiang, Ziyan and Arulraj, Aaran and Wang, Kai and Do, Quy Duc and Ni, Yuansheng and Lyu, Bohan and Narsupalli, Yaswanth and Fan, Rongqi and Lyu, Zhiheng and Lin, Yuchen and Chen, Wenhu},
journal = {ArXiv},
year = {2024},
volume={abs/2406.15252},
url = {https://arxiv.org/abs/2406.15252},
}