AnyV2V Icon AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks

♠️†Max Ku*, ♠️†Cong Wei*, ♠️†Weiming Ren*, Harry Yang, ♠️†Wenhu Chen
♠️University of Waterloo, Vector Institute, Harmony.AI
*Equal contribution
{m3ku, c58wei, w2ren, wenhuchen}@uwaterloo.ca

🔔News

  • 2024 Apr 16: Local Gradio demo now supports edits up to 16 seconds (128 frames).
  • 2024 Mar 21: Paper available on Arxiv. AnyV2V is the first work to leverage I2V models in Video Editing!

Introduction

In the dynamic field of digital content creation using generative models, state-of-the-art video editing models still do not offer the level of quality and control that users desire. Previous works on video editing either extended from image-based generative models in a zero-shot manner or necessitated extensive fine-tuning, which can hinder the production of fluid video edits. Furthermore, these methods frequently rely on textual input as the editing guidance, leading to ambiguities and limiting the types of edits they can perform. Recognizing these challenges, we introduce AnyV2V, a novel tuning-free paradigm designed to simplify video editing into two primary steps: (1) employing an off-the-shelf image editing model to modify the first frame, (2) utilizing an existing image-to-video generation model to generate the edited video through temporal feature injection. AnyV2V can leverage any existing image editing tools to support an extensive array of video editing tasks, including prompt-based editing, reference-based style transfer, subject-driven editing, and identity manipulation, which were unattainable by previous methods. AnyV2V can also support any video length. Our evaluation indicates that AnyV2V significantly outperforms other baseline methods in automatic and human evaluations by significant margin, maintaining visual consistency with the source video while achieving high-quality edits across all the editing tasks.

Logo AnyV2V

AnyV2V Framework

AnyV2V disentangles the video editing process into two stages: (1) first-frame image editing and (2) image-to-video reconstruction. The first phase benefits from the extensive range of existing image editing models, enabling (1) detailed and precise modification and (2) flexibility for any editing tasks.

AnyV2V

Method

AnyV2V framework takes a source video as input. In the first stage, we apply a block-box image editing method on the first frame according to the editing task. In the second stage, the source video is inverted to initial noise, which is then denoised using DDIM sampling. During the sampling process, we extract spatial features, spatial attention and temporal attention from the image-to-video' decoder layers. To generate our edited video, we perform a DDIM sampling by fixing the latent and use the edited first frame as the conditional signal. During the sampling, we inject the features and attention into corresponding layers of the model.

method

Experiment Results

Prompt based Editing

AnyV2V is robust in a wide range of localized editing tasks while maintaining the background. The generated result aligns the most with the text prompt and also maintains high motion consistency.

"Make it snowing"
"Turn the man into darth vader"
"Turn the couple into robots"
"Turn the sand into snow"
"Turn horse into zebra"

Subject Driven Editing

Given one subject image, AnyDoor-equipped AnyV2V replaces an object in the video with the target subject while maintaining the video motion and persevering the background.

Single Reference Image:
style_1
Single Reference Image:
style_1
Single Reference Image:
style_1
Single Reference Image:
style_1

Style Transfer

AnyV2V uses NST / InstantStyle for the 1st frame style edits. Prior text-based methods struggle with unseen styles. In contrast, AnyV2V seamlessly transfers any reference style to the video, uniquely allowing artists to use their own creations as references.

Single Reference Style Image:
style_1
Single Reference Style Image:
style_1
Single Reference Style Image:
style_1
Single Reference Style Image:
style_1

Identity Manipulation

Integrating the InstantID, AnyV2V enables swapping a person's identity in a video with a single-reference face image. To the best of our knowledge, our work is the first to provide such flexibility in the video editing domain.

Single Reference Face Image:
style_1
Single Reference Face Image:
style_1

Extra

Long Video Editing

AnyV2V can perform video editing beyond the training frames of the I2V Model.

"Turn woman into robot"
"change her clothes to a hoodie"

Comparison

AnyV2V can preserve the video fidelity while performing the correct amount of edit.

Source
AnyV2V(Ours)
TokenFlow
FLATTEN
Source
Reference
AnyV2V(Ours)
VideoSwap
AnyV2V
AnyV2V

Citation

Please kindly cite our paper if you use our code, data, models or results:

@article{ku2024anyv2v,
  title={AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks},
  author={Ku, Max and Wei, Cong and Ren, Weiming and Yang, Harry and Chen, Wenhu},
  journal={arXiv preprint arXiv:2403.14468},
  year={2024}
}