AnyV2V Icon AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks

♠️†Max Ku*, ♠️†Cong Wei*, ♠️†Weiming Ren*, Harry Yang, ♠️†Wenhu Chen
♠️University of Waterloo, Vector Institute, Harmony.AI
*Equal contribution
{m3ku, c58wei, w2ren, wenhuchen}@uwaterloo.ca

🔔News

Introduction

Video-to-video editing involves editing a source video along with additional control (such as text prompts, subjects, or styles) to generate a new video that aligns with the source video and the provided control. Traditional methods have been constrained to certain editing types, limiting their ability to meet the wide range of user demands. In this paper, we introduce AnyV2V, a novel training-free framework designed to simplify video editing into two primary steps: (1) employing an off-the-shelf image editing model (e.g. InstructPix2Pix, InstantID, etc) to modify the first frame, (2) utilizing an existing image-to-video generation model (e.g. I2VGen-XL) for DDIM inversion and feature injection. In the first stage, AnyV2V can plug in any existing image editing tools to support an extensive array of video editing tasks. Beyond the traditional prompt-based editing methods, AnyV2V also can support novel video editing tasks, including reference-based style transfer, subject-driven editing, and identity manipulation, which were unattainable by previous methods. In the second stage, AnyV2V can plug in any existing image-to-video models to perform DDIM inversion and intermediate feature injection to maintain the appearance and motion consistency with the source video. On the prompt-based editing, we show that AnyV2V can outperform the previous best approach by 35\% on prompt alignment, and 25\% on human preference. On the three novel tasks, we show that AnyV2V also achieves a high success rate. We believe AnyV2V will continue to thrive due to its ability to seamlessly integrate the fast-evolving image editing methods. Such compatibility can help AnyV2V to increase its versatility to cater to diverse user demands.

Logo AnyV2V

AnyV2V Framework

AnyV2V disentangles the video editing process into two stages: (1) first-frame image editing and (2) image-to-video reconstruction. The first phase benefits from the extensive range of existing image editing models, enabling (1) detailed and precise modification and (2) flexibility for any editing tasks.

AnyV2V

Method

AnyV2V framework takes a source video as input. In the first stage, we apply a block-box image editing method on the first frame according to the editing task. In the second stage, the source video is inverted to initial noise, which is then denoised using DDIM sampling. During the sampling process, we extract spatial features, spatial attention and temporal attention from the image-to-video' decoder layers. To generate our edited video, we perform a DDIM sampling by fixing the latent and use the edited first frame as the conditional signal. During the sampling, we inject the features and attention into corresponding layers of the model.

method

Experiment Results

Prompt based Editing

AnyV2V is robust in a wide range of localized editing tasks while maintaining the background. The generated result aligns the most with the text prompt and also maintains high motion consistency.

"Make it snowing"
"Turn the man into darth vader"
"Turn the couple into robots"
"Turn the sand into snow"
"Turn horse into zebra"

Subject Driven Editing

Given one subject image, AnyDoor-equipped AnyV2V replaces an object in the video with the target subject while maintaining the video motion and persevering the background.

Single Reference Image:
style_1
Single Reference Image:
style_1
Single Reference Image:
style_1
Single Reference Image:
style_1

Style Transfer

AnyV2V uses NST for the 1st frame style edits. Prior text-based methods struggle with unseen styles. In contrast, AnyV2V seamlessly transfers any reference style to the video, uniquely allowing artists to use their own creations as references.

Single Reference Style Image:
style_1
Single Reference Style Image:
style_1

Identity Manipulation

Integrating the InstantID, AnyV2V enables swapping a person's identity in a video with a single-reference face image. To the best of our knowledge, our work is the first to provide such flexibility in the video editing domain.

Single Reference Face Image:
style_1
Single Reference Face Image:
style_1

Comparison

AnyV2V can preserve the video fidelity while performing the correct amount of edit.

Source
AnyV2V(Ours)
TokenFlow
FLATTEN
Source
Reference
AnyV2V(Ours)
VideoSwap
AnyV2V
AnyV2V

Citation

Please kindly cite our paper if you use our code, data, models or results:

@article{ku2024anyv2v,
  title={AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks},
  author={Ku, Max and Wei, Cong and Ren, Weiming and Yang, Harry and Chen, Wenhu},
  journal={arXiv preprint arXiv:2403.14468},
  year={2024}
}