Video-to-video editing involves editing a source video along with additional control (such as text prompts, subjects, or styles) to generate a new video that aligns with the source video and the provided control. Traditional methods have been constrained to certain editing types, limiting their ability to meet the wide range of user demands. In this paper, we introduce AnyV2V, a novel training-free framework designed to simplify video editing into two primary steps: (1) employing an off-the-shelf image editing model (e.g. InstructPix2Pix, InstantID, etc) to modify the first frame, (2) utilizing an existing image-to-video generation model (e.g. I2VGen-XL) for DDIM inversion and feature injection. In the first stage, AnyV2V can plug in any existing image editing tools to support an extensive array of video editing tasks. Beyond the traditional prompt-based editing methods, AnyV2V also can support novel video editing tasks, including reference-based style transfer, subject-driven editing, and identity manipulation, which were unattainable by previous methods. In the second stage, AnyV2V can plug in any existing image-to-video models to perform DDIM inversion and intermediate feature injection to maintain the appearance and motion consistency with the source video. On the prompt-based editing, we show that AnyV2V can outperform the previous best approach by 35\% on prompt alignment, and 25\% on human preference. On the three novel tasks, we show that AnyV2V also achieves a high success rate. We believe AnyV2V will continue to thrive due to its ability to seamlessly integrate the fast-evolving image editing methods. Such compatibility can help AnyV2V to increase its versatility to cater to diverse user demands.
AnyV2V disentangles the video editing process into two stages: (1) first-frame image editing and (2) image-to-video reconstruction. The first phase benefits from the extensive range of existing image editing models, enabling (1) detailed and precise modification and (2) flexibility for any editing tasks.
AnyV2V framework takes a source video as input. In the first stage, we apply a block-box image editing method on the first frame according to the editing task. In the second stage, the source video is inverted to initial noise, which is then denoised using DDIM sampling. During the sampling process, we extract spatial features, spatial attention and temporal attention from the image-to-video' decoder layers. To generate our edited video, we perform a DDIM sampling by fixing the latent and use the edited first frame as the conditional signal. During the sampling, we inject the features and attention into corresponding layers of the model.
AnyV2V is robust in a wide range of localized editing tasks while maintaining the background. The generated result aligns the most with the text prompt and also maintains high motion consistency.
Given one subject image, AnyDoor-equipped AnyV2V replaces an object in the video with the target subject while maintaining the video motion and persevering the background.
AnyV2V uses NST for the 1st frame style edits. Prior text-based methods struggle with unseen styles. In contrast, AnyV2V seamlessly transfers any reference style to the video, uniquely allowing artists to use their own creations as references.
Integrating the InstantID, AnyV2V enables swapping a person's identity in a video with a single-reference face image. To the best of our knowledge, our work is the first to provide such flexibility in the video editing domain.
AnyV2V can preserve the video fidelity while performing the correct amount of edit.
Source |
AnyV2V(Ours) |
TokenFlow |
FLATTEN |
Source |
Reference |
AnyV2V(Ours) |
VideoSwap |
@article{ku2024anyv2v,
title={AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks},
author={Ku, Max and Wei, Cong and Ren, Weiming and Yang, Harry and Chen, Wenhu},
journal={arXiv preprint arXiv:2403.14468},
year={2024}
}