VersVideo

Abstract

Creating stable, controllable videos remains a challenging task due to the need for significant variation in temporal dynamics and cross-frame temporal consistency. To tackle this, we enhance the spatial-temporal capability and introduce a versatile video generation model, VersVideo, which utilizes textual, visual, and stylistic conditions. Current video diffusion models typically extend image diffusion architectures by augmenting 2D operations (such as convolutions and attentions) with a 1D temporal operation. This approach, while efficient, often limits spatial-temporal performance due to the simplification of standard 3D operations. To overcome this, we incorporate multi-excitation paths for spatial-temporal convolutions, which are embedded with 2D convolutions and multi-expert spatial-temporal attention blocks. These improvements enhance the model's spatial-temporal performance without significantly increasing training and inference costs. We also address the issue of information loss that occurs when a variational autoencoder is used to convert pixel space into latent features and back into pixel frames. To compensate for this, we integrate temporal modules into the decoder to retain more temporal information. Leveraging the new denoising UNet and decoder, we develop a unified ControlNet model suitable for various conditions, including image, Canny, HED, depth, and style. Examples of generated videos can be in this page.

Text2Video Generation (576*320)

Art Nouveau painting of a female botanist surrounded by exotic plants in a greenhouse.

A small ship sailing in a stormy sea, with dramatic lighting and powerful waves.

Aerial photography of a winding river through autumn forests, with vibrant red and orange foliage.

The octopus marches slowly under the water.

Mountain river.

Close up craftsman worker sawing a steel pipe, Technician concept.

Forest in Autumn.

Fishing boat at the little wharf.

Ducks in a Pond.

A man is riding a motorcycle on the road, cyberpunk style.

A clownfish is swimming in coral reefs.

A turtle crawling on the bottom of the sea, beautiful.

The Effectiveness of TCM-VAE

The Variational Autoencoder (VAE) from pretrained image models is often reused for most existing video latent diffusion models. However, the image VAE is a significant contributor to video flickering and inconsistency. To illustrate this issue, we present several video reconstruction examples of resolution 24x256x256 below. We compare the widely-used image VAE for StableDiffusion with our proposed TCM-VAE.

As for image VAE, we encode the video frames independently and then reconstruct them back into a video using a vanilla decoder. For TCM-VAE, the encoding process remains unchanged, but we employ our well-designed decoder with temporal awareness. As demonstrated by the results, the flickering effects in the reconstructed videos have been significantly reduced, although they are not yet perfectly smooth. Another advantage of TCM-VAE is its compatibility. Since the encoder remains frozen, TCM-VAE can be easily integrated with most existing video diffusion models, making it a plug-and-play solution.

Original Video

Image VAE

TCM-VAE (Ours)

Visual Conditions Generation (576*320)

The text prompt from the top row to the bottom row. (1) Cyberpunk city. (2) Glittering and translucent fishes swimming in a small bowl with stones, like a glass fish. (3) Mountain river in spring. (4) A robot walking in the field. (5) Lotus flowers, best quality. (6) A swan. (7) A swan.

Original Video

Visual Condition

Generated Video

Visual Conditions Generation (256*256)

We use the test examples presented in Make-Your-Video.

The text prompt from the top row to the bottom row. (1) A dam discharging water. (2) A tiger walks in the forest, photorealistic. (3) A camel walking on the snow field, Miyazaki Hayao anime style. The visual conditions used for generation are: depth, canny, and HED.

Original Video

Text2Video-zero

LVDMExt+Adapter

Make-Your-video

VersVideo-L (Ours)

VersVideo: Leveraging Enhanced Temporal Diffusion Models for Versatile Video Generation

Abstract

Text2Video Generation (576*320)

The Effectiveness of TCM-VAE

Visual Conditions Generation (576*320)

Visual Conditions Generation (256*256)