Creating stable, controllable videos remains a challenging task due to the need for significant variation in temporal dynamics and cross-frame temporal consistency. To tackle this, we enhance the spatial-temporal capability and introduce a versatile video generation model, VersVideo, which utilizes textual, visual, and stylistic conditions. Current video diffusion models typically extend image diffusion architectures by augmenting 2D operations (such as convolutions and attentions) with a 1D temporal operation. This approach, while efficient, often limits spatial-temporal performance due to the simplification of standard 3D operations. To overcome this, we incorporate multi-excitation paths for spatial-temporal convolutions, which are embedded with 2D convolutions and multi-expert spatial-temporal attention blocks. These improvements enhance the model's spatial-temporal performance without significantly increasing training and inference costs. We also address the issue of information loss that occurs when a variational autoencoder is used to convert pixel space into latent features and back into pixel frames. To compensate for this, we integrate temporal modules into the decoder to retain more temporal information. Leveraging the new denoising UNet and decoder, we develop a unified ControlNet model suitable for various conditions, including image, Canny, HED, depth, and style. Examples of generated videos can be in this page.
The Variational Autoencoder (VAE) from pretrained image models is often reused for most existing video latent diffusion models. However, the image VAE is a significant contributor to video flickering and inconsistency. To illustrate this issue, we present several video reconstruction examples of resolution 24x256x256 below. We compare the widely-used image VAE for StableDiffusion with our proposed TCM-VAE.
As for image VAE, we encode the video frames independently and then reconstruct them back into a video using a vanilla decoder. For TCM-VAE, the encoding process remains unchanged, but we employ our well-designed decoder with temporal awareness. As demonstrated by the results, the flickering effects in the reconstructed videos have been significantly reduced, although they are not yet perfectly smooth. Another advantage of TCM-VAE is its compatibility. Since the encoder remains frozen, TCM-VAE can be easily integrated with most existing video diffusion models, making it a plug-and-play solution.
The text prompt from the top row to the bottom row. (1) Cyberpunk city. (2) Glittering and translucent fishes swimming in a small bowl with stones, like a glass fish. (3) Mountain river in spring. (4) A robot walking in the field. (5) Lotus flowers, best quality. (6) A swan. (7) A swan.
We use the test examples presented in Make-Your-Video.
The text prompt from the top row to the bottom row. (1) A dam discharging water. (2) A tiger walks in the forest, photorealistic. (3) A camel walking on the snow field, Miyazaki Hayao anime style. The visual conditions used for generation are: depth, canny, and HED.