StereoCrafter-Zero: Zero-Shot Stereo Video Generation with Noisy Restart

King Abdullah University of Science and Technology, Thuwal, Saudi Arabia

Side-By-Side 3D

View with cross-eye or parallel viewing techniques, or use a VR headset for immersive 3D experience.

Anaglyph 3D

Use red-cyan 3D glasses to view the stereoscopic depth effect.

Abstract

Generating high-quality stereo videos requires consistent depth perception and temporal coherence across frames. Despite advances in image and video synthesis using diffusion models, producing high-quality stereo videos remains a challenging task due to the difficulty of maintaining consistent temporal and spatial coherence between left and right views. We introduce StereoCrafter-Zero, a novel framework for zero-shot stereo video generation that leverages video diffusion priors without requiring paired training data. Our key innovations include a noisy restart strategy to initialize stereo-aware latent representations and an iterative refinement process that progressively harmonizes the latent space, addressing issues like temporal flickering and view inconsistencies. In addition, we propose the use of dissolved depth maps to streamline latent space operations by reducing high-frequency depth information. Our comprehensive evaluations, including quantitative metrics and user studies, demonstrate that StereoCrafter-Zero produces high-quality stereo videos with enhanced depth consistency and temporal smoothness. In terms of epipolar consistency, our method achieves an $11.7\%$ improvement in MEt3R score over the current state-of-the-art. Furthermore, user studies indicate strong perceptual gains over the previous arts, with an $8.0\%$ higher perceived frame quality and $10.9\%$ higher perceived temporal coherence. Our code will be made publicly available upon acceptance of this manuscript.

Our Pipeline

Pipeline Diagram

An overview of the StereoCrafter-Zero pipeline. Top: Our method contains two main components: (1) Noisy Restart for a robust initial latent estimation and (2) Iterative Refinement for latent refinement. These components act on target view latents (blue) for temporal and inter-view consistency with the source view (orange). Bottom: Given an image and text prompt,our pipeline generates stereo videos with a strong stereoscopic effect.

Visual Comparison

Interactive comparison with ProPainter, RoDynRF, ImmersePro, TrajactoryCrafter, StereoCrafter, and StereoDiffusion. Select tabs to compare each method.

Artifacts are stronger with zoom-in videos. Using ctrl+scroll to zoom in and out can help in better visualizing these artifacts.

🔻 Comparing View

No temporal consistency

Oversmoothed texture

Whitened overall coloring; less stereoscopic effect.

Significant background noise.

🔻 Comparing View

 

No temporal consistency

Oversmoothed texture

Whitened overall coloring; less stereoscopic effect.

Significant background noise.

Diverse Capabilities

Multi-Resolution Support

576x1024

320x512

256x256

Generative frame interpolation

We input the starting and ending frames as our input and generate stereo content with our method.

Input starting frame Input ending frame

Input starting frame Input ending frame

Depth Dissolving

We apply depth dissolving to smooth out high-frequency artifacts in the depth map. This technique progressively reduces fine details in the depth representation, yielding a smoother and more stable depth representation for stereo video generation without introducing sharp transitions or discontinuities. Eventually, resulting in smoother and more stable stereo video generation. The parameter t controls the dissolving strength, with higher values producing more aggressive smoothing.

BibTeX

@article{shi2024stereocrafter,
  title={Stereocrafter-zero: Zero-shot stereo video generation with noisy restart},
  author={Shi, Jian and Wang, Qian and Li, Zhenyu and Idoughi, Ramzi and Wonka, Peter},
  journal={arXiv preprint arXiv:2411.14295},
  year={2024}
}