Recently, open-source video diffusion mdoels (VDMs) have been scaled to over 10 billion (10b+) parameters. Fine-tuning these large-scale VDMs for the portrait video synthesis task can result in significant improvements across multiple dimensions, such as visual quality and natural facial motion dynamics. Despite their advancements, how to achieve step distillation and reduce the substantial computational overhead of large-scale VDMs remains unexplored. To fill this gap, this paper proposes Weak-to-Strong Video Distillation (W2SVD) to mitigate both the issue of insufficient training memory and the problem of training collapse observed in vanilla DMD during the training process. Specifically, we first leverage LoRA to fine-tune the fake diffusion transformer (DiT) to address the out-of-memory issue. Then, we use W2S distribution matching to alleviate the conundrum where the video synthesized by the few-step generator deviates from the real data distribution, leading to inaccuracies in the KL divergence approximation. Additionally, we minimize the distance between the fake data distribution and the ground truth distribution to further enhance the visual quality of the synthesized videos. As experimentally demonstrated on HunyuanVideo, W2SVD surpasses the standard Euler, LCM and DMD in FID/FVD and VBench in 1/4-step portrait video synthesis. Moreover, our 4-step generator even outperforms the 28-step standard sampling in VBench.
1. We apply DMD to large-scale HunyuanVideo and address the out-of-memory issue by leveraging LoRA to fine-tune the fake diffusion transformer (DiT).
2. We propose Weak-to-Strong Distribution Matching Distillation (W2SVD) to mitigate the problem of training collapse observed in vanilla DMD during the training process.
3. We also introduce a KL divergence approximation between the fake data distribution and the ground truth distribution, which can effectively achieve a balance between visual quality and motion dynamics.
4. We validate our proposed approach using benchmark datasets alongside our own collected widescreen video dataset. We verified the superiority of W2SVD in FID, FVD and VBench metrics.
Visualization of W2SVD, vanilla DMD2, LCM, and the standard Euler sampling. In both 1-step and 4-step scenarios, W2SVD visibly outperforms other methods in terms of visual quality and facial motion dynamics.