EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation

CVPR 2025

1Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University, 2Shanghai Collaborative Innovation Center of Intelligent Visual Computing,
3Noah's Ark Lab, Huawei

†: Corresponding author.

Abstract

Handling complex or nonlinear motion patterns has long posed challenges for video frame interpolation. Although recent advances in diffusion-based methods offer improvements over traditional optical flow-based approaches, they still struggle to generate sharp, temporally consistent frames in scenarios with large motion. To address this limitation, we introduce EDEN, an Enhanced Diffusion for high-quality large-motion vidEo frame iNterpolation. Our approach first utilizes a transformer-based tokenizer to produce refined latent representations of the intermediate frames for diffusion models. We then enhance the diffusion transformer with temporal attention across the process and incorporate a start-end frame difference embedding to guide the generation of dynamic motion. Extensive experiments demonstrate that EDEN achieves state-of-the-art results across popular benchmarks, including nearly a 10% LPIPS reduction on DAVIS and SNU-FILM, and an 8% improvement on DAIN-HD.

Pipeline

Pipeline of EDEN

We design an efficient transformer-based tokenizer to enhance the representation of intermediate frame latents. Specifically, we employ a pyramid feature fusion module to integrate multi-scale features, reducing information loss during compression. Additionally, we introduce temporal attention to incorporate start and end frame information, enabling effective temporal modeling for intermediate frames. This results in strong intermediate frame latent representation.

For the diffusion model, we adopt a DiT-based architecture with a dual-stream conditioning mechanism to enhance denoising. Temporal attention allows the intermediate frame to effectively leverage rich pixel information from the start and end frames, while modulation of attention and MLP inputs/outputs enables adaptation to varying motion intensities. This dual-stream conditioning strategy ensures robust motion modeling across different levels of motion complexity.

Results

BibTeX

@inproceedings{zhang2025eden,
  title     = {EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation},
  author    = {Zhang, Zihao and Chen, Haoran and Zhao, Haoyu and Lu, Guansong and Fu, Yanwei and Xu, Hang and Wu, Zuxuan},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
}