LMP: Leveraging Motion Prior in Zero-Shot Video Generation with Diffusion Transformer

East China Normal University
text to video
The batman and spiderman...
image to video
A man in a gray suit stands in a room...

Abstract

In recent years, large-scale pre-trained diffusion transformer models have made significant progress in video generation. While current DiT models can produce high-definition, high-frame-rate, and highly diverse videos, there is a lack of fine-grained control over the video content. Controlling the motion of subjects in videos using only prompts is challenging, especially when it comes to describing complex movements. Further, existing methods fail to control the motion in image-to-video generation, as the subject in the reference image often differs from the subject in the reference video in terms of initial position, size, and shape. To address this, we propose the Leveraging Motion Prior (LMP) framework for zero-shot video generation. Our framework harnesses the powerful generative capabilities of pre-trained diffusion transformers to enable motion in the generated videos to reference user-provided motion videos in both text-to-video and image-to-video generation. To this end, we first introduce a foreground-background disentangle module to distinguish between moving subjects and backgrounds in the reference video, preventing interference in the target video generation. A reweighted motion transfer module is designed to allow the target video to reference the motion from the reference video. To avoid interference from the subject in the reference video, we propose an appearance separation module to suppress the appearance of the reference subject in the target video. We annotate the DAVIS dataset with detailed prompts for our experiments and design evaluation metrics to validate the effectiveness of our method. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in generation quality, prompt-video consistency, and control capability.

Overview

Overview

Results

Text to Video Results

The videos visually demonstrates the effectiveness of our method in text-to-video setting. It contains four sets of data, where the first line of each set is the reference video, the second line shows the effect of changing the subject prompt while keeping the background prompt unchanged, and the third line illustrates the effect of modifying the background while keeping the subject unchanged.

A red flamingo...

A monkey...

A white rally motorcycle...

An astronaut...

...on a mountain stream...

...in a bustling city park...

...in an urban setting...

...along a sandy beach...

Image to Video Results

Different from previous motion-transfer or video-to-video methods that generate videos with structures identical to the reference video, our method imposes minimal manual supervision on the motion. Instead, we leverage DiT's strong generative capabilities to create videos with different overall structures but the same subject motion patterns as the reference video.


Ref video

A rabbit in the forest...autumn...

A cyclist...snow....winter...

Ref video

An astronaut...on the moon...

A Bengal tiger...forest...

Comparison

To verify the effectiveness of our method, we conducted both quantitative and qualitative experiments. We compared our proposed LMP method with the baseline method, CogVideoX. Since there are no other open-source motion transfer methods based on the DiT architecture currently available, we compared our method with the previous zero-shot SOTA method, DMT \cite{Yatim2023SpaceTimeDF}, which is based on the U-Net architecture. It is important to note that due to limitations imposed by the U-Net-based model, ZeroScope \cite{Sterling2023ZeroScope}, DMT can only generate videos with 24 frames at a resolution of 576×320. To ensure a fair comparison, we also extracted the first 24 frames of the videos generated by our method and reshaped them to the same resolution.

Text to Video Results

Ref video

Baseline

DMT

Ours

A black and orange cat...

...on a sandy beach with waves...

Ref video

Baseline

DMT

Ours

A man walks down...

...a serene coastal area...

Image to Video Results


Ref video

Baseline

Ours

A panda with thick fur...

BibTeX

@article{chen2025lmp,
      title={LMP: Leveraging Motion Prior in Zero-Shot Video Generation with Diffusion Transformer},
      author={Chen, Changgu and Yang, Xiaoyan and Shu, Junwei and Wang, Changbo and Li, Yang},
      journal={arXiv preprint arXiv:2505.14167},
      year={2025}
    }