Video Motion Transfer with Diffusion Transformers

Supplementary material

Press spacebar to pause/unpause videos

Press 'r' to reset all videos, then spacebar to play them

Hover over bold prompts to see the full text used

Sections

CogVideoX-5B results
Zero-shot CogVideoX-5B results
Baseline comparison
Ablation results
CogVideoX-2B results
Dataset

CogVideoX-5B results (full videos from Figure 1 and 4 + more)

Reference video	Generation 1 A lion walking through a busy market	Generation 2 Motorbiker driving around moonlit sand dunes
	Leopard running up a snowy hill in a forest	Hiker climbing upwards on a mountain peak
	Driving motorcycle through cityscape, first person perspective	Drone footage of a castle corridor interior with tall statues
	A duck with a tophat swimming in a river	A paper boat floating in a river
	Firefighter running towards the camera away from a burning building	Panda charging towards the camera in a bamboo forest, low angle shot
	Man longboarding past a forest, camera following behind	Man snowboarding down a passage in a snowy forest, camera following behind

Note: Input to CogVideoX is actually 24 frames, but since it generates outputs with 21 frames, we output reference videos consistently to make it synchronous.

Zero-shot CogVideoX-5B results

Reference video	Optimizing positional embedding Robot walking on a sidewalk in a cyberpunk-style city	Injecting positional embedding Astronaut walking on moon An astronaut in a pristine white spacesuit, adorned with a reflective visor and mission patches, takes deliberate steps across the moon's desolate, cratered surface. The vast, dark expanse of space stretches infinitely above, dotted with distant stars. As the astronaut moves, the Earth hangs majestically in the sky, a vibrant blue and green sphere against the stark lunar landscape. Each step kicks up fine, gray lunar dust, creating a slow-motion effect in the low gravity.
	A blue Sedan car turning into a driveway	A camel walking in a zoo
	Zoom into a gorilla wearing a lab coat on a field	Zoom into a lion standing on a cliff looking towards us

Baseline comparison

Reference video Dog running between poles in an agility course	DiTFlow	Injection [2]
MotionClone [5]	SMM [3]	MOFT [2]

Reference video Bear running in a garden	DiTFlow	Injection [2]
MotionClone [5]	SMM [3]	MOFT [2]

Reference video Parachuting over a city, aerial view from above	DiTFlow	Injection [2]
MotionClone [5]	SMM [3]	MOFT [2]

Ablation results

CogVideoX-5B ablations run over all prompt categories on 14 unique videos: blackswan car-turn bmx-bumps bmx-trees dog flamingo mtb-race chamaleon dog-gooses dogs-jump gold-fish hoverboard motorbike paragliding. Samples shown for the following reference video and prompt.

Reference video

Leopard running up a snowy hill in a forest

DiT Block used for guidance

Block 0

Block 10

Block 20

Block 30

Percentage of denoising steps that are guided

20%

40%

60%

80%

100%

Number of optimization steps

1 step

5 steps

10 steps

CogVideoX-2B results

Reference video

Generation 1

A gray mini cooper driving around a roundabout in a town

Generation 2

A lion walking through a bustling roundabout, surrounded by vibrant city life

Dataset

We provide the prompts used (three categories) for each of the 50 DAVIS videos chosen. We sample 24 frames from each video. The frame indices extracted from the original dataset are included in prompts.json in the format [start, end, step].

References

[1] William Peebles and Saining Xie. Scalable diffusion models with transformers. In CVPR, 2023.

[2] Zeqi Xiao, Yifan Zhou, Shuai Yang, and Xingang Pan. Video diffusion models are training-free motion interpreter and controller. In NeurIPS, 2024.

[3] Danah Yatim, Rafail Fridman, Omer Bar-Tal, Yoni Kasten, and Tali Dekel. Space-time diffusion features for zero-shot text-driven motion transfer. In CVPR, 2024.

[4] Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. In ICLR, 2024.

[5] Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, and Yi Jin. MotionClone: Training-Free Motion Cloning for Controllable Video Generation. In ICLR, 2025.