Video Motion Transfer with Diffusion Transformers

Supplementary material

Press spacebar to pause/unpause videos

Press 'r' to reset all videos, then spacebar to play them

Hover over bold prompts to see the full text used

Sections

CogVideoX-5B results (full videos from Figure 1 and 4 + more)

Reference video
Generation 1
A lion walking through a busy market
Generation 2
Motorbiker driving around moonlit sand dunes
Leopard running up a snowy hill in a forest
Hiker climbing upwards on a mountain peak
Driving motorcycle through cityscape, first person perspective
Drone footage of a castle corridor interior with tall statues
A duck with a tophat swimming in a river
A paper boat floating in a river
Firefighter running towards the camera away from a burning building
Panda charging towards the camera in a bamboo forest, low angle shot
Man longboarding past a forest, camera following behind
Man snowboarding down a passage in a snowy forest, camera following behind
Note: Input to CogVideoX is actually 24 frames, but since it generates outputs with 21 frames, we output reference videos consistently to make it synchronous.

Zero-shot CogVideoX-5B results

Reference video
Optimizing positional embedding
Robot walking on a sidewalk in a cyberpunk-style city
Injecting positional embedding
Astronaut walking on moon
An astronaut in a pristine white spacesuit, adorned with a reflective visor and mission patches, takes deliberate steps across the moon's desolate, cratered surface. The vast, dark expanse of space stretches infinitely above, dotted with distant stars. As the astronaut moves, the Earth hangs majestically in the sky, a vibrant blue and green sphere against the stark lunar landscape. Each step kicks up fine, gray lunar dust, creating a slow-motion effect in the low gravity.
A blue Sedan car turning into a driveway
A camel walking in a zoo
Zoom into a gorilla wearing a lab coat on a field
Zoom into a lion standing on a cliff looking towards us

Baseline comparison

Reference video
DiTFlow
Injection [2]
Dog running between poles in an agility course
SMM [3]
MOFT [2]
Reference video
DiTFlow
Injection [2]
Bear running in a garden
SMM [3]
MOFT [2]
Reference video
DiTFlow
Injection [2]
Parachuting over a city, aerial view from above
SMM [3]
MOFT [2]

Ablation results

CogVideoX-5B ablations run over all prompt categories on 14 unique videos: blackswan car-turn bmx-bumps bmx-trees dog flamingo mtb-race chamaleon dog-gooses dogs-jump gold-fish hoverboard motorbike paragliding. Samples shown for the following reference video and prompt.
Reference video
Leopard running up a snowy hill in a forest

DiT Block used for guidance

Block 0
Block 10
Block 20
Block 30

Percentage of denoising steps that are guided

0%
20%
40%
60%
80%
100%

Number of optimization steps

1 step
5 steps
10 steps

CogVideoX-2B results

Reference video
Generation 1
A gray mini cooper driving around a roundabout in a town
Generation 2
A lion walking through a bustling roundabout, surrounded by vibrant city life

Dataset

We provide the prompts used (three categories) for each of the 50 DAVIS videos chosen. We sample 24 frames from each video.

References

[1] William Peebles and Saining Xie. Scalable diffusion models with transformers. In CVPR, 2023.

[2] Zeqi Xiao, Yifan Zhou, Shuai Yang, and Xingang Pan. Video diffusion models are training-free motion interpreter and controller. arXiv preprint arXiv:2405.14864, 2024.

[3] Danah Yatim, Rafail Fridman, Omer Bar-Tal, Yoni Kasten, and Tali Dekel. Space-time diffusion features for zero-shot text-driven motion transfer. In CVPR, 2024.

[4] Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. In ICLR, 2024.