HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation

CVPR 2025
1 GigaAI 2 CASIA 3 PKU 4 CUHK
teaser

Abstract

Human-motion video generation has been a challenging task, primarily due to the difficulty inherent in learning human body movements. While some approaches have attempted to drive human-centric video generation explicitly through pose control, these methods typically rely on poses derived from existing videos, thereby lacking flexibility. To address this, we propose HumanDreamer, a decoupled human video generation framework that first generates diverse poses from text prompts and then leverages these poses to generate human-motion videos. Specifically, we propose MotionVid, the largest dataset for human-motion pose generation. Based on the dataset, we present MotionDiT, which is trained to generate structured human-motion poses from text prompts. Besides, a novel LAMA loss is introduced, which together contribute to a significant improvement in FID by 62.4%, along with respective enhancements in R-precision for top1, top2, and top3 by 41.8%, 26.3%, and 18.3%, thereby advancing both the Text-to-Pose control accuracy and FID metrics. Our experiments across various Pose-to-Video baselines demonstrate that the poses generated by our method can produce diverse and high-quality human-motion videos. Furthermore, our model can facilitate other downstream tasks, such as pose sequence prediction and 2D-3D motion lifting.

Method Overview

Training pipeline of the proposed Text-to-Pose generation. Pose data are encoded in latent space via the Pose VAE, which are then processed by the proposed MotionDiT, where local feature aggregation and global attention are utilized to capture information from the entire pose sequence. Finally, the LAMA loss is calculated via the proposed CLoP, which enhances the training of MotionDiT.

The pipeline of Pose-to-Video.


Comparison of Text-to-Pose


A man is holding his head in his hands and continues to do so while looking down at them.
A woman is dancing, moving her arms and legs around.

Pose-to-Video Generation


A man is speaking to the camera and then looking off into the distance. Later, he is speaking to the camera again.
A man is performing a yoga pose on a mat, and he is seen moving his legs and arms in different positions.

Comparison of Text-to-Video


HumanDreamer

CogVideo-X

Mochi-1

A man is sitting at a table, holding a book. He turns the pages and then holds it up to the camera.

A woman performs a belly dance routine in front of a black curtain, moving her hips and torso fluidly with raised arms.

Pose Sequence Prediction


Text-to-Pose model can infer and generate missing parts by conditioning on existing poses and textual movement descriptions.
A man is using an ax to chop wood in a forest, swinging the ax and continuing to chop.
A man is lifting a barbell over his head and then dropping it to the ground.

2D-3D Motion Lifting


Demonstration of lifting 2D motion sequences to 3D motion representations by using MotionBert.

BibTeX

@article{wang2025humandreamer,
        title={HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation}, 
        author={Boyuan Wang and Xiaofeng Wang and Chaojun Ni and Guosheng Zhao and Zhiqin Yang and Zheng Zhu and Muyang Zhang and Yukun Zhou and Xinze Chen and Guan Huang and Lihong Liu and Xingang Wang},
        journal={arXiv preprint arXiv:2503.24026},
        year={2025}
        }