Robust Human Motion Reconstruction via Diffusion

* The work was done during an internship at Meta.
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024
Oral Presentation

Conditioned on noisy and occluded input data, RoHM reconstructs complete, plausible motions in consistent global coordinates for both visible and occluded joints, predicting whether feet are in contact or not with the ground for improved physical plausibility.


We propose RoHM, an approach for robust 3D human motion reconstruction from monocular RGB(-D) videos in the presence of noise and occlusions. Most previous approaches either train neural networks to directly regress motion in 3D or learn data-driven motion priors and combine them with optimization at test time. The former do not recover globally coherent motion and fail under occlusions; the latter are time-consuming, prone to local minima, and require manual tuning. To overcome these shortcomings, we exploit the iterative, denoising nature of diffusion models. RoHM is a novel diffusion-based motion model that, conditioned on noisy and occluded input data, reconstructs complete, plausible motions in consistent global coordinates. Given the complexity of the problem -- requiring one to address different tasks (denoising and infilling) in different solution spaces (local and global motion) -- we decompose it into two sub-tasks and learn two models, one for global trajectory and one for local motion. To capture the correlations between the two, we then introduce a novel conditioning module, combining it with an iterative inference scheme. We apply RoHM to a variety of tasks -- from motion reconstruction and denoising to spatial and temporal infilling. Extensive experiments on three popular datasets show that our method outperforms state-of-the-art approaches qualitatively and quantitatively, while being faster at test time.

Motion denoising, infilling and in-between

Here we present the results on AMASS dataset for motion denoising, infilling and in-between. Synthetic gaussian noise is added to the input body, with uppder body visible (blue joints) and lower body masked out (yellow joints) for all frames. Also, accurate foot-ground contact labels are predicted (in contact or not).

Besides motion infilling, our method also enables motion in-between for the full body with noisy input, when a certain percentage of the input frames are masked out (yellow frames).

Motion Reconstruction from RGB(-D) videos

Our method also reconstructs realistic motions from real-life video input from:

RGBD video from PROX dataset:

RGB video from PROX dataset:

RGB video from EgoBody dataset:



RoHM: Robust Human Motion Reconstruction via Diffusion
Siwei Zhang, Bharat Lal Bhatnagar, Yuanlu Xu, Alexander Winkler, Petr Kadlecek, Siyu Tang, Federica Bogo

  title={RoHM: Robust Human Motion Reconstruction via Diffusion},
  author={Zhang, Siwei and Bhatnagar, Bharat Lal and Xu, Yuanlu and Winkler, Alexander and Kadlecek, Petr and Tang, Siyu and Bogo, Federica},


For questions, please contact Siwei Zhang: