Motion Reconstruction via Human Anatomy Diffusion from Sparse Tracking

1University of Chinese Academy of Sciences, Beijing, China, 2Peng Cheng Laboratory, Shenzhen, China, 3Tsinghua University, Beijing, China, 4Shenzhen University, Shenzhen, China
Title Image

Sequential visualization of full-body pose estimation using the Human Anatomy Diffusion method from sparse tracking inputs. The red coordinate axes represent the head-mounted display (HMD) tracking, while the green and blue coordinate axes correspond to the left and right hand tracking, respectively. The Human Anatomy Diffusion represents a full-body motion capture technology applicable to VR or AR environments.

Abstract

In the research field of analysis of people, generating precise full-body human motion from sparse tracking is a significant challenge. It is well known that diffusion techniques excel in generating high-quality two-dimensional (2D) visual content. However, when applied to human motion reconstruction, they might struggle to capture the inherent complexities of human motion, which is characterized by three-dimensional (3D) anatomical features and one-dimensional (1D) temporal dynamics. This heterogeneous structure between human motion and images can lead to accumulated errors at the joints, affecting the accuracy and smoothness of the generated motions. Building on this insight, we propose Human Anatomy Diffusion (HAD), a novel framework that integrates human anatomical features into the denoising process and excels in handling complex motions, accurately capturing body angles and balance, and showing enhanced alignment in motion prediction. HAD remarkably advanced the performance of motion reconstruction, notably enhancing smoothness by 81.29% compared to the previous state-of-the-art works and improving key accuracy metrics like MPJPE, Root PE, and Lower PE by approximately 20% on AMASS. Our method provides a crucial advancement for creating realistic and responsive virtual avatars in real-world applications.

Method

We propose Human Anatomy Diffusion (HAD), a novel framework that integrates human anatomical features into the denoising process for motion reconstruction from sparse tracking. HAD consists of two key components: the Human Anatomy Network (HAN) and the Human Anatomy Diffusion process.

Human Anatomy Diffusion

The Human Anatomy Network (HAN) is structured into four modules: Latent Space Mapping (LSM), Iterative Feature Enhancement (IFE), Temporal Feature Pyramid (TFP), and Hierarchical Motion Refinement (HMR).

Human Anatomy Network Pipeline
  • LSM maps the input features into a unified latent space for subsequent processing.
  • IFE takes the coarse features from LSM as input and enhances them for better representation.
  • TFP is applied to integrate multi-scale temporal motion features and output coarse predictions.
  • HMR adopts a hierarchical architecture inspired by the human body structure to refine the obtained predictions for more natural and accurate motion.

The Human Anatomy Diffusion process is divided into two phases: Parallel Motion Regression (PMR) and Alternate Motion Refinement (AMR).

  • PMR utilizes two specialized predictions, Smooth Prediction (SP) and Accurate Prediction (AP), to generate both smooth and accurate motion predictions simultaneously.
  • AMR further refines the predictions by applying the two predictions alternately.

Results

Our method significantly outperforms the state-of-the-art approaches on the AMASS dataset, as shown in the quantitative results below:

Method MPJRE ↓ MPJPE ↓ MPJVE ↓ Hand PE ↓ Upper PE ↓ Lower PE ↓ Root PE ↓ Jitter ↓ Upper Jitter ↓ Lower Jitter ↓
LoBSTr 10.69 9.02 44.97 - - - - - - -
CoolMoves 5.20 7.83 100.54 - - - - - - -
VAE-HMD 4.11 6.83 37.99 - - - - - - -
AvatarPoser 3.08 4.18 27.70 2.12 1.81 7.59 3.34 14.49 7.36 24.81
DAP 2.69 3.68 24.03 - - - - - - -
AGRoL-MLP 2.69 3.93 22.85 2.62 1.89 6.88 3.35 13.01 9.13 18.61
AGRoL 2.66 3.71 18.59 1.31 1.55 6.84 3.36 7.26 5.88 9.27
HAN-SP (Ours) 2.41 3.31 16.59 1.75 1.50 5.91 2.87 4.69 3.93 5.78
HAN-AP (Ours) 2.40 3.18 16.42 1.15 1.37 5.79 2.90 7.35 5.66 9.79
HAD (Ours) 2.29 3.03 15.45 1.15 1.30 5.52 2.71 4.61 3.86 5.70

HAD achieves remarkable improvements in accuracy and smoothness metrics compared to previous methods. It reduces MPJRE by 14.29%, MPJPE by 18.87%, and MPJVE by 16.46%. Additionally, it improves the motion accuracy of the hands, upper body, and lower body by 8.4%, 15.48%, and 19.88%, respectively. The overall jitter is reduced by 81.29% compared to the previous state-of-the-art.

The qualitative results demonstrate HAD's ability to generate accurate and natural human motions from sparse tracking inputs, even for complex and dynamic actions.

Qualitative Results

BibTeX

@article{niu2024motionreconstruction,

author = {Niu, Zehai and Lu, Ke and Dong, Kun and Xue, Jian and Qin, Xiaoyu and Wang, Jinbao},

title = {Motion Reconstruction via Human Anatomy Diffusion from Sparse Tracking},

journal = {ECCV Workshop},

year = {2024},

}