KineST: A Kinematics-guided Spatiotemporal State Space Model for Human Motion Tracking from Sparse Signals

Anonymous
Teaser Left

KineST turns sparse tracking signals into full-body pose estimations and achieves superior performance compared to SOTAs.

Stylized Daily Motions

Athletic and Dance Motions

KineST achieves high estimation accuracy in both stylized daily motions and complex athletic and dance movements.

Abstract

Full-body motion tracking plays an essential role in AR/VR applications, bridging physical and virtual interactions.

However, Head-Mounted Displays (HMDs), the main device in AR/VR scenarios, can only provide sparse input signals, making it challenging to reconstruct realistic and diverse full-body poses. Previous works attempt to solve this at high computational cost or by separately modeling temporal and spatial dependencies to capture richer features, but they often struggle to balance accuracy, temporal continuity, and efficiency.

To address this problem, we propose KineST, a novel kinematics-guided state space model that effectively extracts spatiotemporal dependencies while integrating local and global pose perception. The proposed model relies on two core ideas. First, we reformulate the scanning strategy within the State Space Duality framework into a kinematics-guided bidirectional scanning, which embeds kinematic priors to better capture intricate joint relations. Second, we design a mixed spatiotemporal modeling mechanism that tightly couples spatial and temporal contexts to maintain a balance between accuracy and smoothness. Additionally, a geometric angular velocity loss is introduced to impose physically meaningful constraints on rotational variations for further improving motion stability.

Through extensive experiments, KineST demonstrates superior performance in both accuracy and temporal consistency within a lightweight framework.

Video

Comparative Experiments