Full-body motion tracking plays an essential role in AR/VR applications, bridging physical and virtual interactions.
However, Head-Mounted Displays (HMDs), the main device in AR/VR scenarios, can only provide sparse input signals, making it challenging to reconstruct realistic and diverse full-body poses. Previous works attempt to solve this at high computational cost or by separately modeling temporal and spatial dependencies to capture richer features, but they often struggle to balance accuracy, temporal continuity, and efficiency.
To address this problem, we propose KineST, a novel kinematics-guided state space model that effectively extracts spatiotemporal dependencies while integrating local and global pose perception. The proposed model relies on two core ideas. First, we reformulate the scanning strategy within the State Space Duality framework into a kinematics-guided bidirectional scanning, which embeds kinematic priors to better capture intricate joint relations. Second, we design a mixed spatiotemporal modeling mechanism that tightly couples spatial and temporal contexts to maintain a balance between accuracy and smoothness. Additionally, a geometric angular velocity loss is introduced to impose physically meaningful constraints on rotational variations for further improving motion stability.
Through extensive experiments, KineST demonstrates superior performance in both accuracy and temporal consistency within a lightweight framework.