Skip to main content

ORIGINAL RESEARCH article

Front. Neurorobot.
Volume 19 - 2025 | doi: 10.3389/fnbot.2025.1482281
This article is part of the Research Topic Neural Network Models in Autonomous Robotics View all 6 articles

Latent Space Improved Masked Reconstruction Model for Human Skeleton-based Action Recognition

Provisionally accepted
Enqing Chen Enqing Chen 1*Xueting Wang Xueting Wang 1Xin Guo Xin Guo 1*Ying Zhu Ying Zhu 2*Dong Li Dong Li 2*
  • 1 Zhengzhou University, Zhengzhou, China
  • 2 State Grid Henan Electric Power Company Information and Communication Branch, Zhengzhou, China

The final, formatted version of the article will be published soon.

    Human skeleton-based action recognition is an important task in the field of computer vision. In recent years, masked autoencoder (MAE) has been used in various fields due to its powerful self supervised learning ability and has achieved good results in masked data reconstruction tasks. However, in visual classification tasks such as action recognition, the limited ability of the encoder to learn features in the autoencoder structure results in poor classification performance. We propose to enhance the encoder's feature extraction ability in classification tasks by leveraging the latent space of variational autoencoder (VAE) and further replace it with the latent space of vector quantized variational autoencoder (VQVAE). The constructed models are called SkeletonMVAE and SkeletonMVQVAE, respectively. In SkeletonMVAE, we constrain the latent variables to represent features in the form of distributions. And in SkeletonMVQVAE, we discretize the latent variables. These help the encoder learn deeper data structures and more discriminative and generalized feature representations. The experiment results on the NTU-60 and NTU-120 datasets demonstrate that our proposed method can effectively improve the classification accuracy of the encoder in classification tasks and its generalization ability in the case of few labeled data. SkeletonMVAE exhibits stronger classification ability, while SkeletonMVQVAE exhibits stronger generalization in situations with fewer labeled data.

    Keywords: human skeleton-based action recognition, Variational autoencoder, vector quantized variational autoencoder, masked reconstruction model, Self-supervised learning

    Received: 18 Aug 2024; Accepted: 27 Jan 2025.

    Copyright: © 2025 Chen, Wang, Guo, Zhu and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence:
    Enqing Chen, Zhengzhou University, Zhengzhou, China
    Xin Guo, Zhengzhou University, Zhengzhou, China
    Ying Zhu, State Grid Henan Electric Power Company Information and Communication Branch, Zhengzhou, China
    Dong Li, State Grid Henan Electric Power Company Information and Communication Branch, Zhengzhou, China

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.