An initial prediction and fine-tuning model based on improving GCN for 3D human motion prediction

He, Zhiquan; Zhang, Lujun; Wang, Hengyou

doi:10.3389/fncom.2023.1145209

ORIGINAL RESEARCH article

Front. Comput. Neurosci., 05 April 2023

Volume 17 - 2023 | https://doi.org/10.3389/fncom.2023.1145209

This article is part of the Research TopicDeep Neural Network Based Decision-Making InterpretabilityView all 9 articles

An initial prediction and fine-tuning model based on improving GCN for 3D human motion prediction

Updated

A correction has been applied to this article in:

Corrigendum: An initial prediction and fine-tuning model based on improving GCN for 3D human motion prediction
1. Read correction

Zhiquan He^1,2^*

Lujun Zhang²

Hengyou Wang³

¹Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen, China
²Guangdong Multimedia Information Service Engineering Technology Research Center, Shenzhen University, Shenzhen, China
³School of Science, Beijing University of Civil Engineering and Architecture, Beijing, China

Human motion prediction is one of the fundamental studies of computer vision. Much work based on deep learning has shown impressive performance for it in recent years. However, long-term prediction and human skeletal deformation are still challenging tasks for human motion prediction. For accurate prediction, this paper proposes a GCN-based two-stage prediction method. We train a prediction model in the first stage. Using multiple cascaded spatial attention graph convolution layers (SAGCL) to extract features, the prediction model generates an initial motion sequence of future actions based on the observed pose. Since the initial pose generated in the first stage often deviates from natural human body motion, such as a motion sequence in which the length of a bone is changed. So the task of the second stage is to fine-tune the predicted pose and make it closer to natural motion. We present a fine-tuning model including multiple cascaded causally temporal-graph convolution layers (CT-GCL). We apply the spatial coordinate error of joints and bone length error as loss functions to train the fine-tuning model. We validate our model on Human3.6m and CMU-MoCap datasets. Extensive experiments show that the two-stage prediction method outperforms state-of-the-art methods. The limitations of proposed methods are discussed as well, hoping to make a breakthrough in future exploration.

1. Introduction

3D skeleton-based human motion prediction uses an action posture observed in the past to predict an action posture in the future. Motion prediction technology helps robots understand human behavior. This technology is of great value in areas such as intelligent security, autonomous driving (Ge et al., 2019; Djuric et al., 2020; Gao et al., 2020), object tracking, and human-robot collaboration (Liu and Wang, 2017; Oguz et al., 2017; Liu et al., 2019a, 2021; Li et al., 2020b; Liu and Liu, 2020; Ding et al., 2021; Mao et al., 2021).

Recurrent neural networks(RNN) are usually adopted to solve sequence-to-sequence prediction tasks, such as voice recognition and automatic translation (Wang et al., 2017; Tang R. et al., 2018; Iida et al., 2019; Yao et al., 2021). Due to the sequential nature of motion in the time dimension, many works use RNN to realize human motion prediction (Fragkiadaki et al., 2015; Chiu et al., 2019; Guo and Choi, 2019; Corona et al., 2020). However, RNN-based networks are usually difficult to train and have the problem of error accumulation in long-term predictions. There are a few works adopting convolution networks (CN) to solve the problem of human behavior prediction (Butepage et al., 2017; Li et al., 2018; Cui et al., 2021; Shu et al., 2021). They process the human motion sequences as images and use 2D convolution to generate the prediction sequence.

Nevertheless, human motion sequences are not traditional image data, and traditional convolution neural networks are limited in processing such sequences. In recent years, lots of work uses graph convolution networks (GCN) to solve human motion prediction tasks and achieved excellent results (Aksan et al., 2019; Cui et al., 2020; Cui and Sun, 2021; Dang et al., 2021). GCN is similar to CNN except that it performs feature extraction on graphs. GCN usually defines an adjacency matrix in advance, representing the interconnection relationship between each node in the graph. Then GCN generates new node information by aggregating related node information. In addition, by aggregating action information efficiently, recent work (Mao et al., 2019) confirms that the discrete cosine transform (DCT) has great advantages in motion prediction. Taking advantage of GCN, Many works have achieved good performance, but they also expose some shortcomings of GCN. For example, many GCN-based methods convert joint information to the frequency domain for prediction and recover the time domain information, causing the joint position generated to not smooth on the time domain. In addition, many works have changed the bone length of the human body, causing deformation.

Inspired by the proposed concept of two stages prediction (Shan et al., 2022), we present a two stages framework to solve the above problems, including the prediction stage and fine-tuning stage, to achieve precise prediction of human motion sequences. The task of the prediction stage is to use DCT to encode the motion information and then use the attention mechanism to calculate the attention score to strengthen the interaction of each node. Then we use IDCT (inverse discrete cosine transform) to decode the aggregated features into the original 3D pose, generating the initial prediction for the first stage.

We observe that the initial prediction always has a certain deviation from the ground truth. To solve this problem, we organize a fine-tuning model to correct the initial predictions of the first stage. Observing that the actors for different actions in the datasets are the same, each frame in the action sequence should contain the same body structure information, such as the length of each bone. We add a bone length constraint term in the loss function of the fine-tuning model. Since the motion sequences generated by the frequency domain are not coherent in the time domain, the traditional TCN method uses a global adjacency matrix to aggregate sequence information, which often makes predicted actions deviate from reality. In response to this problem, we propose a CMM (causal mask matrix) to improve the T-GCN and fine-tune the initial prediction, making each frame future sequence generated only related to its previous information, which eliminates the effect of future inaccurate information when constructing the current frame.

We used MPJPE as metrics to evaluate our network on the Human3.6m and CMU-MoCap, and conducted related ablation experiments to analyze our key models. Many comparative experiments show that our method achieves more accurate predictions than the existing approaches.

In summary, the main contributions of this paper can be concluded as follows:

• We propose a two-stage training method, including prediction and fine-tuning stages. Fine-tuning stage corrects the human motion sequences generated by the prediction stage.

• To further utilize the interactive information on the temporal structure of human motion, we present a CMM improving the T-GCN in the fine-tuning stage to reconstruct the sequence in a causal, temporal order.

• In order to improve the power of GCN to extract the spatial interaction information of the human, we introduce a SAB (spatially attention block) to aggregate node information along the spatial dimension. Moreover, we incorporate the constraint of length invariance of human bones for guiding the framework to generate more realistic human motion sequences.

2. Related work

2.1. RNN-based method

RNN-based methods are widely used for sequence-to-sequence tasks (Jain et al., 2016b; Martinez et al., 2017; Tang Y. et al., 2018; Liu et al., 2019b; Sang et al., 2020). According to the characteristics of human motion sequence, a lot of works use RNN as the basic structure of the network. By embedding encoder and decoder networks before and after recurrent layers, Fragkiadaki et al. (2015) propose an Encoder-Recurrent-Decoder (ERD) model for predicting human motion. Jain et al. (2016a) combine RNNs with the spatiotemporal structure of the human body, proposing the Structural-RNN. Liu et al. (2019b) develop a hierarchical recurrent network structure to simultaneously encode the local context of a single frame and the global context of a sequence.

However, RNN combines the hidden layer of the previous unit to output the prediction of the next unit, which will cause the accumulation of errors. These methods cannot avoid the error accumulation problem. Error accumulation causes discontinuities in generated frames, resulting in unrealistic human motion sequences. Gui et al. (2018) propose a novel sequence-to-sequence model that adds a residual architecture connection between the input and output of each RNN module, which alleviates the discontinuity problem of the RNN model. Guo and Choi (2019) modified the seq2seq framework to encode temporal correlations at different time scales. Shu et al. (2021) designed a new bone- joint attention mechanism to dynamically learn the bone-joint feature map of the bone-joint attention feature map, making the generated action sequences closer to reality. Although these methods effectively improve the accuracy of prediction, their performance in long-term prediction is still insufficient.

2.2. GCN-based method

Compared with traditional CNN-based methods, the GCN-based method has significant advantages in the face of irregular data structures, such as social networks (Tabassum et al., 2018; Li et al., 2020a) and human body posture and behavior (Fan et al., 2019; Chen et al., 2020). In recent years, graph neural networks have been widely used for 3D human motion prediction and have achieved outstanding results. Lebailly et al. (2020) use GCN as an encoder and GCN to decode the aggregated features. The works of Mao et al. (2019), Cui et al. (2020), and Dang et al. (2021) totally used GCN to organize the model. Li et al. (2020b) encode the human body into multiple scales and perform information fusion, proposing the DMGNN. Ma et al. (2022) used a spatiotemporal GCN to build a model to obtain more accurate long-term predictions by predicting the median value of human motion. Mao et al. (2019) used discrete cosine transform to encode human motion sequences and designed a GCN-based model that automatically learns node relationships.

Although these GCN-based methods have achieved good results, these methods still have not solved the following problems: The method predicts in the frequency domain often cannot pay attention to the time dependence of the original information. The future human motion sequence generated by these methods does not follow the bone constraints. In other words, the length of the human bone skeletal generated has changed. In order to solve the time dependence of frequency domain prediction methods, we propose a two-stage network architecture. The first stage is initially predicted in the frequency domain, and the second stage fine-tunes the initial result in the time domain. In the second stage, we use CMM to change the adjacent matrix into causality. In order to make the prediction results follow the bone constraints, we proposed the SAB to enhance the ability to capture the Spatial interaction relationship of the joint and increase all bone length as a constraint to train the model. We introduce the details of our framework architecture in the following section.

3. Problem formulation

Suppose that X_{−T_p:0} = [X_{−T_p}, …, X₀] denotes the historical human motion sequence of length T_p + 1 and X_{1:T_f} = [X₁, …, X_{T_f}] denotes the future sequence of length T_f, where $X_{i} \in ℝ^{N \times D}$ with N joints and D = 3 feature- dimensions depicts the 3D human pose at time i. The task of 3D human motion prediction is to generate the future sequence X_{1:T_f} given the historical one.

For predicting complex human motion more accurately, we use a two-stage prediction method based on GCN. We use cascading SAGCLs to predict the results of the first stage. Then, the initial prediction is fine-tuned by using the space-time constraints of the human body to get the second stage prediction that is closer to real human movement.

4. Methodology

4.1. Prediction and fine-tuning framework

In order to predict future motion sequences precisely, we adopt a two-stage training method, as shown in Figure 1. According to the human motion of T frames observed in the past X₋ = [X_−T, X_−T+1, …, X₋₁], we first apply DCT along the time dimension to convert the temporal dynamics of motions into the frequency domain.

FIGURE 1

Figure 1. Overview of our prediction and fine-tuning human motion prediction framework containing a prediction model and a fine-tuning model. SAGCL denotes the spatial attention graph convolution layer, and the SAGCB consists of cascade SAGCL. Black skeleton represents group truth. The red skeleton represents the prediction of the first stage, and the green skeleton represents the result of fine-tuning.

The encoder in the predicted model uses the node information in the frequency domain of SAGCL and SAGCB (Spatial Attention Graph Convolution Block), and then generates the predictive information of the frequency domain through the same structured decoder. Then the prediction model will use IDCT to restore the predicted joint information to the time domain. As shown in Figure 1, the red skeleton in the middle denotes the initial prediction of the first stage. We use the joint position errors as constraints to train the prediction model. Meanwhile, the skeleton is marked as red, representing the problem of discontinuity and skeletal deformation.

We reduce the impact of these problems in the second stage. CT-GCL only predicts depending on the past sequence. And we consider the joint position and the bone length constraint to train the fine-tuning model. The second stage corrects the bone length and keeps temporal dependence, which makes the prediction closer to natural human motion.

4.2. Prediction model (frequency domain)

Based on S-GCN (Spatial-Graph Convolution Layer) and T-GCN (temporal-Graph Convolution Layer), we build an encoder-decoder human motion prediction model. Both the encoder and decoder contain a SAGCL and a SAGCB. Each SAGCB includes six SAGCLs. The structure of SAGCL is shown in Figure 2. When the motion information flows through SAGCL, SAB will first extract the interaction information between human joints based on the motion information. In SAB, we use the average pooling layer to aggregate human body node interactive information along the spatial dimension. SAB can calculate the gating weight value of each node in the interval 0 − 1 according to the Sigmoid function and finally aggregate the information between dependent nodes according to the dynamic joint weight. The weight matrix of SAB can be calculated by the following formula:

\begin{array}{l} A_{S A B} = Sigmoid (W AvgPool (H) + b), & (1) \end{array}

Where H represents the hidden feature. W and b are the Parameter matrix and bias vector of FC layer, respectively. AvgPool denotes the average pooling along the temporal dimension. The Sigmoid function calculates the jointswise 0 − 1 gating weights.

FIGURE 2

Figure 2. The structure of SAGCL. SAB denotes spatial attention block. S-GCN and T-GCN indicates the Spatial and temporal GCN, respectively. BN indicates batch normalize operation.

Then S-GCN aggregates interaction information along the spatial dimensions. Let X∈R^L×M×F be a pose sequence where L is the length of the sequence, M is the number of joints of a pose, and F indicates the number of features of a joint. Defining a learnable adjacency matrix $A_{s} \in R^{M \times M}$ the elements of which measure relationships between pairs of joints of a pose, S-GCN work as:

\begin{array}{l} X^{l + 1} = σ (A_{S A B} A_{s} X^{l} W_{s}^{l}), & (2) \end{array}

Where l denotes the parameter in l^th layer. σ represents the Leaky ReLU. T-GCN aggregates interaction information along the temporal dimensions. Defining a learnable adjacency matrix $A_{t} \in R^{L \times L}$ measuring weights between pairs of joints of a trajectory, T-GCN computes:

\begin{array}{l} Y^{l + 1} = σ (A_{t} Y^{l} W_{t}^{l}) . & (3) \end{array}

4.3. Fine-tuning model (time domain)

Previous work based on graph convolution often focuses on global historical information through the temporal adjacency matrix when generating future action sequences. But the results predicted by the frequency domain are only sometimes smooth in time series. Using the unsmoothed global history motion information tends to corrupt the future sequence generated by the network. Therefore, we build a Fine-tuning model based on a cascade CT-GCL to reconstruct all the future sequences in the time domain. The input of the fine-tuning model is the complete output of the prediction model. The output of the fine-tuning model is a new sequence adjusted by constraints. As shown in Figure 3, each CT-GCL is mainly established by the S-GCN, CMM, and T-GCN, creating a new mapping between the temporal independent motion sequence and the temporal causal sequence.

FIGURE 3

Figure 3. The structure of fine-tuning model. CMM denotes causal mask matrix that improve the T-GCN. The hyperparameter m denotes the cascade number of “CT-GCL”.

CMM adjusts the node position predicted in the first stage in the temporal series so that the node position at each moment is only related to the previous time. As shown in Figure 4, CMM is initialized as an upper triangular matrix and makes a Hadamard product with the adjacency matrix in the temporal dimension. So only the information at the current moment and before is aggregated when CT-GCL reconstructs the future motion sequence.

FIGURE 4

Figure 4. The temporal adjacency matrix makes hadamard product with CMM.

4.4. Loss function

For training the prediction network, we consider L₂ loss function for 3D joint positions. Suppose that the prediction sample is $\hat{χ}$ , and the corresponding ground truth value is χ. For T training samples and K nodes, the loss function is:

\begin{array}{l} L_{prediction} (χ, \hat{χ}) = \frac{1}{K T} \sum_{k = 1}^{K} \sum_{t = 1}^{T} | | p_{t}^{k} - {\hat{p}}_{t}^{k} | |_{2}, & (4) \end{array}

Where $p_{t}^{k}$ denotes the ground truth position of k-th joint in frame t and ${\hat{p}}_{t}^{k}$ denotes the predicted one. We adopts L2 loss function for 3D joint positions and the length of each bone in a human body to train the fine-tuning network. The loss function is:

\begin{array}{l} \begin{matrix} L_{fine-tuning} (χ, \hat{χ}) = \frac{1}{K T} \sum_{k = 1}^{K} \sum_{t = 1}^{T} | | p_{t}^{k} - {\hat{p}}_{t}^{k} | |_{2} \\ + \frac{1}{M T} \sum_{m = 1}^{M} \sum_{t = 1}^{T} | | b_{t}^{m} - {\hat{b}}_{t}^{m} | |_{2}, \end{matrix} & (5) \end{array}

Where $b_{t}^{m}$ denotes the length of the M^th bone in the human body, and ${\hat{b}}_{t}^{m}$ denotes the length of the M^th bones in frame t. $p_{t}^{k}$ and ${\hat{p}}_{t}^{k}$ are the same as in formula (4). By correcting the length of the bones in each frame of the predicted sequence, the fine-tuning network makes the reconstructing sequence closer to the actual value.

5. Experiments

We used Human 3.6m (Ionescu et al., 2013) and CMU-MoCap dataset to validate our framework. The joint data for both datasets are represented by an exponential map. In this work, We convert it to a 3D coordinate representation. Furthermore, we show the quantitative results for both short-term and long-term human motion predictions for joint positions by Mean Per Joint Position Error (MPJPE).

5.1. Datasets

5.1.1. CMU−MoCap

CMU-MoCap has 5 main categories of motion data. In line with previous work (Mao et al., 2019; Dang et al., 2021), we selected eight actions to validate our framework: “basketball”, “basketball signal”, “directing traffic”, “jumping”, “running”, “soccer”, “walking”, and “washing window”. Each motion data contains 38 joints (contains repeated joints), and we preserve 25 valuable joints. The division of the training set and test set also remains the same as Mao et al. (2019) and Dang et al. (2021).

5.1.2. Human3.6m

Human3.6m has 15 different classes of motion performed by 7 actors. Each motion in subjects contains 32 joints, and we preserve 22 joints. To be consistent with Li et al. (2020b), we train the model on 6 subjects and test it on the 5th subject. To be consistent with previous work (Dang et al., 2021), we use S1, S6, S7, S8, and S9 for training and use S5 and S11 for testing and validation, respectively.

5.2. Metrics

In this paper, we train and test the 3D coordinates coordinate representation of the human pose and show the measurement results in 3D coordinates. Defining the prediction sample is $\hat{X}$ , and the corresponding ground truth value is X. We use mean per-joint position error (MPJPE) as an evaluation metric for 3D error:

\begin{array}{l} MPJPE (X, \hat{X}) = \frac{1}{N T} \sum_{n = 1}^{N} \sum_{t = 1}^{T} | | p_{t}^{n} - {\hat{p}}_{t}^{n} | |_{2}, & (6) \end{array}

Where $p_{t}^{n}$ represents the n^th ground truth joint position in t^th frame. And ${\hat{p}}_{t}^{n}$ denotes the predictive one.

5.3. Model configuration

There are different cross-validation methods, such as k-fold cross-validation and jack-knife test, which have been generally used to train the model (Arif et al., 2021; Ge et al., 2021, 2022a,b; Sikander et al., 2022). We trained our proposed model using a 10-fold cross-validation method. Our network predicts the human pose of 25 frames in the future by observing the position of the joints in the past 10 frames. Each SAGCB in the prediction model contains 6 SAGCLs. After testing, we cascaded 6 CT-GCLs in the fine-tuned model. We utilize Adam as an optimizer. The learning rate is initialized to 0.005 with a 0.96 decay every epoch. Both the prediction model and the fine-tuning model are trained for 50 epochs, and the batch size is set to 32. We implemented our network on GeForce RTX 2080 Ti GPU using Pytorch.

5.4. Comparison to state-of-the-art methods

We validated our model on Human3.6m and CMU-MoCap datasets and present detailed results. We compared our method with DMGNN (Li et al., 2020b), LTD (Mao et al., 2019), MSR (Dang et al., 2021), and ST-DGCN (Ma et al., 2022). DMGNN uses GCN to extract features of multiple scales of the human body and uses graph-based GRU for decoding. Applying DCT, LTD uses GCN for prediction in the frequency domain. MSR is improved on the basis of LTD, taking into account multi-scale factors of the human body.

5.4.1. Human3.6m

As seen in Table 1, we compared the several methods mentioned above in short-term prediction (within 400 ms) on Human3.6m. The results show that our method outperforms previous methods in short-term prediction. For example, our method has a significant advantage in the action “greeting”, “posing”, and “sit down”. Our method is greatly reduced in the short -term prediction of many actions such as “walkingdog”, “greeting”, and “discussion”, which performance is more than 10% higher than the best method in 80 ms. And Table 2 shows the comparisons of long-term prediction (between 400 and 1,000 ms). In most cases, our results are better than the compared methods. For example, the performance of our method in motion “discussion” and “posing” is about 3% higher than the best method in 1,000 ms and motion “walkingtogether” is more than 5% higher, which shows that our method also performs well in the longest prediction. According to the average errors for short-term and long-term prediction, our method outperforms the compared methods by a large margin.

TABLE 1

Table 1. Comparisons of short-term prediction on Human3.6m. Results at 80, 160, 320, 400 ms in the future are shown.

TABLE 2

Table 2. Comparisons of long-term prediction on Human3.6M.

5.4.2. CMU−MoCap

Table 3 shows the comparisons of average value on CMU-MoCap. Our method significantly outperforms the comparison methods in both short-term and long-term prediction. The error of our method is reduced by nearly 10% compared with ST-DGCN in 1,000 ms prediction.

TABLE 3

Table 3. Comparisons of average prediction errors on CMU-MoCap in both short-term and long-term prediction.

5.5. Ablation study

To further analyze our model, we performed the following ablation studies on Human3.6m. We conduct the following comparative experiments to analyze the impact of each module of our model.

As shown in Table 4, we tested the performance of the predictive model alone using the ground truth to evaluate the effect of the fine-tuned model. In the case of only the forecasting model, both long-term and short-term predictions have performance degradation. The average prediction error went from 53.71 to 55.08. Experiments show that the fine-tuning module adjusts the initial prediction by time dependence and bone constraint, making the predicted motion sequence closer to the actual value. We also tested the impact of several key modules in the framework, such as SAB, CMM, and bone length loss function. The average error of melting the above key modules has risen to 54.31, 54.13, and 54.53, respectively.

TABLE 4

Table 4. Ablations on architecture.

Our results show that the bone length loss function has the most significant impact on the model, which verifies the problem of the GCN-based method in predicting the deformation of the human body. Our method uses SAB and bone constraints to strengthen the extraction of bone information by GCN layers, making our results better than the current method. The CMM module has also played a positive role in fine-tuning modules to avoid discontinuous information in the future to destroy the aggregation of temporal information in T-GCN.

As shown in Table 5, keeping the output of the prediction model constant, we conduct ablations about the cascaded layers m from 4 to 7 in the fine-tuning model. The results show that the fine-tuned model has the best performance when m = 6.

TABLE 5

Table 5. Results comparsion of different number m of “CT-GCL” cascade, which mentioned in Figure 3.

Table 6 shows the comparison of different numbers of SAGCLs in a SAGCB. We regard the prediction model and fine-tuning model as two independent modules. And we only consider the results of the prediction model in this experiment. The results showed that five SAGCLs achieved the most accurate results at 80 ms. However, in the long-term error comparison, six SAGCLs have more advantages. Considering the average error, we use six SAGCLs to obtain the most accurate initial prediction.

TABLE 6

Table 6. Results comparsion of different numbers of SAGCLs in a SAGCB.

6. Discussion

In the previous section, we compared our method with state of the art. Using SAB and bone length constraint, our method has a strong spatial interactive relationship capture ability. Thus, our method has a significant advantage in some motions with large movements, such as a result of action “walking dogs” and “sitting down” in Tables 1, 2. As shown in Table 3, using a fine-tuning model to adjust the initial prediction in the time domain. Our model has a strong ability to capture time dependence so that the performance of the model is more than 10% compared with the latest method by 1,000 ms. As shown in Table 4, CMM also enhances this performance.

On the other hand, there are also some shortcomings in our model. Our model is based on the two-stage training method. We need to pre-training a prediction model and then train a fine-tuning model, which undoubtedly increases our training time and complexity. What is more, our fine-tuning model is largely limited by predictive models, which means that the correction capacity of fine-tuning models is limited. We still do not do well in long time prediction. As shown in Table 2, we still have a lot of room for improvement in long-term predictions.

7. Conclusion

We propose a two-stage forecasting framework, including prediction and fine-tuning models. In the prediction model, we first transform the observed pose data into the frequency domain using DCT. Before the transformed pose data flows through the GCN, the interaction information between joints is enhanced by the spatial attention mechanism. Then we use IDCT to restore the generated future poses to the time domain. In the second stage, we add the bone length error as a loss function to train the fine-tuning model better, which makes the corrected pose sequence closer to the natural human motion. What is more, we use CMM to improve the T-GCN in the fine-tuning model, making the regenerated motion sequences more coherent on the timeline. Extensive experiments show that fine-tuning the model plays a positive role in improving the results of the predictive model. Our work outperforms previous work on commonly used datasets.

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found here: http://mocap.cs.cmu.edu/.

Author contributions

LZ proposed the two stage training method for human motion production and wrote the manuscript. ZH conducted the literature survey and method guidance. HW analyzed the experiment data and revised the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under grant 61971290 and the Shenzhen Stability Support General Project (Category A) 20200826104014001.

Acknowledgments

We thank all participants who supported our study and the reviewers for constructive suggestions on the manuscript.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Aksan, E., Kaufmann, M., and Hilliges, O. (2019). “Structured prediction helps 3D human motion modelling,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (Seoul: IEEE), 7144–7153.

PubMed Abstract | Google Scholar

Arif, M., Kabir, M., Ahmed, S., Khan, A., Ge, F., Khelifi, A., et al. (2021). Deepcppred: a deep learning framework for the discrimination of cell-penetrating peptides and their uptake efficiencies. IEEE/ACM Trans. Comput. Biol. Bioinform. 19, 2749–2759. doi: 10.1109/TCBB.2021.3102133

PubMed Abstract | CrossRef Full Text | Google Scholar

Butepage, J., Black, M. J., Kragic, D., and Kjellstrom, H. (2017). “Deep representation learning for human motion prediction and classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (Honolulu, HI: IEEE), 6158–6166.

Google Scholar

Chen, G., Song, X., Zeng, H., and Jiang, S. (2020). Scene recognition with prototype-agnostic scene layout. IEEE Trans. Image Process. 29, 5877–5888. doi: 10.1109/TIP.2020.2986599

PubMed Abstract | CrossRef Full Text | Google Scholar

Chiu, H.-K., Adeli, E., Wang, B., Huang, D.-A., and Niebles, J. C. (2019). “Action-agnostic human pose forecasting,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV) (Waikoloa, HI: IEEE), 1423–1432.

Google Scholar

Corona, E., Pumarola, A., Alenya, G., and Moreno-Noguer, F. (2020). “Context-aware human motion prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (Seattle, WA: IEEE), 6992–7001.

PubMed Abstract | Google Scholar

Cui, Q., and Sun, H. (2021). “Towards accurate 3D human motion prediction from incomplete observations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4801–4810.

Google Scholar

Cui, Q., Sun, H., Kong, Y., Zhang, X., and Li, Y. (2021). Efficient human motion prediction using temporal convolutional generative adversarial network. Inform. Sci. 545, 427–447. doi: 10.1016/j.ins.2020.08.123

CrossRef Full Text | Google Scholar

Cui, Q., Sun, H., and Yang, F. (2020). “Learning dynamic relationships for 3D human motion prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (Honolulu, HI: IEEE), 6519–6527.

Google Scholar

Dang, L., Nie, Y., Long, C., Zhang, Q., and Li, G. (2021). “MSR-GCN: multi-scale residual graph convolution networks for human motion prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (IEEE), 11467–11476.

Google Scholar

Ding, Z., Yang, C., Wang, Z., Yin, X., and Jiang, F. (2021). Online adaptive prediction of human motion intention based on sEMG. Sensors 21, 2882. doi: 10.3390/s21082882

PubMed Abstract | CrossRef Full Text | Google Scholar

Djuric, N., Radosavljevic, V., Cui, H., Nguyen, T., Chou, F.-C., Lin, T.-H., et al. (2020). “Uncertainty-aware short-term motion prediction of traffic actors for autonomous driving,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (Snowmass, CO: IEEE), 2095–2104.

Google Scholar

Fan, L., Wang, W., Huang, S., Tang, X., and Zhu, S.-C. (2019). “Understanding human gaze communication by spatio-temporal graph reasoning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (Seoul: IEEE), 5724–5733.

Google Scholar

Fragkiadaki, K., Levine, S., Felsen, P., and Malik, J. (2015). “Recurrent network models for human dynamics,” in Proceedings of the IEEE International Conference on Computer Vision (Santiago: IEEE), 4346–4354.

Google Scholar

Gao, Z., Guo, L., Guan, W., Liu, A.-A., Ren, T., and Chen, S. (2020). A pairwise attentive adversarial spatiotemporal network for cross-domain few-shot action recognition-r2. IEEE Trans. Image Process. 30, 767–782. doi: 10.1109/TIP.2020.3038372

PubMed Abstract | CrossRef Full Text | Google Scholar

Ge, F., Hu, J., Zhu, Y.-H., Arif, M., and Yu, D.-J. (2022a). TargetMM: accurate missense mutation prediction by utilizing local and global sequence information with classifier ensemble. Combin. Chem. High Throughput Screen. 25, 38–52. doi: 10.2174/1386207323666201204140438

PubMed Abstract | CrossRef Full Text | Google Scholar

Ge, F., Zhang, Y., Xu, J., Muhammad, A., Song, J., and Yu, D.-J. (2022b). Prediction of disease-associated nsSNPs by integrating multi-scale resnet models with deep feature fusion. Brief. Bioinform. 23, bbab530. doi: 10.1093/bib/bbab530

PubMed Abstract | CrossRef Full Text | Google Scholar

Ge, F., Zhu, Y.-H., Xu, J., Muhammad, A., Song, J., and Yu, D.-J. (2021). MutTMpredictor: robust and accurate cascade XGBoost classifier for prediction of mutations in transmembrane proteins. Comput. Struct. Biotechnol. J. 19, 6400–6416. doi: 10.1016/j.csbj.2021.11.024

PubMed Abstract | CrossRef Full Text | Google Scholar

Ge, S., Zhao, S., Gao, X., and Li, J. (2019). “Fewer-shots and lower-resolutions: towards ultrafast face recognition in the wild,” in Proceedings of the 27th ACM International Conference on Multimedia, 229–237.

Google Scholar

Gui, L.-Y., Wang, Y.-X., Liang, X., and Moura, J. M. (2018). “Adversarial geometry-aware human motion prediction,” in Proceedings of the European Conference on Computer Vision (ECCV) (Munich), 786–803.

Google Scholar

Guo, X., and Choi, J. (2019). “Human motion prediction via learning local structure representations and temporal dependencies,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2580–2587.

PubMed Abstract | Google Scholar

Iida, S., Kimura, R., Cui, H., Hung, P.-H., Utsuro, T., and Nagata, M. (2019). “A multi-hop attention for rnn based neural machine translation,” in Proceedings of The 8th Workshop on Patent and Scientific Literature Translation, 24–31.

Google Scholar

Ionescu, C., Papava, D., Olaru, V., and Sminchisescu, C. (2013). Human3. 6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1325–1339. doi: 10.1109/TPAMI.2013.248

PubMed Abstract | CrossRef Full Text | Google Scholar

Jain, A., Zamir, A., Savarese, S., and Saxena, A. (2016a). “Deep learning on spatio-temporal graphs,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (Las Vegas, NV), 27–30.

Google Scholar

Jain, A., Zamir, A. R., Savarese, S., and Saxena, A. (2016b). “Structural-RNN: deep learning on spatio-temporal graphs,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (Seattle, WA: IEEE), 5308–5317.

Google Scholar

Lebailly, T., Kiciroglu, S., Salzmann, M., Fua, P., and Wang, W. (2020). “Motion prediction using temporal inception module,” in Proceedings of the Asian Conference on Computer Vision.

Google Scholar

Li, C., Zhang, Z., Lee, W. S., and Lee, G. H. (2018). “Convolutional sequence to sequence model for human dynamics,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5226–5234.

Google Scholar

Li, M., Chen, S., Zhang, Y., and Tsang, I. (2020a). “Graph cross networks with vertex infomax pooling,” Advances in Neural Information Processing Systems, arXiv [Preprint]. arXiv: 2010.01804.

Google Scholar

Li, M., Chen, S., Zhao, Y., Zhang, Y., Wang, Y., and Tian, Q. (2020b). “Dynamic multiscale graph neural networks for 3D skeleton based human motion prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 214–223.

Google Scholar

Liu, H., and Wang, L. (2017). Human motion prediction for human-robot collaboration. J. Manufact. Syst. 44, 287–294. doi: 10.1016/j.jmsy.2017.04.009

CrossRef Full Text | Google Scholar

Liu, Q., Liu, Z., Xiong, B., Xu, W., and Liu, Y. (2021). Deep reinforcement learning-based safe interaction for industrial human-robot collaboration using intrinsic reward function. Adv. Eng. Inform. 49, 101360. doi: 10.1016/j.aei.2021.101360

CrossRef Full Text | Google Scholar

Liu, R., and Liu, C. (2020). Human motion prediction using adaptable recurrent neural networks and inverse kinematics. IEEE Control Syst. Lett. 5, 1651–1656. doi: 10.1109/LCSYS.2020.3042609

CrossRef Full Text | Google Scholar

Liu, Z., Liu, Q., Xu, W., Liu, Z., Zhou, Z., and Chen, J. (2019a). Deep learning-based human motion prediction considering context awareness for human-robot collaboration in manufacturing. Proc. CIRP 83, 272–278. doi: 10.1016/j.procir.2019.04.080

CrossRef Full Text | Google Scholar

Liu, Z., Wu, S., Jin, S., Liu, Q., Lu, S., Zimmermann, R., et al. (2019b). “Towards natural and accurate future motion prediction of humans and animals,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (Long Beach, CA: IEEE), 10004–10012.

Google Scholar

Ma, T., Nie, Y., Long, C., Zhang, Q., and Li, G. (2022). “Progressively generating better initial guesses towards next stages for high-quality human motion prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6437–6446.

Google Scholar

Mao, W., Liu, M., Salzmann, M., and Li, H. (2019). “Learning trajectory dependencies for human motion prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (Seoul: IEEE), 9489–9497.

Google Scholar

Mao, W., Liu, M., Salzmann, M., and Li, H. (2021). Multi-level motion attention for human motion prediction. Int. J. Comput. Vis. 129, 2513–2535. doi: 10.1007/s11263-021-01483-7

CrossRef Full Text | Google Scholar

Martinez, J., Black, M. J., and Romero, J. (2017). “On human motion prediction using recurrent neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (Honolulu, HI: IEEE), 2891–2900.

PubMed Abstract | Google Scholar

Oguz, O. S., Gabler, V., Huber, G., Zhou, Z., and Wollherr, D. (2017). “Hybrid human motion prediction for action selection within human-robot collaboration,” in International Symposium on Experimental Robotics (Tokyo: Springer), 289–298.

Google Scholar

Sang, H.-F., Chen, Z.-Z., and He, D.-K. (2020). Human motion prediction based on attention mechanism. Multimedia Tools Appl. 79, 5529–5544. doi: 10.1007/s11042-019-08269-7

CrossRef Full Text | Google Scholar

Shan, W., Liu, Z., Zhang, X., Wang, S., Ma, S., and Gao, W. (2022). P-STMO: pre-trained spatial temporal many-to-one model for 3d human pose estimation. arXiv preprint arXiv:2203.07628. doi: 10.1007/978-3-031-20065-6_27

CrossRef Full Text | Google Scholar

Shu, X., Zhang, L., Qi, G.-J., Liu, W., and Tang, J. (2021). Spatiotemporal co-attention recurrent neural networks for human-skeleton motion prediction. IEEE Trans. Pattern Anal. Mach. Intell. 44, 3300–3315. doi: 10.1109/TPAMI.2021.3050918

PubMed Abstract | CrossRef Full Text | Google Scholar

Sikander, R., Arif, M., Ghulam, A., Worachartcheewan, A., Thafar, M. A., and Habib, S. (2022). Identification of the ubiquitin–proteasome pathway domain by hyperparameter optimization based on a 2D convolutional neural network. Front. Genet. 13, 851688. doi: 10.3389/fgene.2022.851688

PubMed Abstract | CrossRef Full Text | Google Scholar

Tabassum, S., Pereira, F. S., Fernandes, S., and Gama, J. (2018). Social network analysis: an overview. Wiley Interdiscipl. Rev. Data Mining Knowl. Discovery 8, e1256. doi: 10.1002/widm.1256

CrossRef Full Text | Google Scholar

Tang, R., Yang, G., Wei, H., Mao, Y., Ture, F., and Lin, J. (2018). Streaming voice query recognition using causal convolutional recurrent neural networks. arXiv preprint arXiv:1812.07754. doi: 10.48550/arXiv.1812.07754

CrossRef Full Text | Google Scholar

Tang, Y., Ma, L., Liu, W., and Zheng, W. (2018). Long-term human motion prediction by modeling motion context and enhancing motion dynamic. arXiv preprint arXiv:1805.02513. doi: 10.48550/arXiv.1805.02513

CrossRef Full Text | Google Scholar

Wang, R., Panju, M., and Gohari, M. (2017). Classification-based RNN machine translation using GRUs. arXiv preprint arXiv:1703.07841. doi: 10.48550/arXiv.1703.07841

CrossRef Full Text | Google Scholar

Yao, J., Zhang, J., Li, J., and Zhuo, L. (2021). Anchor voiceprint recognition in live streaming via RawNet-SA and gated recurrent unit. EURASIP J. Audio Speech Music Process. 2021, 1–18. doi: 10.1186/s13636-021-00234-3

CrossRef Full Text | Google Scholar

Keywords: motion prediction, GCN-based, two-stage prediction method, spatial attention, causally temporal

Citation: He Z, Zhang L and Wang H (2023) An initial prediction and fine-tuning model based on improving GCN for 3D human motion prediction. Front. Comput. Neurosci. 17:1145209. doi: 10.3389/fncom.2023.1145209

Received: 15 January 2023; Accepted: 02 March 2023;
Published: 05 April 2023.

Edited by:

Ye Duan, University of Missouri, United States

Reviewed by:

Guan Ning Lin, Shanghai Jiao Tong University, China
Muhammad Arif, University of Management and Technology, Pakistan
Tian Bai, Jilin University, China

Copyright © 2023 He, Zhang and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Zhiquan He, emhpcXVhbkBzenUuZWR1LmNu

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.