Developing an Embedding, Koopman and Autoencoder Technologies-Based Multi-Omics Time Series Predictive Model (EKATP) for Systems Biology research

Liu, Suran; You, Yujie; Tong, Zhaoqi; Zhang, Le

doi:10.3389/fgene.2021.761629

ORIGINAL RESEARCH article

Front. Genet. , 26 October 2021

Sec. Statistical Genetics and Methodology

Volume 12 - 2021 | https://doi.org/10.3389/fgene.2021.761629

This article is part of the Research Topic Data Mining and Statistical Methods for Knowledge Discovery in Diseases Based on Multimodal Omics View all 15 articles

Developing an Embedding, Koopman and Autoencoder Technologies-Based Multi-Omics Time Series Predictive Model (EKATP) for Systems Biology research

Suran Liu¹^†

Yujie You¹^†

Zhaoqi Tong²

Le Zhang¹*

¹College of Computer Science, Sichuan University, Chengdu, China
²College of Software Engineering, Sichuan University, Chengdu, China

It is very important for systems biologists to predict the state of the multi-omics time series for disease occurrence and health detection. However, it is difficult to make the prediction due to the high-dimensional, nonlinear and noisy characteristics of the multi-omics time series data. For this reason, this study innovatively proposes an Embedding, Koopman and Autoencoder technologies-based multi-omics time series predictive model (EKATP) to predict the future state of a high-dimensional nonlinear multi-omics time series. We evaluate this EKATP by using a genomics time series with chaotic behavior, a proteomics time series with oscillating behavior and a metabolomics time series with flow behavior. The computational experiments demonstrate that our proposed EKATP can substantially improve the accuracy, robustness and generalizability to predict the future state of a time series for multi-omics data.

Introduction

Currently, the prediction of multi-omics time series states is one of the trending areas in systems biology research (Zhang et al., 2019a). In particular, the development of high-throughput technology (Soon et al., 2013) has produced a large-scale time series multi-omics state (Liang and Kelemen 2017a), including genomics (Lockhart and Winzeler 2000), proteomics (Tyers and Mann, 2003), metabolomics (Weckwerth 2003) and more. Previous studies usually employed differential equation (Eisenhammer et al., 1991; Zhang et al., 2016; Zhang and Zhang 2017; Liu G.-D. et al., 2020) based models to abstract and formalise multi-omics time series data (Bianconi et al., 2020). Then, it became possible to explore the time-varying connections and predict their future state (Ji et al., 2017) by solving these differential equations. In particular, predicting multi-omics time series states can not only discover dynamic information for biological entities, such as genes, proteins and metabolites, but also explore complicated biological interactions and the pathogenesis of diseases (Liang and Kelemen, 2017b).

However, a multi-omics time series usually has high dimensions (Perez-Riverol et al., 2017), complicated interaction relationships (Fischer 2008) and inevitable noise (Fischer 2008; Tsimring 2014). Thus, when we employ differential equations to model the multi-omics time series state, it is hard for us to solve these equations due to their high dimensionality and nonlinear characteristics (Bianconi et al., 2020). For these reasons, the way to predict the future state of a multi-omics time series by solving these complicated nonlinear differential equations has already become challenging work.

Recently, future state prediction for a multi-omics time series has been widely studied by computational biologists. For genomic studies, we usually use a gene expression time series to develop gene regulatory networks (Davidson and Levin 2005; Zhang et al., 2018; Xiao et al., 2020; Zhang et al., 2020; Xiao et al., 2021; Zhang et al., 2021a). However, since the gene regulatory network is a complex high-dimensional nonlinear system (Zhang et al., 2012a), it often produces chaotic phenomena (Levnajić and Tadić 2010), which not only play an important role in maintaining stable gene expression patterns (Sevim and Rikvold 2008) but also are closely related to the occurrence of diseases (Suzuki et al., 2016). Usually, we employ the Lorentz system (Lorenz 1963) to describe the chaotic phenomenon. However, it is inaccurate to predict the future state of genomics time series with nonlinear complicated interactions because the Lorentz system is not good at processing nonlinear complicated interactions (Lai et al., 2018). Currently, delay embedding theory (Sauer et al., 1991; Holmes et al., 2012) is commonly used to transform the spatial information (complicated interactions) into temporal information (the future state of the time series (Chen et al., 2020)) for dimensional reduction (Gao et al., 2017; Li et al., 2017; Xia et al., 2017; Zhang et al., 2019b; Zhang et al., 2019c; Wu et al., 2020; You et al., 2020; Zhang et al., 2021b), whereas Koopman theory (Koopman, 1931) can switch the nonlinear system into a linear system to reduce computing cost. Therefore, our first research question asks if we can develop such a time series predictive model that integrates the Lorentz system with delay embedding and Koopman theory to accurately predict the future state of genomics time series with chaotic behavior.

For proteomics studies, we usually use proteomic time series data to infer protein–protein interactions (PPIs) (Wu et al., 2009). Currently, we employ mass spectrometry technology (Mann et al., 2001) to obtain proteomics time series data. However, since it is unstable to have time-course experimental data by mass spectrometry technology, proteomics time series data are prone to oscillating behavior (Iuchi et al., 2018). Previously, we employed a nonlinear pendulum system (Hirsch 1974) to describe the oscillation behavior, though it was subjected to overfitting under a strong noise environment. Since the conjugate form of delay embedding (Sauer et al., 1991; Holmes et al., 2012) can ensure the reversibility of the time series predictive model (Chen et al., 2020) and reduce the impact of noise on prediction to a certain extent, our second research question asks if we can develop such a time series predictive model that can integrate a nonlinear pendulum system with delay embedding to accurately predict the state of proteomics time series with oscillating behavior.

For metabolomics studies, we usually use metabolic time series data that represent the flow behavior of biological fluids (serum, cerebrospinal fluid, etc.) to discover key metabolites in biological fluids (Zhang et al., 2012b). A previous study (Noack et al., 2003) always employed a nonlinear biological fluid system to describe metabolic time series data. However, because most nonlinear fluid flow systems have high dimensions (Lusch et al., 2018), we not only have difficulty selecting features from high-dimensional metabolic time series data but also impede progress because of time-consuming computing (Wang et al., 2021). Currently, since neural networks (Wang et al., 2014) can decrease the computing cost (Song et al., 2017) by dimensional reduction for time series data (Hinton and Salakhutdinov, 2006), our third research question asks if we can develop such a time series predictive model that integrates a nonlinear fluid flow system with a neural network to predict the future state of the metabolomics time series accurately and quickly with flow behaviour.

To answer the above three research questions, this study innovatively develops an Embedding, Koopman and Autoencoder technologies-based multi-omics time series predictive model (EKATP) to predict the future state of the time series for the corresponding genomics, proteomics and metabolomics datasets. Compared with previous approaches (Lusch et al., 2018; Azencot et al., 2020), the contributions of the study are summarised as follows. First, we select key features from a high-dimensional nonlinear state by integrating a neural network with the delay embedding theory. Second, we switch the nonlinear system with a linear system to reduce the computing cost by the Koopman theory. Finally, we develop a neural network and delay embedding theory-based model for reversible mapping between a high-dimensional nonlinear system and a low-dimensional linear system, thereby improving the accuracy and robustness of prediction.

The rest of the manuscript is organised as follows. Related Works mainly describes the related work for Autoencoder, delay embedding theory and Koopman theory. Materials and Methods introduces the architecture of the EKATP and the related procedure. Experiments describes the computational experiments. Finally, we conclude the study and discuss the future work.

Related Works

Supplementary Presentation S1 details the related theory and existing research of the Autoencoder, delay embedding theory and Koopman theory.

Materials and Methods

Figure 1 describes the workflow of the EKATP.

FIGURE 1

FIGURE 1. EKATP workflow.

Problem and Definitions

Given a set of high-dimensional nonlinear multi-omics time series states $F = (F^{1}, F^{2}, \dots, F^{T})$ , where $T$ represents the total step, the time series state at $t$ can be described as $F^{t} = {(f_{1}^{t}, f_{2}^{t}, \dots f_{n}^{t})}^{'}$ , where $n$ represents the dimension of the time series state, “ $'$ ”, as the transpose of a vector. Our goal is to predict the future state of the multi-omics time series. Next, we detail how to develop an EKATP as follows.

Autoencoding Observations

Since an EKATP is based on the Autoencoder framework, we employ Eq. 1 to define the objective function for Autoencoder ( $L_{i d}$ ).

L_{i d} = {‖ {\hat{F}}^{t} - F^{t} ‖}_{MSE} (1)

Here, ${\hat{F}}^{t}$ is the reconstructed high-dimensional time series state according to encoder ( $χ_{e}$ ) and decoder ( $χ_{d}$ ) of Autoencoder (Supplementary Presentation S1). ${| | \cdot | |}_{MSE}$ denotes the mean squared error (MSE), which presents the expected value of the square of the difference between the predicted value and the true value. This loss function term enables us to construct an Autoencoder model that satisfies $χ_{d} \circ χ_{e} \approx i d$ , the identity.

Delay Embedding

According to the description in the delay embedding theory (Supplementary Presentation S1), we employ $χ_{e}$ of the Autoencoder to approximate the delay embedding $Φ$ , mapping the high-dimensional nonlinear input time series state $F^{t}$ back to the low-dimensional time series state $Y^{t}$ by Eq. 2,

Y^{t} = {(y^{t}, y^{t + 1}, ..., y^{t + L - 1})}^{'} = χ_{e} (F^{t}) . (2)

where $L$ represents the dimension of the low-dimensional time series state. Similarly, the inverse mapping $χ_{d}$ of mapping $χ_{e}$ is used to approximate the conjugate form of delay embedding $Φ$ , mapping the low-dimensional time series state back to the high-dimensional time series state by Eq. 3.

{\hat{F}}^{t} = {({\hat{f}}_{1}^{t}, {\hat{f}}_{2}^{t}, \dots {\hat{f}}_{n}^{t})}^{'} = χ_{d} (Y^{t}) . (3)

Linearized Representation of the Koopman Operator

Based on the Koopman theory discussed by Supplementary Presentation S1, we construct the finite dimensional linear matrix $C$ (and matrix $D$ ) to compute the forward (and backward) low-dimensional time series state. Equation 4 shows how to realize the forward prediction for low-dimensional time series state $Y^{t}$ to obtain $Y^{t + 1}$ .

Y^{t + 1} = C Y^{t} . (4)

Equation 4 can be expanded by Eq. 5.

[\begin{matrix} y^{t + 1} \\ y^{t + 2} \\ y^{t + 3} \\ \dots \\ y^{t + L - 1} \\ y^{t + L} \end{matrix}] = [\begin{matrix} 0 & 1 & \dots & 0 & 0 \\ 0 & 0 & \dots & 0 & 0 \\ 0 & 0 & \dots & 0 & 0 \\ \dots & \dots & \dots & \dots & \dots \\ 0 & 0 & \dots & 0 & 1 \\ a_{1} & a_{2} & \dots & a_{L - 1} & a_{L} \end{matrix}] [\begin{matrix} y^{t} \\ y^{t + 1} \\ y^{t + 2} \\ \dots \\ y^{t + L - 2} \\ y^{t + L - 1} \end{matrix}] . (5)

Here, $a_{i}$ is the estimated parameter that needs training, and $a_{1} \neq 0$ . Equation 6 shows how to realize the backward prediction for a low-dimensional time series state $Y^{t}$ to obtain $Y^{t - 1}$ .

Y^{t - 1} = D Y^{t} . (6)

Equation 6 can be expanded by Eq. 7.

[\begin{matrix} y^{t - 1} \\ y^{t} \\ y^{t + 1} \\ \dots \\ y^{t + L - 3} \\ y^{t + L - 2} \end{matrix}] = [\begin{matrix} b_{1} & b_{2} & \dots & b_{L - 1} & b_{L} \\ 1 & 0 & \dots & 0 & 0 \\ 0 & 1 & \dots & 0 & 0 \\ \dots & \dots & \dots & \dots & \dots \\ 0 & 0 & \dots & 0 & 0 \\ 0 & 0 & \dots & 1 & 0 \end{matrix}] [\begin{matrix} y^{t} \\ y^{t + 1} \\ y^{t + 2} \\ \dots \\ y^{t + L - 2} \\ y^{t + L - 1} \end{matrix}] . (7)

Here, $b_{i}$ is the estimated parameter that needs training, and $b_{L} \neq 0$ . Our goal is to optimise the parameters of the linear matrix $C$ (and $D$ ) of Eqs 5, 7 by model training.

Forward and Backward Prediction

We make the $k$ -steps forward prediction by Eq. 8 and backward prediction by Eq. 9 for the state of the low-dimensional time series $Y^{t}$ . After that, $χ_{d}$ is used to map the low-dimensional predictive time series state back to the high-dimensional predictive time series state by Eq. 10,

Y^{t + k} = C^{k} Y^{t} . (8)

Y^{t - k} = D^{k} Y^{t} . (9)

{\hat{F}}^{t \pm k} = χ_{d} (Y^{t \pm k}) . (10)

where $Y^{t + k}$ and $Y^{t - k}$ represent the low-dimensional state after $k$ steps of forward and backward prediction, respectively. ${\hat{F}}^{t \pm k}$ represents the predictive high-dimensional nonlinear state.

Equations 11, 12 define the loss function of forward prediction ( $L_{f w d}$ ) and backward prediction ( $L_{b w d}$ ) to minimize the difference between the high-dimensional predictive value and true states at each step, respectively.

L_{fwd} = \frac{1}{k} \sum_{s = 1}^{k} {| | {\hat{F}}^{t + s} - F^{t + s} | |}_{MSE} . (11)

L_{bwd} = \frac{1}{k} \sum_{s = 1}^{k} {| | {\hat{F}}^{t - s} - F^{t - s} | |}_{MSE} . (12)

Equation 13 defines the loss function ( $L_{i d y}$ ) to minimize the difference between the predictive low-dimensional state obtained by the $C$ and $D$ matrices and defines such a low-dimensional state that is mapped from the true high-dimensional state by mapping $χ_{e}$ .

L_{idy} = \frac{1}{k} \sum_{s = 1}^{k} [‖ C^{s} χ_{e} (F^{t}) - χ_{e} {(F^{t + s}) ‖}_{MSE} + ‖ D^{s} χ_{e} (F^{t}) - χ_{e} {(F^{t - s}) ‖}_{MSE}] . (13)

Additionally, we employ loss function ( $L_{c o n}$ ) by Eq. 14 to train the parameters $a_{i}$ and $b_{i}$ in the matrices $C$ and $D$ , respectively.

L_{con} = \frac{1}{k} \sum_{s = 1}^{k} [{‖ χ_{d} (D^{s} C^{s} Y^{t}) - F^{t} ‖}_{MSE} + {‖ χ_{d} (C^{s} D^{s} Y^{t}) - F^{t} ‖}_{MSE}] (14)

Parameter Estimation for the EKATP

Equation 15 optimizes the key parameters for the EKATP by minimizing $L$ .

L = λ_{i d} L_{i d} + λ_{f w d} L_{f w d} + λ_{b w d} L_{b w d} + λ_{i d y} L_{i d y} + λ_{c o n} L_{c o n} . (15)

Here, $λ_{i d}$ , $λ_{fwd}$ , $λ_{bwd}$ , $λ_{idy}$ and $λ_{con}$ are user-defined hyperparameters.

Experiments

This section evaluates the predictability of the proposed EKATP for high-dimensional nonlinear multi-omics datasets by comparing it with recurrent neural networks (RNNs) (Jiang and Lai, 2019), long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997), dynamic Autoencoder (DAE) (Lusch et al., 2018) and Koopman Autoencoder (KAE) (Azencot et al., 2020). The detailed experimental setup is listed in Supplementary Presentation S2. In addition, we detail the workflow chart and list the related pseudocode in Supplementary Figure S1; Supplementary Presentation S3.

Genomics

We usually employ the chaotic Lorentz system (Lorenz, 1963) to describe a gene expression time series with a low-dimensional manifold (Sauer et al., 1991) by Eq. 16,

{\begin{cases} x_{t + 1} = x_{t} + h (η (y_{t} - z_{t})) \\ y_{t + 1} = y_{t} + h (x_{t} (ρ - z_{t}) - y_{t}) \\ z_{t + 1} = z_{t} + h (x_{t} y_{t} - β z_{t}) \end{cases}, (16)

where $η$ and $ρ$ represent the Prandtl and Rayleigh numbers, respectively. $β$ is related to geometry, and $t$ represents time. $h$ represents the level of the complicated nonlinear system. When $h$ is greater, the nonlinear relationship between genes becomes more complicated.

Since gene expression time series contains considerable noise, we employ white Gaussian noise (Li et al., 2017) to simulate the noise by Eq. 17,

{\begin{matrix} \tilde{x} = x + ε_{x} \\ \tilde{y} = y + ε_{y} \\ \begin{matrix} \tilde{z} = z + ε_{z} \end{matrix} \end{matrix} . (17)

where $\tilde{x}$ , $\tilde{y}$ and $\tilde{z}$ represent data with noise. $ε_{x}$ , $ε_{y}$ and $ε_{z}$ represent the white Gaussian noise for $x$ , $y$ and $z$ by normal distributions $N (0, σ^{2})$ with a zero mean and a standard deviation $σ$ . The standard deviation $σ$ is referred to as the noise intensity.

Here, we describe how to obtain a high-dimensional gene expression time series with a low-dimensional manifold as follows. First, we generate the three-dimensional time series $V = (V^{1}, V^{2}, \dots, V^{T}) \in ℝ^{3}$ ( $T$ is the total step), which is listed in Supplementary Tables S1.1, S1.2, S1.3. Next, we develop a random orthogonal transformation (Anderson et al., 1987) matrix $P \in ℝ^{96 \times 3}$ . Finally, we map the state of a 3-dimensional time series onto the state of a 96-dimensional time series by Eq. 18 to simulate a high-dimensional gene expression time series $F = (F^{1}, F^{2}, \dots, F^{T}) \in ℝ^{96}$ with a 3-dimensional manifold, which is listed in Supplementary Tables S1.4, S1.5, S1.6.

F = P V . (18)

To prove the accuracy and robustness of the EKATP, we generate a small-scale system containing $T$ = 1,050 steps and choose the last 50 steps to visualize the predictive power of the EKATP.

Figure 2 shows the predictive error in the range of 50 steps under different initial conditions and environments. Detailed information is listed in Supplementary Tables S1.7, S1.8; Supplementary Presentation S4.

FIGURE 2

FIGURE 2. Comparison among the RNN, LSTM, DAE, KAE and EKATP. The abscissa represents the step, and the ordinate represents the predictive error. (A) The initial conditions are $h$ = 0.003 and $σ$ = 0.00. (B) The initial conditions are $h$ = 0.003 and $σ$ = 0.01. (C) The initial conditions are $h$ = 0.006 and $σ$ = 0.00. (D) The initial conditions are $h$ = 0.006 and $σ$ = 0.01.

Figures 2A,C demonstrates that the EKATP not only has less of a predictive error than the existing methods under a clean environment ( $σ$ =0.00) but also has a stable predictive error when the complexity $h$ increases from 0.003 to 0.006. In particular, with the increase in predictive steps, the predictive error of the EKATP increases slower than that of the existing methods.

Figures 2B,D shows that the EKATP not only has less of a predictive error than previous methods under a noisy environment ( $σ$ =0.01) but also has a predictive error that slightly fluctuates when $h$ increases from 0.003 to 0.006. Moreover, after 25 steps, the predictive error of the EKATP increases much slower than that of the existing methods.

Figure 2 indicates that the EKATP has greater predictive accuracy and robustness than excitation methods in clean and noisy environments.

To further prove the generalizability of the EKATP, we generate a large-scale system containing $T$ = 15,000 steps under the condition of $h$ = 0.003 and $σ$ = 0.00. After that, we randomly choose three different time periods to train and test the model as follows, the procedure of which is detailed in Supplementary Table S1.9.

First, since the 3-dimensional time series state and 96-dimensional time series state are diffeomorphic (Sauer et al., 1991), which is indicated by the data preprocessing procedure, it implies that the mapping between these two time series is reversible. Here, we map the 96-dimensional gene expression predictive results onto a 3-dimensional space by orthogonal inverse transformation (Anderson et al., 1987) to visualize the predictive result of the EKATP.

Figures 3A,B,C demonstrates that the predictive results of the EKATP are close to the true value for different periods of a time series. Figure 3 shows that the EKATP can accurately predict the gene expression time series at different periods, implying that it has a strong generalizability, even in a very complicated nonlinear environment.

FIGURE 3

FIGURE 3. The 50-step predictive trajectories of the EKATP are under initial conditions $h$ = 0.003 and $σ$ = 0.00. Grey colors represent full true data. (A) This is the predictive situation of the first period. Yellow and green colors represent true and predictive data, respectively. (B) This is the predictive situation of the second period. Purple and cyan colors represent true and predictive data, respectively. (C) This is the predictive situation of the third period. Blue and red colors represent true and predictive data, respectively.

Proteomics

We always use a nonlinear pendulum model (Hirsch, 1974) with oscillatory behaviour to describe a proteomics time series with a low-dimensional manifold (Sauer et al., 1991) by Eq. 19,

{\begin{cases} \frac{d^{2} θ}{d t^{2}} + \frac{g}{l} \sin θ = 0 \\ θ (t_{0}) = h \end{cases} . (19)

where $l$ , $g$ and $t$ denote the length, gravity and time, respectively. $h$ denotes the initial value of $θ$ , which represents the level of the complicated nonlinear system. When $h$ is greater, the nonlinear relationship between proteins becomes more complicated.

Since a considerable amount of noise exists in a protein time series, we employ white Gaussian noise (Li et al., 2017) to describe it by Eq. 20,

{\begin{cases} \tilde{θ} = θ + ε_{θ} \\ \tilde{\dot{θ}} = \dot{θ} + ε_{\dot{θ}} \end{cases}, (20)

where $\tilde{θ}$ and $\tilde{\dot{θ}}$ represent data with noise. $ε_{θ}$ and $ε_{\dot{θ}}$ represent the noise Gaussian terms for $θ$ and $\dot{θ}$ by normal distributions $N (0, σ^{2})$ with a zero mean and a standard deviation $σ$ .

Here, we describe how to obtain a high-dimensional proteomics time series with a low-dimensional manifold. First, we generate the 2-dimensional time series $V = (V^{1}, V^{2}, \dots, V^{T}) \in ℝ^{2}$ , which is listed in Supplementary Tables S2.1, S2.2. Next, we develop a random orthogonal transformation (Anderson et al., 1987) matrix $P \in ℝ^{64 \times 2}$ . Finally, we map the state of a 2-dimensional time series onto the state of a 64-dimensional time series by Eq. 18 to simulate a high-dimensional proteomics time series $F = (F^{1}, F^{2}, \dots, F^{T}) \in ℝ^{64}$ with a 2-dimensional manifold, which is listed in Supplementary Tables S2.3, S2.4.

To prove the accuracy and robustness of the EKATP, we generate a system containing $T$ = 1,600 steps and choose the last 1,000 steps to visualize the predictability for the EKATP.

Figure 4 shows that the EKATP can effectively predict a proteomic time series under clean and noisy environments within 1,000 steps, the details of which are listed in Supplementary Tables S2.5, S2.6; Supplementary Presentation S4.

FIGURE 4

FIGURE 4. Comparison with the RNN, LSTM, DAE, KAE and EKATP. The abscissa represents the step, and the ordinate represents the predictive error. (A) The initial conditions are $h$ = 0.8 and $σ$ = 0.00. (B) The initial conditions are $h$ = 2.4 and $σ$ = 0.00. (C) The initial conditions are $h$ = 0.8 and $σ$ = 0.03. (D) The initial conditions are $h$ = 2.4 and $σ$ = 0.03. (E) The initial conditions are $h$ = 0.8 and $σ$ = 0.08. (F) The initial conditions are $h$ = 2.4 and $σ$ = 0.08.

Figures 4A,B shows that the EKATP not only has less of a predictive error under a clean environment ( $σ$ =0.00) than the existing methods but also maintains a smaller predictive error when $h$ increases from 0.8 to 2.4. Moreover, the predictive error of the EKATP increases much slower than that of the existing methods when the predictive step increases.

Figures 4C,D demonstrates that the EKATP has less of a predictive error under a noise environment ( $σ$ =0.03) than the existing methods. When $h$ increases from 0.8 to 2.4, the predictive error of the EKATP remains stable. In particular, with the increase in predictive steps, the predictive error of the EKATP increases much slower than that of the existing methods.

Figures 4E,F indicates that the EKATP not only has less of a predictive error under a noise environment ( $σ$ =0.08) than the existing methods but also has a predictive error of the EKATP that remains stable when $h$ increases from 0.8 to 2.4. In particular, when the predictive steps are long enough (after 500 steps), the predictive error of previous methods increases much faster than that of the EKATP.

Figures 4A,C,E shows that the predictive error of the EKATP remains stable when the noise intensity $σ$ increases from 0 to 0.08 under complexity $h$ = 0.8. Figures 4B,D,F shows that the predictive error of the EKATP remains stable when the noise intensity $σ$ increases from 0 to 0.08 under complexity $h$ = 2.4.

Figure 4 demonstrates that the predictive accuracy and robustness of the EKATP outperforms the existing methods under clean and noisy environments.

Since Figure 4 shows that KAE has a better predictive effect than the other existing methods, we use it to compare the predictive performance with the EKATP by visualizing the predictive trajectory.

Indicated by our data preprocessing procedure, since the 2-dimensional time series state and 64-dimensional time series state are diffeomorphic (Sauer et al., 1991), the mapping between these two time series is reversible. Here, we map the 64-dimensional protein time series predictive results onto a 2-dimensional space by orthogonal inverse transformation (Anderson et al., 1987) to visualize the predictive time series trajectory. Figure 5 shows the predictive trajectories of the KAE and EKATP within 1,000 steps under the initial conditions of $h$ = 2.4 and $σ$ = 0.03, which show that the predictive protein time series trajectory of the EKATP (Figure 5B) is much closer to the true trajectory than that of the KAE (Figure 5A). Figure 5 further indicates that the predictive accuracy and robustness of the EKATP is better than that of the KAE.

FIGURE 5

FIGURE 5. The prediction trajectories within 1,000 steps under the initial conditions of $h$ = 2.4 and $σ$ = 0.03. The abscissa represents $\dot{θ}$ , the ordinate represents $θ$ , blue colours represent true data and red dots represent predictive data. (A) KAE. (B) EKATP.

To further prove that the EKATP has strong generalizability, we randomly selected 20 pieces of different protein time series data for model training and analysis. The details are listed in Table 1; Supplementary Table S2.7.

TABLE 1

TABLE 1. Predictive error at 1,000 steps for both the KAE and EKATP.

After we employ 20 different proteomics time series datasets to test the KAE and EKATP, Table 1 shows the predictive error of the KAE and EKATP at 1,000 steps under different initial noise and complexity ( $h$ ) conditions, which demonstrates that the EKATP has less of a statistically significant minimum, maximum, average and variance of the predictive error than the KAE under each noise and complexity ( $h$ ) condition (p-value <0.05) (Gao et al., 2017; Li et al., 2017; Gao et al., 2021). Table 1 implies that the EKATP has statistically significant predictive power for different time series datasets.

Metabolomics

We usually employ a nonlinear biological fluid system (Noack et al., 2003) to describe the high-dimensional metabolic time series with a low-dimensional manifold (Sauer et al., 1991) for the flow behavior of biological fluids simulation by Eq. 21,

{\begin{cases} \dot{x} = γ x - ω y + A x z \\ \dot{y} = ω x + γ y + A y z \\ \dot{z} = - λ (z - x^{2} - y^{2}) \end{cases}, (21)

where $γ$ , $ω$ and $A$ determine the size of the fluid. $λ$ determines the speed of the dynamics of $z$ . The different initial values of $x$ , $y$ and $z$ determine the different nonlinear complexities of the metabolomics time series. We use the initial conditions $ζ_{1}$ ( $x$ =0, $y$ = -0.01, $z$ = 0) and $ζ_{2}$ ( $x$ =0.01, $y$ = -0.1, $z$ = 0.5) to generate a high-dimensional metabolomics time series with low complexity $h_{1}$ and high complexity $h_{2}$ , respectively.

Since the metabolomics time series contains considerable noise, we employ white Gaussian noise (Li et al., 2017) to describe it by Eq. 22,

{\begin{matrix} \tilde{x} = x + ε_{x} \\ \tilde{y} = y + ε_{y} \\ \begin{matrix} \tilde{z} = z + ε_{z} \end{matrix} \end{matrix} . (22)

Fig. 6 Comparison of the RNN, LSTM, DAE, KAE and EKATP. The abscissa represents the time step, and the ordinate represents the predictive error. (A) The initial conditions are $h_{1}$ and $σ$ = 0.000. (B) The initial conditions are $h_{1}$ and $σ$ = 0.001. (C) The initial conditions are $h_{2}$ and $σ$ = 0.000. (D) The initial conditions are $h_{2}$ and $σ$ = 0.001.

Here, we show how to generate a high-dimensional metabolomics time series with a low-dimensional manifold. First, we build up the 3-dimensional time series $V = (V^{1}, V^{2}, \dots, V^{T}) \in ℝ^{3}$ , which is listed in Supplementary Tables S3.1; S3.2. Next, we develop a random orthogonal transformation (Anderson et al., 1987) matrix $P \in ℝ^{96 \times 3}$ . Finally, we map the state of the 3-dimensional time series onto the state of the 96-dimensional time series by Eq. 18 to simulate a high-dimensional metabolic time series $F = (F^{1}, F^{2}, \dots, F^{T}) \in ℝ^{96}$ with the 3-dimensional manifold, which is listed in Supplementary Tables S3.3, S3.4.

To demonstrate the accuracy and robustness of the EKATP, we generate a system containing $T$ = 900 steps and choose the last 100 steps to visualize the predictive result of the EKATP. Figure 6 shows the predictive results of the metabolic time series under different initial conditions and environments for the last 100 steps. Detailed information is listed in Supplementary Tables S3.5, S3.6; Supplementary Presentation S4.

FIGURE 6

FIGURE 6. Comparison of the RNN, LSTM, DAE, KAE and EKATP. The abscissa represents the time step, and the ordinate represents the predictive error. (A) The initial conditions are $h_{1}$ and $σ$ = 0.000. (B) The initial conditions are $h_{1}$ and $σ$ = 0.001. (C) The initial conditions are $h_{2}$ and $σ$ = 0.000. (D) The initial conditions are $h_{2}$ and $σ$ = 0.001.

Figures 6A,C demonstrates that the EKATP has less of a predictive error under a clean environment ( $σ$ =0.000) than the existing methods. When the complexity of $h$ increases, the predictive error of the EKATP remains stable. With the increase in the predictive step, the predictive error of the existing methods increases rapidly, while the predictive error of the EKATP remains small.

Figures 6B,D suggests that the EKATP not only has less of a predictive error under a low noise intensity environment ( $σ$ =0.001) than the existing methods but also has a predictive error of the EKATP that remains stable when $h$ increases. In particular, when the predictive steps are long enough, the predictive error of the EKATP increases much slower than that of the existing methods.

Figure 6 implies that the EKATP has better predictive accuracy and robustness than the existing methods under clean and weakly noisy environments.

Since a metabolomics time series usually has strong noise intensity (Mak et al., 2015), we use the EKATP to predict a high-dimensional metabolomics time series under strong noise intensities to prove its robustness. Because the 3-dimensional time series state and the 96-dimensional time series state are diffeomorphic (Sauer et al., 1991), the mapping between these two time series is reversible. Thus, after we map the 96-dimensional metabolic time series predictive results onto a 3-dimensional space by orthogonal inverse transformation (Anderson et al., 1987), Figure 7 shows the predictive time series trajectories by the EKATP under different intensities of noise. We select the last 100 steps to validate the predictive power of the EKATP as in the previous setup (Supplementary Table S3.6). The results demonstrate that although the true data become gradually messy when we increase the noise intensity $σ$ , the predictive time series trajectory of the EKATP is still very close to the true data to a certain extent (Figures 7A,B,C,D,E,F), which implies that the EKATP still has a satisfactory predictive performance when we increase the noise intensity.

FIGURE 7

FIGURE 7. The predictive trajectories of the EKATP within 100 steps under different intensities of noise. Cyan dots represent true data with an interval of $t \in$ [0:900]. Red dots represent predictive data with an interval of $t \in$ [800:900]. (A) $σ$ = 0.001. (B) $σ$ = 0.005. (C) $σ$ = 0.010. (D) $σ$ = 0.050. (E) $σ$ = 0.100. (F) $σ$ = 0.500.

Moreover, we use Eqs 23, 24 to calculate the Pearson correlation coefficient (PCC) (Abar et al., 2017) and the root mean squared error (RMSE) (Abar et al., 2017) between predictive and true data under different noise intensities.

Here, $V^{t}$ and ${\hat{V}}^{t}$ represent the true and predictive data at time $t$ . $μ$ and $\hat{μ}$ represent the average value for true and predictive data, respectively. $p$ represents the predictive step size.

PCC = \frac{\sum_{t = 1}^{t = p} ({\hat{V}}^{t} - \hat{μ}) (V^{t} - μ)}{\sqrt{\sum_{t = 1}^{t = p} {({\hat{V}}^{t} - \hat{μ})}^{2}} \sqrt{\sum_{t = 1}^{t = p} {(V^{t} - μ)}^{2}}} . (23)

RMSE = \sqrt{\frac{1}{p} \sum_{t = 1}^{t = p} {| | {\hat{V}}^{t} - V^{t} | |}^{2}} . (24)

Figure 8A shows that the PCC value of the EKATP gradually decreases when we increase the noise intensity $σ$ , but the overall value is relatively high. Figure 8B indicates that with the increase in noise intensity $σ$ , although the RMSE value of the EKATP gradually increases, it is still relatively small. Thus, we conclude that the EKATP can effectively avoid noise interference and is robust enough under a very strong noise intensity condition.

FIGURE 8

FIGURE 8. PCC and RMSE values between predictive and true data under different noise intensities. The details are listed in Supplementary Table S3.7. (A) PCC value between predictive and true data, where the abscissa represents the noise intensity and the ordinate represents the PCC value. (B) RMSE value between predictive and true data, where the abscissa represents the noise intensity and the ordinate represents the RMSE value.

Conclusion and Future Work

To answer the three proposed questions, this study developed an EKATP to predict the future state of a high-dimensional nonlinear multi-omics time series. First, we select key features from high-dimensional nonlinear multi-omics time series data. After that, we map these key features to the low-dimensional linear space. Next, we obtain the future state of the multi-omics time series by learning the evolutionary relationship between the adjacent states of the time series in the low-dimensional linear space. Finally, we predict the future state of the high-dimensional nonlinear multi-omics time series by mapping the low-dimensional linear predictive state back to the high-dimensional nonlinear space. The experimental results demonstrate that the EKATP can greatly improve the accuracy, robustness and generalisability to predict the future state of a time series for genomics (Figures 2, 3), proteomics (Figures 4, 5; Table 1) and metabolomics (Figures 6–8) datasets.

However, there are still several shortcomings to the current study. For example, we are still unclear on the impact of embedding dimensions from high-dimensional nonlinear space to low-dimensional linear space on predictive accuracy and the way to use high-performance computing to increase the efficiency of the EKATP. Applying the EKATP to network biological datasets (Liu X. et al., 2020) is also the direction we need to continue the study. Thus, we will improve the EKATP from these perspectives in the distant future.

Data Availability Statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

Author Contributions

YY, LZ and SL conceived the project. SL, YY and ZT carried out experiments. SL and YY visualized experiment results. SL drafted the manuscript. YY and LZ revised the article. All the authors read and approved the final article.

Funding

This work was supported by the National Science and Technology Major Project (2018ZX10201002).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2021.761629/full#supplementary-material

References

Abar, T., El Asmi, A. S., and Asmi, S. E. (2017). “Machine Learning Based QoE Prediction in SDN Networks,” in Proceedings of the 2017 13th International Wireless Communications and Mobile Computing Conference (IWCMC), Valencia, Spain, June 2017. doi:10.1109/IWCMC.2017.7986488

Developing an Embedding, Koopman and Autoencoder Technologies-Based Multi-Omics Time Series Predictive Model (EKATP) for Systems Biology research

Introduction

Related Works

Materials and Methods

Problem and Definitions

Autoencoding Observations

Delay Embedding

Linearized Representation of the Koopman Operator

Forward and Backward Prediction

Parameter Estimation for the EKATP

Experiments

Genomics

Proteomics

Metabolomics

Conclusion and Future Work

Data Availability Statement

Author Contributions

Funding

Conflict of Interest

Publisher’s Note

Supplementary Material

References

95% of researchers rate our articles as excellent or good