LPI-SKF: Predicting lncRNA-Protein Interactions Using Similarity Kernel Fusions

Zhou, Yuan-Ke; Hu, Jie; Shen, Zi-Ang; Zhang, Wen-Ya; Du, Pu-Feng

doi:10.3389/fgene.2020.615144

METHODS article

Front. Genet., 09 December 2020

Sec. Computational Genomics

Volume 11 - 2020 | https://doi.org/10.3389/fgene.2020.615144

This article is part of the Research TopicGraph-Based Methods Volume II: Innovative Graph-Based Methods and Theories for Biomedical Data AnalysisView all 8 articles

LPI-SKF: Predicting lncRNA-Protein Interactions Using Similarity Kernel Fusions

Yuan-Ke Zhou

Jie Hu

Zi-Ang Shen

Wen-Ya Zhang

Pu-Feng Du^*

College of Intelligence and Computing, Tianjin University, Tianjin, China

Long non-coding RNAs (lncRNAs) play an important role in serval biological activities, including transcription, splicing, translation, and some other cellular regulation processes. lncRNAs perform their biological functions by interacting with various proteins. The studies on lncRNA-protein interactions are of great value to the understanding of lncRNA functional mechanisms. In this paper, we proposed a novel model to predict potential lncRNA-protein interactions using the SKF (similarity kernel fusion) and LapRLS (Laplacian regularized least squares) algorithms. We named this method the LPI-SKF. Various similarities of both lncRNAs and proteins were integrated into the LPI-SKF. LPI-SKF can be applied in predicting potential interactions involving novel proteins or lncRNAs. We obtained an AUROC (area under receiver operating curve) of 0.909 in a 5-fold cross-validation, which outperforms other state-of-the-art methods. A total of 19 out of the top 20 ranked interaction predictions were verified by existing data, which implied that the LPI-SKF had great potential in discovering unknown lncRNA-protein interactions accurately. All data and codes of this work can be downloaded from a GitHub repository (https://github.com/zyk2118216069/LPI-SKF).

Introduction

The human genome is comprised of ~3.2 billion nucleotides, which harbors ~20,000–25,000 protein-coding genes (International Human Genome Sequencing Consortium, 2004). The remaining non-coding genes were once considered to be “junk DNA” in the 1970s due to their weak coding capacity. This included pseudogenes, and simple repeats. (Comings, 1972; Ohno and Smith, 1972). Nonetheless, non-coding sequences have received continuous attention since the 1970s. With the development of sequencing technologies, various ncRNAs (non-coding RNAs), like H19 and XIST (Brannan et al., 1990; Brockdorff et al., 1992; Kung et al., 2013), were discovered in biological regulation processes. lncRNA (long non-coding RNA) is an important type of ncRNAs with a length longer than 200 nt (Mercer et al., 2009; Ma et al., 2013). lncRNAs play important roles in various biological processes (Clark and Mattick, 2011), including transcription (Martianov et al., 2007), splicing (Rintala-Maki and Sutherland, 2009), translation (Beltran et al., 2008), imprinting (Bartolomei et al., 1991), apoptosis (Reeves et al., 2007), and many more. lncRNAs perform their molecular functions by interacting with proteins (Hentze et al., 2018). For example, MALAT1, a functional lncRNA, which is highly expressed in several tumors, can bind the tumor suppressor gene SFPQ (also known as PSF) to release proto-oncogene PTBP2 (also known as PTB) from the SFPQ/PTBP2 complex (Meissner et al., 2000; Tseng et al., 2009; Gutschner et al., 2013; Ji et al., 2014). Studying lncRNA-protein interactions is of great value in understanding the functional mechanism of lncRNAs. However, wet experiments to determining lncRNA-protein interactions are always costly and time-consuming. Therefore, it is crucial to develop efficient and accurate computational methods to predict potential lncRNA-protein interactions.

Recently, a number of computational methods have been developed to predict novel lncRNA-protein interactions. Generally, these methods fall into two categories, the supervised binary classification-based methods and semi-supervised learning-based methods. The most significant difference between these two categories is whether the non-interacting lncRNA-protein pairs are regarded as negative samples or unlabeled samples.

In the binary classification methods, the non-interacting lncRNA-protein pairs are regarded as negative instances. Muppirala et al. encoded RNA-protein pairs using sequence information and trained the model RPISeq, using SVM (support vector machine) and RF (random forest) classifiers (Muppirala et al., 2011). By encoding RNA-protein pairs in different ways, two more models were built by SVM or RF classifiers in the following years (Suresh et al., 2015; Xiao et al., 2017). Wang et al. applied a novel extended naive-Bayes classifier on sequence-based features to predict potential protein-RNA interactions (Wang et al., 2013). Ensemble learning was widely applied in combining various machine learning algorithms in predicting lncRNA-protein interactions (Deng et al., 2018; Hu et al., 2018; Wekesa et al., 2019). Despite all these efforts, selecting veracious negative instances is still the most challenging problem in training binary classification-based models. Moreover, the dataset in predicting lncRNA-protein interactions is always highly imbalanced in nature, which could influence the prediction performances in many ways.

In the semi-supervised learning methods, non-interacting lncRNA-protein pairs were considered as unlabeled instances. Lu et al. introduced the matrix multiplication method to score each potential protein-RNA pair (Lu et al., 2013). Li et al. (2015) utilized the RWR (random walk with restart) algorithm on the lncRNA-protein-protein heterogeneous network to predict lncRNA-protein interactions. Serval prediction models were established by the MF (matrix factorization) algorithm, which separates the adjacency matrix into two talent feature vectors (Liu et al., 2017; Ma et al., 2019; Zhang T. et al., 2020). Zhao et al. integrated the RWR and MF algorithm to construct a prediction model (Zhao et al., 2018). A label propagation algorithm is another common recommendation algorithm, two models were built based on label propagation algorithms (Zhang et al., 2018a; Zhu et al., 2019). Meanwhile, some other machine learning algorithms were also adapted in the prediction of lncRNA-protein interactions, including feature projection ensemble learning (Zhang et al., 2018b), KATZ scoring schemes (Zhang et al., 2019), the kernel ridge regression algorithm (Shen et al., 2019), and the depth-first search algorithm (Zhang H. et al., 2020).

Although existing computational models have achieved great performances, there are still some problems that should be solved. With the development of high-throughput sequencing technology, a large number of novel lncRNAs have been discovered. Unlike lncRNAs, that were deposited in the database long ago, little is known about the interacting proteins of these newly identified lncRNAs. Therefore, few existing models can infer potential interacting proteins for these lncRNAs (Zhang et al., 2018b; Zhang T. et al., 2020).

In this paper, we proposed a new model to predict lncRNA-protein interactions based on the similarity kernel fusion approach, namely LPI-SKF. Multiple similarities between lncRNAs and proteins were first calculated. These similarities were integrated to obtain a comprehensive similarity. Ultimately, the Laplacian regularized least squares framework was applied to build the predictive model. Five-fold cross-validation was used to estimate the performance of LPI-SKF in this work. The LPI-SKF achieved an AUROC (area under receiver operating characteristics curve) of 0.909 and an AUPR (area under precision-recall curve) of 0.685, which indicated that the LPI-SKF method could identify unknown lncRNA-protein interactions accurately. Moreover, LPI-SKF could also be used to identify interacting partners for novel lncRNA/proteins. A total of 19 out of our 20 top-ranked lncRNA-protein interaction predictions were confirmed by existing data.

Materials and Methods

In this work, we proposed an lncRNA-protein interaction prediction model, named LPI-SKF. This model can be summarized in four steps, which are shown in Figure 1. Firstly, we collected experimentally verified lncRNA-protein interactions in the NPInter V2.0 database and constructed the heterogeneous network. Secondly, based on the assumption that similar lncRNAs tend to interact with similar proteins and vice versa, we calculated three different pairwise similarities for lncRNAs, and three different pairwise similarities for proteins, respectively. Thirdly, to synthesize the similarity information in different aspects and to also reduce noise, the SKF approach was utilized to integrate the lncRNA similarities and protein similarities. Finally, considering the network structure information, we combined the Laplacian regularization and the least squares method to build our prediction model.

FIGURE 1

Figure 1. The flowchart of the entire work. Known lncRNA-protein interactions were downloaded from the NPInter V2.0 database to form a heterogeneous network. Three different similarities of both lncRNAs and proteins were calculated, subsequently. Afterward, the similarity kernel fusion (SKF) approach was utilized to integrate these similarities. Finally, the Laplacian regularized least squares (LapRLS) framework was used to build the prediction model.

Dataset Curations

NPInter is an integrated database of ncRNA interactions, which includes vast interactions between ncRNAs and biomolecules uncovered by various high-throughput sequencing approaches (Yuan et al., 2014). lncRNA-protein interactions collected in NPInter have been utilized as materials in numerous related studies. For a better comparison, we collected lncRNA-protein interactions from the NPInter V2.0 database according to the previous study (Zhang et al., 2018a). Ultimately, 4158 lncRNA-protein interactions including 990 lncRNAs and 27 proteins were obtained. Afterward, the sequences and expressions of lncRNAs and the sequences of proteins Were downloaded from the NONCODE database and the SUPERFAMILY database, separately (Fang et al., 2018; Pandurangan et al., 2019).

Similarities for lncRNAs and Proteins

This work is based on the assumption that similar lncRNAs tend to interact with similar proteins and vice versa. Hence, defining appropriate similarity is of great importance in predicting lncRNA-protein interactions. We employed three different pairwise similarities of lncRNAs, including the interaction similarity, the expression similarity, and the sequence similarity. We also applied three different similarities of proteins, including the interaction similarity, the statistical feature similarity, and the sequence similarity. With all these similarity definitions, we proposed to use the similarity kernel fusion strategy to establish a universal and comprehensive similarity kernel matrix to predict potential lncRNA-protein interactions.

The Interaction Profile Similarities

For the convenience of the reader, we first defined the adjacency matrix between lncRNAs and proteins. Let l_i (i = 1, 2, …, n) be the i-th lncRNA, and p_j (j = 1, 2, …, m) the j-th protein. The adjacency matrix A can be defined as follows:

\begin{array}{l} A = {a_{i, j}}_{n \times m} = {\begin{matrix} 1 & l_{i} interacts with p_{j}, \\ 0 & Otherwise \end{matrix} & (1) \end{array}

The interaction profile of the i-th lncRNA is the i-th row of matrix A, which can be noted as the A_i*, while the interaction profile of the j-th protein is the j-th column of matrix A, which can be noted as the A_j.

The interaction similarity between l_u and l_v can be defined as:

\begin{array}{l} s_{l, 0} (u, v) = exp (- γ_{l} {|| A_{u^{*}} - A_{v^{*}} ||}^{2}), & (2) \end{array}

where

\begin{array}{l} γ_{l} = n / \sum_{i = 1}^{n} {|| A_{i^{*}} ||}^{2}, & (3) \end{array}

and ||.|| is the 2-norm operator.

Similarly, the interaction similarity between p_u and p_v can be defined as:

\begin{array}{l} s_{p, 0} (u, v) = exp (- γ_{p} {|| A_{u} - A_{v} ||}^{2}), & (4) \end{array}

where.

\begin{array}{l} γ_{p} = m / \sum_{j = 1}^{m} {|| A_{j} ||}^{2} . & (5) \end{array}

lncRNA Expression Profile Similarity

The expression profiles of lncRNAs in 24 different tissues can be downloaded from the NONCODE database. The expression profile of the i-th lncRNA can be noted as e_i. The expression profile similarity is defined as follows:

\begin{array}{l} s_{l, 1} (u, v) = {\begin{matrix} \frac{1}{2} (1 + ρ_{u, v}) & u \neq v \\ 0 & u = v \end{matrix}, & (6) \end{array}

where ρ_u,v is the Pearson's correlation coefficient between e_u and e_v. It can be calculated as follows:

\begin{array}{l} ρ_{u, v} = \frac{cov (e_{u}, e_{v})}{σ (e_{u}) σ (e_{v})}, & (7) \end{array}

where cov() is the covariance, and σ is the standard deviation operator.

Protein Pairwise Sequence Alignment Similarity

Blast+ is a local alignment search tool, which was utilized to calculate the alignment score of proteins in this work (Camacho et al., 2009). We used blast+ to align p_u against p_v. The bit score in this alignment can be noted as b_u,v. The pairwise sequence alignment similarity can be defined as:

\begin{array}{l} s_{p, 1} (u, v) = {\begin{matrix} b_{u, v} / b_{u, u} & u \neq v \\ 0 & u = v \end{matrix} & (8) \end{array}

It worth noting that s_p,1 is not symmetric. Therefore, we have s_p,1(u, v) ≠ s_p,1(v, u).

Sequence Statistical Feature Similarity

RNA is composed of four types of ribonucleotide (A, G, C, U). According to the previous work, we calculated the percentage of these four nucleotides and 16 dinucleotides (AA, AG, AC, AU, …) to represent each lncRNA in a 20-D vector (Zhang et al., 2018a). We employed CTD (composition-transition-distribution) features (Li et al., 2006) in this work. Twenty different amino acids were divided into three groups, according to their hydrophobicity, normalized van der Waals volume, polarity, and polarizability. Each protein was represented as a 504-D vector. Linear neighborhood similarity (LNS), which is based on the hypothesis that each vector can be represented by their k-nearest neighbors, was adopted to compute the similarity between statistical features (Wang and Zhang, 2008; Deng et al., 2020) for lncRNA and proteins, respectively. The sequence statistical feature similarity between l_u and l_v can be noted as s_l,2(u, v), while the similarity between p_u and p_v can be noted as s_p,2(u, v).

Similarity Kernel Fusion

Three different lncRNA similarities (s_l_q q = 0, 1, 2) and three different protein similarities (s_p,q q = 0, 1, 2) were calculated in the above sections. Furthermore, the similarity kernel fusion (SKF) algorithm was utilized to integrate these similarities and obtain a more comprehensive similarity.

We take the similarities of lncRNA as an example. Firstly, we can normalize the three lncRNA similarities (s_l,q q = 0, 1, 2) as follows:

\begin{array}{l} θ_{l, q} (u, v) = \frac{s_{l, q} (u, v)}{\sum_{t = 1}^{n} s_{l, q} (t, v)}, & (9) \end{array}

where θ_l,q is the normalized similarity corresponding to s_l,q. The matrix composed by the normalized similarity is noted as:

\begin{array}{l} Θ_{l, q} = {θ_{l, q} (u, v)}_{n \times n} . & (10) \end{array}

Secondly, we created a neighbor-constrained normalization for each lncRNA similarity. Given l_u and s_l,q, we collected the k most similar lncRNA as a set N_l,q(u, k). The neighborhood constrained normalization of the s_l,q can be defined as follows:

\begin{array}{l} φ_{l, q} (u, v) = \frac{s_{l, q} (u, v) I_{l, q, k} (u, v)}{\sum_{t = 1}^{n} s_{l, q} (u, t) I_{l, q, k} (u, t)}, & (11) \end{array}

where

\begin{array}{l} I_{l, q, k} (u, v) = {\begin{matrix} 1 & l_{v} \in N_{l, q} (u, k) \\ 0 & l_{v} \notin N_{l, q} (u, k) \end{matrix} & (12) \end{array}

The matrix composed by the neighborhood constrained normalization is noted as:

\begin{array}{l} Φ_{l, q} = {φ_{l, q} (u, v)}_{n \times n} . & (13) \end{array}

The three similarity matrices were integrated using the following iterative process:

\begin{array}{l} Θ_{l, q} (λ + 1) = \frac{1}{2} α (Φ_{l, q} \sum_{r \neq q} Θ_{l, r} (λ) Φ_{l, q}^{T}) \\ + \frac{1}{2} (1 - α) \sum_{r \neq q} Θ_{l, r} (0), & (14) \end{array}

where α is a weight coefficient between 0 and 1, T is the transpose operator in matrix algebra, λ is the iterative round parameter, and

\begin{array}{l} Θ_{l, r} (0) = Θ_{l, r} . & (15) \end{array}

After z rounds of the iterative process, we obtained the final integration similarity matrix as

\begin{array}{l} Θ_{l} = \frac{1}{3} (Θ_{l, 0} (z) + Θ_{l, 1} (z) + Θ_{l, 2} (z)) & (16) \end{array}

Although more information is retained in the similarity fusion, more noise is apparent simultaneously. By considering the k most similar lncRNAs of each lncRNA, we defined an indicator function as follows:

\begin{array}{l} w_{l, k} (u, v) = {\begin{matrix} 1 & I_{l, 0, k} (u, v) = I_{l, 1, k} (u, v) = I_{l, 2, k} (u, v) = 1 \\ 0 & I_{l, 0, k} (u, v) = I_{l, 1, k} (u, v) = I_{l, 2, k} (u, v) = 0 \\ 0.5 & o t h e r w i s e \end{matrix} & (17) \end{array}

The final adjusted lncRNA similarity is defined as follows:

\begin{array}{l} S_{l, k} = {θ_{l} (u, v) w_{l, k} (u, v)}_{n \times n}, & (18) \end{array}

where θ_l(u, v) is the element in the u-th row and the v-th column of the matrix Θ_l.

By applying protein similarities, and using Eqs. (9)–(18), we obtained the adjusted protein similarity matrix S_p,k. The value of k in computing protein similarities is not necessarily the same as that of the lncRNAs.

Laplacian Regularized Least Squares

In this work, Laplacian regularized least squares (LapRLS) were utilized to construct the prediction model. Since we obtained the lncRNA similarity matrix and the protein similarity matrix, we could estimate the lncRNA-protein interactions from either the lncRNA similarity matrix or the protein similarity matrix. Without losing generality, we took the lncRNA similarity matrix as an example.

Let L_l be the Laplacian normalized similarity matrix, which can be defined as follows:

\begin{array}{l} L_{l} = D_{l}^{- 1 / 2} (D - S_{l, k}) D_{l}^{- 1 / 2}, & (19) \end{array}

where D is the diagonal matrix of the matrix S_l,k.

We then found the estimation of the adjacency matrix by minimizing the following objective function:

\begin{array}{l} min_{F_{l}} || A - F_{l} {||}_{F}^{2} + β_{l} || F_{l}^{T} L_{l} F_{l} {||}_{F}^{2}, & (20) \end{array}

where A is the adjacency matrix, F_l is the prediction matrix from lncRNA similarities, β_l is a weighting parameter, and ||.||_F is the F-norm operator.

We obtained the prediction matrix from lncRNA similarities by calculating the derivative of the objective function as follows:

\begin{array}{l} F_{l} = S_{l, k} {(S_{l, k} + β_{l} L_{l} S_{l, k})}^{- 1} A & (21) \end{array}

Similarly, we applied Eqs. (19)–(21) on protein similarities to obtain the prediction matrix from protein similarities, as follows:

\begin{array}{l} F_{p} = S_{p, k} {(S_{p, k} + β_{p} L_{p} S_{p, k})}^{- 1} A . & (22) \end{array}

Finally, we integrated the above two prediction matrixes to obtain our final prediction matrix, as follows:

\begin{array}{l} F = δ F_{l} + (1 - δ) F_{p}, & (23) \end{array}

where δ ε (0, 1) is a weighting coefficient.

Performance Estimation Protocol

The prediction performances of the LPI-SKF method was estimated using 5-fold cross-validations. We applied the AUROC and the AUPR as the main performance indicators. We also applied three performance statistics, including precision (pre), recall (rec), and the F1-score (f), which can be calculated as follows:

\begin{array}{l} p r e = \frac{T P}{T P + F P}, & (24) \end{array}

\begin{array}{l} r e c = \frac{T P}{T P + F N}, and & (25) \end{array}

\begin{array}{l} f = \frac{2 p r e \cdot r e c}{p r e + r e c}, & (26) \end{array}

where TP, TN, FP, and FN represent the number of true positives, true negatives, false positives, and false negatives, respectively.

For predicting potential lncRNA-protein interactions, all interactions in the adjacency matrix were divided randomly into five parts. Four parts were utilized as the training dataset, while the remaining part was used as the testing dataset. Through five rounds of cross-validation, we obtained the interacting score of every interaction.

As for predicting potential proteins for new lncRNAs, all lncRNAs were split into five groups. Four groups were treated as the training set and the remaining one as the testing set, which was the same as the prediction for new proteins.

Parameter Calibrations

The primary parts in LPI-SKF are SKF and LapRLS. There are three parameters in the SKF, which are the iteration times z, the number of neighbors k, and the weighting coefficient α. Since SKF was adopted to integrate the lncRNA similarities and the protein similarities separately, we calculated the AUC from lncRNA similarities and protein similarities, respectively to find the optimal α. Since the value range of α is between 0 and 1, we took α within a range of 0.1–0.9 with the step of 0.1 for calculation convenience. The prediction performances were estimated from lncRNAs and proteins separately. As in Table 1, the optimal α for lncRNAs was 0.9, while it was 0.8 for proteins.

TABLE 1

Table 1. AUC of lncRNA space and protein space with different weighting coefficient α.

Considering the number of lncRNAs and proteins in our work (990 lncRNAs and 27 proteins), the number of neighbors k for lncRNA was selected from {33, 99, 150, 300, 600, 900}, and the number of neighbors k for proteins from {3, 6, 9, 15, 20, 25}. To reduce calculating time and to test as much as possible, the iteration times z was taken from 5 to 30 with a step of 5. As in Figure 2, the optimal number of neighbors k for lncRNA was 99, and 3 for proteins. The optimal iteration times z was set to 5 for lncRNAs and proteins.

FIGURE 2

Figure 2. Performance of the combination of different iteration times T and neighbor number k in the SKF method. (A) The AUC of different parameters in the lncRNA subspace. (B) The AUC of different parameters in the protein subspace.

The weighting parameter β_l and β_p are the most important regularization terms in the LapRLS, which can influence the performance directly. In this work, we made β_l equal to β_p for convenience. To obtain the optimal performance, we searched β_l and β_p both from 2⁻¹⁰ to 2⁻¹ according to a previous work (Jiang et al., 2018). Since the amount of lncRNAs is much more than proteins, we made δ range from 0.1 to 0.9 with a step of 0.1. As in Figure 3, we chose β_l = β_p = 2⁻³, and δ = 0.8.

FIGURE 3

Figure 3. The AUC of the combination of different proportional coefficient β and weighting factor δ in the LapRLS framework.

Results

Comparison With Single Similarity

Different types of similarities between both lncRNAs and proteins have been utilized in this work. To demonstrate the benefit of similarity integration, we tested the prediction performance of every single similarity. The results are illustrated in Figure 4. Considering the different numbers of lncRNAs and proteins, performance using lncRNA similarities was better than protein similarities.

FIGURE 4

Figure 4. Performance of the single similarity. Six different similarities (three lncRNA similarities and three protein similarities) were utilized in this work, each of them corresponds to a prediction model.

Comparison With Other Fusion Methods

Similarity kernel fusion (SKF) was applied in our study to integrate different similarities, which could integrate similarity information in different aspects and reduce noise. In this part, we compared SKF with another two similarity fusion methods, similarity network fusion (SNF) (Wang et al., 2014) and average kernel fusion (AVG). The results are shown in Figure 5. The results indicated that SKF outperformed the other two methods.

FIGURE 5

Figure 5. Performance of different similarity fusion methods. (A) ROC curves. (B) PR curves.

Comparison With State-Of-the-Art Methods

Prediction for Uncovered Interactions

In our study, we compared LPI-SKF with two popular algorithms, RWR (random walk with restart) and CF (collaborative filtering), and three other methods, including LPIHN (Li et al., 2015), LPBNI (Ge et al., 2016), and LPI-IBNRA (Xie et al., 2019). We built six prediction models based on the same benchmarking dataset. Subsequently, the 5-fold cross-validation (5-fold CV) was applied for the comparison. The result is shown in Figure 6. Meanwhile, we selected the threshold value of six models based on the optimal F1-score. Furthermore, the recall, precision, and F1-score under the threshold value were computed to compare these models in other aspects. For a better comparison, the results of the six models are collected in Table 2. From the table, we can see that both the AUC and AUPR of LPI-SKF were higher than the other models.

FIGURE 6

Figure 6. Performance of different methods on the same benchmark dataset. (A) ROC curves (B) PR curves.

TABLE 2

Table 2. Comparison with state-of-the-art prediction methods.

Specifically, for the AUC, LPI-SKF received an AUC of 0.909, which increased by 10.05, 8.73, 6.69, 8.47, and 4.72%, respectively, compared with RWR's 0.826, CF's 0.836, LPBNI's 0.852, LPIHN's 0.866, and LPI-IBNRA's 0.864. As for another important index: AUPR, LPI-SKF obtained an AUPR of 0.685, which was higher than all other models, RWR's 0.581, CF's 0.542, LPBNI's 0.625, LPIHN's 0.548, and LPI-IBNRA's 0.684. Meanwhile, the best F1-score of LPI-SKF was also higher than the other models. All these evaluation indexes demonstrate that LPI-SKF outperformed the other state-of-the-art methods.

Prediction for Novel lncRNAs/Proteins

While our model can predict potential interacting lncRNAs/proteins for novel proteins/lncRNAs, we also made a comparison for the prediction of new lncRNAs/proteins. As few methods could predict interacting lncRNAs/proteins for novel proteins/lncRNAs, SFPEL-LPI (Zhang et al., 2018b) was selected for the comparison. Subsequently, we evaluated the performance of the two models in new lncRNAs and new proteins prediction, respectively. The result is shown in Figure 7. For a better comparison, the AUC, AUPR, recall, precision, and F1-score of the two models are shown in Table 3. LPI-SKF obtained an AUC of 0.844 and 0.835 in the prediction of new lncRNAs and proteins, respectively. Comparing with SFPEL, LPI-SKF achieved an AUC improvement of 0.016 and 0.229 in new lncRNAs and proteins prediction, separately.

FIGURE 7

Figure 7. Comparison of the prediction for new lncRNAs/proteins on the same benchmark dataset. (A) ROC curves. (B) PR curves.

TABLE 3

Table 3. Comparison of the prediction for new lncRNAs/proteins.

Case Studies

To evaluate the prediction effect of LPI-SKF more accurately, we tested the 20 top-ranked interactions in our model based on the NPInter V2.0 database. The result is shown in Table 4. Nineteen of these interactions have been verified in the NPInter V2.0 database, which demonstrates that LPI-SKF performed reputably in actual interaction prediction. Meanwhile, the amount of correctly predicted interactions of the 50 top-ranked interactions, the 100 top-ranked interactions, and the 500 top-ranked interactions are 47, 92, and 458, respectively.

TABLE 4

Table 4. 20 top-ranked predicted interactions in this work.

Conclusion

This paper proposed a novel model, named LPI-SKF (lncRNA-protein interactions prediction based on the similarity kernel fusion), to predict potential lncRNA-protein interactions. Serval similarities of both lncRNAs and proteins were integrated to obtain a comprehensive similarity matrix by the SKF method. Furthermore, the LapRLS framework was applied to build the prediction model. Finally, LPI-SKF obtained an AUC of 0.909 and an AUPR of 0.685 in the 5-fold CV framework, which demonstrated that LPI-SKF can infer uncovered lncRNA-protein interactions accurately.

To evaluate the performance of LPI-SKF, serval state-of-the-art methods were compared to LPI-SKF on the same benchmarking dataset. Finally, LPI-SKF received an AUC of 0.909 and an AUPR of 0.685 in the 5-fold cross-validation framework, both higher than the other models. More importantly, LPI-SKF could also predict potential interacting proteins/lncRNAs for novel lncRNAs/proteins precisely. For a better comparison, we also compared LPI-SKF with another model, SFPEL, on the same database and the same random seed. The result showed that LPI-SKF performed much better both in the prediction for new lncRNAs and new proteins.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found at: https://github.com/zyk2118216069/LPI-SKF.

Author Contributions

Y-KZ curated the dataset, designed, and implemented the algorithm, performed the experiments, and collected the results. JH, Z-AS, and W-YZ helped in collecting the data, and calibrating the parameters of the algorithm. P-FD directed the whole study, conceptualized the algorithm, analyzed the results, and wrote the manuscript. All authors contributed to the article and approved the submitted version.

Funding

This work was supported by the National Natural Science Foundation of China (NSFC 61872268); the National Key R&D Program of China (2018YFC0910405); and the Open Project Funding of CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences (CASNDST201705).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Bartolomei, M. S., Zemel, S., and Tilghman, S. M. (1991). Parental imprinting of the mouse H19 gene. Nature 351, 153–155. doi: 10.1038/351153a0

PubMed Abstract | CrossRef Full Text | Google Scholar

Beltran, M., Puig, I., Peña, C., García, J. M., Alvarez, A. B., Peña, R., et al. (2008). A natural antisense transcript regulates Zeb2/Sip1 gene expression during snail1-induced epithelial-mesenchymal transition. Genes Dev. 22, 756–769. doi: 10.1101/gad.455708

PubMed Abstract | CrossRef Full Text | Google Scholar

Brannan, C. I., Dees, E. C., Ingram, R. S., and Tilghman, S. M. (1990). The product of the H19 gene may function as an RNA. Mol. Cell. Biol. 10, 28–36. doi: 10.1128/MCB.10.1.28

PubMed Abstract | CrossRef Full Text | Google Scholar

Brockdorff, N., Ashworth, A., Kay, G. F., McCabe, V. M., Norris, D. P., Cooper, P. J., et al. (1992). The product of the mouse Xist gene is a 15 kb inactive X-specific transcript containing no conserved ORF and located in the nucleus. Cell 71, 515–526. doi: 10.1016/0092-8674(92)90519-I

PubMed Abstract | CrossRef Full Text | Google Scholar

Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., et al. (2009). BLAST+: architecture and applications. BMC Bioinform. 10:421. doi: 10.1186/1471-2105-10-421

PubMed Abstract | CrossRef Full Text | Google Scholar

Clark, M. B., and Mattick, J. S. (2011). Long noncoding RNAs in cell biology. Semin. Cell Dev. Biol. 22, 366–376. doi: 10.1016/j.semcdb.2011.01.001

PubMed Abstract | CrossRef Full Text | Google Scholar

Comings, D. E. (1972). “The structure and function of chromatin,” in Advances in Human Genetics, eds H. Harris, and K. Hirschhorn (Boston, MA: Springer), 237–431. doi: 10.1007/978-1-4757-4429-3_5

CrossRef Full Text | Google Scholar

Deng, L., Wang, J., Xiao, Y., Wang, Z., and Liu, H. (2018). Accurate prediction of protein-lncRNA interactions by diffusion and HeteSim features across heterogeneous network. BMC Bioinform. 19:370. doi: 10.1186/s12859-018-2390-0

PubMed Abstract | CrossRef Full Text | Google Scholar

Deng, Y., Xu, X., Qiu, Y., Xia, J., Zhang, W., and Liu, S. (2020). A multimodal deep learning framework for predicting drug-drug interaction events. Bioinformatics 36, 4316–4322. doi: 10.1093/bioinformatics/btaa501

PubMed Abstract | CrossRef Full Text | Google Scholar

Fang, S., Zhang, L., Guo, J., Niu, Y., Wu, Y., Li, H., et al. (2018). NONCODEV5: a comprehensive annotation database for long non-coding RNAs. Nucleic Acids Res. 46, D308–D314. doi: 10.1093/nar/gkx1107

PubMed Abstract | CrossRef Full Text | Google Scholar

Ge, M., Li, A., and Wang, M. (2016). A bipartite network-based method for prediction of long non-coding RNA–protein Interactions. Genomics Proteomics Bioinform. 14, 62–71. doi: 10.1016/j.gpb.2016.01.004

PubMed Abstract | CrossRef Full Text | Google Scholar

Gutschner, T., Hämmerle, M., Eissmann, M., Hsu, J., Kim, Y., Hung, G., et al. (2013). The noncoding RNA MALAT1 is a critical regulator of the metastasis phenotype of lung cancer cells. Cancer Res. 73, 1180–1189. doi: 10.1158/0008-5472.CAN-12-2850

PubMed Abstract | CrossRef Full Text | Google Scholar

Hentze, M. W., Castello, A., Schwarzl, T., and Preiss, T. (2018). A brave new world of RNA-binding proteins. Nat. Rev. Mol. Cell Biol. 19, 327–341. doi: 10.1038/nrm.2017.130

PubMed Abstract | CrossRef Full Text | Google Scholar

Hu, H., Zhang, L., Ai, H., Zhang, H., Fan, Y., Zhao, Q., et al. (2018). HLPI-ensemble: prediction of human lncRNA-protein interactions based on ensemble strategy. RNA Biol. 15, 797–806. doi: 10.1080/15476286.2018.1457935

PubMed Abstract | CrossRef Full Text | Google Scholar

International Human Genome Sequencing Consortium (2004). Finishing the euchromatic sequence of the human genome. Nature 431, 931–945. doi: 10.1038/nature03001

CrossRef Full Text | Google Scholar

Ji, Q., Zhang, L., Liu, X., Zhou, L., Wang, W., Han, Z., et al. (2014). Long non-coding RNA MALAT1 promotes tumour growth and metastasis in colorectal cancer through binding to SFPQ and releasing oncogene PTBP2 from SFPQ/PTBP2 complex. Br. J. Cancer 111, 736–748. doi: 10.1038/bjc.2014.383

PubMed Abstract | CrossRef Full Text | Google Scholar

Jiang, L., Ding, Y., Tang, J., and Guo, F. (2018). MDA-SKF: similarity kernel fusion for accurately discovering miRNA-disease association. Front. Genet. 9:618. doi: 10.3389/fgene.2018.00618

PubMed Abstract | CrossRef Full Text | Google Scholar

Kung, J. T. Y., Colognori, D., and Lee, J. T. (2013). Long noncoding RNAs: past, present, and future. Genetics 193, 651–669. doi: 10.1534/genetics.112.146704

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, A., Ge, M., Zhang, Y., Peng, C., and Wang, M. (2015). Predicting long noncoding RNA and protein interactions using heterogeneous network model. Biomed. Res. Int. 2015:671950. doi: 10.1155/2015/671950

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, Z. R., Lin, H. H., Han, L. Y., Jiang, L., Chen, X., and Chen, Y. Z. (2006). PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucl. Acids Res. 34, W32–W37. doi: 10.1093/nar/gkl305

PubMed Abstract | CrossRef Full Text | Google Scholar

Liu, H., Ren, G., Hu, H., Zhang, L., Ai, H., Zhang, W., et al. (2017). LPI-NRLMF: lncRNA-protein interaction prediction by neighborhood regularized logistic matrix factorization. Oncotarget 8, 103975–103984. doi: 10.18632/oncotarget.21934

PubMed Abstract | CrossRef Full Text | Google Scholar

Lu, Q., Ren, S., Lu, M., Zhang, Y., Zhu, D., Zhang, X., et al. (2013). Computational prediction of associations between long non-coding RNAs and proteins. BMC Genomics 14:651. doi: 10.1186/1471-2164-14-651

PubMed Abstract | CrossRef Full Text | Google Scholar

Ma, L., Bajic, V. B., and Zhang, Z. (2013). On the classification of long non-coding RNAs. RNA Biol. 10, 924–933. doi: 10.4161/rna.24604

CrossRef Full Text | Google Scholar

Ma, Y., He, T., and Jiang, X. (2019). Projection-based neighborhood non-negative matrix factorization for lncRNA-protein interaction prediction. Front Genet. 10:1148. doi: 10.3389/fgene.2019.01148

PubMed Abstract | CrossRef Full Text | Google Scholar

Martianov, I., Ramadass, A., Serra Barros, A., Chow, N., and Akoulitchev, A. (2007). Repression of the human dihydrofolate reductase gene by a non-coding interfering transcript. Nature 445, 666–670. doi: 10.1038/nature05519

CrossRef Full Text | Google Scholar

Meissner, M., Dechat, T., Gerner, C., Grimm, R., Foisner, R., and Sauermann, G. (2000). Differential nuclear localization and nuclear matrix association of the splicing factors PSF and PTB. J. Cell. Biochem. 76, 559–566. doi: 10.1002/(SICI)1097-4644(20000315)76:4<559::AID-JCB4>3.0.CO;2-U

PubMed Abstract | CrossRef Full Text | Google Scholar

Mercer, T. R., Dinger, M. E., and Mattick, J. S. (2009). Long non-coding RNAs: insights into functions. Nat. Rev. Genet. 10, 155–159. doi: 10.1038/nrg2521

PubMed Abstract | CrossRef Full Text | Google Scholar

Muppirala, U. K., Honavar, V. G., and Dobbs, D. (2011). Predicting RNA-protein interactions using only sequence information. BMC Bioinform. 12:489. doi: 10.1186/1471-2105-12-489

PubMed Abstract | CrossRef Full Text | Google Scholar

Ohno, S., and Smith, H. H. (1972). “So much “junk” DNA in our genome,” in Evolution of Genetic Systems, ed H. H. Smith (New York, NY: Gordon and Breach), 366.

PubMed Abstract | Google Scholar

Pandurangan, A. P., Stahlhacke, J., Oates, M. E., Smithers, B., and Gough, J. (2019). The SUPERFAMILY 2.0 database: a significant proteome update and a new webserver. Nucleic Acids Res. 47, D490–D494. doi: 10.1093/nar/gky1130

PubMed Abstract | CrossRef Full Text | Google Scholar

Reeves, M. B., Davies, A. A., McSharry, B. P., Wilkinson, G. W., and Sinclair, J. H. (2007). Complex I binding by a virally encoded RNA regulates mitochondria-induced cell death. Science 316, 1345–1348. doi: 10.1126/science.1142984

PubMed Abstract | CrossRef Full Text | Google Scholar

Rintala-Maki, N. D., and Sutherland, L. C. (2009). Identification and characterisation of a novel antisense non-coding RNA from the RBM5 gene locus. Gene 445, 7–16. doi: 10.1016/j.gene.2009.06.009

PubMed Abstract | CrossRef Full Text | Google Scholar

Shen, C., Ding, Y., Tang, J., and Guo, F. (2019). Multivariate information fusion with fast kernel learning to kernel ridge regression in predicting LncRNA-protein interactions. Front. Genet. 9:716. doi: 10.3389/fgene.2018.00716

PubMed Abstract | CrossRef Full Text | Google Scholar

Suresh, V., Liu, L., Adjeroh, D., and Zhou, X. (2015). RPI-Pred: predicting ncRNA-protein interaction using sequence and structural information. Nucleic Acids Res. 43, 1370–1379. doi: 10.1093/nar/gkv020

PubMed Abstract | CrossRef Full Text | Google Scholar

Tseng, J.-J., Hsieh, Y.-T., Hsu, S.-L., and Chou, M.-M. (2009). Metastasis associated lung adenocarcinoma transcript 1 is up-regulated in placenta previa increta/percreta and strongly associated with trophoblast-like cell invasion in vitro. Mol. Hum. Reprod. 15, 725–731. doi: 10.1093/molehr/gap071

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, B., Mezlini, A. M., Demir, F., Fiume, M., Tu, Z., Brudno, M., et al. (2014). Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 11, 333–337. doi: 10.1038/nmeth.2810

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, F., and Zhang, C. (2008). Label propagation through linear neighborhoods. IEEE Trans. Knowl. Data Eng. 20, 55–67. doi: 10.1109/TKDE.2007.190672

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, Y., Chen, X., Liu, Z.-P., Huang, Q., Wang, Y., Xu, D., et al. (2013). De novo prediction of RNA-protein interactions from sequence information. Mol. Biosyst. 9, 133–142. doi: 10.1039/C2MB25292A

PubMed Abstract | CrossRef Full Text | Google Scholar

Wekesa, J. S., Luan, Y., Chen, M., and Meng, J. (2019). A hybrid prediction method for plant lncRNA-protein interaction. Cells 8:521. doi: 10.3390/cells8060521

PubMed Abstract | CrossRef Full Text | Google Scholar

Xiao, Y., Zhang, J., and Deng, L. (2017). Prediction of lncRNA-protein interactions using HeteSim scores based on heterogeneous networks. Sci. Rep. 7:3664. doi: 10.1038/s41598-017-03986-1

PubMed Abstract | CrossRef Full Text | Google Scholar

Xie, G., Wu, C., Sun, Y., Fan, Z., and Liu, J. (2019). LPI-IBNRA: Long non-coding RNA-protein interaction prediction based on improved bipartite network recommender algorithm. Front. Genet. 10:343. doi: 10.3389/fgene.2019.00343

PubMed Abstract | CrossRef Full Text | Google Scholar

Yuan, J., Wu, W., Xie, C., Zhao, G., Zhao, Y., and Chen, R. (2014). NPInter v2.0: an updated database of ncRNA interactions. Nucleic Acids Res. 42, D104–108. doi: 10.1093/nar/gkt1057

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, H., Ming, Z., Fan, C., Zhao, Q., and Liu, H. (2020). A path-based computational model for long non-coding RNA-protein interaction prediction. Genomics 112, 1754–1760. doi: 10.1016/j.ygeno.2019.09.018

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, T., Wang, M., Xi, J., and Li, A. (2020). LPGNMF: predicting long non-coding rna and protein interaction using graph regularized nonnegative matrix factorization. IEEE/ACM Trans. Comput. Biol. Bioinform. 17, 189–197. doi: 10.1109/TCBB.2018.2861009

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, W., Qu, Q., Zhang, Y., and Wang, W. (2018a). The linear neighborhood propagation method for predicting long non-coding RNA–protein interactions. Neurocomputing 273, 526–534. doi: 10.1016/j.neucom.2017.07.065

CrossRef Full Text | Google Scholar

Zhang, W., Yue, X., Tang, G., Wu, W., Huang, F., and Zhang, X. (2018b). SFPEL-LPI: sequence-based feature projection ensemble learning for predicting LncRNA-protein interactions. PLoS Comput. Biol. 14:e1006616. doi: 10.1371/journal.pcbi.1006616

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, Z., Zhang, J., Fan, C., Tang, Y., and Deng, L. (2019). KATZLGO: large-scale prediction of LncRNA functions by using the KATZ measure based on multiple networks. IEEE/ACM Trans. Comput. Biol. Bioinform. 16, 407–416. doi: 10.1109/TCBB.2017.2704587

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhao, Q., Zhang, Y., Hu, H., Ren, G., Zhang, W., and Liu, H. (2018). IRWNRLPI: integrating random walk and neighborhood regularized logistic matrix factorization for lncRNA-protein interaction prediction. Front. Genet. 9:239. doi: 10.3389/fgene.2018.00239

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhu, R., Li, G., Liu, J.-X., Dai, L.-Y., and Guo, Y. (2019). ACCBN: ant-colony-clustering-based bipartite network method for predicting long non-coding RNA-protein interactions. BMC Bioinform. 20:16. doi: 10.1186/s12859-018-2586-3

PubMed Abstract | CrossRef Full Text | Google Scholar

Keywords: LncRNA-proteins interactions, LncRNA similarities, protein similarities, similarity kernel fusion, laplacian regularized least squares

Citation: Zhou Y-K, Hu J, Shen Z-A, Zhang W-Y and Du P-F (2020) LPI-SKF: Predicting lncRNA-Protein Interactions Using Similarity Kernel Fusions. Front. Genet. 11:615144. doi: 10.3389/fgene.2020.615144

Received: 08 October 2020; Accepted: 16 November 2020;
Published: 09 December 2020.

Edited by:

Wen Zhang, Huazhong Agricultural University, China

Reviewed by:

Shining Ma, Stanford University, United States
Xiujuan Lei, Shaanxi Normal University, China
Hao Zhang, Jilin University, China

Copyright © 2020 Zhou, Hu, Shen, Zhang and Du. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Pu-Feng Du, cGR1QHRqdS5lZHUuY24=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.