Protein Function Prediction Based on PPI Networks: Network Reconstruction vs Edge Enrichment

Zhou, Jiaogen; Xiong, Wei; Wang, Yang; Guan, Jihong

doi:10.3389/fgene.2021.758131

ORIGINAL RESEARCH article

Front. Genet., 14 December 2021

Sec. Statistical Genetics and Methodology

Volume 12 - 2021 | https://doi.org/10.3389/fgene.2021.758131

This article is part of the Research TopicSystem Biology Methods and Tools for Integrating Omics Data, Volume IIView all 15 articles

Protein Function Prediction Based on PPI Networks: Network Reconstruction vs Edge Enrichment

Jiaogen Zhou¹^†

Wei Xiong²^†

Yang Wang³

Jihong Guan³*

¹Jiangsu Provincial Engineering Research Center for Intelligent Monitoring and Ecological Management of Pond and Reservoir Water Environment, Huaiyin Normal University, Huian, China
²Shanghai Key Lab of Intelligent Information Processing, and School of Computer Science, Fudan University, Shanghai, China
³Department of Computer Science and Technology, Tongji University, Shanghai, China

Over the past decades, massive amounts of protein-protein interaction (PPI) data have been accumulated due to the advancement of high-throughput technologies, and but data quality issues (noise or incompleteness) of PPI have been still affecting protein function prediction accuracy based on PPI networks. Although two main strategies of network reconstruction and edge enrichment have been reported on the effectiveness of boosting the prediction performance in numerous literature studies, there still lack comparative studies of the performance differences between network reconstruction and edge enrichment. Inspired by the question, this study first uses three protein similarity metrics (local, global and sequence) for network reconstruction and edge enrichment in PPI networks, and then evaluates the performance differences of network reconstruction, edge enrichment and the original networks on two real PPI datasets. The experimental results demonstrate that edge enrichment work better than both network reconstruction and original networks. Moreover, for the edge enrichment of PPI networks, the sequence similarity outperformes both local and global similarity. In summary, our study can help biologists select suitable pre-processing schemes and achieve better protein function prediction for PPI networks.

1 Introduction

Over the past decades, massive amounts of un-annotated protein sequence data have been accumulated with the advancement of high-throughput biological technologies. Due to high costs and time-consummation of experimental determining protein function annotation, the proportion of annotated proteins has been still relatively low (Sharan et al., 2007; Barrell et al., 2009). The increasing efforts have been made to predict protein functions.

As the best-known and early method of protein function prediction, homology-based prediction method indeed gave rise to a series of protein function prediction methods based on protein sequence or structural similarity (Sleator and Walsh, 2010). At the same time, the emerging of available protein databases, such as FATCAT (Ye and Godzik, 2004), PAST (Täubig et al., 2006) and PROCAT (Wallace et al., 1996), has further helped to improve the effectiveness of protein prediction. However, the low sequence similarity scores often occur when comparing target protein sequences with source protein sequences (Ofran et al., 2005), and thus this significantly reduces the effective application of homology-based prediction methods.

With the increasing amounts of the measured protein-protein interaction (PPI) data, more and more protein function prediction methods based on PPI networks are proposed and generally outperform the above homology-based prediction methods. In PPI networks, proteins and protein-protein interactions are represented by nodes and edges, respectively (Sharan et al., 2007; Chen et al., 2020; Wu et al., 2020; Waiho et al., 2021). Up to now, numerous algorithms have been used in protein function prediction based on PPI networks, such as edge-betweenness clustering (Dunn et al., 2005), Graphlet-based edge clustering (Solava et al., 2012), clique percolation (Adamcsek et al., 2006), GRAAL (Kuchaiev et al., 2010), hybrid-property based method (Hu et al., 2011), and IsoRank (Singh et al., 2008). Moreover, advanced machine learning and deep learning techniques have also been used for protein function prediction, including collective classification (Xiong et al., 2013; Wu et al., 2014), active learning (Xiong et al., 2014), DeepInteract (Sunil et al., 2017), ConvsPPIS (Zhu et al., 2020), PhosIDN (Yang et al., 2021) and WinBinVec (Abdollahi et al., 2021), etc.

The above methods mainly use existing PPI data. However, current PPI data mainly generated by high-throughput or TAP-MS techniques (Berggard et al., 2007), are often in presence of noise and incompleteness, and this unavoidably causes adverse effects on the prediction performance. Two main methods of network reconstruction and edge enrichment are proposed to effectively boost the prediction performance. Different strategies are used for network reconstruction or edge enrichment. For example, Bogdanov and Singh (2010) presented a network reconstruction approach by extracting functional neighborhood features using random walk with restart. Chua et al. (2007) used weighting strategies to enrich PPI networks, and adopted a local prediction method to predict the functions of un-annotated proteins. Xiong et al. (2013) applied collective classification to PPI networks with enriched edges to predict protein functions.

Although the above two types of approaches achieve promising performance improvements, there still lack comparative studies of the performance differences between network reconstruction and edge enrichment. We do not still know which one is better in performance, or specifically, which one should be applied for different situations. Inspired by the question, we conducte a comprehensive comparison of two network transformation of network reconstruction and edge enrichment for boosting the performance of PPI network-based protein functional annotation. Concretely, we first use three different protein similarity metrics for network reconstruction and edge enrichment of PPI networks, and then evaluate the performance differences between the two transformed networks (network reconstruction and edge enrichment) and original networks on two real PPI datasets. The results of experiments demonstrate that edge enrichment work better than both network reconstruction and original networks. Moreover, for the edge enrichment of PPI networks, the sequence similarity outperformes both local and global similarity. More detailed work will be presented in later sections.

2 Materials and Methods

2.1 Similarity Metrics

As we point out above, the noise and incompleteness of PPI network data adversely affects the performance of protein functional annotation. Network reconstruction and edge enrichment are major approaches to improve PPI data quality. In this work, we carry out comparison study on these two approaches by reconstructing and enriching original networks using various protein similarity metrics, including sequence similarity, local similarity and global similarity. In what follows, we describe and discuss these similarity measures in detail.

2.1.1 Protein Sequence Similarity

BLAST method (Altschul et al., 1997) is used to measure the similarity between any two proteins in this study. The similarity of a given protein V_x with other proteins is defined as

S (V_{x}) = [S_{x, 1}, S_{x, 2}, \dots, S_{x, i}, \dots, S_{x, n}] (1)

where S_x,i is the similarity score between the pair of proteins V_x and V_i. Due to ignoring self-similarity, S_x,i = 0 is set when x = i.

2.1.2 Local Similarity Indices

We consider three kinds of local similarity indices, including Common Neighbors (CN), Jaccard Index and Functional Similarity (FS).

Common Neighbors. Given nodes u and v, their neighboring sets are N_u and N_v, respectively. The CN is defined as the neighborhood overlap of the nodes (Newman, 2001). The more identical neighbors two nodes have, the higher the CN value is. The measure of CN is as follows:

S_{C N} (u, v) = |N_{u} \cap N_{v}| (2)

Jaccard Index. Given nodes u and v and their corresponding neighboring sets of N_u and N_v, Jaccard index is used to measure the similarity between the N_u and N_v sets, and it is calculated as:

S_{J a c c a r d} (u, v) = \frac{|N_{u} \cap N_{v}|}{|N_{u} \cup N_{v}|} (3)

Functional Similarity (FS). For a PPI network, FS index was first used to measure the similarity of any pair of proteins (Chua et al., 2006), and it is defined as follows:

S_{F S} (u, v) = \frac{2 |N_{u} \cap N_{v}|}{|N_{u} - N_{v}| + 2 |N_{u} \cap N_{v}| + λ_{u, v}} \times \frac{2 |N_{u} \cap N_{v}|}{|N_{v} - N_{u}| + 2 |N_{u} \cap N_{v}| + λ_{v, u}} (4)

where $λ_{u, v} = \max (0, n_{a v g} - (|N_{u} - N_{v}|) + |N_{u} \cap N_{v}|))$ , and by using the λ_u,v factor, similarity weights between protein pairs are penalized when their common neighbors are too few. n_avg is the average number of close neighbors that each node has in the network. In a weighted PPI network, the labeled weights of edges mean interaction confidences between pairs of proteins. Thus, we can modify the FS index to take into account the confidence of each interaction. The extended FS index for weighted PPI networks, named FS.R, is defined as follows:

\begin{aligned} S_{F S . R} (u, v) = \frac{2 \sum_{w \in (N_{u} \cap N_{v})} r_{u, w} r_{v, w}}{\sum_{w \in N_{u}} r_{u, w} + \sum_{w \in (N_{u} \cap N_{v})} r_{u, w} r_{v, w} + λ_{u, v}} \times \\ \frac{2 \sum_{w \in (N_{u} \cap N_{v})} r_{u, w} r_{v, w}}{\sum_{w \in N_{v}} r_{v, w} + \sum_{w \in (N_{u} \cap N_{v})} r_{u, w} r_{v, w} + λ_{v, u}} . \end{aligned} (5)

2.1.3 Global Similarity Indices

Two global similarity indices are considered in this paper, they are Katz index and random walk with restart.

Katz Index. This index is proposed by Lü and Zhou (2011). It sums the set of paths directly and deals with the paths by length so that the shorter paths get more weights. Formally,

S_{K a t z} (u, v) = \sum_{L = 1}^{\infty} β^{L} \cdot |p a t h s_{u v}^{< L >}| = β A_{u v} + β^{2} {(A^{2})}_{u v} + β^{3} {(A^{3})}_{u v} + \dots (6)

where $p a t h s_{u v}^{< L >}$ is the set of the paths, which connect the nodes of u and v with a path length of L. The parameter of β controls the path weights.

Random Walk with Restart (RWR). Tong et al. (2008) used RWR index to measure the relevance score between node j and node i in a PPI network. Given the adjacency matrix W_n,n of a PPI network, a random walker transmits from the starting node i to one of its neighbors at random with probability c, and returns to the node i with the probability 1 − c. Finally, the walker will stay stably at node j with probability R_i,j. The steady-state probability R_i,j is defined as RWR index. We have

\vec{R_{i}} = c {\tilde{W}}^{T} \vec{R_{i}} + (1 - c) \vec{e_{i}} (7)

where $\vec{e_{i}}$ is the starting vector, the ith element is 1 and the other elements are 0. $\tilde{W}$ is a weighted matrix. For an unweighted network, ${\tilde{W}}_{i j} = 1 / m$ (where m is the number of neighbors that node i has) if i and j are connected, and ${\tilde{W}}_{i j} = 0$ otherwise. For a weighted network,

\{\begin{cases} {\tilde{W}}_{i j} = W_{i j} / \sum_{j = 1}^{n} W_{i j}, i f i a n d j a r e c o n n e c t e d . \\ {\tilde{W}}_{i j} = 0, o t h e r w i s e . \end{cases} (8)

2.2 Network Reconstruction and Edge Enrichment

Network reconstruction is carried out as follows: First, the similarity scores between protein pairs in the original PPI network are calculated according to the above similarity indexes. Next, some interactions are selected to reconstruct the PPI network based on the similarity scores. As in Liben-Nowell and Kleinberg (2007), an appropriate score threshold is used such that the number of protein pairs with higher scores than the threshold is as same as possible to the interaction number of the original network. Then, a new network is formed by using the protein pairs with higher scores over the threshold. However, this approach may lead to absence of some proteins in the new network. Alternatively, for any node N_i in the original network, we first remove all its interactions. We find the top k neighbors most similar to the node N_i. Then, the k edges from the node N_i to its top k neighbors are created, and their similarity scores are used as edge weights in the new network. Thus, we have

S {(N_{i})}_{k} = [S_{i, 1}, S_{i, 2}, \dots, S_{i, k}] . (9)

Edge enrichment is also performed in two steps as in network reconstruction, the only difference is that all interactions in the original network are preserved. An enriched network has two types of edges: explicit edges (old edges) and similarity-inferred edges (new edges). Here, there are two questions to be addressed: One is how to combine the edge weights with different semantics, and another is how many edges are added for each protein, that is, how to optimize the parameter k (see Eq. 9). The questions will be discussed in the following sections.

2.3 Protein Function Prediction Approaches

In this study, protein function predictions on two real PPI datasets are performed using two different approaches.The first one is majority method, which is a local neighbor counting approach (Schwikowski et al., 2000). The second is a global protein function prediction approach, which is common called collective classification (Xiong et al., 2013). Details of this approach are presented in the following subsections.

ALGORITHM 1

ALGORITHM 1. Gibbs sampling

2.4 Gibbs Sampling Based Collective Classification

Gibbs sampling (GS) includes two main processes of bootstrapping and iterative classification (Sen et al., 2008). The pseudo-code is illustrated below.

2.4.1 Bootstrapping

The closer the proteins to each other, the more similar their functions become in a PPI network. For an unannotated protein, its probability distribution is estimated using a weighted voting method. In the original or reconstructed network, there is only one kind of annotated neighbors to vote. An unannotated protein V_x has the corresponding explicit neighbors of N_x or k similarity-inferred neighbors. For the above neighbor sets, we have their edge weights as follows:

N_{x}^{w} = [w_{x 1}, w_{x 2}, \dots, w_{x i}, \dots, w_{x N_{x}}] N_{x}^{s} = [S_{x, 1}, S_{x, 2}, \dots, S_{x, i}, \dots, S_{x, k}] (10)

The probability of V_x having the jth function F_j (V_xF_j) is calculated as follows:

P_{x}^{j} = \frac{1}{Z_{x}^{w}} \sum_{i = 1}^{N_{x}} w_{x, i} f_{i, j} P_{x}^{j} = \frac{1}{Z_{x}^{s}} \sum_{i = 1}^{k} S_{x, i} f_{i, j} (11)

where $Z_{x}^{w}$ and $Z_{x}^{s}$ are the normalizers:

Z_{x}^{w} = \sum_{j = 1}^{m} \sum_{i = 1}^{N_{x}} w_{x, i} f_{i, j} Z_{x}^{s} = \sum_{j = 1}^{m} \sum_{i = 1}^{k} S_{x, i} f_{i, j} (12)

However, in the enriched network, there are both old (explicit) and new (similarity-inferred) neighbors which need to be voted. So, the parameter λ ∈ (0, 1) is used to combine the two types of different neighbors. Given a query protein V_x, the V_xF_j probability is calculated as follows:

P_{x}^{j} = λ \frac{1}{Z_{x}^{w}} \sum_{i = 1}^{N_{x}} w_{x, i} f_{i, j} + (1 - λ) \frac{1}{Z_{x}^{s}} \sum_{i = 1}^{k} S_{x, i} f_{i, j} (13)

A higher $P_{x}^{j}$ value indicates a higher probability that protein V_x is more likely to have jth function F_j. The V_xF_j probability distribution is represented as:

\vec{a_{x}} = [P_{x}^{1}, P_{x}^{2}, \dots, P_{x}^{m}] (14)

2.4.2 Iterative Classification

Iterative classification has two main steps of burn-in and sampling. In burn-in period, iteration number is fixed, and $\vec{a_{x}}$ is updated in each iteration. In sampling period, we update $\vec{a_{x}}$ in each iteration, and also count how many times the jth function F_j for protein V_x are sampled. Considering each protein with one or more functions, therefore, we define the most likely function of the protein V_x as follow:

b_{x}^{j} = a r g m a x_{j \in [1, m]} P_{x}^{j} (15)

where $b_{x}^{j}$ represents the jth most likely function of the protein V_x, that is the jth-rank result. We further use $\vec{b_{x i}}$ vector to record all ranking results in the ith iteration.

\vec{b_{x i}} = [b_{x i}^{1}, b_{x i}^{2}, \dots, b_{x i}^{m}] . (16)

The matrix M_x with s rows and m columns is produced after running the predetermined s number of iterations.

M_{x} = {[\vec{b_{x 1}}, \vec{b_{x 2}}, \dots, \vec{b_{x s}}]}^{T} . (17)

Finally, we obtain the required m-dimensional vector $\vec{c_{x}}$ for query protein V_x:

\vec{c_{x}} = [c_{x}^{1}, c_{x}^{2}, \dots, c_{x}^{m}] . (18)

where $c_{x}^{1}$ is the first ranked prediction in the ith column of M_x.

3 Results and Discussion

3.1 Data Preprocessing and Experimental Workflow

The two PPI datasets of A and B are used in our study. The datasets A and B are downloaded from the databases of BioGRID (Stark et al., 2011) and STRING (Szklarczyk et al., 2011), respectively. The datasets A and B are annotated as in Ashburner et al. (2000). The datasets in this study are based on Gene Ontology (GO) annotation. GO annotations consist of three basic namespaces: molecular function, biological process and cellular component. We construct one protein interaction network for each GO namespace using only physical interactions.Therefore, there are totally six PPI networks (three for S.cerevisiae and the other three for M.musculus) in Dataset A. For Dataset B, we construct two PPI networks (one for S.cerevisiae and another for M.musculus).More detailed information was listed in the supplementary material (Supplementary Table S1).

The comparison of the function prediction performance on the reconstructed and enriched networks with that on the original networks is first performed using the cross validation of leave-one-out method (LOOM). LOOM takes each protein in turn as a query protein, and carries out function prediction with the remaining proteins in the network. As the bootstrapping in Gibbs sampling based collective classification does not result in updating of the query protein, therefore we use the majority method to predict protein functions in LOOM cross validation. Then, the annotated protein proportion is changed from 10% to 90%, and the average performance of 10 experiments is reported for each of all proportions. The majority method is not suitable in this setting because it is a local neighbor counting approach and does not work well in sparsely-labeled network. Thus, the Gibbs sampling based collective classification is used to predict protein functions. The main hardware configuration of an Inter dual-core processor (3 GHz) and 16GB RAM, with a Linux operating system, and Python 3.0 is as the programming environment for running the algorithms.

Finally, as in Bogdanov and Singh (2010), the ratio of the number of true positive (TP) predictions to the number of false positivepredictions (FP) is produced in the cross validation, i.e. TP/FP is used to assess prediction accuracy of PPI networks. We define the overall ith rank true positive (TP) as the number of proteins whose ith rank predicted function $c_{x}^{i}$ is one of the true functions of protein V_x, and the overall ith rank false positive (FP) as the number of proteins whose ith rank predicted function $c_{x}^{i}$ is not one of the true functions of protein V_x.

3.2 Similarity Index Selection and the Effect of the Parameters k and λ

In this study, in addition to sequence similarity, the PPI networks are reconstructed and enriched by using three local similarity indices (CN, Jaccard and FS)and two global similarity indices (Katz and RWR). In order to choose the best ones for the following experiments, the performance differences between the five similarity indices are evaluated over the two datasets of A and B. The experimental results over the dataset A are presented in Supplementary Table S3 and Table 1, and ones over the Dataset B listed in Table 2. Using FS as the local similarity index and RWR as the global similarity index generally achieve the best performance. Hence, FS and RWR are selected as the local similarity index and global similarity index, respectively in the following experiments.

TABLE 1

TABLE 1. Comparison of performance differences between similarity indices (Dataset A: M.musculus).

TABLE 2

TABLE 2. Comparison of performance differences between similarity indices (Dataset B).

The effect of two parameters on the performance of network reconstruction and edge enrichment are also examined in our study. The first one is the number of similarity-inferred edges k. The prediction performance on the Datasets of A and B is listed in Supplementary Table S4, Table 3, and Table 4, with the varying values of k. For both the datasets A and B, experimental results show that BLAST roughly achieves the best performance by setting k = 5. When the values of k = {10, 30, 50, 100} are used for FS and RWR, using k = 30 or k = 50 generally works best in most cases, and the overall performance is relatively robust for the reconstructed or enriched networks. Hence, in the following experiments, the parameter value of k is used as 5, 30, 30 for BLAST, FS and RWR, respectively.

TABLE 3

TABLE 3. The influence of the parameter of k (M.musculus in Dataset A).

TABLE 4

TABLE 4. The effect of the parameter k (Dataset B).

The second parameter λ dominates the tradeoff between explicit edges and similarity-inferred edges. Further, the effect of the parameter λ is evaluated on the prediction performance when it varies from 0.1 to 0.9. The results on the Dataset A are listed in Supplementary material (see Supplementary Table S5) and Table 5, and ones on the Dataset B in Table 6, respectively. Generally, the λ value has a relatively small impact on prediction accuracy, unless it is too large or too small. In the following experiments, the λ value is set uniformly at 0.7.

TABLE 5

TABLE 5. The influence of the parameter λ (M.musculus in Dataset A).

TABLE 6

TABLE 6. The influence of the parameter λ (Dataset B).

3.3 Performance Evaluation on Dataset A

The performance comparison of reconstructed and enriched networks with that of the original networks is first carried out by leave-one-out validation. The top protein function prediction is selected according to the average number of useful functions per protein in the PPI networks. Therefore, only the top 2 predictions are performed on the PPI networks of S.cerevisiae in the Dataset A, and the top 3 or 4 predictions are examined for M.musculus in Dataset A.

Obviously, edge enrichment gains more accurate predictions than network reconstruction and original networks, due to the combination of explicit and implicit (similarity-inferred) edges (Figure 1). The results clearly indicate that edge enrichment indeed gains better prediction performance by adding similarity-inferred edges to PPI networks. BLAST-enriched networks always worke best, while BLAST-reconstructed networks always work worst. This is because BLAST-inferred edges are based on protein sequence information that is short in the original networks. The useful information in the original network greatly increases by adding BLAST-inferred edges, and consequently boosts prediction accuracy. However, in the reconstructed networks, the original PPI edges are put aside first, BLAST-reconstructed networks contain only protein sequence information, and thus performe worst. The experimental results also validate that FS-reconstructed networks and RWR-reconstructed networks work better than the original networks in most cases. This is because the reconstructed networks filter out noisy or spurious interactions in the original PPI networks.

FIGURE 1

FIGURE 1. The performance evaluation by leave-one-out validation over the PPI networks (Dataset A: S.cerevisiae and M.musculus). Here, the sub figures in the horizontal and vertical directions represent the experimental results for the PPI networks of different data sets and function types, respectively. Horizontally, the top three subplots represent ones on S.cerevisiae, and the bottom for ones on M.musculus. (A) and (D) Molecular function, (B) and (E) Biological process, (C) and (F) Cellular component.

We further evaluate prediction accuracy of these three kinds of networks by using Gibbs Sampling in sparse-labeled PPI networks. Concretely, in PPI networks, the annotated protein proportion is changed from 0.1 to 0.9, and the remaining protein functions are predicted. For each proportion of the annotated proteins, the average prediction accuracy of running 10 experiments is presented on the PPI networks of S.cerevisiae (Figure 2)and M.musculus (Figure 3), respectively. The enrichment gains more accurate predictions than network reconstruction and original networks. The BLAST-enriched networks always work the best, while the BLAST-reconstructed networks always perform the worst. As expected, the experimental results also validate that FS-reconstructed networks and RWR-reconstructed networks generally performe better than the original networks. As the annotated protein proportion in the original networks increases, the prediction performance gets better for most networks, especially for the 1-st rank function. However, the prediction performance of the original network slightly declines as its annotated protein proportion increases (Figure 3G, H).

FIGURE 2

FIGURE 2. The performance evaluation over the sparsely-labeled networks (Dataset A: S.cerevisiae). Here, the sub figures in the horizontal and vertical directions represent the experimental results for the PPI networks of different function types and rank predicted functions, respectively. Horizontally, the top two subplots represent ones of molecular function, the middle for ones of biological process, and the bottom for ones of cellular component (A), (C) and (E) first rank predicted function, (B), (D) and (F) second rank predicted function.

FIGURE 3

FIGURE 3. The performance evaluation over the sparsely-labeled networks (Dataset A: M.musculus). Here, the sub figures in the horizontal and vertical directions represent the experimental results for the PPI networks of different function types and rank predicted functions, respectively. Horizontally, the top three subplots represent ones on the PPI networks of molecular function, the middle for ones of biological process, and the bottom for ones of cellular component (A), (D) and (G) first rank predicted function, (B), (E) and (H) second rank predicted function, (C), (F) and (I) third rank predicted function.

3.4 Performance Evaluation on Dataset B

As above, the performance of reconstructed and enriched networks is first compared with that of the original networks by leave-one-out validation. Here, the top 3 protein function predictions are considered for both PPI networks of S. cerevisiae and M. musculus. As expected, edge enrichment gaines higher accurate predictions than network reconstruction and original networks. Moreover, BLAST-enriched networks get best, while the BLAST-reconstructed networks always work worst (Figure 4). The reasons are the same as for the dataset A.

FIGURE 4

FIGURE 4. The performance evaluation by leave-one-out validation over the PPI networks (Dataset B: S.cerevisiae and M.musculus) (A) S.cerevisiae (B) M.musculus.

Next, we evaluate the prediction performance of these networks in sparse-labeled conditions with the collective classification method. Similarly, the average prediction performance is generated over running 10 experiments, with the annotated-protein proportion varying from 0.1 to 0.9. Generally, the experimental results present a similar trend to the above for the dataset A (Figure 5). However, FS-reconstructed networks and RWR-reconstructed networks do not outperform the original networks, due to the quality properties of the dataset itself. This is mainly because many informative interactions are deleted and the prediction performance is impaired when reconstructing the networks based on similarity.

FIGURE 5

FIGURE 5. The performance evaluation over the sparsely-labeled networks (Dataset B: S.cerevisiae and M.musculus). Here, the sub figures in the horizontal and vertical directions represent the experimental results for different data types and rank predicted functions, respectively. Horizontally, the top three subplots represent ones over the dataset of S.cerevisiae, and the bottom for ones of M.musculus. (A) and (D) first rank predicted function, (B) and (E) second rank predicted function, (C) and (F) third rank predicted function.

To validate this point, 10% and 50% interactions of the original network of the dataset B are randomly selected to construct two sparse networks. The leave-one-out validation is then performed over the two sparse networks. The selection process have two steps: First, a random weight is assigned to each edge of the original network, and a minimum spanning tree is constructed on the new network. The randomness of the minimum spanning tree (MST) is ensured by the random weights, and MST ensures the connectivity of the sparse network. Second, the MST is expanded by adding a number of edges, which are randomly selected from the original network (but not already on the MST). Hence, the number of edges in the sparse network is equal to 10% or 50% of edges in the original network. The sparse network preserves the basic topological properties of the original network.

The final experimental results also confirm the above-mentioned phenomenon. For example, in Figure 6, the FS-reconstructed networks and the RWR-reconstructed networks work better than the original networks when the networks are very sparse (e.g. 10%). However, as the networks become denser, the FS-reconstructed networks and the RWR-reconstructed networks get worse than the original networks.

FIGURE 6

FIGURE 6. The performance evaluation by leave-one-out validation over the PPI networks (Dataset B: S.cerevisiae and M.musculus). Here, the sub figures in the horizontal and vertical directions represent the experimental results for different data types and rank predicted functions, respectively. Horizontally, the top three subplots represent ones over the dataset of S.cerevisiae, and the bottom for ones of M.musculus (A) and (D) first rank predicted function, (B) and (E) second rank predicted function, (C) and (F) third rank predicted function.

4 Conclusion

The systematic comparison of two network transformation approaches (network reconstruction and edge enrichment) is performed using three different protein similarity metrics (sequence similarity, local and global similarity). In summary, edge enrichment performs better than network reconstruction and original networks, while network reconstruction is more effective on relatively small and incomplete PPI networks. The edge enrichment of PPI networks based on sequence similarity outperforms those based on both local and global similarity. As the PPI networks become more and more complete, the effectiveness of both edge enrichment and network reconstruction will decrease or relatively decrease.

Research efforts will be further expanded in future, which include: 1) how the removal of noisy edges and addition of informative edges affect the prediction performance; 2) a combining approach that combines the best properties of all these indices is developed since the similarity indices considered here have different properties and performances.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: (1) Datasets A: BioGRID, https://downloads.thebiogrid.org/BioGRID. (2) Datasets A: STRING, https://string-db.org.

Author Contributions

JZ, JG, WX and JG, JH designed and performed the experiments. JZ, JG, WX and YW analyzed the data. The manuscript was written by JZ, JG and WX and approved by all authors.

Funding

This work was supported by National Natural Science Foundation of China (NSFC) under Grants Nos. 41877009, 61772367, 62172300, U1936205 and by the Fundamental Research Funds for the Central Universities.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2021.758131/full#supplementary-material

References

Abdollahi, S., Lin, P.-C., and Chiang, J.-H. (2021). Winbinvec: Cancer-Associated Protein-Protein Interaction Extraction and Identification of 20 Various Cancer Types and Metastasis Using Different Deep Learning Models. IEEE J. Biomed. Health Inform. 25, 4052–4063. doi:10.1109/JBHI.2021.3093441

CrossRef Full Text | Google Scholar

Adamcsek, B., Palla, G., Farkas, I. J., Derényi, I., and Vicsek, T. (2006). CFinder: Locating Cliques and Overlapping Modules in Biological Networks. Bioinformatics 22, 1021–1023. doi:10.1093/bioinformatics/btl039

PubMed Abstract | CrossRef Full Text | Google Scholar

Altschul, S., Madden, T., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W., et al. (1997). Gapped BLAST and PSI-BLAST: a New Generation of Protein Database Search Programs. Nucleic Acids Res. 25, 3389–3402. doi:10.1093/nar/25.17.3389

PubMed Abstract | CrossRef Full Text | Google Scholar

Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., et al. (2000). Gene Ontology: Tool for the Unification of Biology. Nat. Genet. 25, 25–29. doi:10.1038/75556

PubMed Abstract | CrossRef Full Text | Google Scholar

Barrell, D., Dimmer, E., Huntley, R. P., Binns, D., O'Donovan, C., and Apweiler, R. (2009). The GOA Database in 2009--an Integrated Gene Ontology Annotation Resource. Nucleic Acids Res. 37, D396–D403. doi:10.1093/nar/gkn803

PubMed Abstract | CrossRef Full Text | Google Scholar

Berggård, T., Linse, S., and James, P. (2007). Methods for the Detection and Analysis of Protein-Protein Interactions. Proteomics 7, 2833–2842. doi:10.1002/pmic.200700131

PubMed Abstract | CrossRef Full Text | Google Scholar

Bogdanov, P., and Singh, A. K. (2010). Molecular Function Prediction Using Neighborhood Features. Ieee/acm Trans. Comput. Biol. Bioinf. 7, 208–217. doi:10.1109/TCBB.2009.81

CrossRef Full Text | Google Scholar

Chen, Y., Wang, W., Liu, J., Feng, J., and Gong, X. (2020). Protein Interface Complementarity and Gene Duplication Improve Link Prediction of Protein-Protein Interaction Network. Front. Genet. 11, 291. doi:10.3389/fgene.2020.00291

PubMed Abstract | CrossRef Full Text | Google Scholar

Chua, H. N., Sung, W.-K., and Wong, L. (2007). An Efficient Strategy for Extensive Integration of Diverse Biological Data for Protein Function Prediction. Bioinformatics 23, 3364–3373. doi:10.1093/bioinformatics/btm520

PubMed Abstract | CrossRef Full Text | Google Scholar

Chua, H. N., Sung, W.-K., and Wong, L. (2006). Exploiting Indirect Neighbours and Topological Weight to Predict Protein Function from Protein-Protein Interactions. Bioinformatics 22, 1623–1630. doi:10.1093/bioinformatics/btl145

PubMed Abstract | CrossRef Full Text | Google Scholar

Dunn, R., Dudbridge, F., and Sanderson, C. M. (2005). The Use of Edge-Betweenness Clustering to Investigate Biological Function in Protein Interaction Networks. BMC Bioinformatics 6, 39. doi:10.1186/1471-2105-6-39

PubMed Abstract | CrossRef Full Text | Google Scholar

Hu, L., Huang, T., Shi, X., Lu, W.-C., Cai, Y.-D., and Chou, K.-C. (2011). Predicting Functions of Proteins in Mouse Based on Weighted Protein-Protein Interaction Network and Protein Hybrid Properties. PLOS ONE 6, e14556. doi:10.1371/journal.pone.0014556

PubMed Abstract | CrossRef Full Text | Google Scholar

Kuchaiev, O., Milenković, T., Memišević, V., Hayes, W., and Pržulj, N. (2010). Topological Network Alignment Uncovers Biological Function and Phylogeny. J. R. Soc. Interf. 7, 1341–1354. doi:10.1098/rsif.2010.0063

CrossRef Full Text | Google Scholar

Liben-Nowell, D., and Kleinberg, J. (2007). The Link-Prediction Problem for Social Networks. J. Am. Soc. Inf. Sci. 58, 1019–1031. doi:10.1002/asi.20591

CrossRef Full Text | Google Scholar

Lü, L., and Zhou, T. (2011). Link Prediction in Complex Networks: A Survey. Physica A: Stat. Mech. its Appl. 390, 1150–1170. doi:10.1016/j.physa.2010.11.027

CrossRef Full Text | Google Scholar

Newman, M. E. J. (2001). Clustering and Preferential Attachment in Growing Networks. Phys. Rev. E 64, 025102. doi:10.1103/PhysRevE.64.025102

PubMed Abstract | CrossRef Full Text | Google Scholar

Ofran, Y., Punta, M., Schneider, R., and Rost, B. (2005). Beyond Annotation Transfer by Homology: Novel Protein-Function Prediction Methods to Assist Drug Discovery. Drug Discov. Today 10, 1475–1482. doi:10.1016/S1359-6446(05)03621-4

PubMed Abstract | CrossRef Full Text | Google Scholar

Patel, S., Tripathi, R., Kumari, V., and Varadwaj, P. (2017). Deepinteract: Deep Neural Network Based Protein-Protein Interaction Prediction Tool. Cbio 12, 551–557. doi:10.2174/1574893611666160815150746

CrossRef Full Text | Google Scholar

Schwikowski, B., Uetz, P., and Fields, S. (2000). A Network of Protein-Protein Interactions in Yeast. Nat. Biotechnol. 18, 1257–1261. doi:10.1038/82360

PubMed Abstract | CrossRef Full Text | Google Scholar

Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and Eliassi-Rad, T. (2008). Collective Classification in Network Data. AIMag 29, 93–106. doi:10.1609/aimag.v29i3.2157

CrossRef Full Text | Google Scholar

Sharan, R., Ulitsky, I., and Shamir, R. (2007). Network‐based Prediction of Protein Function. Mol. Syst. Biol. 3, 88. doi:10.1038/msb4100129

PubMed Abstract | CrossRef Full Text | Google Scholar

Singh, R., Xu, J., and Berger, B. (2008). Global Alignment of Multiple Protein Interaction Networks with Application to Functional Orthology Detection. Proc. Natl. Acad. Sci. 105, 12763–12768. doi:10.1073/pnas.0806627105

PubMed Abstract | CrossRef Full Text | Google Scholar

Sleator, R. D., and Walsh, P. (2010). An Overview of In Silico Protein Function Prediction. Arch. Microbiol. 192, 151–155. doi:10.1007/s00203-010-0549-9

PubMed Abstract | CrossRef Full Text | Google Scholar

Solava, R. W., Michaels, R. P., and Milenković, T. (2012). Graphlet-based Edge Clustering Reveals Pathogen-Interacting Proteins. Bioinformatics 28, i480–i486. doi:10.1093/bioinformatics/bts376

PubMed Abstract | CrossRef Full Text | Google Scholar

Stark, C., Breitkreutz, B.-J., Chatr-Aryamontri, A., Boucher, L., Oughtred, R., Livstone, M. S., et al. (2011). The Biogrid Interaction Database: 2011 Update. Nucleic Acids Res. 39, D698–D704. doi:10.1093/nar/gkq1116

PubMed Abstract | CrossRef Full Text | Google Scholar

Szklarczyk, D., Franceschini, A., Kuhn, M., Simonovic, M., Roth, A., Minguez, P., et al. (2011). The String Database in 2011: Functional Interaction Networks of Proteins, Globally Integrated and Scored. Nucleic Acids Res. 39, D561–D568. doi:10.1093/nar/gkq973

PubMed Abstract | CrossRef Full Text | Google Scholar

Täubig, H., Buchner, A., and Griebsch, J. (2006). Past: Fast Structure-Based Searching in the PDB. Nucleic Acids Res. 34, W20–W23. doi:10.1093/nar/gkl273

PubMed Abstract | CrossRef Full Text | Google Scholar

Tong, H., Faloutsos, C., and Pan, J.-Y. (2008). Random Walk with Restart: Fast Solutions and Applications. Knowl Inf. Syst. 14, 327–346. doi:10.1007/s10115-007-0094-2

CrossRef Full Text | Google Scholar

Waiho, K., Afiqah‐Aleng, N., Iryani, M. T. M., and Fazhan, H. (2021). Protein-protein Interaction Network: an Emerging Tool for Understanding Fish Disease in Aquaculture. Rev. Aquacult. 13, 156–177. doi:10.1111/raq.12468

CrossRef Full Text | Google Scholar

Wallace, A. C., Laskowski, R. A., and Thornton, J. M. (1996). Derivation of 3D Coordinate Templates for Searching Structural Databases: Application to Ser-His-Asp Catalytic Triads in the Serine Proteinases and Lipases. Protein Sci. 5, 1001–1013. doi:10.1002/pro.5560050603

PubMed Abstract | CrossRef Full Text | Google Scholar

Wu, Q., Ye, Y., Ng, M. K., Ho, S.-S., and Shi, R. (2014). Collective Prediction of Protein Functions from Protein-Protein Interaction Networks. BMC Bioinformatics 15, S9. doi:10.1186/1471-2105-15-S2-S9

PubMed Abstract | CrossRef Full Text | Google Scholar

Wu, Z., Liao, Q., and Liu, B. (2020). A Comprehensive Review and Evaluation of Computational Methods for Identifying Protein Complexes from Protein-Protein Interaction Networks. Brief. Bioinform. 21, 1531–1548. doi:10.1093/bib/bbz085

PubMed Abstract | CrossRef Full Text | Google Scholar

Xiong, W., Liu, H., Guan, J., and Zhou, S. (2013). Protein Function Prediction by Collective Classification with Explicit and Implicit Edges in Protein-Protein Interaction Networks. BMC Bioinformatics 14, S4. doi:10.1186/1471-2105-14-S12-S4

PubMed Abstract | CrossRef Full Text | Google Scholar

Xiong, W., Xie, L., Zhou, S., and Guan, J. (2014). Active Learning for Protein Function Prediction in Protein-Protein Interaction Networks. Neurocomputing 145, 44–52. doi:10.1016/j.neucom.2014.05.075

CrossRef Full Text | Google Scholar

Yang, H., Wang, M., Liu, X., Zhao, X.-M., and Li, A. (2021). PhosIDN: an Integrated Deep Neural Network for Improving Protein Phosphorylation Site Prediction by Combining Sequence and Protein-Protein Interaction Information. Bioinformatics, btab551. doi:10.1093/bioinformatics/btab551

PubMed Abstract | CrossRef Full Text | Google Scholar

Ye, Y., and Godzik, A. (2004). FATCAT: a Web Server for Flexible Structure Comparison and Structure Similarity Searching. Nucleic Acids Res. 32, W582–W585. doi:10.1093/nar/gkh430

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhu, H., Du, X., and Yao, Y. (2020). Convsppis: Identifying Protein-Protein Interaction Sites by an Ensemble Convolutional Neural Network with Feature Graph. Cbio 15, 368–378. doi:10.2174/1574893614666191105155713

CrossRef Full Text | Google Scholar

Keywords: edge enrichment, network reconstruction, protein-protein interaction networks, protein function prediction, protein sequence annotation

Citation: Zhou J, Xiong W, Wang Y and Guan J (2021) Protein Function Prediction Based on PPI Networks: Network Reconstruction vs Edge Enrichment. Front. Genet. 12:758131. doi: 10.3389/fgene.2021.758131

Received: 13 August 2021; Accepted: 11 November 2021;
Published: 14 December 2021.

Edited by:

Liang Cheng, Harbin Medical University, China

Reviewed by:

Cheng Liang, Shandong Normal University, China
Yongjun Tang, Central South University, China

Copyright © 2021 Zhou, Xiong, Wang and Guan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Jihong Guan, amhndWFuQHRvbmdqaS5lZHUuY24=

^†These authors have contributed equally to this work

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.