WINNER: A network biology tool for biomolecular characterization and prioritization

Nguyen, Thanh; Yue, Zongliang; Slominski, Radomir; Welner, Robert; Zhang, Jianyi; Chen, Jake Y.

doi:10.3389/fdata.2022.1016606

ORIGINAL RESEARCH article

Front. Big Data, 04 November 2022

Sec. Medicine and Public Health

Volume 5 - 2022 | https://doi.org/10.3389/fdata.2022.1016606

WINNER: A network biology tool for biomolecular characterization and prioritization

1. Informatics Institute in School of Medicine, The University of Alabama at Birmingham, Birmingham, AL, United States
2. Department of Biomedical Engineering, The University of Alabama at Birmingham, Birmingham, AL, United States
3. Comprehensive Arthritis, Musculoskeletal, Bone and Autoimmunity Center (CAMBAC), School of Medicine, The University of Alabama at Birmingham, Birmingham, AL, United States

Article metrics

View details

Citations

2,8k

Views

1,1k

Downloads

Abstract

Background and contribution:

In network biology, molecular functions can be characterized by network-based inference, or “guilt-by-associations.” PageRank-like tools have been applied in the study of biomolecular interaction networks to obtain further the relative significance of all molecules in the network. However, there is a great deal of inherent noise in widely accessible data sets for gene-to-gene associations or protein-protein interactions. How to develop robust tests to expand, filter, and rank molecular entities in disease-specific networks remains an ad hoc data analysis process.

Results:

We describe a new biomolecular characterization and prioritization tool called Weighted In-Network Node Expansion and Ranking (WINNER). It takes the input of any molecular interaction network data and generates an optionally expanded network with all the nodes ranked according to their relevance to one another in the network. To help users assess the robustness of results, WINNER provides two different types of statistics. The first type is a node-expansion p-value, which helps evaluate the statistical significance of adding “non-seed” molecules to the original biomolecular interaction network consisting of “seed” molecules and molecular interactions. The second type is a node-ranking p-value, which helps evaluate the relative statistical significance of the contribution of each node to the overall network architecture. We validated the robustness of WINNER in ranking top molecules by spiking noises in several network permutation experiments. We have found that node degree–preservation randomization of the gene network produced normally distributed ranking scores, which outperform those made with other gene network randomization techniques. Furthermore, we validated that a more significant proportion of the WINNER-ranked genes was associated with disease biology than existing methods such as PageRank. We demonstrated the performance of WINNER with a few case studies, including Alzheimer's disease, breast cancer, myocardial infarctions, and Triple negative breast cancer (TNBC). In all these case studies, the expanded and top-ranked genes identified by WINNER reveal disease biology more significantly than those identified by other gene prioritizing software tools, including Ingenuity Pathway Analysis (IPA) and DiAMOND.

Conclusion:

WINNER ranking strongly correlates to other ranking methods when the network covers sufficient node and edge information, indicating a high network quality. WINNER users can use this new tool to robustly evaluate a list of candidate genes, proteins, or metabolites produced from high-throughput biology experiments, as long as there is available gene/protein/metabolic network information.

Introduction

Gene prioritization from large-scale omics projects is a central topic in disease biology (Huang H. et al., 2009). Manual searches of the literature and publicly annotated databases (Gene Ontology et al., 2013; Kanehisa et al., 2017; Tyner et al., 2017) for genes associated with a particular disease or biological process can be biased, because they are limited to existing knowledge. Sifting hundreds and thousands of gene or genetic variations associated with genes from genomic studies can also be daunting (Moreau and Tranchevent, 2012), e.g., even for a user to search for genes associated with cardiac arrhythmia (Rajab et al., 2010) within a 2-Mb region of chromosome 17 may return 77 candidate genes. For many biologists, the lack of ranking of genes based on biological relevance of disease context is an experience analogous to the pre-Google days of Internet search of web content. With influx of data from large-scale sequencing projects (Schlotterer et al., 2014), bioinformatics users increasingly count on good gene prioritization to help them generate biological hypotheses (Chen et al., 2006a; Hale et al., 2012), find potential disease biomarkers (Saha et al., 2008; Zhang and Chen, 2010, 2013), and identify candidate drug targets (Chen et al., 2006b, 2013; Li et al., 2009; Muhammad et al., 2017). However, as datasets continue to become larger and more heterogeneous, statistical (Subramanian et al., 2005; Aerts et al., 2006; Cantor et al., 2010) and text-mining (Krallinger et al., 2008; Liu et al., 2015; ElShal et al., 2016) approaches to gene prioritization lack sufficient precision in the biological knowledge context. For example, surveys of PAGER (Yue et al., 2018) for genes associated with the response of breast cancer to doxorubicin treatment may retrieve more than 2,000 statistically significant genes with MSigDB (Liberzon et al., 2015), or 234 candidate genes with the online text-mining platform Beegle (ElShal et al., 2016). The use of statistical p-values to prioritize retrieved genes can mislead biology users who assume statistical significance in samples equate the gene's true biological significance against one another in the experiment (Kim and Bang, 2016).

To overcome the limitations gene prioritization in practice, bioinformatics researchers have developed gene network models with which they perform knowledge-based gene prioritization and novel candidate genes identification (Chen et al., 2006a; Cowen et al., 2017). A molecular network consists of nodes (e.g., proteins) linked by edges that represent the pairwise interactions between nodes, forming a convenient computational model that is easy to interpret and has been widely used to discover (and rediscover) disease-specific genes and potential targets for treatment (Chen et al., 2009; Wu et al., 2009; Erten et al., 2011; Gottlieb et al., 2011; Guney and Oliva, 2012; Singh-Blom et al., 2013; Smedley et al., 2014; Peters et al., 2017; do Valle et al., 2018). Network-based methods also enable researchers to integrate data from a wide variety of sources, including analyses of gene-gene similarity (Alvarez-Ponce et al., 2013), proteomic interactions (Rolland et al., 2014), and regulatory pathways (Li and Campos, 2015); however, the results of prioritization strongly depend on the input gene list (Antanaviciute et al., 2015), and the list is often derived from existing databases that may lack important genes because of statistical errors or human errors during annotation. For example, acetylcholinesterase (ACHE), which is commonly associated with β-amyloid plaques and neurofibrillary tangles in the brains of patients with Alzheimer's Disease (AD; Talesa, 2001), is not among the annotated genes for AD in the KEGG database (Kanehisa et al., 2017). Input lists may also be compromised by redundancy, which can be generated from at least two sources: (1) the inclusion of genes that were falsely identified during the statistical analysis of an experiment (Yu et al., 2017), and (2) when, in an attempt to increase comprehensiveness, the list is expanded to include the gene for a “hub” protein that interacts with dozens, or even hundreds, of other proteins [e.g., ubiquitin C binds to 4,658 other molecules (Chen et al., 2017)] and, consequently is unlikely to be specific for the phenotype of interest. Furthermore, the statistical significance of a ranking is typically calculated via comparison to the rankings from a randomized version of the original network, but since the randomized network is often created by adding or deleting a small number of gene-gene interactions (i.e., increasing noise), or via total network permutation (Xie et al., 2015; Guala and Sonnhammer, 2017), much of the topology of the original network may be lost.

Related works

According to Bromberg (2013), molecular-interaction-based disease gene prioritization started in the early 2000's by pioneering techniques such as G2D (Perez-Iratxeta et al., 2002). In principle, statistical analysis of the patients' genetic data yields 100's of disease-associated genes. These genes often belong to an interaction network (Sun and Zhao, 2010), which is also called a “disease pathway.” Assume that the disease phenotypes occur due to a disturbance at any point of the pathway, then disturbing the “most influential” genes is the most likely reason leading to the disease. Then, having a good disease pathway, network ranking algorithms, especially the eigenvector-based [Random Walk (Smedley et al., 2014) and PageRank (Page et al., 1999)] and centric-based [betweenness centrality (Newman, 2005)] can be used to prioritize the genes. Also, this idea can be applied to analyze key regulators in non-disease-specific biological processes. However, the pathways are usually incompleted: new disease regulators are still not discovered or some interaction among disease-associated genes are not yet shown (Bromberg, 2013). Therefore, the ranking techniques are required to extend the interaction network beyond the known disease-associated genes. Recent gene prioritization techniques have this ability. For example, DIAMOnD (Ghiassian et al., 2015) built a large network comprising genes related to 70 diseases, clustered the large network into multiple network modules, then assigned the network module to a disease; here, in the same module, genes not related to the disease module are added (extended) into the disease-specific network-module for prioritization. Ingenuity Pathway Analysis (Kramer et al., 2014) extended the disease-specific pathway by statistically estimating the likelihood of how a new gene interacts with the known disease-related gene. In Node2Vec (Grover and Leskovec, 2016; Peng et al., 2019), a “global gene network,” which includes the known disease-specific genes, their direct interacting genes, and indirect interacting ones (optionally) was constructed; then, each gene is represented by a numerical vector having a fixed-length dimension to allow computing the cosine similarity between a known disease-specific gene and another gene; so, the extension can be made by choosing the genes having high cosine similarity to any of the disease-specific ones. Or, in GenePANDA (Yin et al., 2017), given a “global gene network” (similar to Node2Vec), for a specific gene, the average distance between itself and any other gene in the “global” network was subtracted by the average distance between itself and the known disease-specific genes; then, this difference was used to rank the genes.

Besides the network-based approach, gene prioritization could be performed using text mining and similarity profiling approaches (Yin et al., 2017). In the text mining approach, it is hypothesized that important genes are more likely to be mentioned in an article than non-important ones. Therefore, text mining tools, such as aBandApart (Van Vooren et al., 2007) and Gene Prospector (Yu et al., 2008), emphasize efficient queries in MEDLINE and other large literature collections to find important disease-specific genes. However, these approaches may not find important genes when the disease is not yet well-researched or when a new disease model (i.e., a new cell line or new organoid) is built to represent the disease. On the other hand, similarity profiling defines the similarity among the genes according to the disease-related information; then, if a novel gene shares a high similarity with genes that are known to be important, the novel gene will be ranked highly. For example, Endeavor (Aerts et al., 2006) and ToppGene (Chen et al., 2009) integrated multiple disease-omic databases by a machine-learning model; the model was trained to classify between the known-important genes and non-important genes; the model will produce a ranking score reflecting how important a novel gene is, respecting the already known ones. Meanwhile, the disease-specific gene expression and correlation matrix can be clustered or latent-based represented, such as in Pinta (Nitsch et al., 2011), Maxlink (Guala et al., 2014), and Genefriends (van Dam et al., 2012), where the well-known disease-specific genes are expected to concentrate in one or a few clusters/latent modules, and the novel genes in these clusters or modules would be ranked highly.

Here, we introduce a new ranking method, Weighted In-Network Node Expansion and Ranking (WINNER), that addresses many of the current limitations of network-based gene prioritization methods. As with PageRank (Winter et al., 2012) and many other gene prioritization techniques, the ranking engine of WINNER uses random-walk principles (Zhao et al., 2015). However, WINNER was designed to address the following three specific network biology tasks: (1) perform gene prioritization in a weighted biomolecular association network, (2) identify upstream regulators and targeted genes (i.e., “upstream” ranking), or (3) identifying downstream effector molecules that are specific for a particular disease or phenotype (“downstream” ranking). WINNER can generate a ranking score for each input gene, derive optional genes that are “expanded” from the original seed gene lists, and provide two different statistic for users (1) the gene expansion p-value (p_e) for adding a gene to the network, which addresses both incomprehensiveness and redundancy; and (2) the gene ranking p-value (p_r), which represents the significance of the ranking when compared to the randomized network. Furthermore, we found that compared to total network permutation (Xie et al., 2015; Guala and Sonnhammer, 2017), preserving the modularity randomization (Cowen et al., 2017) produces a randomized network that is topologically similar to the original network and yields a more normal distribution of ranks (Espinoza, 2012). We further demonstrated the benefit of WINNER in omics study result interpretations with the following case studies: (1) ranking genes that are genetically associated with Alzheimer's disease (AD); (2) ranking breast-cancer survival-related genes (Lanczky et al., 2016); (3) ranking differentially expressed genes involved in myocardial injury in pigs for their potential roles in myocardial regeneration (Eschenhagen et al., 2017). In all these studies, we discuss how our prioritization score and statistic associated with high-ranked genes enable biology users to derive new insights and hypotheses worth further experimental investigations.

Methods

For this work, we postulated (1) that the seeded (i.e., input) genes consist of (but are not limited to) differentially expressed genes identified in a wet-lab experiment, genes in a well-curated pathway, and phenotype-associated genes mined from the literature; and (2) that genes added to the expanded network (i.e., “expansion genes”) would have significantly more interactions with seeded genes (i.e., “seeded interactions”) than with non-seeded genes. WINNER begins with the set of seeded genes and a collection of gene-gene interactions, iteratively applies network ranking for gene prioritization, and expands the ranked list of genes one gene at a time (Supplementary Figure 1). Each gene-gene interaction has a confidence score (scaled between 0 and 1), which is commonly included in interactome databases (Chatr-Aryamontri et al., 2013; Szklarczyk et al., 2015); however, if a confidence score is not available, then the confidence score is set to 1 for all interactions. Network ranking is first applied to the seeded genes and the interactions among them (S₀ metric, Equation 1); then, genes adjacent to the seeded genes are filtered for significant interactions with the seeded genes (p_e) to identify candidates for the expanded network. The identified candidate is added to the ranked list, and network ranking is re-applied to initiate the next iteration of the cycle. A more detailed description of each step is provided below.

Ranking genes in the network by WINNER

Undirected networks

Given a gene-gene association network, the genes are ranked as in Supplementary Video 1. First, WINNER assigns an initial score (S₀) to the genes, according to Yue et al. (2017):

where i represents the gene index, w(i) is the sum of the confidence scores (normalized to between 0 and 1) for all gene-gene interactions associated with i, and I(i) is the number of gene-gene interactions associated with i. Here, larger confidence scores imply stronger associations. Second, WINNER iteratively updates the gene score by applying the Random Walk technique (Page et al., 1999):

where s is the random walk damping parameter [set to s = 0.85 as described (Page et al., 1999)], c(j, i) represents the confidence score of the interaction between gene i and gene j, and t is the index of iteration (starting at 1); S = 0 for genes that are outside the network but appear in the collection of gene-gene interactions. PageRank theory (Page et al., 1999) demonstrates that S_t converges (|S_t − S_{_t−1}|^®0) if t is large enough, so the iterative cycle was continued until |S_t − S_{_t−1}| < 0.001.

Directed networks

Directed networks, such as networks of regulatory pathways, include more annotation than undirected networks. Thus, we adapted the definitions of terms in Equations 1, 2 so that WINNER could be used to (for example) infer upstream regulatory and downstream effector genes (Kramer et al., 2014). For “upstream” ranking, i is the regulatory gene and j is the gene regulated by i; thus, w(i) is the sum of the confidence scores for all gene-gene relationships that i regulates, I(i) is the number of gene-gene relationships regulated by i, and c(j, i) is the confidence score for the regulation of j by i. For “downstream” ranking, i is the regulated gene and j is the gene that regulates i; thus, w(i) is the sum of the confidence scores for all gene-gene relationships in which i is regulated, I(i) is the number of gene-gene relationships in which i is regulated, and c(j, i) is the confidence score for the regulation of i by j.

Statistical significance of gene ranking

To evaluate the statistical significance (p-value) of the gene ranking, we determined how likely the converging result of S (by default, S₂₀₀) in Equations 1, 2 is higher than in random networks. Randomization was performed in Matlab with degree-preservation (Espinoza, 2012; Tiong and Yeang, 2019) to maintain the topological characteristics of the original gene-gene network; however, the technique only generates unweighted relationships, so weights were randomly assigned from the distribution of relationship weights in the original network. One thousand random networks were generated, and the ranking scores (S₂₀₀) of the genes in the random networks were normally distributed (as validated via the Chi-square goodness-of-fit test). Thus, the ranking p-value (p_r) for each gene i was calculated by using the normal distribution [m(i), s(i)] parameter estimation (Bowman and Azzalini, 1997):

which is equivalent to computing the two-tailed p-value for a normal distribution.

Filtering candidates for expansion

We chose two hypergeometric tests that are common practice in annotation (Huang et al., 2009). First, we tested the likelihood of the candidate expansion gene having a seeded interaction relative to its total number of interactions. Second, we tested the likelihood of the candidate expansion gene having seeded interactions relative to the seeded interactions of its most similar seeded gene, with similarity determined by node degree. Thus, we calculated two p-values for each expansion gene j from the “overrepresented” point of view (Beissbarth and Speed, 2004; terms are defined in Supplementary Figure 2):

Test 1:

Test 2:

in which the double-line bracket operator represents the combination operator:

Genes for which both p_1e(j) < 0.05 and p_2e(j) < 0.05 were chosen as candidates for expansion. Thus, the expansion p-value (p_e) for each gene j is defined by the equation p_e(j) = max [p_1e(j), p_2e(j)].

Selecting one candidate for expanded ranking

Since there will likely be more than one candidate expansion gene remaining after filtration, WINNER estimates which of the candidates should be added to the network by calculating an expansion score (e) from the confidence score of the interaction between the candidate gene and the ranked genes, and the ranking score (S) of the ranked genes:

Where i is the candidate expansion gene, j represents all seeded genes that interact with the candidate expansion gene, and W(j) is the sum of the confidence scores for all interactions involving all seeded genes. Note that W(j) differs from w(j) in Equation 2, because w(j) is restricted to interactions among ranked genes.

Informatics databases and benchmarking metrics

Correlations among WINNER, PageRank (Winter et al., 2012), dual node-edge ranking (Wang et al., 2015), eigenvector centrality, betweenness centrality, node degree, and clustering coefficient (Newman, 2008) were evaluated by computing the linear correlation coefficients and p-values with Matlab (Neupane and Kiser, 2018).

For analyses of upstream and downstream genes (directed network), genes were distributed into layers via the breadth-first-search approach, and groups of genes that formed a self-contained cycle were treated as a single node. Results were visualized with boxplots. In each pathway, the gene rank numbers were converted into percentile format: the first rank (number 1) was converted to 100% percentile, while the last rank was converted to 0% percentile. The percentile format allowed boxplot aggregation from multiple pathways, where the different pathways had different number of genes.

Experiments demonstrating the general topological and biological significance of the WINNER ranking were conducted with the small gene set associated with AD from KEGG release 50 (2009) (Kanehisa et al., 2010) and with undirected gene-gene interactions from HAPPI version 1.0 (Chen J. Y. et al., 2009). Rankings of upstream regulators and downstream effectors were conducted with all cancer disease pathways in KEGG release 85 (Kanehisa et al., 2017; Tessier et al., 2018) and gene-gene regulatory relationships from STRING v.10.5 (Szklarczyk et al., 2017).

The effectiveness of WINNER for identifying network-expansion genes was evaluated by using KEGG release 50 [stored in PAGER 1.0 (Yue et al., 2015)] as the input with interactions of all types (without directionality) from HAPPI v.2.0 whose confidence scores exceeded 0.75 (Chen et al., 2017), and then determining how closely the expanded network matched the updated KEGG release 85 (Kanehisa et al., 2017). An analogous experiment was conducted with Ingenuity Pathway Analysis (IPA), which (in theory) can be used for both upstream and downstream expansion and HAPPI v.2.0 (Kramer et al., 2014) for comparison. Precision, recall, and F1 scores were calculated via the following equations:

where E is the set of expansion genes determined by Winner or IPA and U is the set of genes present in KEGG release 85 but not in KEGG release 50.

The biological relevance of our rankings was evaluated by (1) determining whether the top-ranked genes from WINNER ranking of the KEGG breast cancer pathway (Kanehisa et al., 2017; https://www.genome.jp/kegg-bin/show_pathway?hsa05224) were included among the genes correlated with survival in 3951 Breast Cancer patients (Gyorffy et al., 2010); and (2) by ranking the set of differentially expressed genes from a study of myocardial regeneration in neonatal pigs (Zhu et al., 2018) with WINNER and determining whether the top-ranked genes could contribute to cardiac repair and cardiomyocyte proliferation. For the analysis of breast-cancer survival genes, we calculated the ratio of the number of genes that were both significant (survival p-value < 0.05) in the breast cancer study (Gyorffy et al., 2010) and highly ranked by WINNER (i.e., scored above a defined threshold) to the number of highly-ranked genes.

Network randomization and testing for ranking normal distribution in random networks

In WINNER, given a network (also called the original network), we examined the following network randomization approaches to evaluate which network randomization approach was the most suitable for computing the ranking p-value for each gene:

Total rewiring (also called total network permutation; Waksman, 1968). To implement this approach, for each interaction (edge) in the original network, we randomly changed the two genes (node) connecting through this edge. Therefore, this approach preserves the number of interactions, yet it totally changes the network and gene topology.
Randomly drawing a new network such that each gene's degree is the same to what it is in the original network (also called preserving degree; Rao et al., 1996). A gene degree, in simple description, is the number of other genes connecting to the gene in the network.
Randomly drawing a new network with the same modularity to the original network (also called preserving modularity). We implemented this strategy according to the network modularity definition in Newman (2006). Modularity measures likely the network can be partitioned into clusters of interacting genes.
Randomly adding 5% new interactions into the original network. These interactions were not reported in the gene-gene interaction databases.
Randomly removing 5% of the interactions from the original network.

For each network randomization approach, starting from the same original network, we repeated 10,000 times, yielding 10,000 different random networks. Then, applying WINNER (and other ranking algorithms) yielded 10,000 random ranking results for each gene. We tested whether these random rankings followed a normal distribution using chi-square goodness of fit test (chi2gof)¹ in Matlab. In this test, the smaller chi-square (chi2) indicates that the rankings are more naturally distributed.

Literature validation using co-citations from PubMed

Important disease-specific genes are often co-mentioned in a research article. Therefore, to demonstrate the significance of the genes related to a disease, we applied a co-citations from the NCBI e-utils application programming interface (API; Sayers, 2008) that implements semantic searches of PubMed abstracts to report biomedical literature citations (https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?). We applied “pubmed” as input of database and the concatenated string of the candidate gene and the disease name as input of terms. To identify the co-citation support for the winner scores, we separated the genes into two categories, with literature co-citation (k = 0) or without literature co-citation (k > 0) to find the differences between the winner scores. We applied the Kruskal-Wallis test to report p-values.

Biomedical case studies, data, and preprocessing

Cardiac regeneration dataset

For the cardiac regeneration case study, the bulk-RNA expression dataset was obtained from Zhang et al. (2020). Briefly, two groups of pig hearts were sent for sequencing when they reached postnatal days (P) 7, 14, and 28. In the first group, the pigs underwent myocardial infarction (a heart attack model) on the postnatal day 1, then their heart fully recovered to normal cardiac functionality with no scar. In the second group, the pig did not undergo injury (sham control). For each group at each day (P7, P14, or P28), three pigs were sequenced. The bulk-RNA data were processed by applying trim-galore (Krueger, 2015) for trimming the fastQ read, then STAR package v2.5.2 for mapping to Pig genome (Dobin et al., 2013), then the RNA transcripts were counted using HtSeq version 0.6.1 (Anders et al., 2015). The gene expression was normalized, and fold-change was calculated using Deseq2 software (Love et al., 2014). Due to the small sample size (n = 3), the p-values for differentially expressed genes, compared between two groups at P7, P14, and P21, were calculated using the approach in Bian et al. (2021). After calculating and comparing two groups at these three different postnatal time points, this process yielded 276 seed genes as input for WINNER. Then, these genes were queried in HAPPI v2 database (Chen et al., 2017) to build their interacting network. These gene lists, their interaction, and WINNER results were summarized in Supplementary Tables 1, 2.

Data processing of triple negative breast cancer (TNBC)

Triple negative breast cancer (TNBC) has been found in 15% of breast cancer cases and is characterized by the tumor cells lacking the expression of the following: epidermal growth factor 2 (HER2), estrogen receptor (ER), and progesterone receptor (PR; Liu et al., 2014; Ueda et al., 2019). Unfortunately, because of its nature, TNBC has a poorer prognosis than other types of breast cancers and treatment options are limited (Xia et al., 2014; Eltohamy et al., 2018; Lu et al., 2020). While TNBC markers are already well-studied, finding the key disease regulators and promising targeted genes is still challenging (Nedeljkovic and Damjanovic, 2019). Therefore, we applied WINNER to explore novel answers for this question.

We took the triple negative breast cancer candidate genes from the University of Alabama at Birmingham Cancer data analysis Portal (UALCAN) database (Chandrashekar et al., 2022). In the comparison between the 116 triple negative breast cancer samples and 114 normal samples, UALCAN provided the top 250 up-regulated genes and 250 down-regulated genes selected by the t-test p-value. Next, we retrieved the Protein-Protein Interaction (PPI) using the medium confidence (score ≥ 0.4) and extended 100 genes using the STRING database. We performed WINNER and generated the gene ranking and p-values (Supplementary Tables 3, 4).

PubMed co-citation analysis of the WINNER ranked genes

We hypothesize that important disease-specific genes are often co-mentioned in a research article (Olsen et al., 2014); if so, WINNER high-ranking genes tend to be more co-cited in the literature than the low-ranking ones. Therefore, to demonstrate the significance of the genes related to a disease, we applied co-citations from the NCBI e-utils application programming interface (API; Sayers, 2008) that implements semantic searches of PubMed abstracts to report biomedical literature citations (https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?). We applied “pubmed” as an input of the database and the concatenated string of the candidate gene and the disease name as input of terms. To identify the co-citation support for the winner scores, we separated the genes into two categories, the WINNER significant ranked genes (p-value ≤ 0.05) or WINNER non-significant ranked genes (p-value > 0.05) to find the differences between the co-citations. We applied the Kruskal-Wallis test to report p-values to test differences of co-citations between significant and non-significant genes.

Pathway level assignment

We retrieved significantly enriched pathways from PAGER 2.0 database (Yue et al., 2018) using WINNER highly ranked genes with p-values ≤ 0.05. We applied the parameter set as follows. The data sources were KEGG, WikiPathway, BioCarta, NCI-Nature Curated, Reactome, Protein Lounge, and Spike, the similarity was set to be 0.05, and FDR was set to be 0.01. We constructed the regulatory (r-type) PAG-to-PAG network using the default r-type relationship score cutoff (=1). We performed a 5-step procedure in the pathway level assignment. Firstly, we calculated shortest paths among the pairwise r-type PAG-PAG relationships. Secondly, we extracted the longest shortest path and assigned levels of pathway from the upstream to the downstream pathway using 1 to n. Thirdly, we expanded the level assignment to the using shortest distances, such as the current pathway is level m, the shortest distance between the expanded pathway in the upstream to the current pathway is 2, the expanded pathway level will be assigned by m-2. Fourthly, we took the average of the levels assigned to pathways. Fifthly, we repeated the steps three and four until all the pathways had been assigned.

The correlation analysis of WINNER ranking and the enriched pathways using the exponential scale of top gene bins

Firstly, we segregated the WINNER significant genes into 2^x bins. Secondly, we took the top 2^x bins (x is [1, X]) and merge the genes to perform the enrichment analysis. Thirdly, we had the pathways enriched in the top 2^x gene bins minus the pathways enriched in 2¹, …,2^x−1 to seek the add-on pathways enriched in the top 2^x gene bins. Fourthly, we mapped the levels from the r-type pathway-to-pathway relationships to the add-on enriched pathways in each top 2^x gene bins, and plotted the curve of pathway levels vs. the gene bins. Meanwhile, we performed the Pearson correlation analysis to report the correlation coefficient between the pathways' levels and gene bins.

Results

Characteristics of WINNER ranking

WINNER ranking of undirected networks

When genes in the KEGG [release 50, stored in the PAGER 1.0 database (Yue et al., 2015)] AD pathway (Supplementary Figure 3) were ranked via WINNER gene prioritization, our results were strongly correlated with those obtained via analyses of both eigenvector (Newman, 2008; p = 1.45 × 10⁻³⁹) and node-betweenness (Newman, 2008; p = 1.67 × 10⁻¹¹) centrality, but not with the clustering coefficient (Newman, 2008; p = 0.22). Similar patterns of correlation were obtained with two other state-of-the-art network-based ranking techniques, PageRank (Winter et al., 2012), eigenvector (Newman, 2008), betweenness centrality (Newman, 2005), and dual node-edge ranking (dual rank; Wang et al., 2015) (Figure 1), and all three ranking techniques were strongly correlated with node degree. Notably, the clustering coefficient, but no other metric or technique, failed to identify some of the most important markers for Alzheimer's, including Amyloid Beta Precursor Protein (A4 or APP; Jonsson et al., 2012), Caspase 8 (CASP8; Wei et al., 2002), Caspase 3 (CASP3; D'Amelio et al., 2011), and Presenilin 1 (PSN1; La Bella et al., 2004). Thus, WINNER was at least equivalent to other network topological metrics and well-established prioritization techniques for ranking genes in undirected biological networks.

Figure 1

The strong correlation between the WINNER and node-degree rankings prompted us to preserve the node degree and modularity during randomization. Examining the AD-associated genes network, the pairwise rank differences between the original network and the total-permutation random network were significantly large (Figure 2A). When the difference between the random ranking and the original ranking is too large, the random network topology would be too different from the original network topology; thus, the random ranking may not be suitable to test statistical significance of the original ranking. Besides, when compared to other randomization techniques (total network permutation, preserving modularity, or adding/removing 5% of edges), the distribution of rankings of AD-associated genes in the degree-preserved randomized network was significantly more normally-distributed (Figure 2B). Furthermore, when examining the ranking distributions of two important AD-associated genes A4 and Presenilin 1 (PSN1; Figures 2C,D), it was clear that their distributions had the bell-shape. Thus, rather than relying on the empirical p-value (Cornish et al., 2018) for gene rankings, we generated 1,000 node-preserved randomized networks and calculated a ranking p-value (p_r) for all genes in all KEGG pathways. Notably, the rankings were much less likely to change in response to the addition of noise for genes with p_r < 0.05 than for genes with p_r ≥ 0.05, especially as the amount of noise increased (Figure 3). These observations suggest that when randomized networks are generated with node-degree preservation, fewer randomizations may be required to achieve adequate precision, and fewer noise simulation may be necessary to evaluate the robustness of the rankings.

Figure 2

Figure 3

The accuracy of WINNER gene prioritization was evaluated by ranking genes in the KEGG breast cancer pathway (https://www.genome.jp/kegg-bin/show_pathway?hsa05224) and then determining whether the top-ranked genes correlated with the genes' effect on survival for patients with breast cancer, as estimated with an online Kaplan-Meier (Bland and Altman, 1998) tool that calculates the breast-cancer survival rates associated with more than 6,000 genes (Gyorffy et al., 2010). The KEGG breast cancer pathway contains 146 genes [annotated by UniProt Consortium (2018)], 62% of which significantly influenced patient survival, and a greater proportion of the most highly ranked genes were significantly associated with breast-cancer survival when prioritized with WINNER than with other gene prioritization techniques (PageRank and dual node-edge ranking; Figure 4). Furthermore, the precision of WINNER for retrieving survival-related genes (i.e., the proportion of retrieved genes that were significantly related to breast cancer survival) was even greater when restricted to genes with a ranking p-value of p_r < 0.05.

Figure 4

WINNER ranking of directed networks

WINNER ranking of directed networks was evaluated via WINNER upstream prioritization with all cancer disease pathways in KEGG release 85 (Kanehisa et al., 2017; KEGG, 2022) and the gene-gene regulatory relationships in STRING v.10.5 (Szklarczyk et al., 2017). Genes were distributed into layers using the breadth-first search approach (Wang et al., 2012) with genes coding for proteins that function further upstream in the pathways assigned to the lower-numbered layers. Thus, genes in the lowest-numbered layers tend to encode master regulatory molecules/receptors and first/second messengers, which are located where the signaling cascade originates (e.g., near the cell membrane; Koschmann et al., 2015), while genes with the highest layer numbers tend to encode downstream effector molecules that are closely associated with a specific disease phenotype, such as drug resistance in breast cancer (Johnston, 2006). Our results indicated that using WINNER, layer 1–3 genes, which were the upstream layers in the pathways, were consistently ranked at higher percentiles than genes at other layers (more downstream; Figure 5). But this consistency was not observed when the genes were prioritized via equivalent (directed-network ranking) analyses with PageRank (Winter et al., 2012) and dual node-edge ranking (Wang et al., 2015). WINNER upstream overestimated the ranking of genes in layer 8, but this can likely be attributed to noise, because the layer contained only 12 ranked genes.

Figure 5

WINNER network expansion and ranking upstream regulators

We demonstrated how WINNER could identify upstream regulators of two cancer pathways, Chronic Myeloid Leukemia (CML; https://www.genome.jp/kegg-bin/show_pathway?hsa05220) and hepatocellular carcinoma (https://www.genome.jp/pathway/hsa05225), that were missing from the existing pathways in KEGG but were present in the KEGG database itself. WINNER upstream prioritization distributed genes into five different layers for each pathway, and WINNER expansion added several highly ranked genes to both networks. Additions to the CML network (Figure 6) included JAK1/2/3 and proteins that participate in IL-2 (IL2, IL2RA, and IL2RB), IL-3 (IL-3, IL-3RA, and IL-3RB), and GM-CSF (CSF2) signaling, which is consistent with the JAK2/STAT5 pathway's status as one of the primary targets for treatment of CML (Valent, 2014), as well as evidence that STAT5 is phosphorylated by IL-2 (Kobayashi et al., 2014; Valent, 2014) and IL-3 (Jiang et al., 1999) signaling, and that GM-CSF is a crucial growth factor for myeloid cells; notably, several of these molecules are currently being investigated as therapeutic targets for CML treatment (Hercus et al., 2012; Broughton et al., 2014; Kobayashi et al., 2014). For the hepatocellular carcinoma pathway (Figure 7), WINNER expansion added KC1G2, a serine-threonine kinase that can activate TGF-β1/Smad signaling (Guo et al., 2008); TMED4, WLS, and PRCN, which mediate Wnt/β-catenin signaling (Guo et al., 2008; Martin-Orozco et al., 2019; Bland et al., 2021); and several genes for proteins in the FGF signaling pathway (FRS2, FRS3, KLB, and PLCG1; Gotoh, 2008; Gyanchandani et al., 2013; Wang et al., 2020), of which KLB is particularly important, because it functions as a co-receptor for the binding of FGF-19/21 to FGFR-1/4 (Yang et al., 2012). Thus, the genes added to the KEGG CML and hepatocellular carcinoma pathways by WINNER expansion have strong, well-established links to multiple binding partners that participate in the mechanisms associated these diseases.

Figure 6

Figure 7

Besides, WINNER ranking correlation with other ranking techniques, including Ingenuity Pathway Analysis (IPA; Kramer et al., 2014), DIAMOnD (Ghiassian et al., 2015), Random Walk (Smedley et al., 2014), Node2Vec (Grover and Leskovec, 2016; Peng et al., 2019), and GenePANDA (Yin et al., 2017), vary from −0.83 (negatively correlated) to −0.05 (insignificant correlation), then to 0.74 (moderate-positively correlated; Figure 6C). This result suggests that the major difference between WINNER and other techniques' ranking appears when the network expands beyond the seed genes. Thus, a good benchmark among WINNER and other techniques can be performed by a network-expansion scenario.

Benchmarking WINNER ranking by retrieving newly updated genes in KEGG pathways

Gene prioritization algorithms are benchmarked by information retrieval experiments, such as in Guala and Sonnhammer (2017) and Zhang et al. (2021), where some important regulators are labeled “unknown,” and the algorithms are executed to rank these “unknown-labeled” gene such that these regulators are top-ranked. Thus, to benchmark WINNER, we setup the KEGG Pathway retrieval experiment. Here, WINNER took a KEGG pathway release 50 (2009 version; Kanehisa et al., 2010) as the seed genes and gene-gene interactions (expanded network) in HAPPI database (Chen et al., 2017) as the input; the WINNER expansion p-value (p_e) and WINNER score were calculated for candidate genes to include in the KEGG release 50 pathway networks; then, the highly-ranked non-seed (expanded genes) was compared to the same updated pathway network in KEGG release 85 (Ogata et al., 1999; Kanehisa et al., 2017; 2017 version) as the ground-truth. In this experiment, WINNER performance, quantified by precision, recall, and the F1 score, was compared with Ingenuity Pathway Analysis (IPA; Kramer et al., 2014), DIAMOnD (Ghiassian et al., 2015), Random Walk (Smedley et al., 2014), Node2Vec (Grover and Leskovec, 2016; Peng et al., 2019), and GenePANDA (Yin et al., 2017); these techniques were chosen according to Zhang et al. (2021). The same experiment was executed with each KEGG pathway, and the results were aggregated into error bars.

Our results indicated that the WINNER predictions had greater precision but less recall (i.e., the proportion of newly incorporated genes that were retrieved by the prediction) than the predictions generated via other comparing methods (Figure 8). The WINNER predictions were also associated with a higher F1 score, which incorporates both precision and recall into a global measure of accuracy, when more than 60% of the extension candidates were examined. Besides, Figure 8 shows that the retrieval recall rate is low (usually < 0.2) in all of the algorithms. Precision should be prioritized in comparing the performance among these expansion algorithms.

Figure 8

WINNER ranking of differentially expressed genes in biological case-studies

WINNER ranking of genes involved in apoptosis and cell-cycle activity

The use of WINNER for prioritizing genes involved in cellular processes was evaluated with the KEGG apoptosis and cell-cycle pathways and node-degree–preserved network randomization. WINNER ranking p-values were highly significant for genes that participate in some of the most essential mechanisms of apoptosis, such as Phosphatidylinositol 4,5-bisphosphate 3-kinase catalytic subunit alpha isoform (PIK3CA) (p_r = 5.01 × 10⁻¹³); the Phosphatidylinositol 3-kinase regulatory subunit alpha (P85A; p_r = 1.34 × 10⁻¹²) and Cytokine receptor common subunit beta (IL3RB; p_r = 4.60 × 10⁻¹²); and genes for several proteins of the cytoskeleton (actin, p_r = 1.94 × 10⁻¹⁰⁴; Tubulin, p_r = 1.94 × 10⁻¹⁰⁴; B4DZT3, p_r = 8.71 × 10⁻⁸⁷; Lamin A/C, p_r = 8.17 × 10⁻⁸⁷; Lamin B1, p_r = 8.17 × 10⁻⁸⁷; actin-G, p_r = 5.15 × 10⁻⁶³), which is substantially reorganized to produce the characteristic shrunken morphology of apoptotic cells; notably, actin and actin-binding proteins also initiate and regulate apoptosis (Desouza et al., 2012). However, the KEGG apoptosis pathway also includes genes for a number of proteins that participate IL-3– and NGF-signaling (IL-3, IL-3R, and NGF), which are nonessential (or even irrelevant) for apoptosis, and the ranking p-values calculated for these genes were not significant (p_r = 0.18). Similarly, genes in the KEGG cell-cycle pathway that encode proteins directly involved in DNA replication and cell division had highly significant ranking p-values (Cell Division Cycle 14B, p_r = 9.5 × 10⁻²⁹⁷ and 14A, p_r = 2.28 × 10⁻²²) whereas the ranking p-values for genes that participate in TGF-β signaling were nonsignificant (TGF-β, p_r = 0.29; SMAD2, p_r = 0.29; SMAD3, p_r = 0.29; SMAD4, p_r = 0.29), which is consistent with the role of TGF-β in cell-proliferation: it interacts with many components of the cell cycle pathway but generally inhibits proliferation in non-mesenchymal cells. Collectively, these observations demonstrate that the WINNER ranking p-value can be a useful guide for distinguishing between genes that are essential or nonessential participants in a particular cellular process.

WINNER ranks important signaling pathway markers in mammalian pig heart regeneration

The hearts of adult mammals cannot regenerate myocardial tissues that are lost to injury; however, when myocardial infarction (MI) was induced in the hearts of one-day-old piglets, the animals recovered with no significant loss of cardiac function and little evidence of myocardial scarring (Zhu et al., 2018). Thus, to identify genes that may contribute to mammalian cardiac regeneration, we used WINNER to rank the list of differentially expressed genes from piglets that had or had not undergone surgically induced MI on postnatal day 1 for a previous report (Zhang et al., 2020; Figure 9, Supplementary Table 1). Here, we used HAPPI version 2 database (Chen et al., 2017) to build the network connecting these genes. The two top-ranked genes (FN1 and JAK3) encoded fibronectin, which is required for cardiac regeneration in zebrafish (Wang et al., 2013), and Janus kinase 3 (JAK3), which has been shown to protect against ischemia-reperfusion injury (Kubin et al., 2011); notably, JAK3 also interacts with oncostatin-M, which is encoded by the tenth-highest WINNER-ranked gene (OSM) and is a primary factor in cardiomyocyte dedifferentiation and remodeling (Singh et al., 2016; Doll et al., 2017). Also among the top 10 were genes encoding subunits of the essential matrix proteins integrin alpha (ITGA8) and beta (ITGB4), which are differentially expressed in adult and fetal cardiac fibroblasts and involved in chamber specification of zebrafish hearts (Singh et al., 2016; Doll et al., 2017), while the 11th-ranked gene, THBS3, encodes another extracellular matrix protein, thrombospontin 3, which is a critical [and clinically relevant (Mustonen et al., 2013)] regulator of cell-cell and cell-matrix signaling that appears to impede integrin function and contribute to injury-induced cardiomyopathy in mice (Costa et al., 2014; Porrello and Olson, 2014; Puente et al., 2014). Other genes ranked among the top 20 by WINNER included the nitrous-oxide–related genes NCF2 and NCF4, and the gene for vasopressin 2 (AVPR2), which collectively modulate the cellular environment to promote cardiac regeneration (Costa et al., 2014; Porrello and Olson, 2014; Puente et al., 2014); ERBB3, which encodes a tyrosine kinase that appears to be crucial for embryonic development (Erickson et al., 1997); and genes for a dynamin protein (DNM1) and a Rho GTPase (RND2), which suggests that at least some of the mechanisms of mammalian myocardial regeneration are mediated by vesicle-based signaling.

Figure 9

WINNER ranking reflects the important genes supported by co-citations and reveals the upstream events in the r-type pathway-to-pathway network in triple negative breast cancer (TNBC) study

We found 72 significant genes ranked by WINNER using p-value ≤ 0.05 with the WINNER score ranging from 7.4 to 92.5, and the left nonsignificant genes' WINER score ranges from 0 to 68.7. The co-citations analysis shows that the “triple negative breast cancer” co-citations between the significant ranked genes and the nonsignificant ranked genes have significant difference with Kruskal Wallis test's p-value = 0.027 (Figure 10). The result suggests that WINNER's high-rank genes are more likely lead to biological insights than the WINNER's low-rank genes.

Figure 10

To explore new insights among the high-ranking genes, we performed pathway analysis and built the pathway-to-pathway regulatory networks from these genes using PAGER tool (Yue et al., 2018). The WINNER significantly ranked genes regulated many implicated pathways and processes for TNBC. Thus, we observed the higher ranked gene enriched pathways are more likely to be at upstream side of the regulatory (r-type) enriched pathway-to-pathway network. In general, the add-on pathway levels were positive correlated to the ranked gene bins with Pearson correlation coefficient equal to 0.74 (Figure 11).

Figure 11

We found that the top ranked genes, TOP2A, CDK1, PLK1, and UBE2C, were enriched in the cell cycle related pathways, such as “Phosphorylation of Cyclin B1 in the CRS domain,” “Regulation of mitotic cell cycle,” “Mitotic Metaphase and Anaphase,” and “Free APC/C phosphorylated by Plk1.”

Topoisomerase II a (TOP2A) can be a useful gene in determining whether TNBC patients would have a good response to anthracycline therapy, which is the mainstay treatment in TNBC cancer (Brase et al., 2010; Di Leo et al., 2011; Eltohamy et al., 2018). Both Eltohamy et al. and Di Leo et al. found that patients with aberrant expression of TOP2A have better response to anthracycline treatment (Di Leo et al., 2011; Eltohamy et al., 2018).

Cyclin dependent kinase 1 (CDK1) play a critical role how the cell cycle is regulated, specifically during mitosis. Liu et al. used nanoparticles with siRNA to target CDK1, and it has been found to successfully inhibit the TNBC cell line that has been injected in mice (Liu et al., 2014). Xia et al. has found that the CDK1 inhibitor can inhibit the growth of the TNBC cells by arresting them in the G2/M cell phase (Xia et al., 2014).

Polo like kinase-1 (PLK1) has been found to be one of the key regulators in the cell cycle. Targeting and knocking out of PLK1 has been found to cause the TNBC tumor cells to be arrested in the G2-M cell cycle (Ueda et al., 2019; Zhao et al., 2021; Patel et al., 2022). Morray et al. found that a nanoparticle with siRNA targeting PLK1 can inhibit growth in the TNBC tumor cell line (Morry et al., 2017). Patel et al. used the allosteric inhibitor RK-10 to target the PLK1 in TNBC cell lines, and it has inhibited growth through the S phase and G2/M (Patel et al., 2022).

Overexpression of Ubiquitin-conjugated enzyme (UBE2C) can play a role in the pathogenesis of TNBC (Chou et al., 2014; Kim et al., 2019). Chou et al had found that UBE2C has been highly expressed in cancer tissue cells, and that when UBE2C has been targeted with siRNA, the tumor cells have stopped proliferating (Chou et al., 2014).

Discussion and conclusion

In this paper, we introduce WINNER, a new network-based ranking tool that addresses several of the limitations associated with other gene prioritization techniques. Our novel use of node-degree–preserved and modularity-preserved randomization produced randomized networks that retained some of the original network topology and were more normally distributed, which increased the precision and robustness of our ranking p-value (p_r) calculations, while the expansion p-value (p_e) better accommodated the incomprehensiveness and redundancy of the input gene list. However, WINNER rankings were not well-correlated with the clustering coefficient, which represents the presence of network cliques (Newman, 2008; i.e., semi-isolated groups of genes that collectively function like a single node), which suggests that WINNER ranking may be somewhat compromised in dense networks, such as those containing families of proteins, where the scale-free property (Timar et al., 2016) does not apply. Nevertheless, many biological networks are scale-free (Khanin and Wit, 2006), and since degree-preserved randomization tends to produce near-normal ranking distributions, the WINNER p_r value is likely more accurate than the empirical p-value, even for networks that are not perfectly scale-free.

WINNER network ranking belongs to the “eigenvector ranking” (Newman, 2008) class of algorithm. Therefore, it has the same “big-O” computational cost to PageRank [O(N³), where N is the number of network genes] if implemented using iterative matrix multiplication. However, this class of algorithm can be implemented in parallel, which significantly reduced the computational time in practice.

The performance of gene network prioritization significantly depends on the disease (Zhang et al., 2021), or the biological case-study. Therefore, we demonstrate WINNER's performance in various disease and biological study scenarios. The comprehensive KEGG pathway results reflect the case when lacking biological samples and expression data. Then, prioritization needs to be performed only using the domain-knowledge available network to generate hypotheses. Cardiac regeneration, which focuses on cardiomyocyte proliferation, case-study is an example when a significant biological process, not a disease, that does not naturally happen in matured mammals (Porrello et al., 2011; Lam and Sadek, 2018; Ye et al., 2018; Zhu et al., 2018; Zhao et al., 2020; Nakada et al., 2021; Nguyen et al., 2022). In this case, the focus is finding the regulating mechanism to create new cells and to apply this knowledge in biomedical engineering research. Cancer and other disease case studies (leukemia, TNBC, and Vitamin D) are directly related to the disease, and targeted therapies to kill cells are available or proposed. In this case, the focus is to find markers, especially the “cell-killer ones” associated with the disease outcomes, and there is less emphasis rather than the regulating growing mechanism. WINNER results are insightful in all of these cases, whereas whether other techniques have insightful results is yet to be examined in multiple studies.

In conclusion, WINNER gene prioritization is generally more accurate and robust than other network-based prioritization techniques, such as PageRank and node-degree ranking, and can be effective for identifying genes that may be missing from established gene networks, for determining the relative position (i.e., upstream or downstream) of genes within a pathway, and for ranking a list of differentially expressed genes. The superior performance is linked to better retrieval precision when expanding the network among the seed genes. The important case studies presented in this work are in a scenario where new disease-specific gene-expression data were generated, and novel genes associated with the disease and phenotype are expected. Then, network expansion is required. In this expansion, WINNER emphasizes precision, where only a small expanded but highly relevant candidates are explored, over recall, where more comprehensive candidate genes were explored but may involve many irrelevant ones. Other methods tend to emphasize recall; therefore, they may computationally retrieve more candidates; however, at the same time, make it much more difficult for the user to choose the rightly relevant ones. Also, having too many irrelevant genes in the network significantly affects the ranks of the well-known disease-specific genes. This scenario explains the advantage of WINNER over other methods. Future investigations are warranted to determine what additional biological insights can be obtained by using WINNER to rank genes that participate in other cellular processes, in metabolic regulatory pathways (Berkhout et al., 2013), and in co-expression networks (Radulescu et al., 2018).

Funding

The work was in part supported by the internal University of Alabama at Birmingham research grants to JC, the National Institutes of Health grant awards U54TR001005 in which JC serves as a co-investigator, and R01 awards R01HL150078 in which RW serves as principle investigator and JC serves as co-investigator.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Statements

Data availability statement

The gene expression data used in this work are publicly available at the Gene Expression Omnibus database, accession number GSE144883, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE144883.

Author contributions

TN developed the algorithm, performed case studies, and wrote the manuscript. ZY and RS performed case studies and performed the literature validation of the results. ZY built the website. RW and JZ provided data and participated in the case studies. JC conceptualized the ideas, helped design the analytical experiments, and revised the final manuscript. All authors read, edited, and approved the manuscript.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fdata.2022.1016606/full#supplementary-material

Supplementary Figure 1

Schematic diagrams of WINNER gene prioritization and network expansion. (a) Seeded genes (green) and candidate expansion genes (yellow) are assembled into a network as indicated by their pairwise interactions. (b) The expansion p-value (p_e) are calculated among the expansion-candidate genes, then genes with p_e < 0.05 will be further evaluate and added into and expand the network, one gene at a time. Then (c) the expansion score (e) are calculated for the candidate expansion genes; then, the highest-scored gene is added to the network; this process is repeated until all candidates are added or being halted (not adding all candidates). And (d), after completing the expansion, the statistical significance of the rankings are recalculated for the expanded network.

Supplementary Figure 2

WINNER filtering of candidate genes for network expansion. Red nodes represent seeded genes, open nodes represent candidate expansion genes, black lines represent interactions between two seeded genes, and gray lines represent interactions between one seeded gene and one expansion gene or between two expansion genes. Candidate genes for network expansion were filtered via two tests: (1) the likelihood of the candidate expansion gene (E.Gene) having a seeded interaction relative to its total number of interactions (bottom left table), and (2) the likelihood of the candidate expansion gene having seeded interactions relative to the seeded interactions of its most similar seeded gene (S.Gene), with similarity determined by node degree (bottom right table).

Supplementary Figure 3

WINNER ranking of the network of Alzheimer's disease pathways in KEGG release 50. The network graph was constructed with Cytoscape (Shannon et al., 2003) version 3.6.0 and the force-directed layout; the size of the node represents the WINNER score.

Supplementary Table 1

WINNER ranking for genes in cardiac regeneration dataset. The table includes gene symbol, the indication of whether a gene is a seeded (S) or expanded (E) gene, and WINNER score.

Supplementary Table 2

Gene-gene interaction network in the cardiac regeneration dataset.

Supplementary Table 3

WINNER ranking for genes in triple negative breast cancer (TNBC) dataset. The table includes gene symbol, the indication of whether a gene is a seeded (S) or expanded (E) gene, WINNER score, and p-value.

Supplementary Table 4

Gene-gene interaction network in triple negative breast cancer (TNBC) dataset.

Supplementary Video 1

The .cys (cytoscape) file of the regulatory (r-type) pathway-to-pathway network in the triple negative breast cancer study.

Footnotes

1.^chi2gof: Chi-square goodness-of-fit test [https://www.mathworks.com/help/stats/chi2gof.html].

References

1
AertsS.LambrechtsD.MaityS.Van LooP.CoessensB.De SmetF.et al. (2006). Gene prioritization through genomic data fusion. Nat. Biotechnol. 24, 537–544. 10.1038/nbt1203
2
Alvarez-PonceD.LopezP.BaptesteE.McInerneyJ. O. (2013). Gene similarity networks provide tools for understanding eukaryote origins and evolution. Proc. Natl. Acad. Sci. U. S. A. 110, E1594–1603. 10.1073/pnas.1211371110
3
AndersS.PylP. T.HuberW. (2015). HTSeq–a Python framework to work with high-throughput sequencing data. Bioinformatics31, 166–169. 10.1093/bioinformatics/btu638
4
AntanaviciuteA.DalyC.CrinnionL. A.MarkhamA. F.WatsonC. M.BonthronD. T.et al. (2015). GeneTIER: prioritization of candidate disease genes using tissue-specific gene expression profiles. Bioinformatics31, 2728–2735. 10.1093/bioinformatics/btv196
5
BeissbarthT.SpeedT. P. (2004). GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics20, 1464–1465. 10.1093/bioinformatics/bth088
6
BerkhoutJ.TeusinkB.BruggemanF. J. (2013). Gene network requirements for regulation of metabolic gene expression to a desired state. Sci. Rep. 3, 1417. 10.1038/srep01417
7
BianW.ChenW.NguyenT.ZhouY.ZhangJ. (2021). miR-199a overexpression enhances the potency of human induced-pluripotent stem-cell-derived cardiomyocytes for myocardial repair. Front. Pharmacol. 12, 673621. 10.3389/fphar.2021.673621
8
BlandJ. M.AltmanD. G. (1998). Survival probabilities (the Kaplan-Meier method). Br. Med. J. 317, 1572. 10.1136/bmj.317.7172.1572
9
BlandT.WangJ.YinL.PuT.LiJ.GaoJ.et al. (2021). WLS-Wnt signaling promotes neuroendocrine prostate cancer. iScience24, 101970. 10.1016/j.isci.2020.101970
10
BowmanA. W.AzzaliniA. (1997). Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations, vol. 18. Oxford: Oxford University Press.
- Google Scholar
11
BraseJ. C.SchmidtM.FischbachT.SultmannH.BojarH.KoelblH.et al. (2010). ERBB2 and TOP2A in breast cancer: a comprehensive analysis of gene amplification, RNA levels, and protein expression and their influence on prognosis and prediction. Clin. Cancer Res. 16, 2391–2401. 10.1158/1078-0432.CCR-09-2471
12
BrombergY. (2013). Chapter 15: disease gene prioritization. PLoS Comput. Biol. 9, e1002902. 10.1371/journal.pcbi.1002902
13
BroughtonS. E.HercusT. R.HardyM. P.McClureB. J.NeroT. L.DottoreM.et al. (2014). Dual mechanism of interleukin-3 receptor blockade by an anti-cancer antibody. Cell Rep. 8, 410–419. 10.1016/j.celrep.2014.06.038
14
CantorR. M.LangeK.SinsheimerJ. S. (2010). Prioritizing GWAS results: a review of statistical methods and recommendations for their application. Am. J. Hum. Genet. 86, 6–22. 10.1016/j.ajhg.2009.11.017
15
ChandrashekarD. S.KarthikeyanS. K.KorlaP. K.PatelH.ShovonA. R.AtharM.et al. (2022). UALCAN: an update to the integrated cancer data analysis platform. Neoplasia25, 18–27. 10.1016/j.neo.2022.01.001
16
Chatr-AryamontriA.BreitkreutzB. J.HeinickeS.BoucherL.WinterA.StarkC.et al. (2013). The BioGRID interaction database: 2013 update. Nucleic Acids Res. 41, D816–823. 10.1093/nar/gks1158
17
ChenJ.BardesE. E.AronowB. J.JeggaA. G. (2009). ToppGene Suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Res. 37, W305–311. 10.1093/nar/gkp427
18
ChenJ. Y.MamidipalliS.HuanT. (2009). HAPPI” an online database of comprehensive human annotated and predicted protein interactions. BMC Genomics10, S16. 10.1186/1471-2164-10-S1-S16
19
ChenJ. Y.PandeyR.NguyenT. M. (2017). HAPPI-2: a comprehensive and high-quality map of human annotated and predicted protein interactions. BMC Genomics18, 182. 10.1186/s12864-017-3512-1
20
ChenJ. Y.PinkertonS. L.ShenC.WangM. (2006b). “An integrated computational proteomics method to extract protein targets for fanconi anemia studies,” in 21st Annual ACM Symposium on Applied Computing. Dijon, 173–179. 10.1145/1141277.1141316
- CrossRef
- Google Scholar
21
ChenJ. Y.Piquette-MillerM.SmithB. P. (2013). Network medicine: finding the links to personalized therapy. Clin. Pharmacol. Therapeut. 94, 613–616. 10.1038/clpt.2013.195
22
ChenJ. Y.ShenC.SivachenkoA. Y. (2006a). Mining Alzheimer disease relevant proteins from integrated protein interactome data. Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing2006, 367–378. 10.1142/9789812701626_0034
23
ChouC. P.HuangN. C.JhuangS. J.PanH. B.PengN. J.ChengJ. T.et al. (2014). Ubiquitin-conjugating enzyme UBE2C is highly expressed in breast microcalcification lesions. PLoS ONE9, e93934. 10.1371/journal.pone.0093934
24
CornishA. J.DavidA.SternbergM. J. E. (2018). PhenoRank: reducing study bias in gene prioritization through simulation. Bioinformatics34, 2087–2095. 10.1093/bioinformatics/bty028
25
CostaA.RossiE.ScicchitanoB. M.ColettiD.MoresiV.AdamoS.et al. (2014). Neurohypophyseal hormones: novel actors of striated muscle development and homeostasis. Eur. J. Transl. Myol. 24, 3790. 10.4081/bam.2014.3.217
26
CowenL.IdekerT.RaphaelB. J.SharanR. (2017). Network propagation: a universal amplifier of genetic associations. Nat. Rev. Genet. 18, 551–562. 10.1038/nrg.2017.38
27
D'AmelioM.CavallucciV.MiddeiS.MarchettiC.PacioniS.FerriA.et al. (2011). Caspase-3 triggers early synaptic dysfunction in a mouse model of Alzheimer's disease. Nat. Neurosci. 14, 69–76. 10.1038/nn.2709
28
DesouzaM.GunningP. W.StehnJ. R. (2012). The actin cytoskeleton as a sensor and mediator of apoptosis. Bioarchitecture2, 75–87. 10.4161/bioa.20975
29
Di LeoA.DesmedtC.BartlettJ. M.PietteF.EjlertsenB.PritchardK. I.et al. (2011). HER2 and TOP2A as predictive markers for anthracycline-containing chemotherapy regimens as adjuvant treatment of breast cancer: a meta-analysis of individual patient data. Lancet Oncol. 12, 1134–1142. 10.1016/S1470-2045(11)70231-5
30
do ValleI. F.MenichettiG.SimonettiG.BrunoS.ZironiI.DursoD. F.et al. (2018). Network integration of multi-tumour omics data suggests novel targeting strategies. Nat. Commun. 9, 4514. 10.1038/s41467-018-06992-7
31
DobinA.DavisC. A.SchlesingerF.DrenkowJ.ZaleskiC.JhaS.et al. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics29, 15–21. 10.1093/bioinformatics/bts635
32
DollS.DressenM.GeyerP. E.ItzhakD. N.BraunC.DopplerS. A.et al. (2017). Region and cell-type resolved quantitative proteomic map of the human heart. Nat. Commun. 8, 1469. 10.1038/s41467-017-01747-2
33
ElShalS.TrancheventL. C.SifrimA.ArdeshirdavaniA.DavisJ.MoreauY. (2016). Beegle: from literature mining to disease-gene discovery. Nucleic Acids Res. 44, e18. 10.1093/nar/gkv905
34
EltohamyM. I.BadawyO. M.El kinaaiN.LoayI.NassarH. R.AllamR. M.et al. (2018). Topoisomerase II alpha gene alteration in triple negative breast cancer and its predictive role for anthracycline-based chemotherapy (Egyptian NCI Patients). Asian Pac. J. Cancer Prev. 19, 3581–3589. 10.31557/APJCP.2018.19.12.3581
35
EricksonS. L.O'SheaK. S.GhaboosiN.LoverroL.FrantzG.BauerM.et al. (1997). ErbB3 is required for normal cerebellar and cardiac development: a comparison with ErbB2-and heregulin-deficient mice. Development124, 4999–5011. 10.1242/dev.124.24.4999
36
ErtenS.BebekG.EwingR. M.KoyuturkM. (2011). DADA: degree-aware algorithms for network-based disease gene prioritization. BioData Min. 4, 19. 10.1186/1756-0381-4-19
37
EschenhagenT.BolliR.BraunT.FieldL. J.FleischmannB. K.FrisenJ.et al. (2017). Cardiomyocyte regeneration: a consensus statement. Circulation136, 680–686. 10.1161/CIRCULATIONAHA.117.029343
38
EspinozaM. (2012). On Network Randomization Methods: A Negative Control Study. Fairfield, CT: Fairfield University.
- Google Scholar
39
Gene OntologyC.BlakeJ. A.DolanM.DrabkinH.HillD. P.LiN.et al. (2013). Gene Ontology annotations and resources. Nucleic Acids Res. 41, D530–535. 10.1093/nar/gks1050
40
GhiassianS. D.MencheJ.BarabasiA. L. A. (2015). DIseAse MOdule Detection (DIAMOnD) algorithm derived from a systematic analysis of connectivity patterns of disease proteins in the human interactome. PLoS Comput. Biol. 11, e1004120. 10.1371/journal.pcbi.1004120
41
GotohN. (2008). Regulation of growth factor signaling by FRS2 family docking/scaffold adaptor proteins. Cancer Sci. 99, 1319–1325. 10.1111/j.1349-7006.2008.00840.x
42
GottliebA.MaggerO.BermanI.RuppinE.SharanR. (2011). PRINCIPLE: a tool for associating genes with diseases via network propagation. Bioinformatics27, 3325–3326. 10.1093/bioinformatics/btr584
43
GroverA.LeskovecJ. (2016). “node2vec: scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, CA), 855–864. 10.1145/2939672.2939754
44
GualaD.SjolundE.SonnhammerE. L. (2014). MaxLink: network-based prioritization of genes tightly linked to a disease seed set. Bioinformatics30, 2689–2690. 10.1093/bioinformatics/btu344
45
GualaD.SonnhammerE. L. L. (2017). A large-scale benchmark of gene prioritization methods. Sci. Rep. 7, 46598. 10.1038/srep46598
46
GuneyE.OlivaB. (2012). Exploiting protein-protein interaction networks for genome-wide disease-gene prioritization. PLoS ONE7, e43557. 10.1371/journal.pone.0043557
47
GuoX.WaddellD. S.WangW.WangZ.LiberatiN. T.YongS.et al. (2008). Ligand-dependent ubiquitination of Smad3 is regulated by casein kinase 1 gamma 2, an inhibitor of TGF-beta signaling. Oncogene27, 7235–7247. 10.1038/onc.2008.337
48
GyanchandaniR.Ortega AlvesM. V.MyersJ. N.KimS. (2013). A proangiogenic signature is revealed in FGF-mediated bevacizumab-resistant head and neck squamous cell carcinoma. Mol. Cancer Res. 11, 1585–1596. 10.1158/1541-7786.MCR-13-0358
49
GyorffyB.LanczkyA.EklundA. C.DenkertC.BudcziesJ.LiQ.et al. (2010). An online survival analysis tool to rapidly assess the effect of 22,277 genes on breast cancer prognosis using microarray data of 1,809 patients. Breast Cancer Res. Treat. 123, 725–731. 10.1007/s10549-009-0674-9
50
HaleP. J.Lopez-YunezA. M.ChenJ. Y. (2012). Genome-wide meta-analysis of genetic susceptible genes for Type 2 Diabetes. BMC Syst. Biol. 6(Suppl.3), S16. 10.1186/1752-0509-6-S3-S16
51
HercusT. R.BroughtonS. E.EkertP. G.RamshawH. S.PeruginiM.GrimbaldestonM.et al. (2012). The GM-CSF receptor family: mechanism of activation and implications for disease. Growth Fact. 30, 63–75. 10.3109/08977194.2011.649919
52
Huangde. W.ShermanB. T.LempickiR. A. (2009). Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37, 1–13. 10.1093/nar/gkn923
53
HuangH.LiJ.ChenJ. Y. (2009). “Disease gene-fishing in molecular interaction networks: a case study in colorectal cancer,” in Conference proceedings : Annual International Conference of the IEEE Engineering in Medicine and Biology Society IEEE Engineering in Medicine and Biology Society Conference (Minneapolis, MN), 6416–6419.
- Pubmed Abstract
- Google Scholar
54
JiangX.LopezA.HolyoakeT.EavesA.EavesC. (1999). Autocrine production and action of IL-3 and granulocyte colony-stimulating factor in chronic myeloid leukemia. Proc. Natl. Acad. Sci. U. S. A. 96, 12804–12809. 10.1073/pnas.96.22.12804
55
JohnstonS. R. (2006). Targeting downstream effectors of epidermal growth factor receptor/HER2 in breast cancer with either farnesyltransferase inhibitors or mTOR antagonists. Int. J. Gynecol. Cancer16(Suppl.2), 543–548. 10.1111/j.1525-1438.2006.00692.x
56
JonssonT.AtwalJ. K.SteinbergS.SnaedalJ.JonssonP. V.BjornssonS.et al. (2012). A mutation in APP protects against Alzheimer's disease and age-related cognitive decline. Nature488, 96–99. 10.1038/nature11283
57
KanehisaM.FurumichiM.TanabeM.SatoY.MorishimaK. (2017). KEGG new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45, D353–D361. 10.1093/nar/gkw1092
58
KanehisaM.GotoS.FurumichiM.TanabeM.HirakawaM. (2010). KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 38, D355–360. 10.1093/nar/gkp896
59
KEGG (2022). Chronic Myeloid Leukemia - Homo Sapiens (Human). Kyoto: Human Genome Center, Institute of Medical Science; University of Tokyo, Bioinformatics Center; Institute for Chemical Research, Kyoto University.
- Google Scholar
60
KhaninR.WitE. (2006). How scale-free are biological networks. J. Comput. Biol. 13, 810–818. 10.1089/cmb.2006.13.810
61
KimJ.BangH. (2016). Three common misuses of P-values. Dent. Hypotheses7, 73–80. 10.4103/2155-8213.190481
62
KimY. J.LeeG.HanJ.SongK.ChoiJ. S.ChoiY. L.et al. (2019). UBE2C overexpression aggravates patient outcome by promoting estrogen-dependent/independent cell proliferation in early hormone receptor-positive and HER2-negative breast cancer. Front. Oncol. 9, 1574. 10.3389/fonc.2019.01574
63
KobayashiC. I.TakuboK.KobayashiH.Nakamura-IshizuA.HondaH.KataokaK.et al. (2014). The IL-2/CD25 axis maintains distinct subsets of chronic myeloid leukemia-initiating cells. Blood123, 2540–2549. 10.1182/blood-2013-07-517847
64
KoschmannJ.BharA.StegmaierP.KelA. E.WingenderE. (2015). “Upstream analysis”: an integrated promoter-pathway analysis approach to causal interpretation of microarray data. Microarrays4, 270–286. 10.3390/microarrays4020270
65
KrallingerM.ValenciaA.HirschmanL. (2008). Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol. 9(Suppl.2), S8. 10.1186/gb-2008-9-s2-s8
66
KramerA.GreenJ.PollardJ.Jr.TugendreichS. (2014). Causal analysis approaches in Ingenuity Pathway Analysis. Bioinformatics30, 523–530. 10.1093/bioinformatics/btt703
67
KruegerF. (2015). Trim galore. A wrapper tool around Cutadapt and FastQC to Consistently Apply Quality and Adapter Trimming to FastQ Files, 516.
- Google Scholar
68
KubinT.PolingJ.KostinS.GajawadaP.HeinS.ReesW.et al. (2011). Oncostatin M is a major mediator of cardiomyocyte dedifferentiation and remodeling. Cell Stem Cell. 9, 420–432. 10.1016/j.stem.2011.08.013
69
La BellaV.LiguoriM.CittadellaR.SettipaniN.PiccoliT.MannaI.et al. (2004). A novel mutation (Thr116Ile) in the presenilin 1 gene in a patient with early-onset Alzheimer's disease. Eur. J. Neurol. 11, 521–524. 10.1111/j.1468-1331.2004.00828.x
70
LamN. T.SadekH. A. (2018). Neonatal heart regeneration: comprehensive literature review. Circulation138, 412–423. 10.1161/CIRCULATIONAHA.118.033648
71
LanczkyA.NagyA.BottaiG.MunkacsyG.SzaboA.SantarpiaL.et al. (2016). miRpower: a web-tool to validate survival-associated miRNAs utilizing expression data from 2178 breast cancer patients. Breast Cancer Res. Treat. 160, 439–446. 10.1007/s10549-016-4013-7
72
LiJ.ZhuX.ChenJ. Y. (2009). Building disease-specific drug-protein connectivity maps from molecular interaction networks and PubMed abstracts. PLoS Comput. Biol. 5, e1000450. 10.1371/journal.pcbi.1000450
73
LiR.CamposJ. (2015). Iida J: a gene regulatory program in human breast cancer. Genetics201, 1341–1348. 10.1534/genetics.115.180125
74
LiberzonA.BirgerC.ThorvaldsdottirH.GhandiM.MesirovJ. P.TamayoP.et al. (2015). The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 1, 417–425. 10.1016/j.cels.2015.12.004
75
LiuY.LiangY.WishartD. (2015). PolySearch2: a significantly improved text-mining system for discovering associations between human diseases, genes, drugs, metabolites, toxins and more. Nucleic Acids Res. 43, W535–542. 10.1093/nar/gkv383
76
LiuY.ZhuY. H.MaoC. Q.DouS.ShenS.TanZ. B.et al. (2014). Triple negative breast cancer therapy with CDK1 siRNA delivered by cationic lipid assisted PEG-PLA nanoparticles. J. Control Release. 192, 114–121. 10.1016/j.jconrel.2014.07.001
77
LoveM. I.HuberW.AndersS. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550. 10.1186/s13059-014-0550-8
78
LuY.YangG.XiaoY.ZhangT.SuF.ChangR.et al. (2020). Upregulated cyclins may be novel genes for triple-negative breast cancer based on bioinformatic analysis. Breast Cancer. 27, 903–911. 10.1007/s12282-020-01086-z
79
Martin-OrozcoE.Sanchez-FernandezA.Ortiz-ParraI.Ayala-San NicolasM. (2019). WNT: signaling in tumors: the way to evade drugs and immunity. Front. Immunol. 10, 2854. 10.3389/fimmu.2019.02854
80
MoreauY.TrancheventL. C. (2012). Computational tools for prioritizing candidate genes: boosting disease gene discovery. Nat. Rev. Genet. 13, 523–536. 10.1038/nrg3253
81
MorryJ.NgamcherdtrakulW.GuS.RedaM.CastroD. J.SangvanichT.et al. (2017). Targeted treatment of metastatic breast cancer by PLK1 siRNA delivered by an antioxidant nanoparticle platform. Mol. Cancer Ther. 16, 763–772. 10.1158/1535-7163.MCT-16-0644
82
MuhammadS. A.RazaW.NguyenT.BaiB.WuX.ChenJ.et al. (2017). Cellular signaling pathways in insulin resistance-systems biology analyses of microarray dataset reveals new drug target gene signatures of type 2 diabetes mellitus. Front. Physiol. 8, 13. 10.3389/fphys.2017.00013
83
MustonenE.RuskoahoH.RysaJ. (2013). Thrombospondins, potential drug targets for cardiovascular diseases. Basic Clin. Pharmacol. Toxicol. 112, 4–12. 10.1111/bcpt.12026
84
NakadaY.ZhouY.GongW.ZhangE.SkieE.NguyenT.et al. (2021). Single nucleus transcriptomics: apical resection in newborn pigs extends the time-window of cardiomyocyte proliferation and myocardial regeneration. Circulation121, 56995. 10.1161/CIRCULATIONAHA.121.056995
85
NedeljkovicM.DamjanovicA. (2019). Mechanisms of chemotherapy resistance in triple-negative breast cancer-how we can rise to the challenge. Cells8, 90957. 10.3390/cells8090957
86
NeupaneM.KiserJ. N. (2018). Bovine respiratory disease complex coordinated agricultural project research T, Neibergs HL: gene set enrichment analysis of SNP data in dairy and beef cattle with bovine respiratory disease. Anim. Genet. 49, 527–538. 10.1111/age.12718
87
NewmanM. E. (2005). A measure of betweenness centrality based on random walks. Social Netw. 27, 39–54. 10.1016/j.socnet.2004.11.009
88
NewmanM. E. (2006). Modularity and community structure in networks. Proc. Natl. Acad. Sci. U. S. A. 103, 8577–8582. 10.1073/pnas.0601602103
89
NewmanM. E. J. (2008). “Mathematics of networks,” in The New Palgrave Encyclopedia of Economics, 2 Edn, eds L. E. Blume, S. N. Durlauf. London: Palgrave Macmillan UK.
- Google Scholar
90
NguyenT.WeiY.NakadaY.ZhouY.ZhangJ. (2022). Cardiomyocyte cell-cycle regulation in neonatal large mammals: single nucleus RNA-sequencing data analysis via an artificial-intelligence-based pipeline. Front. Bioeng. Biotechnol. 10, 914450. 10.3389/fbioe.2022.914450
91
NitschD.TrancheventL. C.GoncalvesJ. P.VogtJ. K.MadeiraS. C.MoreauY.et al. (2011). PINTA: a web server for network-based gene prioritization from expression data. Nucleic Acids Res. 39, W334–338. 10.1093/nar/gkr289
92
OgataH.GotoS.SatoK.FujibuchiW.BonoH.KanehisaM.et al. (1999). KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 27, 29–34. 10.1093/nar/27.1.29
93
OlsenC.FlemingK.PrendergastN.RubioR.Emmert-StreibF.BontempiG.et al. (2014). Inference and validation of predictive gene networks from biomedical literature and gene expression data. Genomics103, 329–336. 10.1016/j.ygeno.2014.03.004
94
PageL.BrinS.MotwaniR.WinogradT. (1999). The PageRank Citation Ranking: Bringing Order to the Web. Stanford, CA: Stanford InfoLab.
- Google Scholar
95
PatelJ. R.ThangaveluP.TerrellR. M.IsraelB.SarkarA. B.DavidsonA. M.et al. (2022). Novel allosteric inhibitor targets PLK1 in triple negative breast cancer cells. Biomolecules12, 40531. 10.3390/biom12040531
96
PengJ.GuanJ.ShangX. (2019). Predicting Parkinson's disease genes based on node2vec and autoencoder. Front. Genet. 10, 226. 10.3389/fgene.2019.00226
97
Perez-IratxetaC.BorkP.AndradeM. A. (2002). Association of genes to genetically inherited diseases using data mining. Nat. Genet. 31, 316–319. 10.1038/ng895
98
PetersL. A.PerrigoueJ.MorthaA.IugaA.SongW. M.NeimanE. M.et al. (2017). A functional genomics predictive network model identifies regulators of inflammatory bowel disease. Nat. Genet. 49, 1437–1449. 10.1038/ng.3947
99
PorrelloE. R.MahmoudA. I.SimpsonE.HillJ. A.RichardsonJ. A.OlsonE. N.et al. (2011). Transient regenerative potential of the neonatal mouse heart. Science331, 1078–1080. 10.1126/science.1200708
100
PorrelloE. R.OlsonE. N. (2014). A neonatal blueprint for cardiac regeneration. Stem Cell Res. 13, 556–570. 10.1016/j.scr.2014.06.003
101
PuenteB. N.KimuraW.MuralidharS. A.MoonJ.AmatrudaJ. F.PhelpsK. L.et al. (2014). The oxygen-rich postnatal environment induces cardiomyocyte cell-cycle arrest through DNA damage response. Cell157, 565–579. 10.1016/j.cell.2014.03.032
102
RadulescuE.JaffeA. E.StraubR. E.ChenQ.ShinJ. H.HydeT. M.et al. (2018). Identification and prioritization of gene sets associated with schizophrenia risk by co-expression network analysis in human brain. Mol. Psychiatry2018, 286559. 10.1101/286559
103
RajabA.StraubV.McCannL. J.SeelowD.VaronR.BarresiR.et al. (2010). Fatal cardiac arrhythmia and long-QT syndrome in a new form of congenital generalized lipodystrophy with muscle rippling (CGL4) due to PTRF-CAVIN mutations. PLoS Genet. 6, e1000874. 10.1371/journal.pgen.1000874
104
RaoA. R.JanaR.BandyopadhyayS. A. (1996). Markov chain Monte Carlo method for generating random (0, 1)-matrices with given marginals. Sankhyā1996, 225–242.
- Google Scholar
105
RollandT.TasanM.CharloteauxB.PevznerS. J.ZhongQ.SahniN.et al. (2014). A proteome-scale map of the human interactome network. Cell159, 1212–1226. 10.1016/j.cell.2014.10.050
106
SahaS.HarrisonS. H.ChenJ. Y. (2008). Dissecting the human plasma proteome and inflammatory response biomarkers. Proteomics2008, 507. 10.1002/pmic.200800507
107
SayersE. (2008). E-utilities Quick Start.Bethesda, MD: Entrez Programming Utilities Help.
- Google Scholar
108
SchlottererC.ToblerR.KoflerR.NolteV. (2014). Sequencing pools of individuals - mining genome-wide polymorphism data without big funding. Nat. Rev. Genet. 15, 749–763. 10.1038/nrg3803
109
ShannonP.MarkielA.OzierO.BaligaN. S.WangJ. T.RamageD.et al. (2003). Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504. 10.1101/gr.1239303
110
SinghA. R.SivadasA.SabharwalA.VellarikalS. K.JayarajanR.VermaA.et al. (2016). Chamber specific gene expression landscape of the zebrafish heart. PLoS ONE11, e0147823. 10.1371/journal.pone.0147823
111
Singh-BlomU. M.NatarajanN.TewariA.WoodsJ. O.DhillonI. S.MarcotteE. M.et al. (2013). Prediction and validation of gene-disease associations using methods inspired by social network analyses. PLoS ONE8, e58977. 10.1371/journal.pone.0058977
112
SmedleyD.KohlerS.CzeschikJ. C.AmbergerJ.BocchiniC.HamoshA.et al. (2014). Walking the interactome for candidate prioritization in exome sequencing studies of Mendelian diseases. Bioinformatics30, 3215–3222. 10.1093/bioinformatics/btu508
113
SubramanianA.TamayoP.MoothaV. K.MukherjeeS.EbertB. L.GilletteM. A.et al. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U. S. A. 102, 15545–15550. 10.1073/pnas.0506580102
114
SunJ.ZhaoZ. (2010). A comparative study of cancer proteins in the human protein-protein interaction network. BMC Genomics11(Suppl.3), S5. 10.1186/1471-2164-11-S3-S5
115
SzklarczykD.FranceschiniA.WyderS.ForslundK.HellerD.Huerta-CepasJ.et al. (2015). STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43, D447–452. 10.1093/nar/gku1003
116
SzklarczykD.MorrisJ. H.CookH.KuhnM.WyderS.SimonovicM.et al. (2017). The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res. 45, D362–D368. 10.1093/nar/gkw937
117
TalesaV. N. (2001). Acetylcholinesterase in Alzheimer's disease. Mech. Ageing Dev. 122, 1961–1969. 10.1016/S0047-6374(01)00309-8
118
TessierL.CoteO.ClarkM. E.VielL.Diaz-MendezA.AndersS.et al. (2018). Gene set enrichment analysis of the bronchial epithelium implicates contribution of cell cycle and tissue repair processes in equine asthma. Sci. Rep. 8, 16408. 10.1038/s41598-018-34636-9
119
TimarG.DorogovtsevS. N.MendesJ. F. (2016). Scale-free networks with exponent one. Phys. Rev. E. 94, 022302. 10.1103/PhysRevE.94.022302
120
TiongK. L.YeangC. H. (2019). MGSEA - a multivariate Gene set enrichment analysis. BMC Bioinformatics20, 145. 10.1186/s12859-019-2716-6
121
TynerC.BarberG. P.CasperJ.ClawsonH.DiekhansM.EisenhartC.et al. (2017). The UCSC Genome Browser database: 2017 update. Nucleic Acids Res. 45, D626–D634. 10.1093/nar/gkw1134
122
UedaA.OikawaK.FujitaK.IshikawaA.SatoE.IshikawaT.et al. (2019). Therapeutic potential of PLK1 inhibition in triple-negative breast cancer. Lab. Invest. 99, 1275–1286. 10.1038/s41374-019-0247-4
123
UniProt ConsortiumT. (2018). UniProt: the universal protein knowledgebase. Nucleic Acids Res. 46, 2699. 10.1093/nar/gky092
124
ValentP. (2014). Targeting the JAK2-STAT5 pathway in CML. Blood124, 1386–1388. 10.1182/blood-2014-07-585943
125
van DamS.CordeiroR.CraigT.van DamJ.WoodS. H.de MagalhaesJ. P.et al. (2012). GeneFriends: an online co-expression analysis tool to identify novel gene targets for aging and complex diseases. BMC Genomics13, 535. 10.1186/1471-2164-13-535
126
Van VoorenS.ThienpontB.MentenB.SpelemanF.De MoorB.VermeeschJ.et al. (2007). Mapping biomedical concepts onto the human genome by mining literature on chromosomal aberrations. Nucleic Acids Res. 35, 2533–2543. 10.1093/nar/gkm054
127
WaksmanA. (1968). A permutation network. J. ACM15, 159–163. 10.1145/321439.321449
- CrossRef
- Google Scholar
128
WangC.LiY.LiH.ZhangY.YingZ.WangX.et al. (2020). Disruption of FGF signaling ameliorates inflammatory response in hepatic stellate cells. Front. Cell Dev. Biol. 8, 601. 10.3389/fcell.2020.00601
129
WangJ.KarraR.DicksonA. L.PossK. D. (2013). Fibronectin is deposited by injury-activated epicardial cells and is necessary for zebrafish heart regeneration. Dev. Biol. 382, 427–435. 10.1016/j.ydbio.2013.08.012
130
WangS. L.LiX. L.FangJ. (2012). Finding minimum gene subsets with heuristic breadth-first search algorithm for robust tumor classification. BMC Bioinformatics13, 178. 10.1186/1471-2105-13-178
131
WangZ.Duenas-OsorioL.PadgettJ. E. (2015). A new mutually reinforcing network node and link ranking algorithm. Sci. Rep. 5, 15141. 10.1038/srep15141
132
WeiW.NortonD. D.WangX.KusiakJ. W. (2002). Abeta 17-42 in Alzheimer's disease activates JNK and caspase-8 leading to neuronal apoptosis. Brain125, 2036–2043. 10.1093/brain/awf205
133
WinterC.KristiansenG.KerstingS.RoyJ.AustD.KnoselT.et al. (2012). Google goes cancer: improving outcome prediction for cancer patients by network-based ranking of marker genes. PLoS Comput. Biol. 8, e1002511. 10.1371/journal.pcbi.1002511
134
WuX.ChenJ. Y.AlterovitzG.BensonR.RamoniM. (2009). Molecular interaction networks: topological and functional characterizations. Automat. Proteom. Genom.145, 6. 10.1002/9780470741191.ch6
- CrossRef
- Google Scholar
135
XiaQ.CaiY.PengR.WuG.ShiY.JiangW.et al. (2014). The CDK1 inhibitor RO3306 improves the response of BRCA-proficient breast cancer cells to PARP inhibition. Int. J. Oncol. 44, 735–744. 10.3892/ijo.2013.2240
136
XieB.AgamG.BalasubramanianS.XuJ.GilliamT. C.MaltsevN.et al. (2015). Disease gene prioritization using network and feature. J. Comput. Biol. 22, 313–323. 10.1089/cmb.2015.0001
137
YangC.JinC.LiX.WangF.McKeehanW. L.LuoY.et al. (2012). Differential specificity of endocrine FGF19 and FGF21 to FGFR1 and FGFR4 in complex with KLB. PLoS ONE7, e33870. 10.1371/journal.pone.0033870
138
YeL.D'AgostinoG.LooS. J.WangC. X.SuL. P.TanS. H.et al. (2018). Early regenerative capacity in the porcine heart. Circulation138, 2798–2808. 10.1161/CIRCULATIONAHA.117.031542
139
YinT.ChenS.WuX.TianW. (2017). GenePANDA-a novel network-based gene prioritizing tool for complex diseases. Sci. Rep. 7, 43258. 10.1038/srep43258
140
YuL.FernandezS.BrockG. (2017). Power analysis for RNA-Seq differential expression studies. BMC Bioinformatics18, 234. 10.1186/s12859-017-1648-2
141
YuW.WulfA.LiuT.KhouryM. J.GwinnM. (2008). Gene Prospector: an evidence gateway for evaluating potential susceptibility genes and interacting risk factors for human diseases. BMC Bioinformatics9, 528. 10.1186/1471-2105-9-528
142
YueZ.AroraI.ZhangE. Y.LauferV.BridgesS. L.ChenJ. Y.et al. (2017). Repositioning drugs by targeting network modules: a Parkinson's disease case study. BMC Bioinformatics18, 532. 10.1186/s12859-017-1889-0
143
YueZ.KshirsagarM. M.NguyenT.SuphavilaiC.NeylonM. T.ZhuL.et al. (2015). PAGER: constructing PAGs and new PAG-PAG relationships for network biology. Bioinformatics31, i250–257. 10.1093/bioinformatics/btv265
144
YueZ.ZhengQ.NeylonM. T.YooM.ShinJ.ZhaoZ.et al. (2018). 2.0: an update to the pathway, annotated-list and gene-signature electronic repository for Human Network Biology. Nucleic Acids Res. 46, D668–D676. 10.1093/nar/gkx1040
145
ZhangE.NguyenT.ZhaoM.DangS. D. H.ChenJ. Y.BianW.et al. (2020). Identifying the key regulators that promote cell-cycle activity in the hearts of early neonatal pigs after myocardial injury. PLoS ONE15, e0232963. 10.1371/journal.pone.0232963
146
ZhangF.ChenJ. Y. (2010). Discovery of pathway biomarkers from coupled proteomics and systems biology methods. BMC Genomics11(Suppl.2), S12. 10.1186/1471-2164-11-S2-S12
147
ZhangF.ChenJ. Y. (2013). Breast cancer subtyping from plasma proteins. BMC Medical Genom. 6(Suppl.1), S6. 10.1186/1755-8794-6-S1-S6
148
ZhangH.FergusonA.RobertsonG.JiangM.ZhangT.SudlowC.et al. (2021). Benchmarking network-based gene prioritization methods for cerebral small vessel disease. Brief Bioinform. 22, bbab006. 10.1093/bib/bbab006
149
ZhaoM.ZhangE.WeiY.ZhouY.WalcottG. P.ZhangJ.et al. (2020). Apical resection prolongs the cell cycle activity and promotes myocardial regeneration after left ventricular injury in neonatal pig. Circulation142, 913–916. 10.1161/CIRCULATIONAHA.119.044619
150
ZhaoS.GengY.CaoL.YangQ.PanT.ZhouD.et al. (2021). Deciphering the performance of polo-like kinase 1 in triple-negative breast cancer progression according to the centromere protein U-phosphorylation pathway. Am. J. Cancer Res. 11, 2142–2158.
- Pubmed Abstract
- Google Scholar
151
ZhaoZ. Q.HanG. S.YuZ. G.LiJ. (2015). Laplacian: normalization and random walk on heterogeneous networks for disease-gene prioritization. Comput. Biol. Chem. 57, 21–28. 10.1016/j.compbiolchem.2015.02.008
152
ZhuW.ZhangE.ZhaoM.ChongZ.FanC.TangY.et al. (2018). Regenerative potential of neonatal porcine hearts. Circulation138, 2809–2816. 10.1161/CIRCULATIONAHA.118.034886

Summary

Keywords

gene prioritization, network expansion, network statistical analysis, pathway analysis, network biology

Citation

Nguyen T, Yue Z, Slominski R, Welner R, Zhang J and Chen JY (2022) WINNER: A network biology tool for biomolecular characterization and prioritization. Front. Big Data 5:1016606. doi: 10.3389/fdata.2022.1016606

Received

11 August 2022

Accepted

14 October 2022

Published

04 November 2022

Volume

5 - 2022

Edited by

Prashanti Manda, University of North Carolina at Greensboro, United States

Reviewed by

Emre Sefer, Özyegin University, Turkey; Zhi-Ping Liu, Shandong University, China

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Jake Y. Chen jakechen@uab.edu

This article was submitted to Medicine and Public Health, a section of the journal Frontiers in Big Data

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

ORIGINAL RESEARCH article

WINNER: A network biology tool for biomolecular characterization and prioritization

Abstract

Introduction

Related works

Methods

Ranking genes in the network by WINNER

Undirected networks

Directed networks

Statistical significance of gene ranking

Filtering candidates for expansion

Selecting one candidate for expanded ranking

Informatics databases and benchmarking metrics

Network randomization and testing for ranking normal distribution in random networks

Literature validation using co-citations from PubMed

Biomedical case studies, data, and preprocessing

Cardiac regeneration dataset

Data processing of triple negative breast cancer (TNBC)

PubMed co-citation analysis of the WINNER ranked genes

Pathway level assignment

The correlation analysis of WINNER ranking and the enriched pathways using the exponential scale of top gene bins

Results

Characteristics of WINNER ranking

WINNER ranking of undirected networks

WINNER ranking of directed networks

WINNER network expansion and ranking upstream regulators

Benchmarking WINNER ranking by retrieving newly updated genes in KEGG pathways

WINNER ranking of differentially expressed genes in biological case-studies

WINNER ranking of genes involved in apoptosis and cell-cycle activity

WINNER ranks important signaling pathway markers in mammalian pig heart regeneration

WINNER ranking reflects the important genes supported by co-citations and reveals the upstream events in the r-type pathway-to-pathway network in triple negative breast cancer (TNBC) study

Discussion and conclusion

Funding

Publisher's note

Statements

Data availability statement

Author contributions

Conflict of interest

Supplementary material

Footnotes

References

Summary

Outline

Figures

Cite article

Share article

Article metrics