- 1College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
- 2National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, China
- 3College of Management, Shenzhen University, Shenzhen, China
- 4Beijing Key Laboratory of Mental Disorders, National Clinical Research Center for Mental Disorders and National Center for Mental Disorders, Beijing Anding Hospital, Capital Medical University, Beijing, China
- 5Advanced Innovation Center for Human Brain Protection, Capital Medical University, Beijing, China
- 6College of Life Sciences and Oceanography, Shenzhen University, Shenzhen, China
- 7JCY Biotech Ltd., Pingshan Translational Medicine Center, Shenzhen Bay Laboratory, Shenzhen, China
In the contemporary field of life sciences, researchers have gradually recognized the critical role of microbes in maintaining human health. However, traditional biological experimental methods for validating the association between microbes and diseases are both time-consuming and costly. Therefore, developing effective computational methods to predict potential associations between microbes and diseases is an important and urgent task. In this study, we propose a novel computational framework, called GCATCMDA, for forecasting potential associations between microbes and diseases. Firstly, we construct Gaussian kernel similarity networks for microbes and diseases using known microbe-disease association data. Then, we design a feature encoder that combines graph convolutional network and graph attention mechanism to learn the node features of networks, and propose a feature dual-fusion module to effectively integrate node features from each layer's output. Next, we apply the feature encoder separately to the microbe similarity network, disease similarity network, and microbe-disease association network, and enhance the consistency of features for the same nodes across different association networks through contrastive learning. Finally, we pass the microbe and disease features into an inner product decoder to obtain the association scores between them. Experimental results demonstrate that the GCATCMDA model achieves superior predictive performance compared to previous methods. Furthermore, case studies confirm that GCATCMDA is an effective tool for predicting microbe-disease associations in real situations.
1 Introduction
Microbes are primarily composed of bacteria, fungi, archaea, and viruses, predominantly inhabit the gut within the human body (Sommer and Bäckhed, 2013; Blum, 2017). The gut microbiota is closely associated with human health, playing a crucial role in regulating host physiological processes, such as immunity and metabolism (Lynch and Pedersen, 2016; Tooley, 2020). In recent years, biological experiments have demonstrated that dysbiosis or imbalance in the human microbiota could cause human diseases (Marchesi et al., 2016), such as liver diseases (Henao-Mejia et al., 2013), diabetes (Paun et al., 2017), obesity (Tseng and Wu, 2019), and even cancer (Schwabe and Jobin, 2013). However, traditional biological experiments suffer from drawbacks such as long experimental cycles and expensive costs. Therefore, if we can utilize effective computational methods to predict potential sets of associations between microbes and diseases in advance, it would be possible to reduce unnecessary experimental trials and costs in traditional biological experiments, thereby accelerating the development of research in the field of microbe-disease associations.
Current computational methods for predicting microbe-disease associations can primarily be categorized into three categories, namely network-based methods, random walk-based methods, and deep learning-based methods. The network-based methods infer the potential association between microbes and diseases by utilizing the topological information within the network. For example, Chen et al. (2017) proposed a KATZHMDA model based on the KATZ measure, which scores potential disease related microbes by calculating all paths of different lengths between microbes and diseases. Bao et al. (2017) proposed the Network Consistency Projection for Human Microbe-Disease Association Prediction (NCPHMDA) model, evaluating the association scores between microbes and diseases by computing disease space projection scores and microbe space projection scores. Long and Luo (2019) designed a meta-graph-based method named WMGHMDA, which calculates the probability scores of microbe-disease pairs by utilizing a weighted meta-graph search algorithm on a heterogeneous network. Wang et al. (2023) proposed a SAELGMDA model by combining sparse autoencoder and Light Gradient boosting machine.
The success of random walk algorithms in graph data processing has prompted researchers to propose various microbe-disease association prediction algorithms based on this approach. For instance, Zou et al. (2017) developed a novel computational model of BiRWHMDA, which predicts potential microbe-disease associations by bi-random walks on a heterogeneous network. Luo and Long (2018) proposed a novel computational model of NTSHMDA, which integrates network topology similarity into the restarted random walk algorithm to distinguish the walking probabilities of disease-microbe node pairs. Yan et al. (2019) introduced a BRWMDA method, predicting potential microbe-disease associations by executing bi-random walks with different steps on microbe and disease networks.
With the significant achievements of deep learning algorithms in various research fields, researchers have gradually begun to explore the application of these algorithms in the task of predicting the associations between microbes and diseases. For example, Ma and Jiang (2020) developed an end-to-end graph convolutional neural network-based mining model NinimHMDA to predict different types of microbe-disease associations. Long et al. (2021) proposed a novel deep learning framework of GATMDA, which utilizes graph attention networks along with inductive matrix completion for predicting human microbe-disease associations. Hua et al. (2022) developed a multi-view graph augmentation convolutional network (MVGCNMDA) to predict potential disease-associated microbes. Jiang et al. (2022) proposed the KGNMDA method, using a knowledge graph neural network method for predicting microbe-disease associations. Peng et al. (2023) developed a computational method for predicting microbe-disease associations, named GPUDMDA, which integrates graph attention autoencoder, positive-unlabeled learning, and deep neural network.
In addition to the three mainstream methods mentioned, some computational approaches for microbe-disease prediction have been developed based on regularization and matrix factorization/completion techniques. For instance, Wang et al. (2017) proposed a semi-supervised computational model of Laplacian Regularized Least Squares for Human Microbe—Disease Association (LRLSHMDA) to predict microbe-disease associations. Shen et al. (2017) developed a computational method of CMFHMDA, which utilizes collaborative matrix factorization to reconstruct correlation matrices between diseases and microbes. Liu et al. (2023) proposed a novel method called MNNMDA to predict microbe-disease associations by applying a Matrix Nuclear Norm method.
Among the methods mentioned above, network-based and random walk-based methods may encounter constraints in learning features of nodes representing microbes and diseases with few known associations, due to the limited information propagation caused by the sparsity of the microbe-disease association network. Meanwhile, matrix factorization/completion methods can only capture linear associations, thus failing to accurately capture the nonlinear interactions between microbes and diseases. Recent studies have suggested that graph neural network algorithms in deep learning could offer a more effective approach for learning node features in microbe-disease association networks. Therefore, this study further attempts to design node feature learning algorithms based on graph neural networks, aiming to obtain more effective node features from the microbe-disease association network, thereby predicting more accurate candidate sets of microbe-disease associations.
In this work, we propose a deep learning framework named GCATCMDA, which explores the application of graph neural networks for the microbe-disease association prediction task. First Gaussian kernel similarity is calculated based on known microbe-disease association data to construct microbe similarity networks and disease similarity networks. We then combine graph convolutional networks and graph attention mechanisms to learn feature representations of microbes and diseases in different networks, and propose a feature dual-fusion module to effectively integrate node features generated by each graph attention layer. Next, we utilize contrastive learning to enhance the feature consistency of the same microbe (or disease) across different association networks. Finally, the obtained microbe and disease features are inputted into an inner product decoder to compute their corresponding association scores. The model can obtain better node features through GCAT aggregation. In addition, contrastive learning increases the distance between nodes, allowing the model to better distinguish nodes and make subsequent predictions better. Experimental results demonstrate that the GCATCMDA model achieves better predictive performance compared to previous methods, and case studies of obesity and IBD (inflammatory bowel disease) confirm the high accuracy of the microbe-disease association candidate set produced by our method.
2 Materials and methods
2.1 Datasets
The dataset in this study was sourced from the HMDAD database (http://www.cuilab.cn/hmdad), which collects known associations between microbes and diseases by searching past research literature (Ma et al., 2017). HMDAD adapted a systematic approach by only including associations that have been experimentally validated and published in reputable journals. This ensures a high level of reliability in the dataset. Past researchers commonly employ metagenomic sequencing techniques to analyze fluctuations in microbial community abundance within specific diseases, contrasting them with the microbial compositions of healthy individuals, thus exploring the associations between microbes and diseases. In the HMDAD dataset, a microbe-disease association pair may contain multiple entries from different research literature sources. Therefore, here, we regard the same microbe-disease association from different evidences as a pair, further removing the redundant information present in the HMDAD dataset. Finally, for this study, we employed a dataset consisting of 450 microbe-disease associations, encompassing 39 human diseases and 292 microbes.
2.2 Problem definition
For the convenience of clarity in describing the subsequent research methods, we provide a simple problem definition for the task of predicting associations between microbes and diseases here. We denote M = {m1, m2, …, mnm} and D = {d1, d2, …, dnd} as the sets representing nm microbes and nd diseases, respectively. The matrix represents the known associations between microbes and diseases, where Aij = 1 if microbe mi is associated with disease dj, otherwise Aij = 0. However, Aij = 0 does not mean that microbe mi has no relation with disease dj. It may be the reason that their association has not yet been discovered. Therefore, the task of predicting associations between microbes and diseases aims to find microbe mi for each disease dj where Aij = 0 in the known association matrix, but microbe mi is actually related to disease dj.
2.3 GCATCMDA
Figure 1 illustrates the workflow of GCATCMDA, a model based on graph neural networks and contrastive learning for predicting effective candidate sets of microbe-disease associations. First microbe-microbe and disease-disease Gaussian kernel similarity networks are constructed using known associations. The model then integrates graph neural networks and contrastive learning principles to extract meaningful feature representations of microbes and diseases from the association networks. Last the obtained microbe and disease features are fed into an inner product decoder to compute their corresponding association scores. A detailed description of the key components of this model is elucidated below.
2.3.1 Microbe and disease similarity network construction
Previous study (Chen et al., 2017) have hypothesized that functionally similar microbes (or diseases) tend to exhibit similar interaction or non-interaction patterns with similar diseases (or microbes). They utilize Gaussian kernel functions to measure the similarity between two microbes (or diseases) in the same space. Therefore, in this study, we consider constructing microbe and disease similarity networks based on Gaussian kernel similarity scores for microbes and diseases.
We have recorded the known associations between microbes and diseases using the association matrix . The calculation formulas for the Gaussian kernel similarity score between microbe mi and mj, and between disease di and dj, are as follows:
where KM(mi, mj) represents the Gaussian kernel similarity score between microbes mi and mj, and KD(di, dj) represents the Gaussian kernel similarity score between diseases di and dj. The term IP(mi) represents the i-th row of the association matrix A recording the associations between microbe mi and other diseases, IP(di) represents the i-th column of the association matrix A recording the associations between disease di and other microbes. The parameters λm and λd represent the normalized kernel bandwidths and are defined as follows:
where nm and nd represented the number of microbes and diseases. And and are the original bandwidths, and generally both set to 1.
We consider microbes (or diseases) to be strongly associated with each other when the Gaussian kernel similarity score between microbes (or diseases) exceeds a threshold of t. Therefore, the association matrices MA for microbes and DA for diseases can be expressed as follows:
2.3.2 GCAT
Inspired by the work of Sun et al. (2022) on predicting metabolite-disease associations, this study adopted the GCAT feature encoder. The encoder initially combines graph convolution algorithms and graph attention mechanisms to learn the nodal features of the network, followed by the design of a feature dual-fusion module to effectively integrate the node features outputted by each graph attention layer. Since the GCAT feature encoder learns embedding representations on different association networks in a similar process, we take microbe-disease association network as an example to introduce the process of learning node features, as illustrated in Figure 2.
We represent the microbe-disease association network using a symmetric adjacency matrix , where nm and nd denote the numbers of microbes and diseases, respectively. The initial features of nodes in the network are represented by the matrix H(0).
Considering the ability of graph convolutional networks in capturing the global graph structural information, and the ability of graph attention mechanisms to assign attention weights to different nodes based on the local graph structure, the GCAT feature encoder integrates these two algorithms to effectively learn the node features of the microbe-disease association network. Firstly, the GCAT feature encoder linearly projects the node features of H(0) onto a feature space of dimensional size F, denoted as H(0) = H(0)W, where is the weight matrix. Next, this module employs the graph convolutional networks (GCN) proposed by Kipf and Welling (2016) to learn node features in the network. GCN learns low-dimensional representations of nodes by aggregating neighbor node information through graph convolution operations while preserving the graph's structural information. The first-layer graph convolutional propagation formula for graph G can be expressed as:
Here, σ(.) denotes the activation function, represents the adjacency matrix with self-loops added, is the degree matrix of , W(0) denotes the trainable weight matrix of the first-layer graph convolution, and H(1) represents the feature matrix outputted by the first-layer graph convolution.
Subsequently, the GCAT feature encoder enhances the learned node feature representations from the graph convolutional layers by incorporating a graph attention mechanism to aggregate weighted sums of neighbor information. In this study, we adopt the graph attention network (GAT) proposed by Veličković et al. (2017), which introduces an attention mechanism to assign different attention weights to the features of different neighbor nodes, enabling to focus on important neighbor features during aggregation for the target node. Thus, following the computation of the first-layer graph convolution, the attention scores for node j with respect to its neighbor node i in graph G can be calculated as:
where || denotes the concatenation operation, represents the node features obtained by the graph G through the first-layer graph convolution, represents the weight matrix for the linear transformation of node features, Ni denotes the first-order neighboring nodes of node i. The attention mechanism f(·) is a single-layer feedforward neural network, parametrized by a weight vector , and applying the LeakyReLU nonlinearity. We further employs a multi-head attention mechanism to stabilize the process of learning node representations in attention networks. It aggregates the features obtained from all attention heads by taking their average. Thus, the updated feature for node i via graph attention mechanism can be expressed as follows:
Here, σ denotes the activation function, K represents the number of attention heads, Ni signifies the neighborhood of node i, represents the attention coefficient for node j with respect to node i in the k-th attention head, is the weight matrix for attention in the k-th head, and denotes the feature vector of node j after the first graph convolutional layer.
Finally, inspired by the work of Wang et al. (2019) on node feature fusion, this study further designs a feature dual-fusion module, which considers both concatenation and element-wise product operations to integrate the node features outputted by each graph attention layer. We posit that the concatenation operation helps preserve more node feature information, while the element-wise product operation emphasizes the correlation between node features. We demonstrated the effectiveness of this fusion module in ablation experiments. The node features outputted by each graph attention layer in the GCAT feature encoder can be represented as {Z(1), Z(2), ⋯ , Z(N)}. Then, the feature dual-fusion module can be represented by the following equation:
Here, || represents concatenation, and ⊙ represents element-wise (Hadamard) product, and denote the trainable weight matrices, Z represents the final node feature.
In summary, this study represents the final microbe and disease features obtained from the microbe-disease association network as and , respectively. Similarly, the microbe features obtained from the microbe similarity network are represented as , and the disease features obtained from the disease similarity network are represented as .
2.3.3 Contrastive learning
Inspired by the work of Jin et al. (2024) on miRNA-disease association prediction, this study introduces contrastive learning to enhance the consistency of features of the same nodes across different association networks and the distinctiveness of features between different pairs of nodes. This approach leverages the complementary information among various association networks to obtain more effective representations of microbe and disease features. This module employs the contrastive loss function proposed by Zhu et al. (2020) for graph nodes. It considers the node features of the same disease di obtained from different association networks as positive samples, while all other pairs of different nodes form negative sample pairs. Therefore, the contrastive learning loss function Lossd for disease node features across different association networks can be expressed as:
where θ(·) is the cosine similarity, τ is a temperature parameter, nd denotes the number of disease. Similarly, the contrastive learning loss function Lossm for microbe node features across different association networks can be formulated as follows:
where nm denotes the number of microbe. Therefore, the overall loss function of the GCATCMDA model in the contrastive learning module is formulated as follows:
2.3.4 Microbe—disease associations prediction
This study aggregates the node features of microbes and diseases obtained from different association networks through vector concatenation, resulting in the final microbial feature representation and disease feature representation . Subsequently, these aggregated feature representations are passed into an inner product decoder to compute the association scores between microbes and diseases. The calculation process is as follows:
Where sigmoid is the activation function defined as 1/(1+e−x), which maps output values to the interval (0, 1), represents the association prediction score between microbe mi and disease dj.
Finally, the training of the GCATCMDA model employs Binary Cross-Entropy as the loss function for microbe-disease association prediction. The formula for this function is as follows:
Where N denotes the total number of associations between microbes and diseases, N+ represents the confirmed associations between microbes and diseases, and N− represents the associations yet to be confirmed. The tuple (i, j) represents the association between microbe mi and disease dj. If (mi, dj) belongs to N+, then A(i, j) = 1; otherwise, A(i, j) = 0. denotes the predicted association score by the model for this association pair. Therefore, the overall loss function of the GCATCMDA model can be expressed as:
Where λ represents the weighting parameter for the contrastive learning loss. The detailed steps of GCAT to predict novel microbe—disease associations is described in Algorithm 1.
3 Results and discussion
In this section, we will provide an exposition of the experimental setup and subsequently delve into an analysis and discussion of the experimental results.
3.1 Experimental setup
The GCATCMDA model proposed in this study is a microbe-disease association prediction model based on graph neural networks and contrastive learning. It aims to predict potential associations between microbes and diseases from known microbial-disease association dataset. The hyperparameter settings required for this model are described as follows. Firstly, the Gaussian kernel similarity threshold t needs to be set for constructing microbe and disease similarity networks. Secondly, parameters need to be set for the GCAT feature encoder module, including the dimensionality F of node features, the number of network layers L for graph convolution, and the number of attention heads heads for the graph attention mechanism. Then, in the contrastive learning loss module, the temperature hyperparameter τ and the weight parameter λ relative to the total loss are adjusted. Finally, the GCATCMDA model is trained using the Adam (Kingma and Ba, 2014) optimizer, with parameters including the learning rate lr, weight decay wd, and the number of training iterations epochs.
This study determines the optimal parameter settings of the GCATCMDA model on the dataset by enumerating different parameter combinations. Subsequently, there is an analysis of key parameters t, F, L, and heads. After comparing experimental results, the optimal hyperparameter settings for the GCATCMDA model on the HMDAD dataset are determined as follows: t = 0.4, F = 128, L = 3, heads = 2, τ = 1, λ = 0.2, lr = 0.00001, wd = 0.001, and epochs = 100.
In order to verify the effectiveness of the proposed GCATCMDA model, we compares it with KATZHMDA (Chen et al., 2017), LRLSHMDA (Wang et al., 2017), NTSHMDA (Luo and Long, 2018), and KGNMDA (Jiang et al., 2022). These five methods are recognized for their outstanding performance in this task in past studies and provide research methods with open-source code. For negative samples required in model training, this study randomly selects an equal number of negative samples from all unknown microbe-disease association pairs. The number of negative samples matched the number of positive samples as to maintain a balanced dataset. In each cross-validation experiment, the Gaussian kernel similarity scores for microbes and diseases are recalculated based on the training set to ensure the effectiveness of evaluating model performance through the test set. In this experiment, we employ the same dataset and follow the hyperparameter settings used in the original papers or provide open source codes for other compared models. We adopted the same evaluation metrics as the previous study (Jiang et al., 2022), including the area under the ROC curve (AUC) and the area under the precision-recall curve (AUPR) to assess the performance of the models. To evaluate the performance of these models in predicting potential associations between microbes and diseases, this study conducted 10 repetitions of five-fold cross-validation experiments and 10 repetitions of ten-fold cross-validation experiments by setting different random seeds, and then computed the average to ensure the accuracy of our results.
3.2 The classification performance of models
The comparative results of the two cross-validation experiments conducted on the HMDAD dataset for the five aforementioned models are presented in Table 1. The optimal performance is highlighted in bold, with standard deviations indicated in parentheses. To provide readers with a clearer visualization of the performance of the models, this study further plotted the ROC curve and PR curve, as shown in Figures 3, 4, respectively.
Figure 3. The AUROC curve and AUPR curve of five-fold CV on the HMDAD datasets between different methods.
Figure 4. The AUROC curve and AUPR curve of 10-fold CV on the HMDAD datasets between different methods.
From the experimental data presented above, it can be observed that the GCATCMDA model proposed in this study has achieved excellent predictive performance in the task of predicting associations between microbes and diseases, surpassing methods proposed in previous studies. For instance, in the five-fold cross-validation experiment, the model obtained an approximate 1.3% improvement in AUC compared to the best previous predictive performance. Similarly, in the ten-fold cross-validation experiment, the model obtained an approximate 1.0% enhancement in AUC compared to the best previous predictive performance. The improvement in predictive performance was slightly more pronounced in the fold-fold cross-validation compared to the 10-fold cross-validation. This can be attributed to the larger training sets used in the 10-fold validation, which reduce variability across folds and provide more comprehensive data for model training. However, the reduced variability can lead to subtler improvements in performance metrics, as the model benefits from a more stable but less varied dataset. In contrast, the five-fold validation, with its larger test sets, introduces more variability, making performance improvements more apparent.
Graph transformer models offer strong capabilities in capturing global node features through their self-attention mechanisms (Li et al., 2024a,b). This allows them to handle complex and non-local structures, which can be beneficial for highly heterogeneous datasets. However, these models come with significant computational complexity, scaling quadratically with the number of nodes, making them less practical for large datasets like microbe-disease networks.
While the GCATCMDA model combines GCN and GAT to effectively capture both local features and selective attention on relevant neighbors, Graph Transformer models are designed to capture these relationships on a broader scale. The full attention mechanism of Graph Transformers allows them to dynamically weigh the importance of distant nodes, offering more flexibility in feature extraction across large and complex networks. In contrast, our GCATCMDA model, which combines GCNs and GATs, is more computationally efficient and particularly suited to smaller, sparser datasets like the HMDAD database. While graph transformers excel in capturing global relationships, our approach balances local feature aggregation and attention, offering a more efficient solution. Future work could explore integrating graph transformers to leverage their global feature-capturing capabilities alongside our model's efficiency in handling localized data.
3.3 Parameter analysis
The GCATCMDA model proposed in this study possesses several crucial parameters, such as the Gaussian kernel similarity threshold t for constructing microbe and disease similarity networks, the dimensionality F of node features, the number of network layers L for graph convolution, and the number of attention heads heads for the graph attention mechanism. Therefore, this study conducted training with different parameter combinations on the HMDAD dataset and utilized the experimental results from 10 repetitions of five-fold cross-validation to analyze the impact of these parameters on the model's performance.
As shown in Figure 5, the model fails to achieve the best predictive performance when the Gaussian kernel similarity threshold t is either set too high or too low, the optimal predictive performance of the model is attained when t = 0.4. Moreover, as the dimensionality of node features increases, the predictive performance of the model gradually improves, with the best performance observed when F = 128. Additionally, the model exhibits its best predictive performance when the number of network layers for graph convolution L = 3. Furthermore, it is observed that the evaluation metrics AUC and AUPR attain their maximum values when the number of attention heads for the graph attention mechanism heads = 2.
3.4 Ablation studies
To further validate the impact of each module in the GCATCMDA model on the prediction performance of microbe-disease associations, this study conducted ablation experiments on the HMDAD dataset. The evaluation metrics included AUC, AUPR, Precision, Recall, and F1 score. These metrics aimed to comprehensively analyze the influence of different modules on the performance of the GCATCMDA model. The experimental results represent the average scores of 10 repetitions of five-fold cross-validation experiments. Initially, given that the GCATCMDA model simultaneously utilizes microbe similarity networks, disease similarity networks, and microbe-disease association networks to learn the feature representations of microbes and diseases, this study assessed the impact of node features from different association networks on the model's prediction performance. The experimental results are illustrated in Figure 6, where GCATCMDA_sim denotes learning the feature representations of microbes and diseases only from microbe and disease similarity networks, while GCATCMDA_asso denotes learning the feature representations only from the microbe-disease association network. It can be observed from Figure 6 that integrating feature representations of microbes and diseases from different association networks effectively enhances the model's predictive performance.
Next, given that the GCATCMDA model mainly consists of GCN, GAT, feature dual fusion module, and contrastive learning module, this study attempted to remove each module individually to investigate the impact of different modules on the model's prediction performance. The experimental results are presented in Table 2, where “GCATCMDA_GCN” denotes the removal of the graph convolutional network from the original model, “GCATCMDA_GAT” denotes the removal of the graph attention mechanism, “GCATCMDA_SUM” denotes replacing the feature dual fusion module of the original model with a simple summation operation, and “GCATCMDA_CL” denotes the removal of the contrastive learning module from the original model. From the results in Table 2, it can be observed that both “GCATCMDA_GCN” and “GCATCMDA_GAT” exhibit lower predictive performance compared to the original model, indicating that the integration of graph convolutional networks and graph attention mechanisms for node feature learning is effective in obtaining more informative node feature representations from the network. The predictive performance of “GCATCMDA_SUM” is also lower than that of the original model, suggesting that the designed feature dual fusion module effectively fuses node feature information outputted by the graph attention layers. Similarly, the predictive performance of “GCATCMDA_CL” is slightly lower than that of the original model, indicating that the addition of the contrastive learning module can improve the model's predictive performance to some extent.
Finally, to investigate the impact of different operations for fusing node features outputted by the graph attention layers on the GCATCMDA model prediction performance, this study sophisticatedly combined three common feature vector operations: concatenation, sum, and element-wise product. The combined fusion feature formulas are similar to the feature dual fusion formula described earlier. The experimental results are illustrated in Figure 7. GCATCMDA_C represents the use of concatenation only, GCATCMDA_S represents the use of sum only, and GCATCMDA_H represents the use of element-wise product only. GCATCMDA_CS represents the combination of concatenation and sum, GCATCMDA_CH represents the combination of concatenation and element-wise product, GCATCMDA_SH represents the combination of sum and element-wise product, and GCATCMDA_CSH represents the combination of concatenation, sum, and element-wise product. From the experimental results in Figure 7, it can be observed that selecting the combination operations of concatenation and element-wise product in the feature dual fusion module can most effectively fuse node features outputted by the graph attention layers.
3.5 Case studies
To further validate whether the GCATCMDA model can predict associations between microbes and diseases, this study initially trained the model using all known microbial-disease associations in the HMDAD dataset. Subsequently, obesity and inflammatory bowel disease (IBD), two common diseases, were selected as subjects for case analysis. The model predicted microbial associations with obesity and IBD by sorting the predicted association scores from high to low and retaining the top 20 unknown microbial associations with high scores for these two diseases. Finally, employing a literature search approach, this study validated whether these microbial associations with diseases existed by examining relevant publications in the biomedical literature database PubMed. This validation process aimed to assess the accuracy of the microbial-disease associations predicted by the GCATCMDA model.
From Table 3, it can be observed that among the top 20 associated microbes identified by the GCATCMDA model for obesity, 16 of them have been previously documented in the literature to be associated with obesity. For instance, Xu et al. (2022), by reviewing literature on gut microbiota and obesity, identified an association between Prevotella and obesity. Baradaran et al. (2021) experimentally demonstrated that individuals positive for Helicobacter pylori infection are more likely to suffer from obesity, with an increased risk of Helicobacter pylori infection among obese individuals. From Table 4, it can be observed that in IBD, among the top 20 associated microbes identified by the GCATCMDA model, 15 have been previously demonstrated to be associated with IBD in the literature. For example, Quaglio et al. (2022) demonstrated that the abundance of Bacteroidetes and Firmicutes in patients with IBD undergoes significant changes. Cardoneanu et al. (2021) experimental research showed a significant decrease in the abundance of Clostridium coccoides in patients with IBD compared to healthy individuals.
In summary, it can be observed from Tables 3, 4 that the GCATCMDA model achieves an accuracy of over 75% in predicting potential associated microbes for both obesity and inflammatory bowel disease. Therefore, this study concludes that the GCATCMDA model can provide effective and accurate candidate sets of microbes associated with diseases, thereby reducing the research costs and duration of traditional biological experiments.
4 Conclusion
This article primarily introduces the GCATCMDA model proposed in this study, aimed at predicting potential sets of microbe-disease associations based on known microbe-disease association data. Initially, the article outlines the construction of Gaussian kernel similarity networks for microbes and diseases using known association data and explains how the model combines graph neural networks with contrastive learning to obtain effective feature representations for microbes and diseases. Subsequently, experimental evaluations are conducted to compare the GCATCMDA model with existing methods, demonstrating its superiority in microbe-disease association prediction tasks. Additionally, parameter analysis experiments validate the rationality of parameter settings in the GCATCMDA model, while ablation experiments confirm the effectiveness of each module in the model. Finally, obesity and inflammatory bowel disease are selected as case studies to validate the high accuracy of the microbe-disease association candidate sets predicted by the GCATCMDA model.
The proposed model combines GCN and GAT to leverage the strengths of both approaches. GCN effectively captures local neighborhood information by performing convolution operations over graph structures, allowing the model to aggregate features across connected nodes. However, GCN applies equal weighting to all neighboring nodes, which may limit its ability to differentiate between more and less important neighbors. To address this limitation, GAT introduces an attention mechanism that assigns different importance to neighboring nodes by computing attention coefficients. This allows the model to focus more on the relevant nodes, improving its ability to capture complex interactions. By combining GCN's ability to aggregate global structural information with GAT's selective attention on important neighbors, the proposed model effectively captures both local and global patterns within the graph, leading to enhanced predictive performance.
While our study has demonstrated the effectiveness of the GCATCMDA model in predicting microbe-disease associations, there are several limitations that must be acknowledged. First, the model has only been evaluated using the HMDAD database, and its generalization ability requires further validation across other public datasets, such as HMDA and Disbiome. The limited volume of data in this study may also hinder the model's ability to capture complex patterns, suggesting the need for more extensive datasets to enhance its predictive performance. Additionally, our current approach does not differentiate between positive and negative association information, a distinction that will be addressed in future research to refine prediction accuracy. By overcoming these limitations, we anticipate further improvements in the model's robustness and its potential application across a broader range of microbial and disease studies.
In conclusion, this study asserts that the GCATCMDA model can advance the development of deep learning algorithms in the field of microbe-disease association prediction. Moreover, it effectively aids biologists in exploring potential associations between microbes and human diseases from a big data perspective, thereby reducing the costs of traditional biological experiments and accelerating research progress in the field of gut microbes and disease association studies.
Data availability statement
The HMDAD database is available at: http://www.cuilab.cn/hmdad. The source code is available upon reasonable request to the corresponding authors.
Author contributions
CJ: Investigation, Software, Visualization, Writing – original draft. JF: Formal analysis, Validation, Writing – original draft. BS: Formal analysis, Validation, Writing – original draft. QC: Data curation, Writing – original draft. JY: Funding acquisition, Writing – review & editing. GW: Resources, Writing – review & editing. XP: Methodology, Supervision, Writing – review & editing. XL: Conceptualization, Funding acquisition, Project administration, Supervision, Writing – original draft, Writing – review & editing.
Funding
The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This work was supported by the National Natural Science Foundation Project of China (82171526), Beijing Talents Project (2020A38), and Open Fund of National Engineering Laboratory for Big Data System Computing Technology (Grant No. SZU-BDSC-OF2024-19).
Acknowledgments
The authors thank Shenzhen University Big Data Research Center for computing hardware facilities.
Conflict of interest
XL was a co-founder of JCY Biotech Ltd.
The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
Bao, W., Jiang, Z., and Huang, D.-S. (2017). Novel human microbe-disease association prediction using network consistency projection. BMC Bioinformatics 18, 173–181. doi: 10.1186/s12859-017-1968-2
Baradaran, A., Dehghanbanadaki, H., Naderpour, S., Pirkashani, L. M., Rajabi, A., Rashti, R., et al. (2021). The association between Helicobacter pylori and obesity: a systematic review and meta-analysis of case-control studies. Clin. Diabetes Endocrinol. 7, 1–11. doi: 10.1186/s40842-021-00131-w
Blum, H. E. (2017). The human microbiome. Adv. Med. Sci. 62, 414–420. doi: 10.1016/j.advms.2017.04.005
Cardoneanu, A., Mihai, C., Rezus, E., Burlui, A., Popa, I., Prelipcean, C. C., et al. (2021). Gut microbiota changes in inflammatory bowel diseases and ankylosing spondylitis. J. Gastrointestin. Liver Dis. 30, 46–54. doi: 10.15403/jgld-2823
Chen, X., Huang, Y.-A., You, Z.-H., Yan, G.-Y., and Wang, X.-S. (2017). A novel approach based on Katz measure to predict associations of human microbiota with non-infectious diseases. Bioinformatics 33, 733–739. doi: 10.1093/bioinformatics/btw715
Henao-Mejia, J., Elinav, E., Thaiss, C. A., Licona-Limon, P., and Flavell, R. A. (2013). Role of the intestinal microbiome in liver disease. J. Autoimmun. 46, 66–73. doi: 10.1016/j.jaut.2013.07.001
Hua, M., Yu, S., Liu, T., Yang, X., and Wang, H. (2022). Mvgcnmda: multi-view graph augmentation convolutional network for uncovering disease-related microbes. Interdiscip. Sci. Comput. Life sci. 14, 669–682. doi: 10.1007/s12539-022-00514-2
Jiang, C., Tang, M., Jin, S., Huang, W., and Liu, X. (2022). Kgnmda: a knowledge graph neural network method for predicting microbe-disease associations. IEEE/ACM Trans. Comput. Biol. Bioinform. 20, 1147–1155. doi: 10.1109/TCBB.2022.3184362
Jin, Z., Wang, M., Tang, C., Zheng, X., Zhang, W., Sha, X., et al. (2024). Predicting mirna-disease association via graph attention learning and multiplex adaptive modality fusion. Comput. Biol. Med. 169:107904. doi: 10.1016/j.compbiomed.2023.107904
Kingma, D. P., and Ba, J. (2014). Adam: a method for stochastic optimization. arXiv [Preprint]. arXiv:1412.6980. doi: 10.48550/arXiv.1412.6980
Kipf, T. N., and Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv [Preprint]. arXiv:1609.02907. doi: 10.48550/arXiv.1609.02907
Li, G., Bai, P., Chen, J., and Liang, C. (2024a). Identifying virulence factors using graph transformer autoencoder with esmfold-predicted structures. Comput. Biol. Med. 170:108062. doi: 10.1016/j.compbiomed.2024.108062
Li, G., Bai, P., Liang, C., and Luo, J. (2024b). Node-adaptive graph transformer with structural encoding for accurate and robust lncrna-disease association prediction. BMC Genomics 25:73. doi: 10.1186/s12864-024-09998-2
Liu, H., Bing, P., Zhang, M., Tian, G., Ma, J., Li, H., et al. (2023). Mnnmda: predicting human microbe-disease association via a method to minimize matrix nuclear norm. Comput. Struct. Biotechnol. J. 21, 1414–1423. doi: 10.1016/j.csbj.2022.12.053
Long, Y., and Luo, J. (2019). Wmghmda: a novel weighted meta-graph-based model for predicting human microbe-disease association on heterogeneous information network. BMC Bioinformatics 20, 1–18. doi: 10.1186/s12859-019-3066-0
Long, Y., Luo, J., Zhang, Y., and Xia, Y. (2021). Predicting human microbe-disease associations via graph attention networks with inductive matrix completion. Brief. Bioinform. 22:bbaa146. doi: 10.1093/bib/bbaa146
Luo, J., and Long, Y. (2018). Ntshmda: prediction of human microbe-disease association based on random walk by integrating network topological similarity. IEEE/ACM Trans. Comput. Biol. Bioinform. 17, 1341–1351. doi: 10.1109/TCBB.2018.2883041
Lynch, S. V., and Pedersen, O. (2016). The human intestinal microbiome in health and disease. N. Engl. J. Med. 375, 2369–2379. doi: 10.1056/NEJMra1600266
Ma, W., Zhang, L., Zeng, P., Huang, C., Li, J., Geng, B., et al. (2017). An analysis of human microbe-disease associations. Brief. Bioinform. 18, 85–97. doi: 10.1093/bib/bbw005
Ma, Y., and Jiang, H. (2020). Ninimhmda: neural integration of neighborhood information on a multiplex heterogeneous network for multiple types of human microbe-disease association. Bioinformatics 36, 5665–5671. doi: 10.1093/bioinformatics/btaa1080
Marchesi, J. R., Adams, D. H., Fava, F., Hermes, G. D., Hirschfield, G. M., Hold, G., et al. (2016). The gut microbiota and host health: a new clinical frontier. Gut 65, 330–339. doi: 10.1136/gutjnl-2015-309990
Paun, A., Yau, C., and Danska, J. S. (2017). The influence of the microbiome on type 1 diabetes. J. Immunol. 198, 590–595. doi: 10.4049/jimmunol.1601519
Peng, L., Huang, L., Tian, G., Wu, Y., Li, G., Li, Z., et al. (2023). Predicting potential microbe-disease associations with graph attention autoencoder, positive-unlabeled learning, and deep neural network. Front. Microbiol. 14:1244527. doi: 10.3389/fmicb.2023.1244527
Quaglio, A. E. V., Grillo, T. G., De Oliveira, E. C. S., Di Stasi, L. C., and Sassaki, L. Y. (2022). Gut microbiota, inflammatory bowel disease and colorectal cancer. World J. Gastroenterol. 28, 4053. doi: 10.3748/wjg.v28.i30.4053
Schwabe, R. F., and Jobin, C. (2013). The microbiome and cancer. Nat. Rev. Cancer 13, 800–812. doi: 10.1038/nrc3610
Shen, Z., Jiang, Z., and Bao, W. (2017). “CMFHMDA: collaborative matrix factorization for human microbe-disease association prediction,” in Intelligent Computing Theories and Application: 13th International Conference, ICIC 2017, Liverpool, UK, August 7-10, 2017, Proceedings, Part II 13 (Cham: Springer), 261–269. doi: 10.1007/978-3-319-63312-1_24
Sommer, F., and Bäckhed, F. (2013). The gut microbiota–masters of host development and physiology. Nat. Rev. Microbiol. 11, 227–238. doi: 10.1038/nrmicro2974
Sun, F., Sun, J., and Zhao, Q. (2022). A deep learning method for predicting metabolite-disease associations via graph neural network. Brief. Bioinform. 23:bbac266. doi: 10.1093/bib/bbac266
Tooley, K. L. (2020). Effects of the human gut microbiota on cognitive performance, brain structure and function: a narrative review. Nutrients 12:3009. doi: 10.3390/nu12103009
Tseng, C.-H., and Wu, C.-Y. (2019). The gut microbiome in obesity. J. Formosan Med. Assoc. 118, S3–S9. doi: 10.1016/j.jfma.2018.07.009
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. (2017). Graph attention networks. arXiv [Preprint]. arXiv:1710.10903. doi: 10.48550/arXiv.1710.10903
Wang, F., Huang, Z.-A., Chen, X., Zhu, Z., Wen, Z., Zhao, J., et al. (2017). LRLSHMDA: Laplacian regularized least squares for human microbe-disease association prediction. Sci. Rep. 7:7601. doi: 10.1038/s41598-017-08127-2
Wang, F., Yang, H., Wu, Y., Peng, L., and Li, X. (2023). Saelgmda: identifying human microbe-disease associations based on sparse autoencoder and lightgbm. Front. Microbiol. 14:1207209. doi: 10.3389/fmicb.2023.1207209
Wang, X., He, X., Cao, Y., Liu, M., and Chua, T.-S. (2019). “KGAT: knowledge graph attention network for recommendation,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining (New York, NY: ACM), 950–958. doi: 10.1145/3292500.3330989
Xu, Z., Jiang, W., Huang, W., Lin, Y., Chan, F. K., Ng, S. C., et al. (2022). Gut microbiota in patients with obesity and metabolic disorders–a systematic review. Genes Nutr. 17:2. doi: 10.1186/s12263-021-00703-6
Yan, C., Duan, G., Wu, F.-X., Pan, Y., and Wang, J. (2019). Brwmda: predicting microbe-disease associations based on similarities and bi-random walk on disease and microbe networks. IEEE/ACM Trans. Comput. Biol. Bioinform. 17, 1595–1604. doi: 10.1109/TCBB.2019.2907626
Zhu, Y., Xu, Y., Yu, F., Liu, Q., Wu, S., Wang, L., et al. (2020). Deep graph contrastive representation learning. arXiv [Preprint]. arXiv:2006.04131. doi: 10.48550/arXiv.2006.04131
Keywords: microbe-disease associations, graph convolutional network, graph attention mechanism, contrastive learning, gut microbial metagenomics
Citation: Jiang C, Feng J, Shan B, Chen Q, Yang J, Wang G, Peng X and Li X (2024) Predicting microbe-disease associations via graph neural network and contrastive learning. Front. Microbiol. 15:1483983. doi: 10.3389/fmicb.2024.1483983
Received: 22 August 2024; Accepted: 14 October 2024;
Published: 13 December 2024.
Edited by:
Liang Wang, Guangdong Provincial People's Hospital, ChinaReviewed by:
Guanghui Li, East China Jiaotong University, ChinaLihong Peng, Hunan University of Technology, China
Copyright © 2024 Jiang, Feng, Shan, Chen, Yang, Wang, Peng and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Xiaogang Peng, cGVuZ3hnJiN4MDAwNDA7c3p1LmVkdS5jbg==; Xiaozheng Li, bGkueGlhb3poZW5nJiN4MDAwNDA7c3p1LmVkdS5jbg==
†These authors have contributed equally to this work