LABAMPsGCN: A framework for identifying lactic acid bacteria antimicrobial peptides based on graph convolutional neural network

Sun, Tong-Jie; Bu, He-Long; Yan, Xin; Sun, Zhi-Hong; Zha, Mu-Su; Dong, Gai-Fang

doi:10.3389/fgene.2022.1062576

ORIGINAL RESEARCH article

Front. Genet., 03 November 2022

Sec. Computational Genomics

Volume 13 - 2022 | https://doi.org/10.3389/fgene.2022.1062576

This article is part of the Research TopicNovel Machine Learning Algorithms for Conventional Omics Data and their ApplicationView all 5 articles

LABAMPsGCN: A framework for identifying lactic acid bacteria antimicrobial peptides based on graph convolutional neural network

Tong-Jie Sun¹

He-Long Bu¹

Xin Yan¹

Zhi-Hong Sun²

Mu-Su Zha²*

Gai-Fang Dong¹*

¹College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China
²College of Food Science and Engineering, Inner Mongolia Agricultural University, Hohhot, China

Lactic acid bacteria antimicrobial peptides (LABAMPs) are a class of active polypeptide produced during the metabolic process of lactic acid bacteria, which can inhibit or kill pathogenic bacteria or spoilage bacteria in food. LABAMPs have broad application in important practical fields closely related to human beings, such as food production, efficient agricultural planting, and so on. However, screening for antimicrobial peptides by biological experiment researchers is time-consuming and laborious. Therefore, it is urgent to develop a model to predict LABAMPs. In this work, we design a graph convolutional neural network framework for identifying of LABAMPs. We build heterogeneous graph based on amino acids, tripeptide and their relationships and learn weights of a graph convolutional network (GCN). Our GCN iteratively completes the learning of embedded words and sequence weights in the graph under the supervision of inputting sequence labels. We applied 10-fold cross-validation experiment to two training datasets and acquired accuracy of 0.9163 and 0.9379 respectively. They are higher that of other machine learning and GNN algorithms. In an independent test dataset, accuracy of two datasets is 0.9130 and 0.9291, which are 1.08% and 1.57% higher than the best methods of other online webservers.

1 Introduction

Lactic acid bacteria (LAB) is a kind of bacteria that can use fermentable carbohydrates to produce large amounts of lactic acid (Gu et al., 2022; Hu et al., 2022). Organic acids, special enzymes, lactobacilli and other substances produced by lactic acid bacteria through fermentation have special physiological functions. A large number of research data show that lactic acid bacteria can promote animal growth, regulate the normal flora of gastrointestinal tract, maintain micro ecological balance, thereby improving gastrointestinal function, improving food digestibility and biological titer, reducing serum cholesterol, controlling endotoxin, inhibiting the growth of intestinal putrefactive bacteria, and improving the immunity of the body (Teusink and Molenaar, 2017). Lactic acid bacteria have been widely used in food industry and poultry husbandry, and also have important academic value in genetic engineering (Greub et al., 2016), biochemistry (Kadomatsu, 2022), genetics (Sung Won et al., 2020) and molecular biology (Saibil, 2022).

Antimicrobial peptides of lactic acid bacteria are a kind of active peptides or proteins produced by the metabolic process of lactic acid bacteria, which can inhibit or kill pathogenic bacteria or spoilage bacteria in food. In recent years, several new methods have been developed for the screening and development of new antimicrobial peptides, including enzyme-linked immunodeficient assay (Huang X et al., 2022), biological analysis of K+ ion current (Lauger and Apell, 1988), ATP-bioluminescence method (Crouch et al., 1993; Aiken et al., 2011), Lux gene-bioluminescence method (Van Dyk et al., 1994), berberine-based fluorescence analysis method (Liu et al., 1998; Song et al., 2018) and micro-plate method (Kai et al., 2012). Although the above wet experimental methods can distinguish, they are time-consuming and expensive, so they cannot be popularized and used. To help wet lab researchers identify novel antimicrobial peptides, a variety of computational methods for antimicrobial peptide identification have been proposed. Many algorithms combine machine learning or statistical analysis techniques such as discriminant analysis (DA) (Kouw and Loog, 2021; Beck and Sharon, 2022), fuzzy K-nearest neighbors (Zhai et al., 2020), hidden Markov models (Fuentes-Beals et al., 2022), logistic regression (Fagerland and Hosmer, 2012), random forests (RF) (Ziegler and Koenig, 2014), and support vector machines (SVM) (Azar and El-Said, 2014). Although these models have made great progress in antimicrobial peptide recognition, the following challenges still exist: First, many related classification tasks based on machine learning suffer from the small number of samples. The model trained with small sample size cannot achieve robustness and is prone to the problems of over fitting and poor generalization ability. Secondly, most of the existing feature extraction technologies are aimed at specific datasets, and do not have universality.

In a word, most of the existing machine learning based classification work mainly uses the manually determined features (Jiang et al., 2021), which is highly dependent on biologists. The artificially determined features also have their shortcomings. On the one hand, the intrinsic nonlinear information of the function of some peptides cannot be obtained through this featured way; On the other hand, when the research object is changed, the adaptability of artificial features is poor. In addition, the dimension disaster caused by feature engineering brings new troubles to researchers.

In the past 10 years, deep learning has achieved extremely rapid development. In the field of text processing, achievements in the application of natural language processing to biological information prediction have been published repeatedly. In particular, graph neural network plays an excellent role in text classification (Xie et al., 2022; Zhou et al., 2022). Qu (Qu et al., 2017) proposed a method based on deep learning to identify DNA binding protein sequences. This method uses a two-stage convolutional network to detect the functional domain of protein sequences, and uses LSTM neural networks to identify context dependencies. In the independent test set, the accuracy of the model in the yeast data set is 80%; Hamid and Friedberg (Hamid and Friedberg, 2018) proposed a method used word embedding and RNN to identify bacteriocin and non bacteriocin sequences. The recall of the model in the two training data sets is 89.8% and 92.1% respectively; Veltri (Veltri et al., 2018) proposed a deep neural network model, which includes an embedded layer, a convolution layer and a recursive layer. The accuracy of the model in the independent test set is 91.01%; Zeng (Zeng et al., 2019) proposed to identify protein sequences based on the use of node2vec technology, convolution neural network and sampling technology. In this framework, node2vec technology is used to capture the semantic features and topology of each protein in the protein interaction network, and convolution layer is used to extract information from gene expression profiles. The AUC of the model in the training set is 82%; he (He et al., 2021) proposed a new Meta learning framework based on mutual information maximization. The core of the framework is ProtoNet, a classical meta learning algorithm based on metric learning, which learns the vector representation of each prototype. The accuracy of this model in the training set of antifungal peptides was 91.3%. The above five deep learning models have improved the performance of AMP prediction to a certain extent, but most of these models used convolutional neural network and LSTM neural network combination framework without significant innovation. Recently, due to the rise of graph neural networks, more and more people began to do some research on graph neural networks. Therefore, our work is based on graph convolution neural network to predict LABAMPs.

In this work, we design a graph convolution neural network framework to predict antimicrobial peptides of lactic acid bacteria. First, we construct a large heterogeneous graph based on all the samples, which contains sequences and peptides (amino acids, dipeptide, tripeptide. We can think of these peptides as words in natural language processing) as nodes. Then connect the nodes by doing that: The edge between two peptide fragments is determined by whether the two peptide fragments appear together in the fixed range (window size) of a sequence. The edge between a peptide fragment and a sequence depends on whether the peptide fragment is a substring of this sequence. Finally, the classification of nodes on the graph is realized through the calculation and transmission of information between nodes on the graph. The experimental results show that our model has great advantages over machine learning methods, deep learning models and other webservers.

2 Materials and methods

2.1 Collection of datasets

We collected LABAMPs records from 25 databases (Gueguen et al., 2006; Mulvenna et al., 2006; Fjell et al., 2007; Henderson et al., 2007; Kawashima et al., 2008; Hammami et al., 2009; Hammami et al., 2010; Sundararajan et al., 2012; Gogoladze et al., 2014; Theolier et al., 2014; Pirtskhalava et al., 2021; Shi et al., 2022) according to the 30 genus classification of lactic acid bacteria in Supplementary Table S1. Finally, after removing duplicate records, 1622 LABAMPs are obtained, and their lengths are from 2 to 1619.

According to the positive raw data set obtained above, we do some processing on it: First, we remove records which contain unnatural amino acids such as B, J, O, U, X, and Z from these raw data. Second, to reduce sequence homology bias and redundancy, we used respectively the CD-HIT program (Li and Godzik, 2006) to delete peptides with 70% and 90% similarity to each other. Finally, we get 460 and 636 peptide sequences after removing redundancy, respectively.

Our negative raw datasets obtained as follows:

1 On the UniProt website (Consortium, 2021), we obtain peptide sequences between the length of 2–1619;

2 Remove sequences contain or annotated with information of antimicrobial, antibiotic, fungicide, defensin, AMP, membrane, toxic, secretory, defensive, anticancer, antiviral, antifungal, effector, and exacted;

3 Remove resulting protein sequences which include unnatural amino acids;

4 Remove peptide sequences with 70% and 90% similarity by CD-HIT program;

5 Randomly select the same number of sequences as the number of positive samples.

All positive and negative samples are shown as Table 1, with processing of 70% and 90% by CD-HIT. We called them DS-70% and DS-90% respectively. The statistics of the preprocessed datasets are summarized in Table 2. Since we classify nodes on the graph, the number of graphs is one respectively in DS-70% and DS-90%. The number of sequences is the total number of positive and negative samples of DS-70% or DS-90%. The number of words is obtained by removing stop words and the words whose frequencies are less than 5. The number of nodes is the sum of the number of sequences and the number of words. Because our work has two categories of tasks, the number of categories is two.

TABLE 1

TABLE 1. Raw data processed through CD-HIT program.

TABLE 2

TABLE 2. Summary statistics results of datasets.

2.2 Model construction

The model construction is divided into three steps: first, establish the initial graph, then conduct the convolution operation on the graph, and finally complete the node classification through the classification function.

2.2.1 Establishment of initial graph

Before the construction of initial graph, we preprocess all positive and negative samples. First, all positive and negative samples are segmented by amino acid, dipeptide or tripeptide as words. Secondly, count the words frequencies, and filter all the words whose word frequency is less than 5 times. Then, we get the required words.

Suppose the initial input graph is expressed as Graph $G = (V, E)$ , then the number of $V$ is equal to the sum of the number of sequences and the number of peptide segments, and the number of edges depends on the connecting lines between peptide segments and the connecting lines between peptide segments and sequences. As shown in Figure 1A, there are two kinds of edges. One kind of edges are the connecting lines between peptide segments—if two peptide segments occur at the same time within the specified range of the same sequence, the corresponding nodes of these two peptide segments will be connected. The other kind of edges are the connection lines between peptide segments and sequences—if a peptide segment is a sub string of a sequence, the corresponding nodes will be connected.

FIGURE 1

FIGURE 1. Overview of LABAMPsGCN model architecture. (A) Graph Construction. Each sequence is processed by word segmentation, and then the required graph is obtained by word co-occurrence technology. (B) Graph convolutional neural network. It mainly carries out message transmission through word-sequence relations. (C) Classification. It uses the full connection layer for classification.

In order to calculate the information on the graph through the edges, we establish the adjacency matrix A of the initial graph, that is, assign a certain weight to each edge. The calculation method is shown in Eq. 1. Where $| W |$ represents the total number of sliding windows in all sequences, and its value is a positive integer. $| W (i) |$ is the number of sliding windows containing peptide segment $i$ in all sequences, and $| W (i, j) |$ is the number of sliding windows containing both peptide segment $i$ and peptide segment $j$ in all sequences. $n_{i, j}$ is the number of occurrences of the peptide segment $i$ in sequence j, and N is the total peptide number of all sequences. $| D |$ is the total number of all sequences, and $| {j : i \in j} |$ is the number of sequences containing peptide segment $i$ . The reason for adding one to the denominator is that when the peptide segment is not in all known sequences, $| {j : i \in j} |$ will be zero. Therefore, one is added to denominator.

A_{i j} = {\begin{array}{c} \log \frac{| W | \cdot | W (i, j) |}{| W (i) | \cdot | W (j) |} & i, j a r e w o r d s \\ \frac{n_{i, j}}{N} \cdot \log \frac{| D |}{| {j : i \in j} | + 1} & i i s w o r d, j i s s e q u e n c e \\ 1 & i = j \\ 0 & o t h e r w i s e \end{array} (1)

2.2.2 Graph convolutional network module

Word embedding is a method converting a word into a vector representation. There are many methods for word embedding, such as one-hot embedding, Skip Gram model (Carrasco and Sicilia, 2018), CBOW model (Xiong et al., 2019) and GloVe word vector (Gao and Huang, 2021). In this module, we first need to determine the node features of the initial graph. We use one-hot embedding to embed each word and send it to the model together with the sequence for training. Because the initial values of node features have little influence on the graph convolution neural network, we set $X$ as the identity matrix $I$ .

Since the diagonal element of the adjacency matrix is 0, it is easy to lose the information of the nodes themselves in the calculation process, so an identity matrix is added to the adjacency matrix. In order to avoid the change of feature distribution, the adjacent matrix with an identity matrix is normalized to obtain the processed adjacent matrix $N o r m (A + I)$ (Gao and Huang, 2021).

We design a graph convolution neural network framework to learn the information between nodes on the graph and transfer the related information under the supervision of labels, and finally achieve node classification. The graph convolution neural network framework of lactobacillus antibacterial peptides can be expressed as Eq. 2.

R = s o f t \max (N o r m (A + I) . . . R e L U (N o r m (A + I) X W_{0}) . . . W_{n}) (2)

The network learning process under the supervision of sequence labels needs to calculate the loss rate, and we use the cross entropy loss function to calculate the loss (Aurelio et al., 2019). Eq. 2 is a general model of LABAMPsGCN. Figure 1B shows a two-layer LABAMPsGCN. In the following chapters, we analyze that the two-layer LABAMPsGCN has the best performance.

2.2.3 Classification module

We use the full connection layer to integrate the feature space into the sample label space, and then use the softmax classification function to calculate the probability of nodes being classified into different categories. As is shown in Figure 1C.

2.3 Evaluation metrics

To assess the performance of LABAMPsGCN, we adopt statistical metric of precision, recall, accuracy and $F 1_s c o r e$ . They defined as follows:

\Pr e c i s i o n = \frac{T P}{T P + F P} (3)

R e c a l l = \frac{T P}{T P + F N} (4)

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N} (5)

F 1_s c o r e = 2 \times \frac{\Pr e c i s i o n \times R e c a l l}{\Pr e c i s i o n + R e c a l l} (6)

$T P$ , $T N$ , $F P$ and $F N$ are the four components of the confusion matrix, and also are the abbreviation of true positive, true negative, false positive and false negative, respectively. Precision rate means the proportion of correctly predicted positive to all actually positive samples. Recall rate means the proportion of correctly predicted positive samples to all the samples that should be predicted to be positive samples. Accuracy means the percentage of correct predictions in all samples. $F 1_s c o r e$ denotes the harmonic value of precision and recall.

2.4 Implementation details

The parameters of a model have an important impact on the performance of the model. In our LABAMPsGCN, we set the activation function, window size, first layer convolution size, learning rate and loss rate to ReLU, 15, 200, 0.01, and 0.5 respectively. We used Adam optimizer (Shao et al., 2021) to train our model for 150 epochs.

2.5 Development of the webserver.

We constructed a webserver with our prediction model embedded at the back end of website. When users submit their interested LABAMPs, the predicted percentage will be displayed based on the website prediction tool (Sim et al., 2012). Because the weight matrix of the graph convolution neural network will change with the change of the adjacency matrix and feature matrix of the input data, we embedded SVM model with accuracy lower of 3.77% than that of LABAMPsGCN.

3 Results

3.1 Effects of different feature extraction methods on graph convolutional neural networks

We randomly combined the features of single peptide, dipeptide and tripeptide respectively, and obtained six feature combinations: dipeptide, dipeptide and single peptide, tripeptide, tripeptide and single peptide, tripeptide and dipeptide, tripeptide added by dipeptide and single peptide. Table 3 shows the model accuracy on the DS-70% and DS-90%.

TABLE 3

TABLE 3. The different accuracy of different features on LABAMPsGCN.

It can be seen that the features of tripeptide and single peptide are significantly better than other combinations on DS-70% and DS-90%. As the number of features continues to increase, the accuracy (ACC) of the test data is also slowly increasing, and the number of features in DS-70% and DS-90% begins to decline significantly after 8020.

3.2 Parameter sensitivity

3.2.1 Window sizes

Figure 2A reports the accuracy for different sliding window on DS-70% and DS-90% based on features of tripeptide and single peptides. It demonstrates that the influence of the size of the sliding window on the prediction accuracy meets the general rule—taking 15 as the dividing point, it rises first and then falls. It is further explained that small windows cannot accommodate sequence fragments that play key functions, while too large windows take some irrelevant information as key information to participate, disturbing the judgment. Therefore, in this paper, window size is set to 15.

3.2.2 Graph convolutional network layer

We designed GCNs with different layers to obtain the features of LABAMPs. Figure 2B indicate the effect of the number of GCN layers on the performance of our model. In this paper, we changed the GCN layer in {1, 2, 3, 4}. It can be seen from Figure 2B that the 2-layer GCN can achieve the optimal performance. Too many GCN layers will cause the model to be over-smoothing, thus causing the learned model to collapse. Although there is no direct sequence-sequence edge connection in the graph, 2 GCNs can be connected through the middle word node, thus realizing sequence to sequence information interaction.

FIGURE 2

FIGURE 2. Parameter analysis of LABAMPsGCN. (A) Accuracy varied by windows size. (B) Accuracy varied by numbers of layers.

If there are too many layers, the features of a node will aggregate the features of more and more neighbors, so that these nodes become similar, which increases the similarity between classes, and the natural classification effect is poor.

3.3 Compare with machine learning methods

In order to verify metric of LABAMPsGCN, we compare machine learning models with it on the same features. In Table 4 all results are obtained by using 10-fold cross-validation. We used Multinomial Bayesian classifier (MNB), Random forest (RF), Support vector machine (SVM), AdaBoost (Huang H et al., 2022) and XGBoost (Zhang et al., 2022). It can be seen that LABAMPsGCN show good performance no matter how features change. This is because LABAMPsGCN can obtain the information of sequence nodes through word nodes.

TABLE 4

TABLE 4. Comparisons of LABAMPsGCN with machine learning and GNN models.

3.4 Comparison with existing AMP prediction tools

Table 5 compares our LABAMPsGCN model to three state-of-the-art machine learning methods which can be found publicly for AMPs recognition. Table 5 shows that our LABAMPsGCN model achieves the best values of metrics for Recall, Precision and Accuracy. In DS-70%, the Recall score of AMPfun model (Chung et al., 2020) is the highest (3.42% higher than our model). In DS-90%, the metrics of our LABAMPsGCN model are significantly better than other methods.

TABLE 5

TABLE 5. Comparisons of LABAMPsGCN with three state-of-the-art webservers.

3.5 Ablation study

In order to judge if all the parts of our identifier are necessary, we adopt three variants of LABAMPsGCN (LABAMPsGCN-noFC, LABAMPsGCN-cheby and LABAMPsGCN-cheby-noFC) as comparison methods. Specifically, LABAMPsGCN-noFC means that we do not add a full connection layer after the GCN layer for classification, while directly use the output of the GCN layer for classifying. LABAMPsGCN-cheby adds Chebyshev polynomials (Christiansen et al., 2021), which can use polynomial expansion to approximate the convolution of graphs, that is, polynomial approximation of parameterized frequency response functions. LABAMPsGCN-cheby-noFC adds Chebyshev polynomials and there is no full connection layer after GCN layer output.

Table 6 shows the evaluation indicators obtained by training with LABAMPsGCN and its variants on DS-90%. These four groups of training were conducted on the feature of tripeptide and single peptide. For LABAMPsGCN and LABAMPsGCN-noFC, the ACC of LABAMPsGCN was significantly higher than that of LABAMPsGCN-noFC. This is because the full connection layer integrates the feature representations and maps them to the space where the sample labels are located. For LABAMPsGCN and LABAMPsGCN-cheby, the performance of LABAMPsGCN-cheby is slightly poor because the use of Chebyshev polynomials makes each sequence vertex fuse too much irrelevant information. For LABAMPsGCN and LABAMPsGCN-cheby-noFC, the performance of LABAMPsGCN with full connection layer is significantly higher than that without it.

TABLE 6

TABLE 6. Performance evaluation of LABAMPsGCN and its three variants.

3.6 Visualization of words

LABAMPsGCN learned a lot of word features related to labels. In order to observe these words clearly, we visualized them qualitatively. Figure 3 shows the t-SNE visualization (Ruit et al., 2022) of the second layer word features learned from DS-70% and DS-90%. We set the dimension of the maximum value in the word feature vectors as the label of the word. As can be seen from Figure 3, words of the same color are clustered together, which means that a large number of words are closely related to certain specific classes. The red, green and orange in Figure 3 are used for visualization to determine whether word embedding can learn the main information of some sequences. Different colors in the figure represent different sequences. Figure 3A and Figure 3B is the results of DS-70% and DS-90%, respectively. In Table 7, we show the top representative words in each category, such as “ILE,” “TIW,” and “KLK”.

FIGURE 3

FIGURE 3. The t-SNE visualization of the second layer features learned from DS-70% and DS-90%. (A) The second word features learned from DS-70%. (B) The second word features learned from DS-90%.

TABLE 7

TABLE 7. Words with highest ACCs for two datasets of DS-70% and DS-90%. We used the word embedding at the last level of GCN to view the best performing words under each category.

4 Discussion

In this study, we constructed LABAMPsGCN, a novel graph-based identifier to predict LABAMPs accurately. In this identifier, we designed a graph convolutional neural network framework to automatically learning sequence features. By retrieving and reorganizing multiple AMPs databases and Uni-Prot database, we constructed the positive and negative datasets. We organized positive and negative samples into a large heterogeneous graph, transforming the sequence classification problem into a node classification problem. Graph convolution neural network can aggregate the information of the surrounding nodes to predict the label information of the central node.

LABAMPsGCN is superior to other predictors, on the one hand, because the graph structure can effectively represent the relationship between sequences and words (when constructing a graph, an edge is established between a word and a sequence when this word belongs to this sequence), on the other hand, the label information of sequences can be transferred through the edges on the graph. Because the graph structure is a kind of many-to-many structure, the label information of sequences can be transferred in the whole graph. In this way, the words corresponding to positive and negative labels can be easily distinguished. These words may be the key features to determine whether a sequence is a LABAMP.

For users’ convenience, we have established a publicly accessible web server (http://www.dong-group.cn/database/dlabamp/Prediction/amplab/result/) that can help to predict LABAMPs metabolized from various Lactic acid bacteria. In the next step, we will discuss how to mine the key fragments with antimicrobial function from the whole genome sequence by combining information such as multiple sequence alignment and domain prediction. We believe LABAMPsGCN will be a competent tool for screening lactic acid strains with antimicrobial activities.

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.

Author contributions

T-JS conducted experiments, analyzed data and wrote the manuscript. H-LB and XY collected data and made the webserver. Z-HS guided the collection of data and the construction of the webserver. G-FD and M-SZ supervised the experiment and managed the whole project.

Funding

This work has been supported by the High level Talent Fund Project of Inner Mongolia Agricultural University, China (No. NDYBH 2017-1, NDYB 2018-9), Inner Mongolia Natural Science Foundation Project (No.2021MS06023), the National Natural Science Foundation Project (No.31901666), Major Project of Inner Mongolia Natural Science Foundation (2020ZD12), and 2022 Basic Scientific Research Business Fee Project of Universities Directly under the Inner Mongolia Autonomous Region—Interdisciplinary Research Fund of Inner Mongolia Agricultural University (BR22-14-01).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2022.1062576/full#supplementary-material

References

Aiken, Z. A., Wilson, M., and Pratten, J. (2011). Evaluation of ATP bioluminescence assays for potential use in a hospital setting. Infect. Control Hosp. Epidemiol. 32, 507–509. doi:10.1086/659761

PubMed Abstract | CrossRef Full Text | Google Scholar

Aurelio, Y. S., de Almeida, G. M., de Castro, C. L., and Braga, A. P. (2019). Learning from imbalanced data sets with weighted cross-entropy function. Neural process. Lett. 50, 1937–1949. doi:10.1007/s11063-018-09977-1

LABAMPsGCN: A framework for identifying lactic acid bacteria antimicrobial peptides based on graph convolutional neural network

1 Introduction

2 Materials and methods

2.1 Collection of datasets

2.2 Model construction

2.2.1 Establishment of initial graph

2.2.2 Graph convolutional network module

2.2.3 Classification module

2.3 Evaluation metrics

2.4 Implementation details

2.5 Development of the webserver.

3 Results

3.1 Effects of different feature extraction methods on graph convolutional neural networks

3.2 Parameter sensitivity

3.2.1 Window sizes

3.2.2 Graph convolutional network layer

3.3 Compare with machine learning methods

3.4 Comparison with existing AMP prediction tools

3.5 Ablation study

3.6 Visualization of words

4 Discussion

Data availability statement

Author contributions

Funding

Conflict of interest

Publisher’s note

Supplementary material

References

94% of researchers rate our articles as excellent or good