ORIGINAL RESEARCH article

Front. Microbiol., 06 June 2022

Sec. Systems Microbiology

Volume 13 - 2022 | https://doi.org/10.3389/fmicb.2022.912145

Disease-Ligand Identification Based on Flexible Neural Tree

  • 1. School of Information Science and Engineering, Zaozhuang University, Zaozhuang, China

  • 2. School of Information and Electrical Engineering, Xuzhou University of Technology, Xuzhou, China

  • 3. Xuzhou No.1 People’s Hospital, Xuzhou, China

Abstract

In order to screen the disease-related compounds of a traditional Chinese medicine prescription in network pharmacology research accurately, a new virtual screening method based on flexible neural tree (FNT) model, hybrid evolutionary method and negative sample selection algorithm is proposed. A novel hybrid evolutionary algorithm based on the Grammar-guided genetic programming and salp swarm algorithm is proposed to infer the optimal FNT. According to hypertension, diabetes, and Corona Virus Disease 2019, disease-related compounds are collected from the up-to-date literatures. The unrelated compounds are chosen by negative sample selection algorithm. ECFP6, MACCS, Macrocycle, and RDKit are utilized to numerically characterize the chemical structure of each compound collected, respectively. The experiment results show that our proposed method performs better than classical classifiers [Support Vector Machine (SVM), random forest (RF), AdaBoost, decision tree (DT), Gradient Boosting Decision Tree (GBDT), KNN, logic regression (LR), and Naive Bayes (NB)], up-to-date classifier (gcForest), and deep learning method (forgeNet) in terms of AUC, ROC, TPR, FPR, Precision, Specificity, and F1. MACCS method is suitable for the maximum number of classifiers. All methods perform poorly with ECFP6 molecular descriptor.

Introduction

Computer-aided drug design (CADD) has gradually become an indispensable emerging technology in the research and development of a new drug (Leelananda and Steffen, 2016; Tong et al., 2019; Maia et al., 2020). CADD technology reduces the capital, time, and labor cost of drug development and greatly improves the efficiency of the research and development of new drug (Gomeni et al., 2001). Virtual screening is one of the important comprehensive technical means in CADD, which is a process of discovering new ligands on the basis of biological structure based on the computer methods (Guasch et al., 2016; Olubiyi et al., 2020; Rajguru et al., 2020). It is a new technology and method for innovative drug research. By using the high-speed computing of computer, a small number of potential active compounds are screened from a large number of candidate compounds, so as to greatly reduce the blindness of subsequent experimental verification. In the future, virtual screening technology will become an important means to explore the relevant biochemical space because of its many advantages, such as high efficiency, high speed, low cost, and so on (Zaslavskiy et al., 2019; Guo et al., 2021; Maddah et al., 2021; Selvaraj et al., 2021; Yang et al., 2021).

In the past decade, virtual screening has been applied to the medical and the pharmaceutical researches widely (Meng et al., 2011; Bajusz et al., 2017). The most commonly used virtual screening method is molecular docking, and the software involved contains AutoDock, SLIDE, DOCK, Flex X, etc. (Morris et al., 1996; Kellenberger et al., 2004; Taufer et al., 2005). Fischer et al. (2021) utilized virtual screening method to screen 25, 56, 750 compounds in order to make the analysis about the binding of small molecules to translationally controlled tumor protein. Baxter et al. (2000) utilized molecular docking to screen ligand-receptor complexes in virtual database and tabu search method was utilized to assist this work. Talluri (2021) utilized Vina and SMINA to make molecular docking to predict potential drugs for the treatment of Corona Virus Disease 2019 (COVID-19). Zhou et al. (2016) screened the compounds of Chicory, which were bundled with concentrated nucleoside transporter 2 (CNT2) in order to validate that CNT2 as the potential target of chicory could reduce the absorption of purine nucleosides in the intestine. Meenakumari et al. (2019) made docking analysis between 17 coumarin derivatives and carbonic anhydrase IX (CAIX) to screen the ligands. Thiyagarajan et al. (2016) made molecular docking between the 3D structures of focal adhesion kinase and S6 kinase and 60 natural compounds to obtain the new specific inhibitors, and the findings could provide help for the treatment of tumorigenesis and metastasis.

In order to improve the time and accuracy of virtual screening, some machine learning methods have been utilized to assist or replace molecular docking (Berishvili et al., 2018; Zaki et al., 2021). Wang et al. (2016) proposed a new virtual screening based on ensemble learning and SVM to tackle with protein-ligand in action fingerprint. Zhang Y. et al. (2019) investigated the performances of 8 classifiers containing decision tree (DT), KNN, SVM, random forest (RF), extremely randomizer tree, AdaBoost, gradient boosting tree, and XGBoost with ACC inhibitor data for the researches of drug design and discovery. Zhang et al. (2017) proposed a new scoring function based on machine learning to screen the compounds targeting the viral neuraminidase protein so as to make anti-influenza therapy. Chen et al. (2011) proposed a ligand screening algorithm based SVM to discovery lead compounds. Bustamam et al. (2021) proposed a dipeptidyl peptidase-4 (DPP-4) inhibitors identification method based on Rotation Forest and Deep Neural Network with the fingerprint datasets for the treatment of type 2 diabetes mellitus. Zheng et al. (2020) utilized Naïve Bayesian and recursive partitioning to select the important active chemical components from many compounds in Xiaoshuan Tongluo formula with ECFP_6 and MACCS feature sets for treating stroke.

Virtual screening of disease-related compounds can narrow the scope of analysis in network pharmacology research. In this paper a new virtual screening method based on flexible neural tree (FNT) model is proposed to screen the disease-related active compounds. A novel hybrid evolutionary algorithm based on Grammar-guided genetic programming and salp swarm algorithm is proposed to infer the structure and parameters in each FNT model. The 3 diseases (hypertension, diabetes, and COVID-19) related compounds are searched from the up-to-date literatures. The unrelated compounds are selected by negative sample selection algorithm from DUD-E website. About 4 kinds of molecular descriptors (ECFP6, MACCS, Macrocycle, and RDKit) are utilized to numerically characterize the chemical structures of related and unrelated compounds of diseases, respectively. We make the investigation about the performances of these 4 molecular descriptors.

Materials and Methods

Flexible Neural Tree Model

In order to solve the automatic design problem of artificial neural network, FNT was proposed, which is a hierarchical, multilayer, and irregular artificial neural network (Chen et al., 2012). FNT can transform a single and fixed neural network model into a special tree model that can change flexibly between various levels. It could overcome the difficulty of structural optimization of common neural network, have strong adaptive ability for various classification and prediction problems, and obtain high classification and prediction accuracy. In this paper, FNT is proposed to predict active disease-related compounds. An example of structure of FNT model is showed in Figure 1. AFNT includes input layer, several hidden layers and output layer. The nodes in the input layer are created randomly from terminal set T = {x1, x2,…,xn}. The nodes in the hidden layers are selected randomly from terminal set and operator set F = { + 2, + 3,…, + n}. The output layer contains one node.

FIGURE 1

In FNT, each layer is randomly generated according to the operation set and terminal set. The maximum depth of tree is set in advance. If an operator instruction + n is selected, n branches are created randomly from set T and F, which are terminal variables and operators. And n weights are generated randomly. If a terminal variable is selected, the corresponding branch is terminated. When FNT is created randomly, the depth of FNT could not exceed the maximum depth. + n is depicted in Figure 2 and is calculated as follows.

FIGURE 2

The final output of + n is calculated by activation function, which is given as follows.

Where an and bn are parameters of activation function.

Model Optimization Algorithm

Grammar-Guided Genetic Programming

Grammar-guided genetic programming (GGGP) was proposed in order to overcome the shortcomings of genetic programming (Wu and Chen, 2007). In this paper, GGGP is utilized to search the optimal structure of FNT model. In GGGP, context-free grammar (CFG) model is utilized to guide the evolutionary process of GP in order to search the optimal solution faster.

The CFG model contains a quadruple, which is represented as G = {N, T, P,∑}, where N is non-terminal symbol set, T is terminal symbol set, P is production rule set and ∑ is beginning symbol set. The 4 sets satisfy the conditions: NT = ϕ and ∑ ∈ N. An element in production rule set is represented as xy, where xN, and yNT. Assuming that terminal set and operator set are set as T = {x1, x2,…,xn}, and F = { + 2, + 3}, 4 sets of CFG model are defined:N = {s, exp, var, op2, op3}, T = { + 2, + 3, x1, x2, …, xn}, ∑ = {s}, and P is represented with Eq. (3) or Eq. (4).

Generate the initial population randomly. When generating each individual tree, the non-terminal node S is started with. Then the subtree of each non-terminal node is derived in top-down and left-right order according to the rules of the syntax model. When all non-terminal nodes in the tree have sub-trees, stop the derivation process of the tree, and then judge the depth of the tree. If the depth is greater than the predefined maximum depth, the tree is considered invalid, and a tree is regenerated after deletion. If the depth is less than the maximum depth, the tree is considered and can be saved to the population. Then 3 genetic operators (replication, crossover, and mutation) are utilized to generate a new population in the iteration process.

Salp Swarm Algorithm

The Salp swarm algorithm (SSA) is a new swarm optimization algorithm proposed by Mirjalili et al. (2017). The main idea of SSA comes from simulating the group behavior of salp chain (Babaei et al., 2020; Ren et al., 2021). In this algorithm, salp chain is divided into 2 groups: leader and follower. The leader is at the head of the salp chain, and the followers are at the back of the chain. In each iteration, the leader directs the followers to move in a chain toward the food. In the process of moving, the leader makes global search, while the follower makes full local search, which greatly avoid falling into local optimization. The leader’s leadership role for the followers behind will be weaker and weaker. The followers behind will not blindly move toward the leader, which could maintain the diversity of the population. Therefore, this movement mode makes the salp chain have a strong ability of global search and local development. Because of its simple implementation, fast convergence speed, and easy computer implementation, SSA is utilized to optimize the parameters of FNT model. The SSA is given as follows in detailed.

(1) Initialize the population. Suppose that population size is m, the dimension is n, the upper bound of the search space is , the lower bound is . The positions of salp population are created randomly by the following equation.

(2) Give the fitness values of population according to the fitness function defined in advanced. In the iteration process, the position of the food is not clear, so the fitness values of all individual salps are calculated and sorted. And the position of salp with the optimal fitness value is set as the current food position, which is set as F = {F1, F2,…, Fn}.

(3) Positions of leader and followers are updated. The leader is responsible for searching food to lead the moving direction of the whole group. The position of the leader is updated as follows (Chen and Mu, 2021).

Where and Fi are the i-th positions of leader (the first salp) and food. c2 and c3 are random number. c1 is the convergence factor in SSA, which could play the role of balancing global search and local development. c1 is calculated as follows.

Where t is the current generation and T is the maximum generation.

The positions of the followers are updated according to Newton’s laws of motion, which is defined as follows.

Where a is acceleration. The difference between two adjacent iterations is 1 and v0 = 0, so Eq. (8) could be defined as follows.

(4) Update the fitness values of new population and the position of food. If the end condition is satisfied, algorithm is stopped; otherwise go to step (3).

Screen Disease-Related Compounds by Our Proposed Method

Virtual screening is needed in the research of network pharmacology to select the disease-related compounds. In this paper, a novel virtual screening method based on FNT, hybrid evolutionary method and negative sample selection algorithm is proposed, which is depicted in Figure 3.

FIGURE 3

(1) Disease-related compound dataset collection. Search the up-to-date literatures for treating diseases according to the name of disease. By consulting these literatures with data mining method, the active compounds for the treatment of the disease are collected as the positive compound samples. In order to generate the unrelated compounds, the positive compounds are input into DUD-E database to generate the corresponding decoys, which are set as negative samples (Mysinger et al., 2012). There are too many decoys generated compared to the number of positive samples. In order to balance the proportion of positive samples and negative samples, negative sample selection based on Tanimoto index (Algorithm 1) is presented to choose a certain number of decoys that are quite different from the positive sample set. Tanimoto index could measure the distance between the 2 compounds, which can measure the similarity between 2 sets (Klekota et al., 2005), which can solve the relationship between 0 and 1 well. The greater Tanimoto index is, the higher the similarity of 2 sets is. The Tanimoto index of 2 sets A and B is calculated as followed.

Algorithm 1

Input: disease-related compound set [c1, c2,…,cm] (m is the number of compounds),
   the generated decoy set [g1, g2,…,gn] (n is the number of decoys)
Output: the selection negative compound set [n1, n2,…,n2m]
for i = 1;in;i + + do
sumi = 0;
 for j = 1;jm;j + + do
Tij = Tanimotoindex(gi, cj);
sumi = sumi + Tij;
 End
End
Sort the decoy set according to [sum1, sum2,…,sumn];
Select the decoys with 2m smallest Tanimoto indexes as negative compound set;

Negative sample selection algorithm.

(2) Screening process. The related and unrelated molecules collected are all chemical structures. To facilitate the compounds collected inputting into flexible neural tree model, 4 kinds of molecular descriptors (ECFP6, MACCS, Macrocycle, and RDKit) are utilized to numerically characterize the chemical structure of each compound (Todeschini and Consonni, 2009). ECFP6 contains 2,048 features, which denotes all possible molecular routes retrieved from the atom according to radius 3 and each bit denotes whether the special stator structure exists. MACCS contains 166 molecular characteristic sites, such as ISOTOPE, ATOMIC NO, 4M RING, and GROUP VIII. Macrocycle contains 1,613 features, which refer the information about the ring-size, sugars, and ester functional groups. RDK it contains 208 features, such as number of valence electros, number of radical electrons, charge information, and number of Aliphatic Carbocycles. Cross-validation method is utilized to divide the training and testing datasets to test the performance of our proposed method. With the feature vector of each compound in the training dataset as the input, flexible neural tree model is utilized to train with the feature datasets. A hybrid evolutionary method based on grammar-guided genetic programming and salp swarm algorithm is proposed to search the optimal structure and parameters of FNT model. For the unknown compounds of testing dataset, the feature vectors are used as the input of the optimal FNT model to obtain the output results. If the result is higher than 0.5, the compound is identified to be disease-related; otherwise, it is unrelated.

Experiment Results and Discussion

In order to test the effectiveness of our method, the important compounds were collected, which were involved in the treatment of hypertension, diabetes, and COVID-19. The related compounds of these 3 diseases are regarded as positive samples and the numbers of samples are 67, 124, and 88, respectively. Negative sample selection method is utilized to select the inactive compounds about hypertension, diabetes and COVID-19, and the numbers of negative samples are 134, 248, and 176, respectively. The 4 kinds of molecular descriptors (ECFP6, MACCS, Macrocycle, and RDKit) are utilized to numerically characterize related and unrelated compounds of diseases, respectively.

The 10-cross validation method is utilized to test the performance of our method. SVM (Hearst et al., 1998), RF (Breiman, 2001), AdaBoost (Collins et al., 2002), decision tree (DT) (Safavian and Landgrebe, 1991), GBDT (Zhang B. et al., 2019), KNN, logical regression (LR) (Collins et al., 2002), gc Forest (Zhou and Feng, 2017), forgeNet (Kong and Yu, 2020), and Naive Bayes (NB) (Kim et al., 2006)are also utilized to identify disease-related compounds of three diseases. In our method, operator set is set as F = { + 2, + 3, + 4, + 5}, population size is set as 30 and the maximum depth of tree is set as 5. In SVM, linear kernel function is selected. In RF, the number of trees is set as 100. In GBDT, the number of regression trees is set as 200. In DT, CART algorithm is utilized. The parameters of other algorithms are set by default. The AUC performances of 11 methods with the datasets about hypertension, diabetes, and COVID-19 are shown in Figures 46, respectively. From Figure 4, it could be seen that with ECFP6, Macrocycle, and RDKit methods, our method has the highest AUC performances among 11 methods. With MACCS method, the AUC values obtained by our method and RF are very close to 1.0, which are 0.999889 and 0.997772, respectively. For Figure 5, in terms of AUC, it could be clearly seen that our method performs best with ECFP6, MACCS, and RDKit methods. With Macrocycle feature method, our method, gcForest, and SVM could obtain the better AUC values than other 8 methods, which are 1, 0.99803, and 0.998435, respectively. By the comparison of these 3 methods, our method performs best, which show that our method is a good classifier for disease-compound identification problem. For Figure 6, with ECFP6molecular descriptor, our method and SVM could obtain the higher AUC values than other 9 methods, which are 0.996901 and 0.99703. With other molecular descriptors, our method could obtain the better performances, which are equal to or very close to 1.0.

FIGURE 4

FIGURE 5

FIGURE 6

TPR, FPR, Precision, Specificity, and F1 are also utilized to test the performances of 11 methods for compound identification about 3 diseases. TPR denotes the ratio of true disease-related compounds identified against all true disease-related ones. FPR denotes the ratio of disease-related compounds identified erroneously against all true disease-unrelated ones. Precision denotes the ratio of true disease-related compounds identified against all disease-related ones identified. Specificity is the ratio of true disease-unrelated compounds identified against all true disease-unrelated ones. F1 could evaluate a classifier comprehensively with Precision and Recall. TPR, FPR, Precision, Specificity, and F1performances of11 methods with the datasets about hypertension, diabetes and COVID-19 are listed in Tables 13, respectively. In Table 1, with ECFP6 method, our method has the highest TPR performance among 11 classifiers, which shows that our method could identify more true disease-related compounds. In terms of FPR, Precision and Specificity, forgeNet and RF perform best, which reveal that all the true disease-unrelated compounds are identified. But our method could obtain the highest F1 performance. Overall our method could obtain the more accurate identification results. With MACCS, Macrocycle, and RDKit, our method could obtain the best performances of TPR, FPR, Precision, Specificity, and F1.

TABLE 1

Molecular descriptorsMethodsTPRFPRPrecisionSpecificityF1
ECFP6Our method0.9850750.0222220.9565220.9777780.970588
gcForest0.9552240.1555560.7529410.8444440.842105
forgeNet0.8955220110.944882
SVM0.8805970.0074070.9833330.9925930.929134
RF0.8805970110.936508
AdaBoost0.8358210.0370370.9180330.9629630.875
DT0.8358210.0444440.9032260.9555560.868217
GBDT0.8507460.0518520.8906250.9481480.870229
KNN0.6865670110.814159
LR0.9701490.3111110.6074770.6888890.747126
NB0.7313430.0962960.7903230.9037040.75969
MACCSOur method10.0074070.9852940.9925930.992593
gcForest0.9701490.0518520.9027780.9481480.935252
forgeNet0.9253730.0185870.961240.9814130.942966
SVM0.9402990.029630.9402990.970370.940299
RF0.9402990.0148150.9692310.9851850.954545
AdaBoost0.8955220.0444440.9090910.9555560.902256
DT0.8955220.0518520.8955220.9481480.895522
GBDT0.9253730.0148150.968750.9851850.946565
KNN0.9253730.029630.9393940.970370.932331
LR0.9701490.0666670.8783780.9333330.921986
NB0.9402990.1925930.7078650.8074070.807692
MacrocycleOur method0.9843750110.992126
gcForest0.93750.090090.8571430.909910.895522
forgeNet0.9218750.0180180.9672130.9819820.944
SVM0.8906250.0270270.950.9729730.919355
RF0.906250.0270270.950820.9729730.928
AdaBoost0.9531250.0270270.9531250.9729730.953125
DT0.9218750.0720720.8805970.9279280.900763
GBDT0.906250.0360360.9354840.9639640.920635
KNN0.9218750.0720720.8805970.9279280.900763
LR0.93750.1531530.7792210.8468470.851064
NB0.93750.090090.8571430.909910.895522
RDKitOur method0.9850750110.992481
gcForest0.9552240.029630.9411760.970370.948148
forgeNet0.8955220.0222220.9523810.9777780.923077
SVM0.9402990.0148150.9692310.9851850.954545
RF0.8656720.0148150.9666670.9851850.913386
AdaBoost0.9253730.0148150.968750.9851850.946565
DT0.8731340.0557620.8863640.9442380.879699
GBDT0.8955220.029630.93750.970370.916031
KNN0.8656720.0444440.906250.9555560.885496
LR0.9552240.029630.9411760.970370.948148
NB0.8955220.2148150.6741570.7851850.769231

Prediction performances of 11 methods with hypertension dataset.

Bold values denote the best performances.

TABLE 2

Molecular descriptorsMethodsTPRFPRPrecisionSpecificityF1
ECFP6Our method0.9919350.0120480.976190.9879520.984
gcForest0.9677420.1244980.7947020.8755020.872727
forgeNet0.9160310.0076050.9836070.9923950.948617
SVM0.9354840.020080.9586780.979920.946939
RF0.8629030.0080320.9816510.9919680.918455
AdaBoost0.8790320.0361450.9237290.9638550.900826
DT0.8064520.1004020.80.8995980.803213
GBDT0.8548390.020080.9549550.979920.902128
KNN10.9397590.3463690.0602410.514523
LR0.9677420.152610.7594940.847390.851064
NB0.6048390.0522090.8522730.9477910.707547
MACCSOur method0.9758060110.987755
gcForest0.9758060.020080.9603170.979920.968
forgeNet0.9516130.0240960.9516130.9759040.951613
SVM0.9354840.0240960.950820.9759040.943089
RF0.9435480.0120480.9750.9879520.959016
AdaBoost0.9435480.0321290.9360.9678710.939759
DT0.9516130.0401610.9218750.9598390.936508
GBDT0.9758060.020080.9603170.979920.968
KNN0.9516130.0441770.9147290.9558230.932806
LR0.9758060.020080.9603170.979920.968
NB0.9677420.4176710.5357140.5823290.689655
MacrocycleOur method0.9914530110.995708
gcForest0.9829060.0280370.9504130.9719630.966387
forgeNet0.9572650.0093460.9824560.9906540.969697
SVM0.9743590.0186920.9661020.9813080.970213
RF0.9572650.0140190.9739130.9859810.965517
AdaBoost0.9572650.0186920.9655170.9813080.961373
DT0.914530.0373830.9304350.9626170.922414
GBDT0.9658120.0467290.9186990.9532710.941667
KNN0.9230770.0186920.9642860.9813080.943231
LR0.9829060.0420560.9274190.9579440.954357
NB0.9743590.0420560.9268290.9579440.95
RDKitOur method0.9596770110.979424
gcForest0.9596770.020080.9596770.979920.959677
forgeNet0.9677420.0120480.975610.9879520.97166
SVM0.9516130.0080320.9833330.9919680.967213
RF0.9354840.0120480.974790.9879520.954733
AdaBoost0.9435480.0160640.9669420.9839360.955102
DT0.9435480.0281120.9435480.9718880.943548
GBDT0.9435480.0080320.9831930.9919680.962963
KNN0.9032260.0120480.9739130.9879520.937238
LR0.9596770.0240960.9520.9759040.955823
NB0.9516130.2048190.6982250.7951810.805461

Prediction performances of 11 methods with diabetes dataset.

TABLE 3

Molecular descriptorsMethodsTPRFPRPrecisionSpecificityF1
ECFP6Our method0.9659090110.982659
gcForest0.9659090.1016950.8252430.8983050.890052
forgeNet0.9318180.005650.9879520.994350.959064
SVM0.9204550.0112990.9759040.9887010.947368
RF0.9318180110.964706
AdaBoost0.8962260.0258820.9452740.9741180.920097
DT0.9090910.0451980.9090910.9548020.909091
GBDT0.8863640.0282490.9397590.9717510.912281
KNN0.8977270.4350280.506410.5649720.647541
LR0.9886360.2146890.6960.7853110.816901
NB0.6363640.0621470.8358210.9378530.722581
MACCSOur method10111
gcForest0.9545450.0112990.9767440.9887010.965517
forgeNet0.9431820.0084990.9822490.9915010.962319
SVM0.9318180.0112990.976190.9887010.953488
RF0.9545450110.976744
AdaBoost0.8863640.0169490.9629630.9830510.923077
DT0.9318180.0338980.9318180.9661020.931818
GBDT0.9318180.005650.9879520.994350.959064
KNN0.9545450.0282490.943820.9717510.949153
LR0.9545450.0169490.9655170.9830510.96
NB0.8636360.0903950.8260870.9096050.844444
MacrocycleOur method0.9655170110.982456
gcForest0.9540230.0065360.9880950.9934640.97076
forgeNet0.9540230110.976471
SVM0.9425290.0065360.9879520.9934640.964706
RF0.9425290.0065360.9879520.9934640.964706
AdaBoost0.9540230110.976471
DT0.9080460.0392160.9294120.9607840.918605
GBDT0.8965520.032680.9397590.967320.917647
KNN0.9310340.0196080.9642860.9803920.947368
LR0.9540230.0261440.9540230.9738560.954023
NB0.8850570.0392160.9277110.9607840.905882
RDKitOur method0.9659090110.982659
gcForest0.9431820.0225990.9540230.9774010.948571
forgeNet0.9431820.0112990.9764710.9887010.959538
SVM0.9431820.0112990.9764710.9887010.959538
RF0.9318180.005650.9879520.994350.959064
AdaBoost0.9318180.0169490.9647060.9830510.947977
DT0.9431820.0112990.9764710.9887010.959538
GBDT0.9431820.0112990.9764710.9887010.959538
KNN0.9545450.0169490.9655170.9830510.96
LR0.9431820.0282490.9431820.9717510.943182
NB0.8977270.1129940.797980.8870060.84492

Prediction performances of 11 methods with COVID-19 dataset.

Bold values denote the best performances.

In Table 2, with ECFP6 method, KNN has the highest TPR performance among 11 classifiers, which is 1.0. The result shows that KNN could identify all true disease-related compounds. In terms of FPR, Precision, and Specificity, forgeNet perform better than other 10 methods. But our method could also obtain the highest F1 performance. Overall our method could obtain the more accurate identification results. With MACCS and Macrocycle, our method could obtain the best performances of TPR, FPR, Precision, Specificity, and F1. With RDKit, our method performs best in terms of FPR, Precision, Specificity, and F1, while forgeNet could obtain the best TPR performance. For Table 3, our method performs best with 4 kinds of molecular descriptors in terms of 5 criterions. All results show that our method could predict disease-related compounds more accurately than gcForest, forgeNet, SVM, RF, AdaBoost, DT, GBDT, KNN, LR, and NB.

According to the performances of 11 methods with the datasets from 3 diseases and 4 molecular descriptors, 11 methods are ranked. For each molecular descriptor, the averaged ranking results of each method are listed in Table 4. From Table 4, we can see that our method, gcforest, forgenet, RF, GDBT, and LR perform best with MACCS feature set, while SVM and DT perform best with RDKit feature set. AdaBoost, KNN and NB perform better with Mordred feature set than the other 3 feature sets. All methods perform poorly with ECFP6 molecular descriptor. The results also show that the different molecular descriptors of compounds are suitable for the different classifiers and the ranking results can provide the guidance for each classifier to choose the appropriate molecular descriptor to solve the problem in the future. On the whole, MACCS method is suitable for the maximum number of classifiers. In future research, MACCS method can be preferred for a new classifier.

TABLE 4

ECFP6MACCSMacrocycleRDKit
Our method3.331.6722.67
gcForest3.671.832.332.17
forgeNet2.52.172.333
SVM2.832.52.52.17
RF2.831.332.53.17
AdaBoost3.52.51.832.17
DT41.832.51.67
GBDT3.51.332.832.33
KNN41.831.672.5
LR3.831.172.832.17
NB3.672.8312.33

Averaged ranking scores of 11 methods with 3 datasets.

We investigate the performances of our method with different ratios of positive and negative samples. The 8 kinds of ratios (1:1, 1:2, 1:3, 1:4, 1:5, 1:6, 1:8, and 1:10) are selected and COVID-19 dataset is utilized. The identification results are depicted in Figure 7. From Figure 7, it could be seen that when the ratios are 1:1, 1:2, 1:3, and 1:4, our method could have the better ROC and AUC performances. The excessive imbalance of data may affect the classification performance of the algorithm.

FIGURE 7

Conclusion

In order to sort the candidate compounds in a traditional Chinese medicine prescription and narrow the scope of analysis in network pharmacology research accurately, this paper proposes a new virtual screening method based on flexible neural tree (FNT) model, hybrid evolutionary method, and negative sample selection algorithm to screen the disease-related active compounds. 3 diseases (hypertension, diabetes, and Corona Virus Disease 2019) related compounds are collected from the up-to-date literatures. The unrelated compounds are selected by negative sample selection algorithm from DUD-E website. 4 kinds of molecular descriptors (ECFP6, MACCS, Macrocycle, and RDKit) are utilized to characterize the features of related and unrelated compounds of diseases, respectively. The experiment results show that our proposed method performs better than classical classifiers (SVM, RF, AdaBoost, DT, GBDT, KNN, LR, and NB), up-to-date classifier (gcForest) and deep learning method (forgeNet) in terms of AUC, ROC, TPR, FPR, Precision, Specificity, and F1.

We also investigate the performances of 11 methods with 4 kinds of molecular descriptors. The results show that our method, gcforest, forgenet, RF, GDBT, and LR perform best with MACCS feature set, while SVM and DT perform best with RDKit feature set, AdaBoost, KNN and NB perform best with Mordred feature set. With ECFP6 molecular descriptor all methods perform poorly.

In the paper, our proposed method has been successfully applied to hypertension, diabetes, and Corona Virus Disease. In the future, our method will be utilized to identify other chronic disorders related compounds, such as cancers, coronary heart disease, and rheumatoid disease.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Statements

Data availability statement

The original contributions presented in this study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.

Author contributions

WB conceived the method and wrote the main manuscript text. BY designed the method and conducted the experiments. All authors reviewed the manuscript.

Funding

This work was supported by the talent project of “Qingtan scholar” of Zaozhuang University, the Natural Science Foundation of China (Nos. 61702445 and 61902337), Jiangsu Provincial Natural Science Foundation (No. SBK2019040953), Natural Science Fund for Colleges and Universities in Jiangsu Province (No. 19KJB520016), Young talents of Science and Technology in Jiangsu, Youth Innovation Team of Scientific Research Foundation of the Higher Education Institutions of Shandong Province, China (No. 2019KJM006), the Key Research Program of the Science Foundation of Shandong Province (No. ZR2020KE001), the fundamental Research Funds for the Central Universities (2020QN89), Xuzhou Science and Technology Plan Project (KC19142 and KC21047), the Ph.D. research startup foundation of Zaozhuang University (No. 2014BS13), and Zaozhuang University Foundation (No. 2015YY02).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  • 1

    BabaeiF.LashkariZ. B.SafariA.FarrokhifarM.SalehiJ.et al (2020). Salp swarm algorithm-based fractional-order PID controller for LFC systems in the presence of delayed EV aggregators.IET Electr. Syst. Transport.10259267. 10.1049/iet-est.2019.0076

  • 2

    BajuszD.FerenczyG. G.KeserG. M. (2017). Structure-Based Virtual Screening Approaches in Kinase-Directed Drug Discovery.Curr. Topics Med. Chem.1722352259. 10.2174/1568026617666170224121313

  • 3

    BaxterC. A.MurrayC. W.WaszkowyczB.LiJ.SykesR. A.BoneR. G.et al (2000). New approach to molecular docking and its application to virtual screening of chemical databases.J. Chem. Inform. Comput. Sci.40254262. 10.1021/ci990440d

  • 4

    BerishviliV. P.VoronkovA. E.RadchenkoE. V.PalyulinV. A.et al (2018). Machine Learning Classification Models to Improve the Docking-based Screening: a Case of PI3K-Tankyrase Inhibitors.QSAR Combinator. Sci.37:e1800030. 10.1002/minf.201800030

  • 5

    BreimanL. (2001). Random forest.Mach. Learn.45532. 10.1023/A:1010933404324

  • 6

    BustamamA.HamzahH.HusnaN. A.SyarofinaS.DwimantaraN.YanuarA.et al (2021). Artificial intelligence paradigm for ligand-based virtual screening on the drug discovery of type 2 diabetes mellitus.J. Big Data8:74. 10.1186/s40537-021-00465-3

  • 7

    ChenL.MuY. (2021). Improved salp swarm algorithm.Appl. Res. Comput.3816481652.

  • 8

    ChenY. F.HsuK. C.LinP. T.HsuD. F.KristalB. S.YangJ. M.et al (2011). LigSeeSVM: ligand-based virtual screening using support vector machines and data fusion.Int. J. Comput. Biol. Drug Design4274289. 10.1504/IJCBDD.2011.041415

  • 9

    ChenY. H.YangB.MengQ. (2012). Small-time scale network traffic prediction based on flexible neural tree.Appl. Soft Comput.12274279. 10.1016/j.asoc.2011.08.045

  • 10

    CollinsM.SchapireR. E.SingerY. (2002). Logistic Regression, AdaBoost and Bregman Distances.Mach. Learn.48253285. 10.1023/A:1013912006537

  • 11

    FischerN.SeoE. J.AbdelfatahS.FleischerE.KlingerA.EfferthT.et al (2021). A novel ligand of the translationally controlled tumor protein (TCTP) identified by virtual drug screening for cancer differentiation therapy.Invest. N. Drugs39914927. 10.1007/s10637-020-01042-w

  • 12

    GomeniR.BaniMD.AngeliC.CorsiM.ByeA. (2001). Computer-assisted drug development (CADD): an emerging technology for designing first-time-in-man and proof-of-concept studies from preclinical experiments.Eur. J. Pharmaceut. Sci.13261270. 10.1016/S0928-0987(01)00111-7

  • 13

    GuaschL.ZakharovA. V.TarasovaO. A.PoroikovV. V.LiaoC.NicklausM. C.et al (2016). Novel HIV-1 Integrase Inhibitor Development by Virtual Screening Based on QSAR Models.Curr. Topics Med. Chem.16441448. 10.2174/1568026615666150813150433

  • 14

    GuoS.XieH.LeiY.LiuB.ZhangL.XuY.et al (2021). Discovery of Novel Inhibitors Against Main Protease (Mpro) of SARS-CoV-2 via Virtual Screening and Biochemical Evaluation.Bioorgan. Chem.110:104767. 10.1016/j.bioorg.2021.104767

  • 15

    HearstM. A.DumaisS. T.OsmanE.PlattJ.ScholkopfB. (1998). Support Vector Machines.IEEE Intell. Syst.131828. 10.1109/5254.708428

  • 16

    KellenbergerE.RodrigoJ.MullerP.RognanD. (2004). Comparative evaluation of eight docking tools for docking and virtual screening accuracy.Proteins.57225242. 10.1002/prot.20149

  • 17

    KimS. B.HanK. S.RimH. C.MyaeungS. H. (2006). Some Effective Techniques for Naive Bayes Text Classification.IEEE Transac. Knowledge Data Eng.1814571466. 10.1109/TKDE.2006.180

  • 18

    KlekotaJ.BraunerE.SchreiberS. L. (2005). Identifying Biologically Active Compound Classes Using Phenotypic Screening Data and Sampling Statistics.J. Chem. Inform. Modeling4518241836. 10.1021/ci050087d

  • 19

    KongY.YuT. (2020). forgeNet: a graph deep neural network model using tree-based ensemble classifiers for feature graph construction.Bioinformatics3635073515. 10.1093/bioinformatics/btaa164

  • 20

    LeelanandaS. P.SteffenL. (2016). Computational methods in drug discovery.Beilstein J. Organ. Chem.1226942718. 10.3762/bjoc.12.267

  • 21

    MaddahM.BahramsoltaniR.YektaN. H.RahimiR.AliabadiR.PourfathM.et al (2021). Proposing high-affinity inhibitors from Glycyrrhiza glabra L. against SARS-CoV-2 infection: virtual screening and computational analysis.N. J. Chem.451597715995. 10.1039/D1NJ02031E

  • 22

    MaiaE. H. B.AssisL. C.de OliveiraT. A.da SilvaA. M.TarantoA. G. (2020). Structure-Based Virtual Screening: from Classical to Artificial Intelligence.Front. Chem.8:343. 10.3389/fchem.2020.00343

  • 23

    MeenakumariK.BupeshG.VasanthS.VasuC. A.PandianK.PrabhuK.et al (2019). Molecular docking based virtual screening of carbonic anhydrase IX with coumarin (a cinnamon compound) derived ligands.Bioinformation15744749. 10.6026/97320630015744

  • 24

    MengX. Y.ZhangH. X.MezeiM.CuiM. (2011). Molecular Docking: a Powerful Approach for Structure-Based Drug Discovery.Curr. Comput. Aided Drug Design7146157. 10.2174/157340911795677602

  • 25

    MirjaliliS.GandomiA. H.MirjaliliS. Z.SaremiS.FarisH.MirajaliliM. S.et al (2017). Salp swarm algorithm: a bio-inspired optimizer for engineering design problems.Adv. Eng. Soft.114163191. 10.1016/j.advengsoft.2017.07.002

  • 26

    MorrisG. M.GoodsellD. S.HueyR.OlsonA. J. (1996). Distributed automated docking of flexible ligands to proteins: parallel applications of AutoDock 2.4.J. Mol. Recogn.10293304. 10.1007/BF00124499

  • 27

    MysingerM. M.CarchiaM.IrwinJ. J.ShoichetB. K. (2012). Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking.J. Med. Chem.55:6582. 10.1021/jm300687e

  • 28

    OlubiyiO. O.OlagunjuM.KeutmannM.LoschwitzJ.StrodelB. (2020). High Throughput Virtual Screening to Discover Inhibitors of the Main Protease of the Coronavirus SARS-CoV-2.Molecules25:3193. 10.3390/molecules25143193

  • 29

    RajguruT.BoraD.ModiM. K. (2020). Combined CADD and Virtual Screening to Identify Novel Nonpeptidic Falcipain-2 Inhibitors.Curr. Comput. Drug Design17579588. 10.2174/1573409916666200701213526

  • 30

    RenH.LiJ.ChenH.LiC. Y.et al (2021). Adaptive levy-assisted salp swarm algorithm: analysis and optimization case studies.Mathemat. Comput. Simul.181380409. 10.1016/j.matcom.2020.09.027

  • 31

    SafavianS. R.LandgrebeD. (1991). A survey of decision tree classifier methodology.IEEE Transac. Syst. Man, Cybernet.21660674. 10.1109/21.97458

  • 32

    SelvarajC.PanwarU.DineshD. C.BouraE.SinghP.DubeyV. K.et al (2021). Microsecond MD Simulation and Multiple-Conformation Virtual Screening to Identify Potential Anti-COVID-19 Inhibitors Against SARS-CoV-2 Main Protease.Front. Chem.8:595273. 10.3389/fchem.2020.595273

  • 33

    TalluriS. (2021). Molecular Docking and Virtual Screening based prediction of drugs for COVID-19.Comb Chem. High Throughput Screen24716728. 10.2174/1386207323666200814132149

  • 34

    TauferM.CrowleyM.PriceD. J.ChienA. A.BrooksC. L.IIIet al (2005). Study of a highly accurate and fast protein-ligand docking method based on molecular dynamics.Concurr. Comput.1416271641. 10.1002/cpe.949

  • 35

    ThiyagarajanV.LinS. H.ChangY. C.WengC. F.et al (2016). Identification of novel FAK and S6K1 dual inhibitors from natural compounds via ADMET screening and molecular docking.Biomed. Pharmacother.805262. 10.1016/j.biopha.2016.02.020

  • 36

    TodeschiniR.ConsonniV. (2009). Molecular Descriptors for Chemoinformatics.Weinheim: Wiley-VCH. 10.1002/9783527628766

  • 37

    TongJ.QinS.JiangG. (2019). 3D-QSAR Study of Melittin and Amoebapore Analogues by CoMFA and CoMSIA Methods.Chin. J. Struct. Chem.2201210.

  • 38

    WangM. Y.PengL.QiaoP. L. (2016). The Virtual Screening of the Drug Protein with a Few Crystal Structures Based on the Adaboost-SVM.Comput. Math Methods Med.2016:4809831. 10.1155/2016/4809831

  • 39

    WuP.ChenY. (2007). “Grammar Guided Genetic Programming for Flexible Neural Trees Optimization,” in Advances in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), Vol. 4426edsZhouZ. H.LiH.YangQ. (Berlin, Heidelber: Springer).

  • 40

    YangZ.ZhouY.ZhongL. (2021). Discovery of BAZ1A bromodomain inhibitors with the aid of virtual screening and activity evaluation.Bioorganic Med. Chem. Lett.33:127745. 10.1016/j.bmcl.2020.127745

  • 41

    ZakiM. E. A.AlhussainS. A.MasandV. H.AkasapuS.BajajS. O.GhoshA.et al (2021). Identification of Anti-SARS-CoV-2 Compounds from Food Using QSAR-Based Virtual Screening, Molecular Docking, and Molecular Dynamics Simulation Analysis.Pharmaceuticals14:357. 10.3390/ph14040357

  • 42

    ZaslavskiyM.JégouS.TramelE. W. (2019). ToxicBlend: virtual screening of toxic compounds with ensemble predictors.Computat. Toxicol.108188. 10.1016/j.comtox.2019.01.001

  • 43

    ZhangB.RenJ.ChengY.WangB.WeiZ.et al (2019). Health Data Driven on Continuous Blood Pressure Prediction based on Gradient Boosting Decision Tree Algorithm.IEEE ACCESS73242332433. 10.1109/ACCESS.2019.2902217

  • 44

    ZhangL.AiH. X.LiS. M.QiM. Y.ZhaoJ.ZhaoQ.et al (2017). Virtual screening approach to identifying influenza virus neuraminidase inhibitors using molecular docking combined with machine-learning-based scoring function.Oncotarget88314283154. 10.18632/oncotarget.20915

  • 45

    ZhangY.WangY.ZhouW.FanY.ZhaoJ.ZhuL.et al (2019). A combined drug discovery strategy based on machine learning and molecular docking.Chem. Biol. Drug Design93685699. 10.1111/cbdd.13494

  • 46

    ZhengY.KongL.JiaH.ZhangB.WangZ.XuL.et al (2020). Network pharmacology study on anti-stroke of Xiaoshuan Tongluo formula based on systematic compound-target interaction prediction models.Acta Pharmaceut. Sin.55256264.

  • 47

    ZhouY.ZhangB.LinZ. J.ZhangX. M.LiF.WangH. G.et al (2016). Virtual screening for components in Chicory combined with CNT2 target based on molecular docking.Zhongguo Zhong Yao Za Zhi4139623967.

  • 48

    ZhouZ. H.FengJ. (2017). “Deep Forest: Towards An Alternative to Deep Neural Networks,” in Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, (Nanjing: Nanjing University), </UB>35533559. 10.24963/ijcai.2017/497

Summary

Keywords

virtual screening, network pharmacology, flexible neural tree, grammar-guided genetic programming, salp swarm algorithm

Citation

Yang B, Bao W and Chen B (2022) Disease-Ligand Identification Based on Flexible Neural Tree. Front. Microbiol. 13:912145. doi: 10.3389/fmicb.2022.912145

Received

15 March 2022

Accepted

06 May 2022

Published

06 June 2022

Volume

13 - 2022

Edited by

Liang Wang, Xuzhou Medical University, China

Reviewed by

Chun-Chun Wang, China University of Mining and Technology, China; Chandrabose Selvaraj, Alagappa University, India

Updates

Copyright

*Correspondence: Wenzheng Bao,

This article was submitted to Microbe and Virus Interactions with Plants, a section of the journal Frontiers in Microbiology

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics