- 1Department of Acupuncture, Affiliated Hospital of Shandong University of Traditional Chinese Medicine, Jinan, China
- 2School of Pharmacy, Shandong University of Traditional Chinese Medicine, Jinan, China
- 3Institute of Chinese Materia Medica, China Academy of Chinese Medical Sciences, Beijing, China
- 4Department of Ultrasound, Jinan Central Hospital Affiliated to Shandong First Medical University, Jinan, China
The integration of multiple virtual screening strategies facilitates the balance of computational efficiency and prediction accuracy. In this study, we constructed an efficient and reliable “multi-stage virtual screening-in vitro biological validation” system to identify potential inhibitors targeting extracellular signal-regulated protein kinase 2 (ERK2). Firstly, we rapidly obtained 10 candidate ERK2 inhibitors with desirable pharmacokinetic characteristics from thousands of named natural products in ZINC database based on machine learning classification models and ADME/T prediction. The structure-based molecular docking approach was then used to obtain four further hits with lower binding free energy compared to the positive control molecule Magnolipin. Subsequently, the two compounds were purchased for in vitro biological validation considering commercial availability and economic cost, and the results showed that Dodoviscin A exhibited acceptable inhibitory activity on ERK2 (IC50 = 10.79 μm). Finally, the mechanism of action and binding stability of this natural product inhibitor were investigated by binding mode analysis and molecular dynamics simulation.
1 Introduction
Overexpression of kinases is a major contributor to the pathogenesis of many diseases, such as cancer, inflammation and neurodegenerative diseases. The search for inhibitors that target kinases has therefore become one of the hot topics in drug discovery (Gharwan and Groninger, 2016). The mitogen-activated protein kinase (MAPK) family consists of serine/threonine kinases that are widely expressed in many cellular tissues and are implicated in a variety of cellular processes including cell growth, differentiation and apoptosis (Dhillon et al., 2007). Four parallel MAPK signaling pathways have been identified, of which the extracellular signal-regulated protein kinase 1/2 (ERK1/2) pathway has been the most extensively studied. In this pathway, extracellular signals (e.g., epidermal growth factor) first bind to the receptor tyrosine kinase (RTK), which subsequently stimulates the conversion of Ras proteins from GDP-binding to GTP-binding, thereby activating Raf proteins; and then the activated Raf proteins are involved in catalyzing the phosphorylation of MEK1/2, which further catalyzes the phosphorylation of ERK1/2 (Chang and Karin, 2001).
Many inhibitors against upstream targets achieved good clinical efficacy initially, but they all developed resistance after a few months, the main reason for this being the reactivation of the downstream target ERK (Pratilas and Solit, 2010). Many studies have shown that inhibitors against ERK2 are more specific and have a relatively low probability of acquiring resistance, making ERK2 an ideal therapeutic target (Carlino et al., 2014). Like other protein kinases, the primary structure of ERK2 consists of an N-terminal and a C-terminal structural domain, the former containing five antiparallel β-sheet structures (β1 ∼ β5), an αC-helix structure and a glycine-rich loop, while the latter containing six conserved α-helix structures and four shorter β-sheet structures (β6 ∼ β9). Besides, there is an ATP-binding pocket in the hinge region connecting the two, which is the binding site for most current kinase inhibitors (Lefloch et al., 2008).
The high diversity of chemical structures and physicochemical properties of natural products makes them a valuable source for the discovery of novel active compounds with representative success stories including pilocarpine, morphine, and artemisinin (Zhang et al., 2020). Compared to synthetic compounds, natural products are widely available, inexpensive and have lower toxic effects. Unfortunately, there are still relatively few reports of natural products targeting kinases, which means that the search for natural products with ERK2 inhibitory activity is a valuable direction to explore. In the pharmaceutical industry, the discovery and experimental validation of active compounds is a time-consuming and laborious process. Traditional high-throughput screening often requires biochemical testing of over a million compounds individually to identify the active ingredients, which is costly in terms of time and money. In contrast, virtual screening techniques can significantly reduce the number of compounds used for pharmacological activity testing, resulting in significant cost savings (Gupta et al., 2021).
With the rapid development of medical research, increasing experimental data on drugs is available to researchers. In this context, using the wealth of results from the pharmaceutical field to find target compounds is a more accurate and cost-effective way of drug discovery (Paul et al., 2021). With the booming era of big data, artificial intelligence technologies, represented by machine learning, have made significant progress in shortening the drug development cycle, which provides strong support for the virtual screening of lead compounds. In many cases, machine learning exhibits superior performance compared to traditional computational methods and has unique advantages in the screening of compounds (Tsou et al., 2020). Based on the research background, we have designed a “sequential” virtual screening process that combines machine learning with multiple cheminformatics tools to identify natural product inhibitors targeting ERK2, and we have also performed in vitro biological evaluation of the screened compounds to validate the reliability of the process. The technical route involved in this study was shown in Figure 1.
2 Materials and methods
2.1 Data pre-processing, molecular characterization, and feature engineering
In the present study, experimental IC50 values for ERK2 inhibitors were collected from the ChEMBL database. Due to some variation between data from different sources, we handled duplicate compounds according to the following criteria: 1) If a compound had multiple identical IC50 values, only one of them was retained. 2) If a compound had multiple different IC50 values, the average of them was used as the final IC50 value. In addition, if a compound had no clear IC50 value, it was excluded. Next, 422 compounds were labelled as “active” (indicated by number 0) and 442 compounds as “inactive” (indicated by number 1) based on an activity cut-off value of 1 μm. Finally, the entire data set was randomly divided into a training set and a test set in the ratio of 8:2. The above process was completed by writing code in Python 3.7.
The input characterization of compounds in the dataset takes the form of molecular descriptors and molecular fingerprints. Molecular descriptors are numerical representations of chemical and biological properties and are classified into different types such as topological, compositional and geometric descriptors (Wei et al., 2021). Molecular fingerprints encode structural information in the form of a sequence of binary bits, with the corresponding bit being 1 if a predetermined bond or functional group is present in the molecule and 0 otherwise (Racz et al., 2019). Molecular descriptors were calculated using the cheminformatics toolkit RDKit, containing a total of 208 numerical indicators representing the physicochemical and structural characteristics of the molecule. MACCS molecular fingerprints were calculated using the open source software PaDEL-Descriptor, containing a total of 166 binary numbers representing the presence or absence of specific structural fragments in the molecule.
In general, not all features facilitate the construction of machine learning models. Irrelevant and unnecessary features may reduce the generalization ability of the model and lead to overfitting. Therefore, feature engineering is an integral step. Considering the wide range of values for different molecular representations, the corresponding strategies were adopted: for molecular descriptors, all descriptors with a variance of 0 and those not related to the target value (i.e., whether the compound is active or not) were eliminated by variance filtering and mutual information methods. The final 132 molecular descriptors were retained for the construction of machine learning models. For molecular fingerprints, recursive feature elimination combined with learning curves was used to find the optimal number of features, and 80 features with high relevance to the target value were retained.
2.2 Generation and evaluation of machine learning models
Machine learning is an important branch of artificial intelligence and has been widely used in recent years for the discovery of lead compounds and the prediction of physicochemical and biological properties of molecules (Serafim et al., 2020). The decision tree algorithm presents the model building process through a tree-like structure, with the intermediate nodes representing the selected features and the leaf nodes representing the decision results. Random Forest (RF) is an ensemble learning algorithm based on decision trees, which overcomes the shortcomings of single decision trees that are prone to overfitting (Svetnik et al., 2003). The RF algorithm obtains some features randomly from the training set and constructs multiple decision tree models using random sampling to make predictions together. Support Vector Machine (SVM) is a machine learning algorithm based on statistical theory, the core idea of which is to find a hyperplane with a maximum bound to classify the training samples. The kernel function is a unique trick of the SVM algorithm, which solves the problem of indistinguishability by mapping the training samples into a high-dimensional space (Mavroforakis and Theodoridis, 2006). Just as the brain usually obtains information from experience to solve problems, the architecture of an Artificial Neural Network (ANN) consists of multiple interconnected neurons distributed in different layers. ANN first calculates the loss by forward propagation and then updates the weights by backward propagation. This process is iterated until the best weights that minimize the loss of the model are found (Jing et al., 2018). In addition, model fusion can also effectively reduce the prediction error of virtual screening. Voting is a voting-based model fusion strategy, while Stacking is a learning-based model fusion strategy. The above machine learning models were generated with the help of the open source Python toolkit Scikit-learn (Abraham et al., 2014).
We used a 10-fold cross-validation of the training set with the test set to evaluate the prediction performance and generalization ability of the model. Accuracy represents the proportion of samples with correct predictions to the total number of samples, and an excellent machine learning classifier should have an accuracy score close to 1. Precision represents the proportion of samples that are truly positive out of all samples that are predicted to be positive, while recall represents the proportion of samples that are correctly predicted out of all samples that are truly positive. F1-score is the harmonic average of precision and recall, which allows an objective evaluation of the predictive performance in the case of an unbalanced data set. The Area Under Curve (AUC) is one of the most important metrics in the evaluation of machine learning models, with an AUC value closer to 1 meaning that the model is more capable of classification.
2.3 Visualization of the chemical spatial distribution
Principal component analysis (PCA) is a common linear dimensionality reduction method whose main purpose is to compress data and remove redundant noise while minimizing the loss of original information. PCA is able to derive a few principal components from the original variables so that they retain as much information as possible and are uncorrelated with each other (Cortés et al., 2017). In this study, we visualized the chemical distribution of compounds in the three-dimensional space by using the decomposition. PCA module of Scikit-learn.
2.4 Pharmacokinetic analysis and toxicity assessment
Early prediction of ADME/T (absorption, distribution, metabolism, excretion and toxicity) of lead compounds can be effective in avoiding adverse drug reactions in clinical practice (Lee and Chen, 2019). Many filtering methods based on specific drug-likeness are often used as reference standards for screening desirable compounds and thus guide the decision-making process in drug development. We used the server ADMETlab 2.0 to make fast and accurate online predictions of ADME/T for the compounds obtained from the first round of screening, while filtering out those molecules that did not meet the requirements according to a range of parameters (Xiong et al., 2021).
2.5 Molecular docking calculations
Molecular Docking predicts the interaction pattern between a target protein and a candidate compound and calculates the corresponding binding free energy. The theory is based on the fact that the binding process of ligands and receptors depends on the matching of spatial shapes (Ferreira et al., 2015). The crystal structure of the receptor protein used in this study was downloaded from the PDB database (PDB ID: 1TVO) and then preprocessed by AutoDock Tools software as follows: water molecules were removed; hydrogen atoms were added and charges were recalculated (Ohori et al., 2005). In addition, the structure optimization of the target protein was performed with the help of the web server FoldX (Schymkowitz et al., 2005). The structures of the ligand compounds were downloaded from the PubChem database and then energy minimization was performed by Chem3D software. Finally, the structures of both receptor and ligands need to be converted to pdbqt format.
We performed a molecular docking study of the compounds obtained from the second round of screening and the target protein using Autodock Vina software (Trott and Olson, 2010). The parameters of the docking box were set to center_x = 6.429, center_y = -4.372, center_z = 16.444, size_x = size_y = size_z = 10.598, and the number of exhaustiveness was set to 30 to ensure the accuracy of the prediction results. After docking was completed, we visualized the docking conformation of the ligand at the active site of the target protein with the help of LigPlot+ and PyMOL software (Laskowski and Swindells, 2011).
2.6 Kinase activity assay
The inhibitory activity of the candidate compounds against ERK2 kinase was assayed by the Lance® Ultra kinase assay. The principle of this method is that when a ULightTM-labeled peptide substrate is phosphorylated by the kinase, its phosphorylation site is recognized by the fluorescent europium (Eu)-labeled monoclonal antibody and the energy transfer that occurs during the process is captured by the instrument (Townsend et al., 2012).
In this study, the initial concentration of the candidate compounds purchased from Macklin (Shanghai, China) was set at 100 µm and the gradient dilution in the 384-well plate was set to 2-fold. Staurosporine purchased from MedChemExpress (New Jersey, United States) was used as a positive control (Meggio et al., 1995). Firstly, 40 µl of the test compound, 10 µl of the kinase solution and 10 µl of the substrate solution containing ATP were incubated at room temperature for 60 min (reaction step). At the end of the reaction, EDTA was added to stop the reaction. Then 20 µl of the detection solution containing the antibody dilution was added to the well plate and incubated at room temperature for 60 min (detection step). Finally, the IC50 values of the candidate compounds were calculated using a dose-effect curve. Each assay was repeated in triplicates and the results were shown as mean ± standard deviation.
2.7 Molecular dynamics simulation
Molecular dynamics simulations based on Newtonian mechanics were able to assess the binding stability of candidate compound Dodoviscin A to target protein ERK2 on both temporal and spatial scales, and the known natural product inhibitor Magnolin was used as a control (Wang et al., 2018). In this study, we performed molecular dynamics simulations using Gromacs software (Pronk et al., 2013). The topology file of ligand compound was generated by the online web tool Swiss Param (https://www.swissparam.ch/), and the topology file of target protein was obtained from the pdb2gmx module (the force field was specified as CHARMM 36). The complex was then confined in a water-filled cubic box (distance between the complex and the edge of the box was at least 10 Å) and the charge of the whole system was neutralized by the addition of gegenions (101 Na+ and 97 Cl-). Next, the system was energy minimized at 10 kJ/mol using the steepest descent method of 50,000 steps (treatment of long-range electrostatic interactions was chosen as PME). Before the final simulation, the system was pre-equilibrated to stabilize at a suitable temperature and pressure, where the temperature for NVT equilibration was maintained at 300 K (Velocity-Rescale was set as the temperature control method) and the pressure for NPT equilibration was maintained at 1 bar (Parrinello-Rahman was set as the pressure control method), and the time duration was both set to 5,000 ps? After the above process, a molecular dynamics simulation was performed for 100 ns and repeated three times, with the trajectory recorded in atomic coordinates at an interval of 2 fs?
3 Results
3.1 Generation and performance evaluation of machine learning models
In this study, we adopted popular machine learning techniques for the activity prediction of compounds. The hyperparameters of the model were tuned by the Bayesian Optimization algorithm and a range of metrics of the tuned model were compared on the training and test sets. Cross-validation is one of the most reliable methods for assessing model performance (Fourches et al., 2015). In K-fold cross-validation, the dataset was divided into K subsets, one of which was selected as the validation set for each training epoch, and the remaining (K-1) subsets were used as the training set. This process was repeated K times, so that each subset was used exactly once. Finally, the average of the K training results was used as the final result. The results of the 10-fold cross-validation of the machine learning models generated using the best combination of hyperparameters on the training set were shown in Figure 2. The overall prediction accuracy of the three single models constructed based on MACCS fingerprints ranged from 87.6% to 88.1%, with F1-score values between 0.881–0.887 and AUC values between 0.941–0.949; while the overall prediction accuracy of the three single models constructed based on RDKit descriptors ranged from 89.2% to 91.4%, with F1-score values between 0.895–0.917 and AUC values between 0.948–0.960. The prediction performance of each model on the test set was shown in Table 1, where we found that almost all of these single models outperformed the training set on the test set, while employing RDKit descriptors as a form of molecular representation was a better choice for the ERK2 dataset. It should be noted that according to the spatial distribution of the training set/test set (Supplementary Figure S1), we could observe that the training set almost covered the test set (36.07%, 26.26%, and 13.95% explained variance contribution rate in the three dimensions, respectively). This is also the reason why these models had better performance on the test set.
FIGURE 2. 10-fold cross-validation results of five machine learning classification models on the training set. (A) Models constructed based on molecular fingerprint; (B) Models constructed based on molecular descriptor.
TABLE 1. Comparison of the performance of machine learning classification models based on two forms of molecular representation on the test set.
Considering the excellent performance of the above three algorithms on the training and test sets, the model fusion of them is beneficial to integrate the applicability of different algorithms and avoid the local bias of a single model affecting the generalization performance of the prediction. When evaluating binary classification models, AUC is relatively more informative as the most important parameter for measuring model performance. As can be seen in Figure 2; Table 1, the two integrated models (Voting and Stacking) constructed based on RDKit molecular descriptors exhibited AUC values of 0.966 and 0.965 on the training set and 0.980 and 0.982 on the test set, respectively, indicating that performing model fusion produced more desirable classification results and generalization ability. It is worth noting that although a single model could sometimes show better performance in terms of accuracy as well as F1-score, as reported by Gramatica et al., any algorithm has its own priorities and a single model may easily fall into local minima, leading to its poor generalization ability (Gramatica et al., 2007). Therefore, we consider that the two integrated models with the highest AUC values have better predictive reliability.
3.2 Virtual screening based on machine learning
After generating machine learning models and evaluating their predictive performance, we performed a virtual screening of 4,112 named natural products in the ZINC database using two integrated models based on molecular descriptors. The 208 RDKit molecular descriptors for these compounds were first calculated and then the same data pre-processing and feature selection steps were applied. To improve the reliability of the prediction results and further narrow the scope of the search, we set the threshold for both integrated models to 0.75, which ultimately showed that a total of 427 compounds were predicted as potential ERK2 inhibitors by both models. In general, predictions are reliable when the chemical space of the screened molecules lies in the application domain of the model (Yang et al., 2022). We therefore visualized the training set used for modeling and the corresponding compounds in the ZINC database by means of the PCA algorithm (Figure 3). The results showed that the chemical space distribution of most of the predicted compounds overlapped with that of the compounds in the training set, implying that the virtual screening results based on machine learning were dependable. In addition, the three dimensions could represent 41.73%, 23.87%, and 15.46% of the original data information, respectively, which means that the spatial distribution was of reference value. 403 candidate compounds were selected for subsequent studies following the elimination of certain compounds that were far from the application domain of the model.
3.3 Drug-like filtering of potential inhibitors of ERK2 and virtual screening based on molecular docking
Excellent pharmacokinetic properties contribute to reducing the risk of failure of drug candidates in the clinical trial stage. The ADME/T of 403 compounds obtained from the first round of screening was predicted and further filtered according to the set criteria. According to Lipinski’s rule, candidate compounds should satisfy at least three of the five items: Octanol-Water Partition Coefficient (LogP)≤5, Molecular Weight≤500, Number of Hydrogen Bond Acceptors≤10, Number of Hydrogen Bond Donors≤5 and Number of Rotatable Bonds≤10. Drug absorption in the gastrointestinal tract is a key factor in oral drug delivery, and the Human Intestinal Absorption (HIA) rate of a candidate compound should be greater than 30%. Blood-brain barrier (BBB) permeability indicates the effect of the drug on the central nervous system, and an ideal ERK2 inhibitor should have the ability to penetrate the BBB. Water Solubility (LogS) is one of the criteria for molecular absorption, with a reasonable range between −4 and 0.5. In addition, the assessment of the toxicity risk of candidate compounds is more essential in the early stages of drug discovery. All compounds should not show risks of Hepatotoxicity, Acute Oral Toxicity, Carcinogenicity and AMES Mutagenicity. The above criteria resulted in 10 drug-like compounds with desirable ADME/T properties, which would be used in molecular docking studies to compare the binding affinity to ERK2. Partial physicochemical properties of these compounds were listed in Supplementary Table S1.
To ensure the reliability of the docking protocol, the co-crystallized ligand was re-docked to the active site of the target protein and the root mean square deviation (RMSD) between the docked conformation and the original conformation was calculated. The results showed a superimposed RMSD of 0.63 Å (less than 2 Å), which confirms the dependability of the docking process used in this study (Supplementary Figure S2). We next performed docking calculations using the same parameters for the 10 drug-like compounds mentioned above, whose binding free energies were shown in Figure 4. It should be noted that we chose magnolin, a natural product with ERK2 inhibitory activity, as an additional positive control molecule (Lee et al., 2015). We could find that all candidate compounds had binding free energies below −7.5 kcal/mol. Based on the principle that the lower the binding free energy, the stronger the ligand-receptor interaction, we screened four natural products with binding free energies below Magnolin (−8.125 kcal/mol) as potential ERK2 inhibitors. Specifically, Amaronol A exhibited the highest binding affinity (−8.725 kcal/mol), followed by Bruceolide (−8.488 kcal/mol) and finally Massonianoside B (−8.313 kcal/mol) and Dodoviscin A (−8.306 kcal/mol). The confidence levels of these four compounds in the machine learning models and the binding free energy in molecular docking were listed in Table 2. We could find that the confidence levels of compound Dodoviscin A were highest in both integrated models, while those of compound Bruceolide were relatively low.
3.4 Biological validation of the screened compounds
We retrieved the relevant reports of these four screened compounds in the PubMed database, of which there was little literature on Bruceolide and Amaronol A, and their in vitro biological activity was also unclear; while Dodoviscin A could inhibit melanogenesis and Massonianoside B had potential antioxidant activity (Chen et al., 2012; Yan et al., 2013). Therefore, Dodoviscin A and Massonianoside B were selected for the kinase assay given the application values. To determine the ERK2 inhibitory effect of those screened compounds, we performed the kinase activity assay based on the Lance® Ultra method. The dose-response curves were shown in Supplementary Figure S3. At a concentration of 50 μm, Dodoviscin A exhibited 87.3% inhibition of ERK2 with an IC50 value of 10.79 μm. In contrast, Massonianoside B showed relatively weak inhibition of ERK2 with an IC50 value greater than 100 μm. These results confirmed the reliability of our virtual screening strategy.
3.5 Analysis of the binding pattern of screened compounds to target protein
Figure 5 showed the interaction of the target protein ERK2 with the positive control molecule Magnolin and the active candidate Dodoviscin A. We can see that both compounds docked to the hinge region of the kinase. Magnolin formed hydrogen bonds with amino acid residues LYS-151 and ARG-67 with bond lengths of 3.09 Å and 3.32 Å, respectively, while Dodoviscin A showed more hydrogen bonding interactions. Specifically, the key amino acid residues MET-108 and GLU-109 in the hinge region interacted with the hydroxymethyl group in the ligand structure via two hydrogen bonds of 2.51 Å and 3.02 Å, while three other amino acid residues, LYS-54, ASP-167, and ASN-154, were also observed to form hydrogen bonds with Dodoviscin A with bond lengths of 3.11 Å, 3.20 Å, and 3.13 Å, respectively. In addition, several amino acid residues including VAL-39, LEU-156, GLU-71, and SER-153 were involved in the formation of hydrophobic interactions. Previous studies have shown that hydrogen bonding interactions between type I inhibitors targeting ERK2 and MET-108 are critical for their occupancy of the ATP binding pocket (Roskoski, 2019). By comparing the binding patterns of these two compounds, we speculate that the formation of more hydrogen bonding interactions between Dodoviscin A and the target protein may account for its relatively high binding affinity.
FIGURE 5. (A,B) 2D/3D binding interaction of Magnolin with ERK2 protein; (C,D) 2D/3D binding interaction of Dodoviscin A with ERK2 protein (pink structure represents small molecule ligand, blue structure represents hydrogen bond interaction residue, gray structure represents hydrophobic interaction residue).
3.6 Analysis of the binding stability of the screened compounds to the target protein
To further explore the dynamic binding process of the screened compound to the target protein, molecular dynamics simulations were performed. The RMSD values of the protein and ligand can be calculated to detect the change in the position of both during the simulation time, which is crucial for the stability of the complex in the dynamic system. In general, a low RMSD value indicates that there is little change in the conformation, while a less fluctuating RMSD value indicates that the whole system has reached stability (Raju et al., 2022). As shown in Figure 6A, the ligand compound Dodoviscin A bound to the receptor protein ERK2 reached equilibrium after initial fluctuations and remained there until the end of the simulation, with mean RMSD values of 0.116 ± 0.020 nm, while the mean RMSD value of Magnolin was 0.153 ± 0.050 nm. Notably, for the Dodoviscin A-ERK2 complex, although the mean RMSD values were not considerably different from those of the control Magnolin-ERK2 complex, the RMSD fluctuation trajectory was smoother and the binding stability was better.
FIGURE 6. (A) Variation of RMSD values of small molecule ligands during molecular dynamics simulation; (B) Variation of RMSF values of protein amino acid residues during molecular dynamics simulation; (C) Variation of SASA values of ligand-protein complex during molecular dynamics simulation; (D) Variation of hydrogen bond number in ligand-protein complex during molecular dynamics simulation.
Root Mean Square Fluctuations (RMSF) are commonly used to assess the flexibility state of the structural regions of proteins. As shown in Figure 6B, the receptor protein ERK2 bound to the ligand compounds Magnolin and Dodoviscin A had similar fluctuation trends, and most of the amino acid residues that were more flexible were those that formed interactions with the ligands.
The hydrophobicity of amino acid residues in proteins is closely related to the folding of proteins. Figure 6C showed the solvent accessible surface area (SASA) changes of both Dodoviscin A and Magnolin systems during the simulation, which is crucial for the stability of hydrophobic interactions. In general, the fluctuations of both systems were similar, with the SASA of the Dodoviscin A-ERK2 complex fluctuating between 170 and 190 nm2 with a mean value of 180.46 ± 2.81 nm2, while the overall SASA of the Magnolin-ERK2 complex was higher with a mean value of 188.01 ± 3.89 nm2. These results indicated that the amount of hydrophobic amino acids hidden inside the protein did not very much. Hydrogen bonding has a significant effect on the stable binding of ligand-protein complexes (Kuhn et al., 2010). Figure 6D displayed the variation in the number of hydrogen bonds during the simulation. We found that more hydrogen bonds were formed in the Dodoviscin A-ERK2 complex than in the control Magnolin-ERK2 complex, with up to six hydrogen bonding interactions observed, mainly between 2 and 4. To further observe the hydrogen bonding of the candidate compound Dodoviscin A to the target protein ERK2 during the 100 ns simulation, the ligand-protein interactions were visualized at different time intervals. The results were shown in Figure 7, compared to the initial conformation, the hydrogen bonds between Dodoviscin A and the two key amino acid residues MET-108 and ASP-167, which are more important for the binding of the inhibitor to ERK2, were still present despite the partial loss of hydrogen bonding interactions. The hydrogen bonding interactions of the Dodoviscin A-ERK2 complex at different time points were listed in Supplementary Table S2. These results also further illustrated that the candidate compound Dodoviscin A was able to bind steadily to the target protein ERK2.
FIGURE 7. Analysis of hydrogen bonding interactions of Dodoviscin A-ERK2 complex at different time intervals during 100 ns simulation. (A) 10 ns; (B) 25 ns; (C) 50 ns; (D) 75 ns; (E) 100 ns.
4 Discussion
The human protein kinase family has become one of the hottest drug targets in the last few decades due to the critical role that mutations and dysregulation of protein kinases play in the pathogenesis of many human diseases (Cohen and Alessi, 2013). The Ras-Raf-MEK-ERK pathway is the most classical signaling pathway in the MAPK family, which transmits upstream signals to downstream responsive molecules through a continuous phosphorylation process. Studies have shown that inhibitors against the downstream target ERK have shown satisfactory results in terms of activity, selectivity and especially drug resistance compared to inhibitors against the upstream target (Miao and Tian, 2020). It should be noted that most of the ERK inhibitors identified so far are ATP-competitive inhibitors, and the development of more attractive covalent inhibitors still faces many challenges.
In recent years, the application of artificial intelligence in the pharmaceutical field has been increasing (Aguiar-Pulido et al., 2013). In this study, we first constructed multiple machine learning classification models for the kinase target ERK2 to discover potential natural product inhibitors. In contrast to previous work that considered only a single form of molecular characterization, we employed MACCS molecular fingerprints and RDKit molecular descriptors to provide a more comprehensive characterization of compounds in the dataset. Moreover, considering the variability between different algorithms, we have further improved the reliability of the predictions by integrating three single models through two model fusion strategies (Voting and Stacking). The results showed that our integrated models based on molecular descriptors performed better on the test set, with accuracy and AUC values ranging from 0.930 to 0.931 and 0.980–0.982, respectively. We then performed the first round of virtual screening of natural products in the ZINC database using these two integrated models, yielding a total of 403 candidate inhibitors.
The second round of virtual screening based on drug-likeness was carried out using the ADMETlab 2.0 online server, resulting in a total of 10 candidate inhibitors. Subsequently, we ranked the binding affinity of these compounds to ERK2 by Autodock Vina software (the third round of virtual screening) and identified the four most promising candidate inhibitors. Following in vitro biological evaluation, we found that the compound Dodoviscin A (a flavonoid isolated from Dodonaea viscosa) exhibited acceptable inhibitory activity on ERK2 (IC50 = 10.79 μm). In addition, a preliminary analysis of the interaction pattern and dynamic binding properties between the compound and the target protein was carried out. Overall, this work combines theoretical calculation and experimental validation in the search for natural product inhibitors of the kinase target ERK2, and the resulting candidate compound is expected to serve as a template molecule for the design of novel inhibitors.
The methods used in this study also have a number of limitations. Firstly, the data dependency and the lack of generalization may lead to a significant reduction in the performance of an excellent predictive model on a new dataset. Research has shown that an important issue to consider in building machine learning models is the quality of the dataset (Recanatini and Cabrelle, 2020). However, training data from various public databases is not always authentic and reliable, which will have a negative impact on the performance of machine learning models. Secondly, many machine learning algorithms are “black boxes” in terms of model interpretability. For example, we were unable to provide a detailed analysis of how the candidate compounds were screened. Thirdly, the number of compounds tested is relatively small and the screened compound Dodoviscin A still has considerable room for improvement in its inhibitory activity against ERK2. To address these issues, on the one hand, we need to clarify the potential of machine learning techniques to minimize prediction bias while understanding the nature of the problems (computational level), and on the other hand, we will work on structural modification of the screened compounds to improve their efficacy as ERK2 inhibitors (experimental level).
Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.
Author contributions
BC, RY, and GZ conceived, designed, processed, analysed, and interpreted the experiments. GZ and RY acquired the data. RY analysed the data and prepared the manuscript. LZ, YX, HY, and BY edited the manuscript. All authors contributed to the article and approved the submitted version.
Funding
This research was funded by the Sub-project of the National Ministry of Health Major New Drug Creation Science and Technology Major Project (No. 2014ZX09509001001) and Shandong Provincial Medical and Health Science and Technology Development Plan Project (No. 202002050626).
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fphar.2022.1077550/full#supplementary-material
References
Abraham, A., Pedregosa, F., Eickenberg, M., Gervais, P., Mueller, A., Kossaifi, J., et al. (2014). Machine learning for neuroimaging with scikit-learn. Front. Neuroinform. 8, 14. doi:10.3389/fninf.2014.00014
Aguiar-Pulido, V., Gestal, M., Cruz-Monteagudo, M., Rabunal, J. R., Dorado, J., and Munteanu, C. R. (2013). Evolutionary computation and QSAR research. Curr. Comput. Aided. Drug Des. 9 (2), 206–225. doi:10.2174/1573409911309020006
Carlino, M. S., Todd, J. R., Gowrishankar, K., Mijatov, B., Pupo, G. M., Fung, C., et al. (2014). Differential activity of MEK and ERK inhibitors in BRAF inhibitor resistant melanoma. Mol. Oncol. 8 (3), 544–554. doi:10.1016/j.molonc.2014.01.003
Chang, L., and Karin, M. (2001). Mammalian MAP kinase signalling cascades. Nature 410 (6824), 37–40. doi:10.1038/35065000
Chen, W. Q., Song, Z. J., and Xu, H. H. (2012). A new antifungal and cytotoxic C-methylated flavone glycoside from Picea neoveitchii. Bioorg. Med. Chem. Lett. 22 (18), 5819–5822. doi:10.1016/j.bmcl.2012.07.089
Cohen, P., and Alessi, D. R. (2013). Kinase drug discovery--what's next in the field? ACS Chem. Biol. 8 (1), 96–104. doi:10.1021/cb300610s
Cortés, V., Rodríguez, A., Blasco, J., Rey, B., Besada, C., Cubero, S., et al. (2017). Prediction of the level of astringency in persimmon using visible and near-infrared spectroscopy. J. Food Eng. 204, 27–37. doi:10.1016/j.jfoodeng.2017.02.017
Dhillon, A. S., Hagan, S., Rath, O., and Kolch, W. (2007). MAP kinase signalling pathways in cancer. Oncogene 26 (22), 3279–3290. doi:10.1038/sj.onc.1210421
Ferreira, L. G., Dos, S. R., Oliva, G., and Andricopulo, A. D. (2015). Molecular docking and structure-based drug design strategies. Molecules 20 (7), 13384–13421. doi:10.3390/molecules200713384
Fourches, D., Muratov, E., and Tropsha, A. (2015). Curation of chemogenomics data. Nat. Chem. Biol. 11 (8), 535. doi:10.1038/nchembio.1881
Gharwan, H., and Groninger, H. (2016). Kinase inhibitors and monoclonal antibodies in oncology: Clinical implications. Nat. Rev. Clin. Oncol. 13 (4), 209–227. doi:10.1038/nrclinonc.2015.213
Gramatica, P., Giani, E., and Papa, E. (2007). Statistical external validation and consensus modeling: A QSPR case study for koc prediction. J. Mol. Graph. Model. 25 (6), 755–766. doi:10.1016/j.jmgm.2006.06.005
Gupta, R., Srivastava, D., Sahu, M., Tiwari, S., Ambasta, R. K., and Kumar, P. (2021). Artificial intelligence to deep learning: Machine intelligence approach for drug discovery. Mol. Divers. 25 (3), 1315–1360. doi:10.1007/s11030-021-10217-3
Jing, Y., Bian, Y., Hu, Z., Wang, L., and Xie, X. Q. (2018). Deep learning for drug design: An artificial intelligence paradigm for drug discovery in the big data era. AAPS J. 20 (3), 58. doi:10.1208/s12248-018-0210-0
Kuhn, B., Mohr, P., and Stahl, M. (2010). Intramolecular hydrogen bonding in medicinal chemistry. J. Med. Chem. 53 (6), 2601–2611. doi:10.1021/jm100087s
Laskowski, R. A., and Swindells, M. B. (2011). LigPlot+: Multiple ligand-protein interaction diagrams for drug discovery. J. Chem. Inf. Model. 51 (10), 2778–2786. doi:10.1021/ci200227u
Lee, C. J., Lee, M. H., Yoo, S. M., Choi, K. I., Song, J. H., Jang, J. H., et al. (2015). Magnolin inhibits cell migration and invasion by targeting the ERKs/RSK2 signaling pathway. BMC Cancer 15, 576. doi:10.1186/s12885-015-1580-7
Lee, C. Y., and Chen, Y. P. (2019). Machine learning on adverse drug reactions for pharmacovigilance. Drug Discov. Today 24 (7), 1332–1343. doi:10.1016/j.drudis.2019.03.003
Lefloch, R., Pouyssegur, J., and Lenormand, P. (2008). Single and combined silencing of ERK1 and ERK2 reveals their positive contribution to growth signaling depending on their expression levels. Mol. Cell. Biol. 28 (1), 511–527. doi:10.1128/MCB.00800-07
Mavroforakis, M. E., and Theodoridis, S. (2006). A geometric approach to support vector machine (SVM) classification. IEEE Trans. Neural Netw. 17 (3), 671–682. doi:10.1109/TNN.2006.873281
Meggio, F., Donella, D. A., Ruzzene, M., Brunati, A. M., Cesaro, L., Guerra, B., et al. (1995). Different susceptibility of protein kinases to staurosporine inhibition. Kinetic studies and molecular bases for the resistance of protein kinase CK2. Eur. J. Biochem. 234 (1), 317–322. doi:10.1111/j.1432-1033.1995.317_c.x
Miao, L., and Tian, H. (2020). Development of ERK1/2 inhibitors as a therapeutic strategy for tumour with MAPK upstream target mutations. J. Drug Target. 28 (2), 154–165. doi:10.1080/1061186X.2019.1648477
Ohori, M., Kinoshita, T., Okubo, M., Sato, K., Yamazaki, A., Arakawa, H., et al. (2005). Identification of a selective ERK inhibitor and structural determination of the inhibitor-ERK2 complex. Biochem. Biophys. Res. Commun. 336 (1), 357–363. doi:10.1016/j.bbrc.2005.08.082
Paul, D., Sanap, G., Shenoy, S., Kalyane, D., Kalia, K., and Tekade, R. K. (2021). Artificial intelligence in drug discovery and development. Drug Discov. Today 26 (1), 80–93. doi:10.1016/j.drudis.2020.10.010
Pratilas, C. A., and Solit, D. B. (2010). Targeting the mitogen-activated protein kinase pathway: Physiological feedback and drug response. Clin. Cancer Res. 16 (13), 3329–3334. doi:10.1158/1078-0432.CCR-09-3064
Pronk, S., Pall, S., Schulz, R., Larsson, P., Bjelkmar, P., Apostolov, R., et al. (2013). Gromacs 4.5: A high-throughput and highly parallel open source molecular simulation toolkit. Bioinformatics 29 (7), 845–854. doi:10.1093/bioinformatics/btt055
Racz, A., Bajusz, D., and Heberger, K. (2019). Multi-level comparison of machine learning classifiers and their performance metrics. Molecules 24 (15), E2811. doi:10.3390/molecules24152811
Raju, B., Verma, H., Narendra, G., Sapra, B., and Silakari, O. (2022). Multiple machine learning, molecular docking, and ADMET screening approach for identification of selective inhibitors of CYP1B1. J. Biomol. Struct. Dyn. 40 (17), 7975–7990. doi:10.1080/07391102.2021.1905552
Recanatini, M., and Cabrelle, C. (2020). Drug research meets Network science: Where are we? J. Med. Chem. 63 (16), 8653–8666. doi:10.1021/acs.jmedchem.9b01989
Roskoski, R. J. (2019). Targeting ERK1/2 protein-serine/threonine kinases in human cancers. Pharmacol. Res. 142, 151–168. doi:10.1016/j.phrs.2019.01.039
Schymkowitz, J., Borg, J., Stricher, F., Nys, R., Rousseau, F., and Serrano, L. (2005). The FoldX web server: An online force field. Nucleic Acids Res. 33, W382–W388. doi:10.1093/nar/gki387
Serafim, M., Kronenberger, T., Oliveira, P. R., Poso, A., Honorio, K. M., Mota, B., et al. (2020). The application of machine learning techniques to innovative antibacterial discovery and development. Expert Opin. Drug Discov. 15 (10), 1165–1180. doi:10.1080/17460441.2020.1776696
Svetnik, V., Liaw, A., Tong, C., Culberson, J. C., Sheridan, R. P., and Feuston, B. P. (2003). Random forest: A classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 43 (6), 1947–1958. doi:10.1021/ci034160g
Townsend, E. A., Sathish, V., Thompson, M. A., Pabelick, C. M., and Prakash, Y. S. (2012). Estrogen effects on human airway smooth muscle involve cAMP and protein kinase A. Am. J. Physiol. Lung Cell. Mol. Physiol. 303 (10), L923–L928. doi:10.1152/ajplung.00023.2012
Trott, O., and Olson, A. J. (2010). AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 31 (2), 455–461. doi:10.1002/jcc.21334
Tsou, L. K., Yeh, S. H., Ueng, S. H., Chang, C. P., Song, J. S., Wu, M. H., et al. (2020). Comparative study between deep learning and QSAR classifications for TNBC inhibitors and novel GPCR agonist discovery. Sci. Rep. 10 (1), 16771. doi:10.1038/s41598-020-73681-1
Wang, Y., Lupala, C. S., Liu, H., and Lin, X. (2018). Identification of drug binding sites and action mechanisms with molecular dynamics simulations. Curr. Top. Med. Chem. 18 (27), 2268–2277. doi:10.2174/1568026619666181212102856
Wei, G., Fu, X., and Wang, Z. (2021). Nature identification of Chinese herbal medicine compounds based on molecular descriptors. J. Aoac Int. 104 (6), 1754–1759. doi:10.1093/jaoacint/qsab002
Xiong, G., Wu, Z., Yi, J., Fu, L., Yang, Z., Hsieh, C., et al. (2021). ADMETlab 2.0: An integrated online platform for accurate and comprehensive predictions of ADMET properties. Nucleic Acids Res. 49 (W1), W5–W14. doi:10.1093/nar/gkab255
Yan, G., Zhu, J., Zhang, L., Xu, Z., Wang, G., Zhu, W., et al. (2013). Dodoviscin a inhibits melanogenesis in mouse b16-f10 melanoma cells. Planta Med. 79 (11), 933–938. doi:10.1055/s-0032-1328649
Yang, R., Zhao, G., and Yan, B. (2022). Discovery of novel c-jun N-terminal kinase 1 inhibitors from natural products: Integrating artificial intelligence with structure-based virtual screening and biological evaluation. Molecules 27 (19), 6249. doi:10.3390/molecules27196249
Keywords: extracellular signal-regulated protein kinase 2, virtual screening, inhibitors, machine learning, natural products
Citation: Yang R, Zhao G, Zhang L, Xia Y, Yu H, Yan B and Cheng B (2022) Identification of potential extracellular signal-regulated protein kinase 2 inhibitors based on multiple virtual screening strategies. Front. Pharmacol. 13:1077550. doi: 10.3389/fphar.2022.1077550
Received: 23 October 2022; Accepted: 09 November 2022;
Published: 18 November 2022.
Edited by:
Norberto Sánchez-Cruz, Universidad Nacional Autónoma de México, MexicoReviewed by:
Vinícius Gonçalves Maltarollo, Federal University of Minas Gerais, BrazilMarina Miñarro, Chemotargets SL, Spain
Copyright © 2022 Yang, Zhao, Zhang, Xia, Yu, Yan and Cheng. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Bin Cheng, 71000883@sdutcm.edu.cn
†These authors share first authorship