Machine learning and molecular docking prediction of potential inhibitors against dengue virus

Hanson, George; Adams, Joseph; Kepgang, Daveson I. B.; Zondagh, Luke S.; Tem Bueh, Lewis; Asante, Andy; Shirolkar, Soham A.; Kisaakye, Maureen; Bondarwad, Hem; Awe, Olaitan I.

doi:10.3389/fchem.2024.1510029

ORIGINAL RESEARCH article

Front. Chem., 24 December 2024

Sec. Medicinal and Pharmaceutical Chemistry

Volume 12 - 2024 | https://doi.org/10.3389/fchem.2024.1510029

This article is part of the Research TopicMedicinal Chemistry for Neglected Tropical Diseases Using In-vitro, In-vivo and In Silico ApproachesView all 9 articles

Machine learning and molecular docking prediction of potential inhibitors against dengue virus

George Hanson¹*^†

Joseph Adams¹^†

Daveson I. B. Kepgang²^†

Luke S. Zondagh³^†

Lewis Tem Bueh⁴^†

Andy Asante⁵^†

Soham A. Shirolkar⁶^†

Maureen Kisaakye⁷^†

Hem Bondarwad⁸^†

Olaitan I. Awe⁹*^†

¹Department of Parasitology, Noguchi Memorial Institute for Medical Research (NMIMR), College of Health Sciences (CHS), University of Ghana, Accra, Ghana
²Department of Biochemistry, Faculty of Sciences, University of Douala, Douala, Cameroon
³Pharmaceutical Chemistry, School of Pharmacy, University of Western Cape Town, Cape Town, South Africa
⁴Department of Computer Engineering, Faculty of Engineering and Technology, University of Buea, Buea, Cameroon
⁵Department of Immunology, Noguchi Memorial Institute for Medical Research (NMIMR), College of Health Sciences (CHS), University of Ghana, Accra, Ghana
⁶College of Engineering, University of South Florida, Florida, United States
⁷Department of Immunology and Molecular Biology, College of Health Sciences, Makerere University, Kampala, Uganda
⁸Department of Biotechnology and Bioinformatics, Deogiri College, Dr. Babasaheb Ambedkar Marathwada University, Sambhajinagar, India
⁹African Society for Bioinformatics and Computational Biology, Cape Town, South Africa

Introduction: Dengue Fever continues to pose a global threat due to the widespread distribution of its vector mosquitoes, Aedes aegypti and Aedes albopictus. While the WHO-approved vaccine, Dengvaxia, and antiviral treatments like Balapiravir and Celgosivir are available, challenges such as drug resistance, reduced efficacy, and high treatment costs persist. This study aims to identify novel potential inhibitors of the Dengue virus (DENV) using an integrative drug discovery approach encompassing machine learning and molecular docking techniques.

Method: Utilizing a dataset of 21,250 bioactive compounds from PubChem (AID: 651640), alongside a total of 1,444 descriptors generated using PaDEL, we trained various models such as Support Vector Machine, Random Forest, k-nearest neighbors, Logistic Regression, and Gaussian Naïve Bayes. The top-performing model was used to predict active compounds, followed by molecular docking performed using AutoDock Vina. The detailed interactions, toxicity, stability, and conformational changes of selected compounds were assessed through protein-ligand interaction studies, molecular dynamics (MD) simulations, and binding free energy calculations.

Results: We implemented a robust three-dataset splitting strategy, employing the Logistic Regression algorithm, which achieved an accuracy of 94%. The model successfully predicted 18 known DENV inhibitors, with 11 identified as active, paving the way for further exploration of 2683 new compounds from the ZINC and EANPDB databases. Subsequent molecular docking studies were performed on the NS2B/NS3 protease, an enzyme essential in viral replication. ZINC95485940, ZINC38628344, 2′,4′-dihydroxychalcone and ZINC14441502 demonstrated a high binding affinity of −8.1, −8.5, −8.6, and −8.0 kcal/mol, respectively, exhibiting stable interactions with His51, Ser135, Leu128, Pro132, Ser131, Tyr161, and Asp75 within the active site, which are critical residues involved in inhibition. Molecular dynamics simulations coupled with MMPBSA further elucidated the stability, making it a promising candidate for drug development.

Conclusion: Overall, this integrative approach, combining machine learning, molecular docking, and dynamics simulations, highlights the strength and utility of computational tools in drug discovery. It suggests a promising pathway for the rapid identification and development of novel antiviral drugs against DENV. These in silico findings provide a strong foundation for future experimental validations and in-vitro studies aimed at fighting DENV.

1 Introduction

Dengue Virus (DENV) is a positive-sense ssRNA virus belonging to the family Flaviviridae, responsible for the most prevalent viral hemorrhagic fever transmitted by mosquitoes (Chao et al., 2018). The disease is transmitted to humans by the mosquitoes Aedes aegypti and Aedes albopictus, especially in hyperendemic regions in Southeast Asia and the Pacific experience the cocirculation of multiple serotypes of the virus (Caminade et al., 2012; Cucunawangsih and Lugito, 2017). There are four unique DENV serotypes (DENV1, DENV2, DENV3, and DENV4); historically, these four serotypes circulated in different geographic areas (Jamal et al., 2024; Murray et al., 2013). The prevalence rates have been deemed 390 million cases as of 2024, with 96 million being symptomatic (Anasir and Ramanathan, 2020; Rachmawati et al., 2024) and annual death recorded at around 25,000 (Yadouleton et al., 2024). The impact of Dengue fever is at its peak in North and South America, the Southeastern part of Asia, and the Western Pacific (Gebhard et al., 2019). Its symptoms usually are myalgia, hemorrhagic features, arthralgia, headache, rash, and retro-orbital discomfort (Chikkaveeraiah et al., 2024; Drago et al., 2021). In severe cases, it may also lead to Dengue Hemorrhagic Fever (DHF) and Dengue Shock Syndrome (DSS), which is an acute vascular permeability syndrome (Chikkaveeraiah et al., 2024). The probability of the disease transitioning into DHF and DSS is considerably higher for patients who have developed secondary DENV infections, around 10 to 100-fold (Martina et al., 2009).

The DENV genome is 11 kb long, comprising 10,723 nucleotides, and encodes large polyprotein precursors of approximately 3,391 amino acid residues (Gautam et al., 2024). After being cleaved by host and viral proteases, these DENV polyproteins form three structural proteins: C, prM, and E (where each stands for capsid, pre-membrane, and envelope, respectively) as well as seven non-structural proteins (NSPs): NS1, NS2A, NS2B, NS3, NS4A, NS4B, and NS5 (Dwivedi et al., 2017). The structural and non-structural proteins of the viral genome have all been identified as potential drug targets against Dengue infection (M. F. Lee et al., 2024). However, among these proteins, the envelope protein and the NSPs, NS3, and NS5 proteins have been identified as the proteins that play a vital role in viral replication (Chen et al., 2018; M. F. Lee et al., 2024). Due to mutations in specific proteins of the virus, emerging resistance to existing therapeutics has been reported (S. P. Lim, 2019) and thus calls for the urgent need to identify multiple vital drug targets that can effectively halt the replication of the virus in its host.

Significant efforts to contain the spread of Dengue fever can be seen in vaccine development, vector control mechanisms, and efforts to reduce viral load and preventive measures against severe forms of Dengue infection (M. F. Lee et al., 2024). Low et al. (2011) discovered that Narasin is a novel antiviral agent effective against all DENV serotypes with an IC50 of less than 1 μM (Gautam et al., 2024). Brefeldin is a promising antiviral compound with a 54.6–65.7 nM IC50 range for all DENV serotypes (Raekiansyah et al., 2017). There have been research efforts that aimed to model the evolution of viral pathogens like SARS-CoV-2 using genomic sequence data (Awe et al., 2023), HIV-1 evolution in sub-Saharan Africa (Obura et al., 2022) and Ebola Virus using comparative genomics (Oluwagbemi and Awe, 2018). Additionally, several experimental studies have evaluated the activity of repurposed drugs against DENV. Therefore, it is essential to continue exploring more elements and inhibitors to develop potent antivirals with high efficacy against DENV (Punekar et al., 2022). Despite decades of attempts to discover new drugs and vaccines, Dengvaxia is the sole vaccine accepted against DENV (marketed in several countries). Dengvaxia has been noted to be non-efficacious against certain dengue strains which dropped its efficacy rate to 61% (X.-N. Lim et al., 2019; Pintado Silva and Fernandez-Sesma, 2023; Thomas, 2023). TAK-003 and Butantan-DV are newly developed live-attenuated vaccines against DENV that have completed their phase III clinical trials, but the data regarding their efficacy against DENV3 and DEN4 is still insufficient (Biswal et al., 2020; Durbin, 2020; Kallas et al., 2020). As effective prophylactic and therapeutic measures against DENV are not present, the focus of patient management diverts to supportive therapy and controlling further transmission with drugs such as Chloroquine and Prednisolone (Lai et al., 2017).

Machine learning models like Support Vector Machine, Random Forest, Logistic Regression, and Naive Bayesian have been extensively applied in drug discovery, bioinformatics, and cheminformatics (Aniceto et al., 2023; Das et al., 2024; Niazi and Mariam, 2024). Advances in next-generation sequencing also enable the application of bioinformatics in diverse fields in the biomedical sciences and in applications like biomarker discovery (Chikwambi et al., 2023; Nyamari et al., 2023; El Abed et al., 2023; Ben et al., 2024; Alaya et al., 2024), co-infection biomarkers of parasites and viruses (Nzungize et al., 2022), analysis of RNA-seq, ChIP-seq data (Ather et al., 2018), genetics of complex diseases (Abolo et al., 2024) and in agriculture (Die et al., 2019; Omar et al., 2024), protein structure prediction (Pawar et al., 2024) and genomics applications in newborn screening (Wesonga and Awe, 2022). Various researchers used different ML techniques to study DENV, such as Gradient Boosting Machine (GBM), Random Forest (RF), and Support Vector Machine (SVM). Sanchez-Gendriz used an interesting technique (Sanchez-Gendriz et al., 2022) in which he developed a neural networking model with Long Short-Term Memory (LSTM) as the base for his studies in predicting future dengue cases in America. Another interesting study (Andersson et al., 2018) used a Convolution Neural Network to process street-level photos to predict DF and DHF rates in urban areas.

This study sought to build different machine learning models using the DENV2 CPE-Based HTS dataset from PubChem to distinguish between potential anti-dengue and non-anti-dengue compounds. The best-performing model based on the accuracy, specificity, Precision, and F1 score was used to predict active compounds solicited from the AfroDb (Ntie-kang et al., 2013), a catalog of ZINC15 database (Sterling and Irwin, 2015) and compounds present in the East African Natural Product Database (EANPDB) (Simoben et al., 2020). The predicted active compounds were further corroborated by employing molecular docking studies. The most promising drug candidates amongst the predicted compounds from our trained model were highlighted while also visualizing the intermolecular interactions between key residues in the active site and the compounds. The noxiousness of the compounds was estimated using SwissADME and DataWarrior. Molecular Dynamics (MD) simulations with Molecular Mechanics Poisson-Boltzmann Surface Area (MMPBSA) were utilized to evaluate predicted leads.

2 Methods

A visual representation of the methodology applied to this study is presented (Figure 1).

Figure 1

Figure 1. Graphical illustration of study workflow methods and instruments. The study developed five models on data from PubChem and used it for predicting new compounds. The predicted hits were screened through molecular docking, in silico pharmacological and toxicity tests, structural assessment using MD simulations, and estimation of binding free energies.

2.1 Dataset acquisition

This study proceeded in two phases as illustrated in the graphical depiction in Figure 1. First, a high throughput screen (HTS) measured in Cell-Based and Microorganism Combination System bioassay data of 343,305 compounds retrieved from PubChem was used for the study. The dataset with PubChem AID: 651640 was aimed to identify inhibitors of Dengue Virus by treating BHK-21 with various compounds before being infected with Dengue Virus serotype 2. The DENV2 strain New Guinea C, obtained from the American Type Culture Collection (ATCC) with catalog number VR-1584 was adapted (Che et al., 2009). Several viruses have been successfully studied using the cytopathic effect (CPE) assay to find new antiviral substances (Lin et al., 2023; McCormick et al., 2012).The expected outcome was that compounds increased ATP levels, indicating increased cell viability compared to the positive control (uninfected cells), and the neutral control was considered positive and otherwise negative. The dataset retrieved for this study comprised 5,946 active and 321,638 inactive compounds, a ratio of about 1:50 compounds. To compute molecular descriptors, the actives and inactives of the dataset were downloaded in the Structure Data Format (SDF).

2.2 Calculation of molecular descriptors and data preprocessing

Recent studies have demonstrated the utility of molecular descriptors in predicting compound activity, toxicity, and other properties from chemical structures (Comesana et al., 2022; Trinh et al., 2023). The active and inactive datasets were converted from SDF to Simplified Molecular-Input Line-Entry System (SMILES) formats for easier manipulation for machine learning. PaDEL-descriptor calculator (Yap, 2011) was used to compute 1440 molecular descriptors from the canonical SMILES of the compounds. Before descriptor calculations, compound standardization was performed, involving salt removal and nitro group normalization, to ensure uniformity and accuracy of the descriptor data (Viganò et al., 2024). The dataset used for training and testing the models comprised 4470 active compounds and 16780 inactive compounds, after reducing the ratio between the active and inactive compounds to 1:4 to enhance the computational efficiency.

The significant imbalance between active and inactive compounds was acknowledged, but oversampling techniques to balance the classes were not applied. A variance filter was employed to address the issue of dimensionality and enhance the relevance of features for predictive modeling (Velliangiri et al., 2019). Mean imputation was applied to handle any missing values within the dataset, ensuring a complete set of descriptors for each compound. Standardization of the dataset was done by removing the mean and scaling to unit variance using the Standard Scaler from the scikit-learn library. This transformation ensures that the features are centered around zero and have a standard deviation of one, which is crucial for many machine learning algorithms to perform optimally, as it ensures that the model is not biased towards features with larger numerical ranges. The was computed using the formula 1:

Standard Scaler (X) = \frac{X (i) - m e a n}{s t a n d a r d d e v i a t i o n} (1)

Where i represent each value in the feature X.

2.3 Development of machine learning models

Five models were built from the training dataset using five machine learning algorithms and the model with superior performance in terms of classification metrics was chosen (Kee et al., 2023; Tougui et al., 2021). A 70%–30% split of the pre-processed data was used for training and testing the models, respectively. The classification models developed included Support Vector Machine (SVM), k-nearest neighbors (k-NN), Gaussian Naïve Bayes (GaussianNB) (Adams et al., 2022), Random Forest classifier (RF), and Logistic Regression (LR) (Khorshid et al., 2021). For the k-nearest neighbors (k-NN) model, k = 3 was used. The Gaussian Naïve Bayes model was implemented using default settings from the scikit-learn library (Pedregosa et al., 2011). The SVM model was optimized with the probability parameter set to True (Sandhu et al., 2022). The Random Forest model was built with a maximum depth of 8 and 100 estimators (J. Adams et al., 2022). Finally, the Logistic Regression model was constructed with a maximum iteration parameter of 1000.

2.4 Model validation

Prior to comparison with other optimized classifiers, each classifier underwent optimization to determine the optimal hyperparameters that yielded the maximum accuracy. The optimized models were evaluated using 10-fold cross-validation. This method splits the training data into kkk groups, trains the model on k-1k-1k-1 folds, and tests the model on the remaining fold to yield a trustworthy estimate of the model’s performance on unseen data (Jung and Hu, 2015; Vabalas et al., 2019). It computes performance metrics such as accuracy, accuracy, precision, recall, and F1 score based on the confusion matrix, which includes true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) (Tharwat, 2021). These metrics (as shown in Equations 2–6) were used to compare and select the best-performing model for predicting Dengue Virus inhibitors (Adams et al., 2022; Orozco-arias et al., 2020)

Accuracy = \frac{T P + T N}{T P + F P + T N F N} (2)

Precision = \frac{T P}{T P + F P} (3)

Recall = \frac{T P}{T P + F N} (4)

F 1 = 2 x \frac{P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l} (5)

Specificity = \frac{T N}{T N + F P} (6)

2.5 Prediction of compounds

The best-performing model was used to make predictions against 812 compounds from Afrodb, and 1871 EANPDB compounds. Prior to compound predictions, the model’s predictive power was validated on known Dengue Virus inhibitors post-cross-validation. The molecular descriptors of these inhibitors were calculated and preprocessed similarly to the training and test data. This was done to reinforce the credibility of the cross-validation results for the best-performing model based on the metrics. The number of correct predictions made by the model on the submitted inhibitors defines the accuracy of the classification or prediction. A total of 18 known inhibitors curated from literature were submitted to the LR model as a result of their activity against the Dengue Virus shown in different studies.

2.6 Preparation of target protein and ligand libraries

The Crystal structure of the Dengue 2 Virus nonstructural protein NS2B/NS3 was obtained from the RCSB Protein Data Bank (http://www.rcsb.org/pdb) with PDB ID: 2FOM. Before selecting 2FOM, different suitable structures such as 4M9M, 4M9T, and 4M9I were retrieved, however, 2FOM was chosen based on its low resolution, low R values, and the number of missing residues. The selected protein structure was superimposed with the other suitable structure using PyMOL to measure their root mean square deviation (RMSD). The structure was cleaned using PyMOL (Yuan et al., 2017) to devoid the protein of ions, water molecules, and other structures like ligands before minimization was carried out by employing Groningen Machine for Chemical Simulations (GROMACS) (Abraham et al., 2015). The steepest descent minimization algorithm with a maximum number of 50,000 steps and a minimization step size of 0.01 was used to minimize the protein structure. The three-dimensional structures (.sdf) of the natural compounds predicted by the machine learning model were obtained for the molecular docking stage. Natural products were chosen for this investigation because of their structural and chemical variety, and the therapeutic effects of phytochemicals found in plants.

2.7 Molecular docking and mechanism of binding characterization

The molecular docking procedure for the predicted compounds was carried out using AutoDock Vina (Trott and Olson, 2010). All compounds were energy minimized in 200 steps using the Universal force field (UFF) before being translated to the Protein Data Bank partial charge and atom type (.pdbqt) format using Open Babel software (O’Boyle et al., 2011). Visualization of the resultant energy minimized protein structure and the removal of surrounding water molecules before the virtual screening was done in PyMOL v1.5.0.4. The prepared structure was then saved using PyMOL before applying the “make macromolecule” option in PyRx to prepare for the docking of selected hits. The library was screened against the NS2B/NS3 protease using a grid box dimension of center_x = −5.179Å center_y = −9.575Å center_z = 13.756Å size_x = 18.302Å size_y = 19.821Å size_z = 23.788Å.

All hit compounds that contributed binding affinities of at least −8.0 kcal/mol were considered. The output of AutoDock Vina is ranked in a decreasing order of binding affinity using a negative function; a more negative binding affinity is preferred. The best mode for each compound was applied using the root mean squared deviation. The result was then examined with PyMOL to find the optimal-docked ligands. LigPlot + (v1.4.5) was used to analyze the interactions between key residues in the active site of the protein and the docked compounds (Laskowski and Swindells, 2011). The protein-ligand complexes generated in PyMOL were used as input for LigPlot. The resulting output provides a 2D depiction of intermolecular interactions, including hydrophobic interactions and hydrogen bonds.

2.8 ADMET screening of selected compounds

A pharmacokinetics profile, comprising an assessment of absorption, distribution, metabolism, and excretion (ADME), was applied to a subset of compounds using SwissADME (Daina et al., 2017). Along with Veber’s rule, the ADMET testing also measured five properties: total average molecular weight in g/mol, the number of hydrogen bond donors, hydrogen bond acceptors, rotatable bonds, and partition coefficient—collectively referred to as Lipinski’s rule of five (Devadasu et al., 2018; Jia et al., 2020). Using OSIRIS DataWarrior v06.02.05 (Sander et al., 2015), the toxicity characteristics, including mutagenicity and tumorigenicity, were predicted.

2.9 Molecular dynamics (MD) simulations

A 100 ns MD simulation was run on each protein-ligand complex and the unbound protein using GROMACS-2020.5 on a Dell EMC high-performance computing cluster at the WACCBIP, University of Ghana, Accra. The CHARMM36 all-atom force field produced the protein and ligand topology (July 2022). Utilizing a cubic box for all simulations, the systems were each solvated with water molecules, neutralized with ions, and energy-minimized to optimize the system. To equilibrate each system, NVT, and NPT ensemble were applied for 100 ps a piece before the production run. The parameters for the production run included 50,000,000 steps which translates into 100 ns and a step size of 0.002 (2 fs). Xmgrace (Turner, 2005) was used to visualize and analyze the root mean square deviation (RMSD), root mean square fluctuation (RMSF), and radius of gyration (Rg) obtained from the MD simulations.

2.10 Calculations of MMPBSA parameters

The binding free energies of the protein-ligand complexes and the individual energy contributions of the residues were calculated using the Molecular Mechanics Poisson-Boltzmann Surface Area (MMPBSA) method (Kumari et al., 2014). This is a corroboration technique used to verify the limitations of the existing scoring function utilizing the MD simulation output files (Wang C. et al., 2018). R programming software was utilized to plot the graphs from the MMPBSA computations.

3 Results

3.1 Data acquisition and processing

The bioactive dataset obtained from PubChem was an imbalanced set, with approximately one-third of its constituents being active compounds. As displayed in Figure 2, inactive compounds dominated the dataset. Using PaDEL, 1,444 molecular descriptors were generated, which provide a mathematical representation of the compounds for QSAR modeling by converting chemical information about the compounds into numerical values. The dataset underwent a three-dataset splitting strategy to be divided into training, validation, and test sets. The dataset of 21,250 compounds was split into 14,875 training data, 3,187 test data, and 3,188 externally held data. The first set is for training the algorithm, the validation set tunes hyperparameters, and the test set is used to test model performance and predicting ability. The application of a variance filter reduced the number of descriptors from 1,444 to 684, using a variance threshold of 0.1 to filter out descriptors with minimal variance. This step ensured that only the most informative features were retained for subsequent modeling, as low-variance features are often less useful in distinguishing between classes (Velliangiri et al., 2019).

Figure 2

Figure 2. 3D plot showing the correlation between the active and inactive compounds based on ALogP, XLogP, and Zagreb. [Labels: Blue = Actives, Red = Inactives, Pink = Outliers].

3.2 Model development and evaluation

Five machine learning models were developed to predict Dengue Virus inhibitors: k-Nearest Neighbors (k-NN), Gaussian Naïve Bayes (NB), Support Vector Machine (SVM), Random Forest (RF), and Logistic Regression (LR). Each model was evaluated using statistical metrics such as accuracy, precision, recall, and F1 score, as displayed in Table 1 with the confusion matrix shown in Supplementary Table S1.

Table 1

Table 1. Comparison of model performance on withheld data sets; showing results for accuracy, precision, recall, and F1 scores.

Among the models, Logistic Regression demonstrated the highest performance across most metrics, including an accuracy of 94%, a precision of 91%, and an F1 score of 0.83. This superior performance made Logistic Regression the most suitable model for predicting Dengue Virus inhibitors in this study. The SVM model followed closely, with an accuracy of 93%, a precision of 94%, and an F1 score of 0.81. Although the k-NN and Random Forest models performed well, they lagged behind the top two models in recall and F1 scores, indicating that they were less effective in identifying all active compounds.

Gaussian Naïve Bayes was the poorest-performing model, with an accuracy of 81% and an F1 score of 0.51. This model’s low precision (0.55) and recall (0.47) indicate that it struggled to balance identifying true positives and minimizing false positives, particularly in the imbalanced dataset where inactive compounds were predominant. Overall, Logistic Regression emerged as the most reliable model for predicting potential inhibitors due to its robust performance across the different metrics.

3.3 Prediction of known inhibitors and new compounds

The study further validated the performance of the developed models by testing 18 known Dengue Virus inhibitors curated from literature. The prediction made by the Logistic Regression model for these compounds is listed in Table 2. The Logistic Regression model correctly predicted 11 of the 18 inhibitors as active, outperforming other models, including SVM and Random Forest.

Table 2

Table 2. Classification and mechanisms of action of known DENV inhibitors identified by the Logistic Regression model.

Several inhibitors, such as Pentoxifylline, Prochlorperazine, and Balapiravir, were correctly classified as active by Logistic Regression, in line with their established mechanisms of action against the Dengue Virus. Notably, inhibitors like Celgosivir and Bortezomib, which inhibit viral replication, were also predicted accurately. This validation process of known inhibitors provided confidence in the model’s predictive capability, suggesting it could effectively generalize to novel compounds with similar mechanisms of action.

Following this validation, the Logistic Regression model was applied to predict 2,683 new compounds, including 812 from the ZINC database and 1,871 from the EANPDB database. Out of these, 933 compounds were predicted to be active, representing a promising pool of potential Dengue Virus inhibitors for further experimental validation.

These predictions highlight the utility of the developed QSAR models, particularly the Logistic Regression model, in identifying novel drug candidates for Dengue Virus inhibition. The robust performance on known inhibitors and newly predicted compounds highlights its potential as a valuable tool in future drug discovery efforts targeting the Dengue Virus. The study’s ability to handle imbalanced data effectively and generate accurate predictions underscores the importance of appropriate descriptor selection and data preprocessing in QSAR modeling.

3.4 Target selection and molecular docking of predicted compounds (PDB ID 2FOM)

In this study, the NS2B/NS3 protease was preferentially selected amongst the seven nonstructural proteins of the Dengue Virus as the target structure to corroborate the prediction by the logistic regression model. The NS2B/NS3 protease is an essential enzyme for viral replication and assembly, making it a principal antiviral target for developing therapeutics against the virus (Erbel et al., 2006; Norshidah et al., 2023). There are two potential locations for inhibiting DENV protease: the active site and the blocking attachment of protease (NS3) to its protein cofactor (NS2B). The active site on the NS3 which is the prime target is made up of a conserved catalytic triad like His51-Asp75-Ser135 (Noble et al., 2012; Zamri et al., 2019). A search via the Protein Data Bank repository for a solved structure of the NS2B/NS3 for the Dengue Virus serotype showed IDs such as 4M9T, 2FOM, 4M9M, and 4M9I with resolutions 1.74, 1.50, 1.53, and 2.40 Å respectively and R-value work of 0.215, 0.176, 0.203, and 0.215 respectively. The 2FOM, solved using x-ray diffraction, was selected for this study since it had the lowest resolution and R-value, both of which are a measure of the quality of the structure (Wlodawer et al., 2008). Additionally, the RMSD values after superimposing the selected protein structure to 4M9T, 4M9M, and 4M9I were 0.358, 0.151, and 0.276 Å, respectively, underlying their close structural similarity with the 2FOM. The three-dimensional structure of the 2FOM with a ligand docked in the active site is shown Figure 3. The active site on the NS3 used in this study, besides consisting of the catalytic triad which is pivotal in inhibiting its activity, also consists of residues such as Leu128, Pro132, Ser131, and Tyr161 (Norshidah et al., 2023; Purohit et al., 2022).

Figure 3

Figure 3. PyMOL visualization of NS2B/NS3 protease structure (A) Light-green cartoon structure representation; (B). Light-green surface representation of the protein with ZINC000095486052 (blue) docked in the active site.

A total of 853 compounds from the Logistic Regression prediction were docked into the active site of the target using Autodock. Anhydrophlegmacin showed the highest binding affinity of −9.2 kcal/mol towards the protease among all docked ligands. The docked compounds demonstrated binding affinities between −9.2 and −3.6 kcal/mol. This reinforces the prediction ability of the Logistic Regression model. Applying a threshold of −8.0 kcal, 59 ligands with affinities of −8.0 kcal/mol or better were considered for downstream analysis. A higher cutoff of −7.0 kcal/mol, taken as the standard threshold for a compound to be considered active against a particular target, was employed (Kwofie et al., 2019a). The higher the binding affinity, the stronger the bond between the ligands and the target protein. The protein in the complex with the ligands was visually inspected using PyMOL (Figure 4) to select the best-docked possess. In addition, inhibitors such as Leflunomide and Prednisolone were also incorporated into the docking to act as a control. They demonstrated affinities of −7.1 and −7.0 kcal/mol respectively. Table 3 shows the binding affinities of the top 20 of the selected compounds and the inhibitors.

Figure 4

Figure 4. Ligand ZINC38628344 docked in NS2B/NS3 binding pocket; 3D pose and 2D protein-ligand interaction diagram generated using PyMOL and LigPlot, respectively.

Table 3

Table 3. Protein-ligand interactions of top 20 hits with NS2B/NS3 post-docking, including interactions of two known inhibitors.

3.5 Mechanism of binding characterization of selected compounds

In continuation of the structure-based molecular docking to further confirm the binding affinities of the predicted compounds, the interactions of the compounds within the predicted binding pocket were determined. The biomolecular interactions between the NS2B/NS3 protease and the compounds were generated using LigPlot. Studies into these interactions are crucial in determining promising lead compounds. Identification of crucial residues in the corresponding targets’ active sites was made possible by characterizing the binding interactions. To find the suitable compound that inhibits the activities of the NS2B/NS3 protease, hydrogen and hydrophobic interactions between the shortlisted compounds and the residues in the active site were elucidated.

For the interactions of the protease, the ligands docked to the active site were observed to interact with the proposed residues such as His51, Ser135, Leu128, Pro132, Ser131, Tyr161, and Asp75 as shown in Table 3 and Supplementary Table S2. Anhydrophlegmacin and anhydrophlegmacin-9,10-quinones_B2 which had the highest binding affinities interacted with similar residues such as His51, Asp75, Gly151, Leu128, Pro132, and Gly153. They interacted with conserved catalytic triad residues Asp75, Ser135, and His51 through hydrogen bonding with bond lengths of 2.57, 3.06, and 2.86 Å respectively. ZINC38628344, which had an affinity of −8.5 kcal/mol with the NS2B/NS3 protease formed hydrogen bond interaction with His51 (2.89 Å), Ser135 (2.68 Å), Asp75 (2.57 Å), Phe130 (3.06 Å), Tyr150 (3.10 Å) and hydrophobic interactions with residues Pro132, Ser131, Leu128, Tyr161, Gly153, Gly151 (Figure 4). The inhibitor Prednisolone interacted via hydrogen bonding with Gly151 (2.90 Å, 2.71 Å), Asp75 (2.95 Å), His51 (3.21 Å), and Gly153 (2.93 Å, 3.16 Å) as shown in Figure 5. In addition, ZINC14441502 formed hydrogen interactions with Gly151 and Ser135 with bond length 2.86 and 2.99 Å respectively; and hydrophobic bonding with Leu128, Gly153, Asn152, Val72, Asp75, His151 and Phe130 (Supplementary Table S2). 39 out of 56 hits docked firmly and interacted with critical residues in the active site, and these were selected for downstream analysis.

Figure 5

Figure 5. Inhibitor Prednisolone docked in the NS2B/NS3 binding pocket, showing the protein-ligand interactions visualized in LigPlot and the 3D pose in PyMOL.

3.6 ADMET screening of selected compounds

Pharmacokinetics controls how medications are absorbed by the body and eventually eliminated (Wang G. Y. et al., 2018). Analyses were conducted on pharmacokinetic features, including gastrointestinal (GI) absorption. Drugs taken orally can enter the bloodstream through a process known as gastrointestinal absorption (GI) (Suenderhauf et al., 2012). “High” compound absorption occurs in the GI tract. To select druglike compounds, Veber’s criteria were also applied, and the selected hits that do not conform to Lipinski’s rule of five (RO5) were eliminated (Ogbodo et al., 2023). Out of the 39 hits, 20 were in violation of the rule (see Supplementary Table S3). Twelve hits also broke one of the RO5s. The remaining 7 hits: 5,7′-physcion-fallacinol, ZINC000095485956, ZINC000085594516, amentoflavone, ZINC000095486111, voucapane-18,19-di-(4-methyl)-benzenesulphonate, ZINC000095485927 showed the least drug-likeness of two RO5 violations (Table 4). Veberr’s rule, with TPSA ≤140 and rotatable bonds ≤10, was used as the main determinant (Veber et al., 2002). 26 out of the selected hits demonstrated 0 violations with the remaining showing only one violation of the rule. The solubility and pharmacological profiles such as GI absorption were also elucidated. Only ZINC000095485927 was predicted to be insoluble (Supplementary Table S3). Four of the 14 hits had a moderate solubility prediction and four had a soluble prediction. However, 19 of the selected hits were predicted to be poorly soluble (Supplementary Table S3). Compounds are considered to have met the GI absorption criteria if it is denoted as ‘High’ suggesting a high propensity of absorption into the intestinal tract for orally administered drugs. 21 and 18 of the selected hits were estimated to be high and low respectively. The mutagenicity and tumorigenicity levels of the hits were also predicted using DataWarrior (Table 4). From the results obtained, 26 out of the 39 hits tested were neither mutagenic nor tumorigenic.

Table 4

Table 4. Prediction of ADME and toxicity profiles of top 15 selected hits.

3.7 Molecular dynamics simulations

Molecular dynamics simulations were carried out using GROMACS 2020.5 to further elucidate the stability of the predicted lead compounds within the active site of the NS2B/NS3 protein (Mazumder et al., 2017). Understanding the binding mechanisms of the various molecules in the active site is crucial for the design of efficacious drugs. To analyze the dynamic behavior of the unbound proteins and complexes, the root mean square deviation, the radius of gyration, and the root mean square fluctuation were plotted with the use of Xmgrace (Agyapong et al., 2021; Kwofie et al., 2019b; Musyoka et al., 2016). All simulations were carried out for 100 ns.

3.7.1 Root mean square deviation (RMSD)

A reliable indicator of a protein’s stability is the RMSD, which assesses the stability of the complex from the original protein backbone atomic coordinates (Adinortey et al., 2022; Kwofie et al., 2022). From the RMSD plot, the unbound protein and the four lead compounds experienced stability throughout the 100 ns run except for the inhibitor Prednisolone which was observed to be unstable till 70 ns of the run. The unbound protein was observed to have the least fluctuations. The RMSD plot for the NS2B/NS3pro-Prednisolone complex rose sharply from 0 to 0.26 nm after which it remained relatively unstable with large fluctuations until 70 ns where it demonstrated some stability (Figure 6). The RMSD of the NS2B/NS3pro-ZINC38628344 complex increased to 0.25 nm and stabilized, averaging 0.22 nm until the end. The complexes NS2B/NS3pro-ZINC95485940, NS2B/NS3pro-ZINC14441502, and NS2B/NS3pro-2′,4′-dihydroxychalcone showed similar fluctuations with the RMSD averaging around 0.17 nm (Figure 6).

Figure 6

Figure 6. RMSD versus time graph of unbound protein and NS2B/NS3pro-ligand complexes generated over a 100 ns MD run.

3.7.2 Radius of gyration for 100 ns MD simulations

The compactness and folding of the five complexes and the unbound protein were examined in this work by charting the radius of gyration (Rg) throughout a 100 ns simulation duration (Liao et al., 2014). A stably folded protein maintains a reasonably steady Rg throughout the simulation. The Rg of the unbound NS2B/NS3 protease and protein-ligand complexes ranged from 1.51 to 1.59 nm (Figure 7). Considering the unbound protease, it experienced relatively steady fluctuation till the 50 ns mark from which it rose sharply till the simulations ended. For the protein-ligand complexes, they demonstrated similar trends in fluctuation throughout the 100 ns run. The Rg for the NS2B/NS3pro-Prednisolone displayed fluctuations with the highest peak at 1.59 nm (Figure 7). The complex NS2B/NS3pro-2′,4′-dihydroxychalcone had the largest fluctuations compared to their complexes, though most of the fluctuations occurred around 40–80 ns.

Figure 7

Figure 7. Rg graph of the NS2B/NS3pro-ligand complexes and unbound protein.

3.7.3 Root mean square fluctuations (RMSF) for 100 ns MD simulations

Additionally, the RMSF trajectories of the protein-ligand complexes and unbound NS2B/NS3pro were examined (L. Adams et al., 2023; Ashley et al., 2024). According to Cheng and Ivanov (2012), the RMSF reveals a protein’s flexibility in several domains, some of which are connected to crystallographic B-factors. By using this stability profile analysis, residuals that contribute to the structural fluctuation can be evaluated. Greater variations are implied by higher RMSF values. Greater fluctuations occur in protein areas involved in catalysis and ligand binding (Dong et al., 2018). These protein sequence areas that influence the conformational changes of the complex are primarily responsible for adaptive variation in flexibility (Dong et al., 2018).

All the predicted lead compounds caused some degree of changes in comparable regions, according to the RMSF plot (Figure 8). Large fluctuations were observed from residue index 28–33 followed by some fluctuations between residue index 60–65 as well as 116–123. The RMSF graph also showed fluctuations in the unbound protein around residues 102–106 (Figure 8).

Figure 8

Figure 8. Examination of the RMSF trajectories of the NS2B/NS3pro-ligand complexes and the unbound protein residues.

3.8 MMPBSA computations

3.8.1 Contributing energy terms

The binding free energies of the complexes were estimated using the Molecular Mechanics Poisson-Boltzmann Surface Area (MMPBSA) calculation (Genheden and Ryde, 2015) (Table 5). Van der Waals energies, electrostatic, polar solvation, and solvent-accessible surface area energy are factors that contribute to the binding free energy (Asiedu et al., 2021; Boateng et al., 2020). The binding free energies were computed in terms of average and standard deviations. The leads ZIN38628344, ZINC95485940, and ZINC14441502 and 2′,4′-dihydroxychalcone had binding free energy of −44.957, −18.586, −25.881, and −55.805 kJ/mol respectively. 2′,4′-dihydroxychalcone demonstrated the lowest binding free energy while ZINC95485940 was observed to have the highest binding free energy among the four predicted lead compounds. The binding free energy of the known inhibitor Prednisolone was −17.682 kJ/mol. It has been found that compounds that have high polar energies and low electrostatic energies are active against receptors. (Gupta et al., 2018).

Table 5

Table 5. MMPBSA contributing energy terms for NS2B/NS3-ligand complexes displayed as averages ± standard deviations in kJ/mol.

3.8.2 Per-residue energy decomposition

By employing per-residue decomposition, the binding energies of individual residues can be computed using the MMPBSA approach. This entails breaking down each residue by taking into account the interactions that each residue participates in. These offer helpful information on significant interactions between crucial residues in the free energy contribution. Critical residues for binding a ligand to a protein contribute binding free energy of at least ±5 kJ/mol (Kwofie et al., 2019a).

For every complex, the per-residue energy decomposition computation was carried out (Figure 9; Supplementary Figures S1A-D). For the NS2B/NS3-ZINC14441502 complex, only Tyr161 contributed energy of −6.4629 kJ/mol (Figure 9). For the NS2/NS3B-ZINC38628344 complex, Tyr161 and Leu128 contributed individual energies of −6.6957 and −3.4011 kJ/mol respectively (Supplementary Figure S1A). Key residues interacting with ZINC95485940, 2′,4′-dihydroxychalcone, and Prednisolone contributed minor energies (Supplementary Figure 1).

Figure 9

Figure 9. MMPBSA plot of binding free energy contributions per residue for NS2B/NS3-ZINC14441502 complex.

4 Discussion

Despite being one of the leading causes of morbidity and mortality in different countries across the world, there is no approved efficacious drug for DENV infection (Palanichamy Kala et al., 2023; Bhatt et al., 2013). While there are currently no approved antiviral drugs or reliable vaccines, a vast number of plants have been tested against DENV through computational discovery (Malabadi et al., 2018; Powers and Setzer, 2016; Rasool et al., 2019; S et al., 2021). The effect of disrupting the function of NS2B/NS3 protease has already been established, particularly through interactions with critical residues His51, Asp75, and Ser135 (Purohit et al., 2022; Tian et al., 2018). In this study, we employed machine learning (ML) and molecular docking techniques to predict potential inhibitors against Dengue Virus (DENV). The combination of machine learning models, molecular dynamics (MD) simulations, and molecular docking enables the exploration of drug candidates and accelerates the identification of lead compounds (Deo, 2015; Niazi and Mariam, 2023; Sarma et al., 2020).

The bioactive dataset from PubChem revealed a significant imbalance, with a ratio of 1:50 between the active and inactive compounds. This class imbalance phenomenon is commonly and largely observed in many bioinformatics and chemoinformatics studies (Japkowicz and Shah, 2011), which can lead to biased model training if not properly addressed. After processing, the dataset was refined to include 4,470 active compounds and 16,780 inactive compounds, achieving a ratio of approximately 1:4. This truncation was performed not merely due to computational constraints but as part of a deliberate effort to enhance the computational efficiency and balance the representation of active and inactive compounds. One key consideration was that excessively large and imbalanced datasets may introduce noise or over-represent the inactive class, thereby diminishing the model’s ability to identify active compounds accurately.

The application of various machine learning algorithms, including k-NN, Gaussian Naïve Bayes, SVM, Random Forest, and Logistic Regression, highlights the diversity of approaches available for predictive modeling of active compounds in pharmacoinformatics (Leardi, 2001; W. Wu &Herath, 2016). The Logistic Regression model emerged as the most effective of the five tested, achieving the highest accuracy, recall, and F1 scores, which are critical metrics for evaluating classifier performance, particularly in imbalanced datasets (Davis and Goadrich, 2006; Mwanga et al., 2023). The performance of other antiviral ML classification models (Gawriljuk et al., 2021; Gupta and Mohanty, 2021; Sandhu et al., 2022), compared to our models though none of them were trained on the dataset used in this study. The recall of 0.76 shows the number of active compounds classified correctly as active while the F1 score of 0.83 illustrates the overall accuracy of the models by combining precision and recall metrics. The SVM closely followed the performance of the LR with a recall and F1 score of 0.71 and 0.81 respectively. In addition, the accuracy of the LR model at 0.94 was not far off from that of the SVM averaging an accuracy of 0.91. An SVM model (Sandhu et al., 2022) trained on 5692 molecules with inhibitory activity against acetylcholinesterase from the bindingDB database establishing accuracy and F1 score of 0.83 and 0.81, respectively. Validation of the model’s predictive power involved testing against known DENV inhibitors, demonstrating the model’s reliability in predicting active compounds. These predictions expand the pool of potential candidate Dengue Virus inhibitors for further validation.

The NS2B/NS3 protease complex, used in this study is composed of the NS3 protease, which carries out the catalytic activity, and the NS2B cofactor, which stabilizes the active conformation of NS3. A serine protease domain located at the N-terminal of NS3 is structurally responsible for cleaving the viral polyprotein into functional units required for viral replication (Low et al., 2011; Bhatt et al., 2013). Given its central role in viral replication, targeting the NS2B/NS3 complex disrupts a key step in the Dengue virus life cycle making NS2B/NS3 protease a key antiviral target for the development of therapeutic agents against the Dengue virus (Erbel et al., 2006; Norshidah et al., 2023; Bhatt et al., 2013). The selection of the PDB ID 2FOM for docking studies was based on structural quality metrics, underscoring the importance of selecting appropriate protein structures for virtual screening (Wlodaweret al., 2008). The results of the docking simulations indicated that anhydrophlegmacin exhibited superior binding affinity, reinforcing the predictive capabilities of the Logistic Regression model. Applying a high threshold of −8.0 kcal/mol improved selection conditions for further studies on active compounds (Kwofie et al., 2019b). Using a threshold of −8.0 kcal/mol, 56 hit compounds that docked firmly were selected for further analysis.

Characterization of ligand interactions using LigPlot provided insight into the binding mechanisms, which are vital for understanding how compounds can effectively inhibit the target protein (Laskowski and Swindells, 2011). This analysis is critical given that it helps identify key residues that contribute significantly to ligand binding, as a means of guiding future drug design efforts. Therefore, hydrogen and hydrophobic interactions between the shortlisted compounds and the residues in the active site helped to elucidate the top-hit compound that inhibits the activities of the NS2B/NS3 protease. Hydrophobic and hydrogen bonds between the selected compounds and catalytic triad His51-Asp75-Ser135 were the main determinants of good interaction and possible inhibition. A structural similarity check between different NS2B/NS3pro confirms the conserved regions and residues (His51-Asp75-Ser135) on the NS3 protease sequence across all serotypes (Purohit et al., 2022; Wahaab et al., 2022), thus the lead compounds identified in this study can serve as cross-serotype inhibitors. The interactions such as His51, Tyr161, Pro132, Asp75, Gly153, and Ser135 by the compounds in this study are consistent with reported binding interactions through docking (Hariono et al., 2019; Purohit et al., 2022; Rasool et al., 2019). The four lead compounds were superimposed to compare their binding modes, revealing that key residues His51, Asp75, and Tyr161 were consistently present across all four complexes. Notably, the residue Ser135 was observed in three lead compounds, ZINC14441502, ZINC95485940, and ZINC38628344. These findings suggest that the four lead compounds share a similar mechanism of action. In addition, the ADMET helps to predict safe compounds by assessing the pharmacokinetic profiles and toxicity of compounds (Wang C. et al., 2018). The integration of Veber’s rules and other drug-like filters serves to prioritize compounds that are more likely to succeed in preclinical and clinical phases denoted as “druglike compounds” (Veber et al., 2002).

Molecular dynamics simulations serve as a powerful tool to explore the stability of protein-ligand complexes over time (Maricarmen et al., 2016). The stability observed in the RMSD plots correlates with the low binding affinities observed during the molecular docking of the compounds. Lower fluctuations often indicate stronger and more stable interactions (Adinortey et al., 2022). However, the inhibitor Prednisolone was observed to be unstable, with its RMSD peaking at 0.3 nm. The predicted leads, ZINC14441502, 2′,4′-dihydroxychalcone, ZINC95485940, and ZINC38628344 demonstrated high fluctuations compared to the unbound state indicating that binding with compounds may have caused conformation changes in the structure. The Rg analysis further supported this by demonstrating the compactness of the complexes, a key indicator of structural integrity during simulations (Liao et al., 2014). A stable Rg between a range of 1.51–1.59 nm, shows a stably folded complex.

Molecular Mechanics Poisson-Boltzmann Surface Area (MMPBSA) calculations provided detailed insights into the energetic contributions of binding interactions, which are crucial for understanding the thermodynamic viability of the ligand-protein complexes (Genheden and Ryde, 2015). The findings indicate that the leads 2′,4′-dihydroxychalcone, ZINC14441502, ZINC95485940, and ZINC38628344 exhibited the most favorable binding free energies of -55.805, -25.881, and -44.957 kJ/mol, respectively, reinforcing their potential as lead compounds for further development (Abdullah et al., 2023; Saleh & Kamisah, 2021). The lead compounds demonstrated a more negative binding energy to the NS2B/NS3 protease than the inhibitor Prednisolone, which is suggestive of a stronger attraction.

Some limitations associated with the study include; the machine learning model was trained on a dataset heavily imbalanced toward inactive compounds, which could influence the model’s ability to generalize well to external datasets. The results validate the logistic model in handling imbalanced datasets, a common issue in bioinformatics studies. While steps were taken to mitigate this, further work could involve data balancing techniques or increasing the number of active compounds to improve predictive accuracy. Additionally, while the molecular docking results are promising, in vitro and in vivo validations of these compounds are essential to confirm their inhibitory effects against the Dengue virus. Furthermore, pharmacokinetic profiling, though performed in silico, would require experimental validation to ensure the predicted compounds meet safety and efficacy standards in a biological system.

5 Conclusion

Natural products used as a reservoir for novel therapeutic agents must be tapped and repurposed as effective DENV inhibitors. The significance of this study lies in applying a comprehensive approach to drug discovery, integrating Machine Learning, Molecular Docking, and Dynamics simulations to identify novel potential DENV inhibitors. From this study, five classification models were developed, from which the best-performing model based on accuracy, F1 score, precision, and specificity was employed to make predictions. With an accuracy and precision of 0.94 and 0.91 respectively, the Logistic Regression outperformed the other models and thus was used to predict potential inhibitors against the protease. Four lead compounds ZINC38628344, ZINC95485940, 2′,4′-dihydroxychalcone, and ZINC14441502 with high binding affinities of −0.85, −0.81, −8.6, and −0.81 kcal/mol respectively, and interactions with the conserved catalytic triad, His51-Asp75-Ser135 in the active site of the NS2B/NS3 protease were discovered. The successful prediction and characterization of binding interactions enhance our understanding of ligand-target dynamics based on natural compounds which can be biosynthesized, paving the way for further experimental validations and in vitro drug development. The compounds were predicted to possess pharmacokinetic properties and exhibit characteristics of non-tumorigenicity and non-mutagenicity, based on the physicochemical and toxicological characterization, adjourning them to be safe and drug-like. The probable inhibition of the activities of the NS2B/NS3pro of DENV by the leads was corroborated by 100 ns molecular dynamics simulation involving MMPBSA calculations. The study’s approach provides a foundation for the continued use of computational tools in the fight against viral diseases, suggesting a promising path for rapid drug discovery in the future. In addition, the prospective compounds can be considered suitable anti-DENV scaffolds for fragment-based drug design and, thus, are worthy of further experimental validation.

Data availability statement

The dataset used in this study is available at the PubChem Database with AID 651640 (https://pubchem.ncbi.nlm.nih.gov/bioassay/651640) and the GitHub Repo (https://github.com/omicscodeathon/denguedrug/tree/main/data). The comprehensive pipelines and scripts utilized for the models' implementation as well as data and results for the Molecular Docking and Molecular Dynamics Simulations are available at the GitHub repository at https://github.com/omicscodeathon/denguedrug.

Author contributions

GH: Conceptualization, Data curation, Formal Analysis, Methodology, Project administration, Supervision, Visualization, Writing–original draft, Writing–review and editing. JA: Formal Analysis, Methodology, Visualization, Writing–original draft, Writing–review and editing. DK: Formal Analysis, Validation, Visualization, Writing–original draft, Writing–review and editing. LZ: Formal Analysis, Methodology, Writing–original draft, Writing–review and editing. LT: Writing–original draft, Writing–review and editing. AA: Writing–original draft, Writing–review and editing. SS: Writing–original draft, Writing–review and editing. MK: Writing–original draft, Writing–review and editing. HB: Writing–original draft, Writing–review and editing. OA: Conceptualization, Project administration, Resources, Supervision, Validation, Writing–original draft, Writing–review and editing.

Funding

The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.

Acknowledgments

The authors express their gratitude to the Office of Data Science Strategy (ODSS) of the National Institutes of Health (NIH) for their tremendous support leading up to and throughout the Omics codeathon in October 2024, which was organized by the African Society for Bioinformatics and Computational Biology (ASBCB).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fchem.2024.1510029/full#supplementary-material

Abbreviations

3D, Three-dimensional; ADMET, chemical Absorption, Distribution, Metabolism, Excretion, and Toxicity; AI, Artificial Intelligence; AID, BioAssay Identification number; ANPDB, African Natural Products Database; DENV, Dengue Virus; DF, Dengue Fever; DHF, Dengue Hemorrhagic Fever; DSS, Dengue Shock Syndrome; EANPDB, Eastern Africa Natural Products Database; GaussianNB, Gaussian Naïve Bayes; GBM, Gradient Boosting Machine; GROMACS, Groningen Machine for Chemical Simulations; HTS, High Throughput Screen; KNN, K-Nearest Neighbors; LR, Logistic Regression; MD, Molecular Dynamics; ML, Machine Learning; MMPBSA, Molecular Mechanics Poisson-Boltzmann Surface Area; NB, Gaussian Naïve Bayes; PaDEL, Pharmaceutical Data Exploration Laboratory; QSAR, Quantitative Structure-Activity Relationship; RF, Random Forest classifier; RMSD, Root Mean Square Deviation; RMSF, Root Mean Square Fluctuation; SASA, Solvent-Accessible Surface Area; SDF, Structure Data Format; SMILES, Simplified Molecular-Input Line-Entry System; SVM, Support Vector Machine; UFF, Universal Force Field.

References

Abdullah, Z. L., Chee, H.-Y., Yusof, R., and &ohdFauzi, F. (2023). Finding lead compounds for dengue antivirals from a collection of old drugs through in silico target prediction and subsequent in vitro validation. ACS Omega 8 (36), 32483–32497. doi:10.1021/acsomega.3c02607

PubMed Abstract | CrossRef Full Text | Google Scholar

Abolo, L., Ssenkaali, J., Mulumba, O., and Awe, O. I. (2024). Exploring the causal effect of omega-3 polyunsaturated fatty acid levels on the risk of type 1 diabetes: a Mendelian randomization study. Front. Genet. 15, 1353081. doi:10.3389/fgene.2024.1353081

PubMed Abstract | CrossRef Full Text | Google Scholar

Abraham, M. J., Murtola, T., Schulz, R., Pall, S., Smith, J. C., Hess, B., et al. (2015). Gromacs: high performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1–2, 19–25. doi:10.1016/j.softx.2015.06.001

PubMed Abstract | CrossRef Full Text | Google Scholar

Adams, J., Agyenkwa-mawuli, K., Agyapong, O., Wilson, M. D., and &Kwofie, S. K. (2022). EBOLApred: a machine learning-based web application for predicting cell entry inhibitors of the Ebola virus. Comput. Biol. Chem. 101 (March), 107766. doi:10.1016/j.compbiolchem.2022.107766

PubMed Abstract | CrossRef Full Text | Google Scholar

Adams, L., Afiadenyo, M., Kwofie, S. K., Wilson, M. D., Kusi, K. A., Obiri-Yeboah, D., et al. (2023). In silico screening of phytochemicals from dissotisrotundifolia against plasmodium falciparum dihydrofolate reductase. Phytomedicine Plus 3 (2), 100447. doi:10.1016/j.phyplu.2023.100447