Artificial intelligence for antiviral drug discovery in low resourced settings: A perspective

Namba-Nzanguim, Cyril T.; Turon, Gemma; Simoben, Conrad V.; Tietjen, Ian; Montaner, Luis J.; Efange, Simon M. N.; Duran-Frigola, Miquel; Ntie-Kang, Fidele

doi:10.3389/fddsv.2022.1013285

REVIEW article

Front. Drug Discov., 02 November 2022

Sec. In silico Methods and Artificial Intelligence for Drug Discovery

Volume 2 - 2022 | https://doi.org/10.3389/fddsv.2022.1013285

This article is part of the Research TopicInsights in In silico Methods and Artificial Intelligence for Drug Discovery: 2022View all 5 articles

Artificial intelligence for antiviral drug discovery in low resourced settings: A perspective

Cyril T. Namba-Nzanguim¹^†

Gemma Turon²^†

Conrad V. Simoben¹^†

Ian Tietjen³

Luis J. Montaner³

Simon M. N. Efange¹

Miquel Duran-Frigola²*

Fidele Ntie-Kang^1,4*

¹Department of Chemistry, University of Buea, Buea, Cameroon
²Ersilia Open Source Initiative, Cambridge, United Kingdom
³The Wistar Institute, Philadelphia, Pennsylvania, PA, United States
⁴Institute of Pharmacy, Martin-Luther University Halle-Wittenberg, Halle (Saale), Germany

Current antiviral drug discovery efforts face many challenges, including development of new drugs during an outbreak and coping with drug resistance due to rapidly accumulating viral mutations. Emerging artificial intelligence and machine learning (AI/ML) methods can accelerate anti-infective drug discovery and have the potential to reduce overall development costs in Low and Middle-Income Countries (LMIC), which in turn may help to develop new and/or accessible therapies against communicable diseases within these countries. While the marketplace currently offers a plethora of data-driven AI/ML tools, most to date have been developed within the context of non-communicable diseases like cancer, and several barriers have limited the translation of existing tools to the discovery of drugs against infectious diseases. Here, we provide a perspective on the benefits, limitations, and pitfalls of AI/ML tools in the discovery of novel therapeutics with a focus on antivirals. We also discuss available and emerging data sharing models including intellectual property-preserving AI/ML. In addition, we review available data sources and platforms and provide examples for low-cost and accessible screening methods and other virus-based bioassays suitable for implementation of AI/ML-based programs in LMICs. Finally, we introduce an emerging AI/ML-based Center in Cameroon (Central Africa) which is currently developing methods and tools to promote local, independent drug discovery and represents a model that could be replicated among LMIC globally.

Introduction

Even with extensive access to resources, funding, and talent, drug research and development is a complex, expensive, and time-consuming endeavour. Despite the advances made toward drug discovery procedures that combine traditional and modern methods, most drugs fail to achieve regulatory approvals and reach the market, a phenomenon known as attrition (Waring, et al., 2015). Currently, over 90% of drug candidates fail between phase I clinical trials and regulatory approval, resulting in substantial loss of financial investment and resources (Fleming, 2018).

Traditional methods of drug discovery include finding and validating a putative drug target, followed by the development of a target-based bioassay and identifying a lead compound that interacts with the target with significant activity. At this stage, hit compounds generally undergo rounds of hit-to-lead optimization to improve stability, activity, and selectivity over toxicity, among other parameters. Additionally, the compounds being examined are investigated in a batch of assays to test their abilities to produce the same observed response within living animals (in vivo) or isolated living tissues (ex vivo) (Hughes et al., 2011).

One avenue to reduce the cost and duration of drug discovery is the use of in silico protocols in the early stages of the drug research and development pipeline. In silico methods can lower the attrition rate by identifying drug candidates with predicted suitable therapeutic activities and excluding compounds with undesirable traits such as predicted toxicity or poor pharmacokinetics (Beresford et al., 2004; Hughes J. D. et al., 2008; Hughes L. D. et al., 2008; Gawwehn et al., 2016; Zhang et al., 2017). Approaches like molecular docking and quantitative structure-activity relationship (QSAR) modeling are used to identify hits in virtual compound libraries as well as predict and optimize molecular bioactivity (Golbraikh et al., 2016). Predictions that can be obtained and tested experimentally for accuracy include physicochemical properties (such as logP and solubility) and the binding mode of a ligand (small molecule/protein) to a target (protein). To predict ligand-protein interactions, a high-resolution protein structure is necessary, ideally with previous knowledge of other ligands bound to the intended binding site. Fine-grained molecular dynamics simulations/relaxations, for instance, can be used to understand the atomistic details of the ideal ligand-protein complex, which in turn leads to a reduced number of suggested final molecules for the experimentalists (i.e., medicinal chemists and biologists) that potentially have better activities when compared to the starting/reference compound. However, while modern physics-based computational methods such as docking and molecular dynamics simulations are able to simulate specific ligand-target interactions, a current challenge of computational drug discovery is the modeling of compound effects at phenotypic and physiological levels in order to improve translation to in vivo experiments, where issues related to efficacy and drug absorption, distribution, metabolism excretion, and toxicity (ADMET) may emerge (Cherkasov et al., 2014). These predictions are generated by data-driven approaches, which ultimately relies on the notion that similar molecules tend to have similar activities. Limitations of such predictions are traced to small training sets to build the models, (Zhao, 2017), the narrow chemical space covered by these training sets (Stouch et al., 2003), experimental data errors (Fourches et al., 2010), and a lack of prospective experimental validations (Tropsha, 2010). Additionally, the hypothesis that similar compounds will have similar activities could be limited if only based on chemical structure and target activity (Zhang et al., 2017), potentially resulting in inaccurate predictions in the presence of activity cliffs (Stumpfe et al., 2019).

Data-driven drug discovery, and in particular the application of artificial intelligence and machine learning (AI/ML) tools, have been suggested as promising strategies to model compound effects that cannot be simulated with physics-based methods alone (Schneider et al., 2020; Jayatunga et al., 2022), as well as to devise sophisticated, more robust, and biologically relevant similarity metrics between compounds (Fernández-Torras et al., 2022a). From a practical perspective, AI/ML methods can be considered to be QSAR models, where a set of predefined physicochemical or structural descriptors of the molecules (molecular weight, number of hydrogen bond donors, etc.) are used as predictor variables of an activity of interest (e.g. cellular growth inhibition). Typically, these models require substantial pre-existing experimental knowledge (Baskin 2019), which limits their potential to generate genuinely novel chemistries or be applied to understudied disease areas. By contrast, modern AI/ML algorithms, including those that can be trained with only a few training samples (Altae-Tran et al., 2017), are self-trained and/or can learn from multiple datasets simultaneously (Stanley et al., 2021). Modern AI/ML algorithms may provide a viable data-driven solution to operate in low-data regimes. Moreover, AI/ML models for drug discovery can perform tasks beyond bioactivity prediction, including a broad set of techniques to capture complex ‘omics’ profiles, the design of retrosynthesis pathways, hit-to-lead optimization through generative models, among many others (Schneider et al., 2020).

In principle, AI/ML approaches to drug discovery could be applied to any disease area, ranging from non-communicable diseases such as cancer and Alzheimer’s to communicable diseases such as viral and bacterial infections. To this end, access to biological and chemical data is essential (Gupta et al., 2021). Features like structural properties, gene expression levels and/or gene sequencing, subcellular locations and network topological features can be used to identify or predict drug targets (Hu et al., 2019) as well as estimate factors like toxicity, solubility, selectivity, and kinetics (Brown, 2020). At the moment, the majority of AI/ML tools available to the research community have been trained on historical (public) data collected from large chemical and bioactivity databases, as well as ‘omics’ resources and biomedical knowledge bases. Therefore, the availability and performance of AI/ML models are biased, to a great extent, towards disease areas that have traditionally received more attention and for which richer datasets are consequently available. Indeed, infectious disease research is hampered by the lack of validated targets, poor molecular characterization of the pathogens and scarcity of large screening datasets (De Rycker et al., 2018).

The amount of available data for a particular disease area is tightly bound to research investment. The intrinsic cost and risk of investment in drug discovery have caused pharmaceutical companies and research funding agencies to focus on diseases for which incentives are high, i.e. non-communicable diseases that affect the Global North or High-Income Countries (HIC). Currently, only 15% of the drugs in development are targeting infectious diseases (WHO, 2022), effectively neglecting the needs of Low and Lower Middle-Income Countries (LMIC), which carry most of the world’s communicable disease burden. For example, as of 2016, approved antiviral drugs targeted only about 10 of the over 200 viruses known to infect humans (de Clercq and Li 2016), with several challenges hampering the antiviral drug discovery pipeline, including not only lack of funding but also lack of knowledge on viral biology (Adamson et al., 2021). Likewise, there is a need for novel antibacterial and antifungal therapies (Perfect, 2017; De Rycker et al., 2018). Many LMIC governments are unable to prioritize investment in scientific innovation, with most countries dedicating less than 0.5% of their domestic gross product to research and development activities (UNESCO, 2020). Arguably, AI/ML methods can have the greatest impact in settings where the cost and time to conduct effective experiments remain prohibitive. Paradoxically, though, these methods are not being developed in these settings precisely because pre-existing datasets and incentives are almost nonexistent. In addition, the shortage of skills and training in data science, computer science, chemoinformatics and bioinformatics in LMIC further hampers the development of AI/ML methods in low-resourced countries. As a result, the research inequality that characterizes drug discovery (i.e. greater investment in non-communicable diseases that affect the Global North and poor investment in communicable diseases that affect the Global South) extends to AI/ML research.

In this review article, we discuss existing and potential attempts to reverse these trends with a focus on antiviral drug discovery on the African continent. In particular, we discuss available data sources and their limitations while emphasizing existing African natural products databases, an untapped resource of novel chemical structures. In addition, we describe new models for data sharing and highlight a set of AI/ML-based initiatives to facilitate access to computational tools worldwide. Finally, we present an emerging initiative for a leading drug discovery center based in Central Africa that will capitalize on such computational tools to provide cost-effective drugs against infectious and communicable diseases.

Available data for antiviral drug discovery

Availability of good quality, task-specific data is perhaps the most important requirement for successful AI/ML modeling. Applied antiviral drug discovery involves knowledge of viral protein targets and their ligands, as well as phenotypic response measurements in infected cells. Knowledge of human targets may also be relevant, especially for host-directed therapies and host-pathogen interaction disruption. Generally, publicly available databases of small molecules and their bioactivities and human targets (ChEMBL (Mendez et al., 2019), PubChem (Kim et al., 2022) and DrugBank (Wishart et al., 2018), among others) provide starting points for experimental testing and AI/ML model training. In the context of research performed in LMIC, three specific regions of the chemical space are very interesting: natural product (NP) databases (especially from endemic plant and marine species) (Newman and Cragg, 2020; Ebob et al., 2021), known antiviral catalogs, and approved/advanced experimental drug databases to be used in drug repurposing (Duran-Frigola et al., 2017). Notably, Table 1 presents a summary of the most remarkable databases for NP-based drug discovery, as well as antiviral-oriented databases. In Table 2 we present a selection of drug databases, with potential for drug repurposing, along with target resources.

TABLE 1

TABLE 1. Natural products and antivirals databases.

TABLE 2

TABLE 2. Selected gene centric databases for integrative knowledge graphs, with a focus on drugs and drug target interactions.

As shown in Table 1, there is a growing number of open databases that provide good starting points for antiviral drug discovery, including a rich repertoire of natural products. For example, many of these NPs have shown antiviral potency against SARS-CoV-2 at concentrations less than 10 µM (Ebob et al., 2021).

However, several challenges need to be addressed to streamline these and other datasets in computational drug discovery pipelines (Krallinger et al., 2015; Tetko et al., 2016). First, data redundancy between the different available databases may cause bias in the extraction of information from the databases and subsequent analysis (Yonchev et al., 2018). Second, poor quality metadata hampers the interpretation of the available information (Williams et al., 2012; Lamy et al., 2020), and lack of computer-readable standard formats make information extraction difficult (Bauer-Mehren et al., 2009). Finally, links to target- and pathogen-centered databases are typically lacking, creating a disconnect between chemistry-centered and biology-centered resources.

New models for data sharing

Despite ongoing efforts by the scientific community to collect experimental data on putative anti-infective molecules, the scarcity of publicly available data in diseases of interest such as antivirals hinders the development of novel AI/ML tools. An avenue to overcome this limitation is to leverage the knowledge accumulated over the years by pharmaceutical companies. While the discovery of anti-infectives may not have been a top priority for many companies, it is clear that they still treasure the majority of data in this domain, sometimes resulting in remarkable initiatives like the GSK Tres Cantos Open Lab or Drugs for Neglected Diseases Initiative (DNDi). Although pharmaceutical companies often publish their results in scientific publications, they only share a small subset of the molecules screened to, understandably, protect the industry’s intellectual property (IP). This trend is particularly acute in primary screenings, where hundreds of thousands of compounds may have been tested. Incomplete disclosure of these experiments hampers the full realization of data-driven drug discovery (Mervin et al., 2015). Although large-scale open-source drug discovery initiatives exist (Antonova-Koch et al., 2018), these are comparatively rare and may still find IP constraints when private stakeholders are involved.

AI/ML offers a unique opportunity to exploit drug screening results without disclosing the identity of proprietary chemical libraries. The so-called privacy-preserving AI/ML approach proposes that IP-sensitive data can be effectively made available in the form of AI/ML models, which retain the essential properties of the training data but do not reveal the identity of the compounds used to train the model. A foundational example of this approach is the MELLODDY Consortium (Burki, 2019), orchestrating data sharing between 10 pharmaceutical companies, thereby compiling the largest collection of compounds and bioactivity endpoints in an IP-protected setting. A key feature of the MELLODDY approach is the decentralization of data, followed by a training scheme of predictive AI/ML models that prevents exposure to proprietary information. AI/ML models developed by the MELLODDY consortium are likely to have a significant impact on the academic scientific community since they capture a formidable amount of data previously owned by pharmaceutical companies (https://www.melloddy.eu/). Similar consortia have been devised in the medical informatics field, with the goal to improve diagnostics AI/ML models by accessing large patient databases while maintaining confidentiality (Warnat-Herresthal et al., 2021). In this line, tools for AI/ML model encryption are flourishing, offering a data-sharing toolbox for data scientists operating at the intersection between private and public stakeholders (Graepel et al., 2013). Researchers based in the LMIC are expected to be amongst the greatest beneficiaries of new data sharing models since they will gain access to data collected from external sources that would otherwise be inaccessible or unaffordable.

Data integration tools for drug discovery

In addition to greater availability of data to cover the gap in antiviral drug discovery, there is the need to design data integration tools that are able to yield amenable inputs for AI/ML modeling. In the context of non-communicable diseases, and especially in the field of anticancer drug discovery, a plethora of data integration protocols have been suggested, with applications in drug repurposing (Luo et al., 2021), virtual phenotypic screening (Sharifi-Noghabi et al., 2021), and target discovery (Rodrigues and Bernardes, 2020), among others. The underlying principle behind all these data integration methods is that data collected from multiple sources can be unified and harmonised in a single resource that can serve as relevant input data for AI/ML modeling. Examples of the necessary sources to build integrative tools include gene centric databases, disease annotation databases and, especially, chemical-protein interaction data (Table 2). Today, a favorite structure for a unified resource is a so-called biomedical knowledge graph. Early examples of comprehensive knowledge graphs include HetioNet (Himmelstein et al., 2017) and the Harmonizome (Rouillard et al., 2016), where data related to genes/proteins, small molecules, cells, diseases, etc. Is centralized in a large network containing thousands of nodes and millions of edges representing ligand-protein interactions, disease-gene associations, gene expression profiles, etc. Modern versions of these biomedical knowledge graphs may contain up to about a hundred million edges (Santos et al., 2022), and are therefore an extraordinarily rich starting point for AI/ML modeling in many disease areas. Moreover, several resources greatly simplify the adaptation of the data contained within these knowledge graphs into vectorial numerical representations that can be plugged to conventional AI/ML algorithms. For example, the Bioteque contains pre-calculated embeddings (i.e. ready-to-use vectorial representations) for thousands of biological entities, capturing the information contained within a gigantic knowledge graph (Fernandez-Torras et al., 2022b). Two years ago, and with a focus on small molecules, the Chemical Checker (Duran-Frigola et al., 2020) was published, providing an unprecedented amount of standardized and intensively processed data, in the form of numerical vectors, for almost one million bioactive compounds found in the public domain.

Unfortunately, though, all the major integrative knowledge graphs are acutely human-centric, meaning they mostly contain information about human genes and cells. Systematic integration of pathogen genomes and biology is currently lacking. As a result, infectious disease biology is difficult to capture with existing resources. Although several attempts have been made by mapping host-pathogen molecular interactions (most notably in the context of the COVID-19 pandemic) (Gordon et al., 2020), the available data is still far from commensurating with non-communicable disease data, especially cancer data for which a formidable number of genomic and phenotypic screening experiments have been performed. From a methodological standpoint, exploitation of a knowledge graph containing viral or bacterial data would not differ greatly from the already-available approaches suggested by resources like the Bioteque, since graph embedding techniques are relatively domain-agnostic and can be applied to a broad range of data types (Cai et al., 2018). The main challenge lies in the incorporation of pathogen data to the knowledge graph. A better characterisation of pathogen disease biology, including gene functions, metabolic pathways and signaling networks, and a more detailed description of the mechanisms of host-pathogen interactions, are key to achieving a biomedical knowledge graph that represents non-communicable and communicable diseases with equal depth and scope.

Ready-to-use AI/ML

Despite the growing number of AI/ML methods for drug discovery, many of them are either behind a paywall or not accessible in a user-friendly manner. With limited funding and access to data science expertise, this poses a real barrier to adoption by LMIC researchers. In recent years, the concept of ‘model hubs’ has become popular thanks to initiatives such as HuggingFace (Wolf et al., 2020), PyTorch Hub (https://pytorch.org/hub/) or TensorFlow Hub (https://www.tensorflow.org/hub). In short, these platforms provide access to a wealth of ready-to-use AI/ML models, which are transforming the fields of natural language processing and image analysis. The major stakeholders in the AI/ML industry (including tech corporations, academic groups and data science centers) are actively contributing their models to these hubs. As a result, users can run state-of-the-art AI/ML models with minimal effort, which has facilitated the inclusion of AI/ML assets into a broad range of disciplines and real-world applications. Unfortunately, though, the scope of these resources is generalist, with poor representation of computational biology and chemistry in their catalogs. In the biomedical domain, a few open-source initiatives, such as Kipoi (Avsec et al., 2019) and ModelHub.ai (Hosny et al., 2019) aim at disseminating pre-trained AI/ML models specific to certain areas such as genomics or medical image analysis, although a reference resource including a significant amount of drug discovery AI/ML models is still lacking.

In addition to providing out-of-the-box predictions for experimental researchers through model hubs, new resources containing ready-made datasets for AI/ML modeling in drug discovery are an excellent starting point for modeling endeavors. Particularly relevant is the recently published Therapeutics Data Commons (TDC) (Huang et al., 2021), a curated compendium of datasets covering the major stages of drug discovery. TDC works with the concept of leaderboards, so researchers can test their AI/ML algorithms and benchmark them. Other benchmarking includes MoleculeNet (Wu et al., 2018), MOSES (Polykovskiy et al., 2020), some of the Kaggle (https://kaggle.com) competitions, and the DREAM challenges (https://dreamchallenges.org). Recently, open-source drug discovery initiatives such as Open Source Malaria (Williamson et al., 2016; Tse et al.,. 2021) and Open-Source Antibiotics (https://github.com/opensourceantibiotics) have organized AI/ML-oriented challenges as part of their experimental cycle, offering a truly collaborative setting for data scientists and experimentalists.

Finally, the AI/ML community has invested significant efforts towards simplifying the model training procedure, facilitating the creation of competent AI/ML models without the need for advanced data science skills. Overall, automated AI/ML (AutoML) methods like AutoGluon (Erikson et al., 2020), AutoSklearn (Freuer et al., 2022), AutoKeras (Jin et al., 2019), FLAML (Wang et al., 2021), and others, are likely to play a key role in the adoption of AI/ML modeling capacity, freeing the user from algorithmic and hyperparameter search and optimization. In low-resourced settings where data science skills are typically scarce, AutoML functionalities can offer out-of-the-box solutions with competitive performance. A few attempts have been made to provide AutoML functionality for drug discovery (Shen et al., 2021), although the bulk of the existing AI/ML research in the field is still the result of highly specialized work. Greater availability of such AutoML tools is necessary to ensure the incorporation of AI/ML promptly in the drug discovery cycle, without the need to externalize the model creation step.

Biological assays for generating AI/ML models and functional validation of AI/ML predictions

The flip side of drug development in LMICs includes the challenge of functionally validating predictions generated in virtual settings. While AI/ML-based methods can both reduce and prioritize the number of leads that need to be validated, assays that can incorporate functional testing with high-throughput remain necessary. NP and drug repurposing collections, as exemplified in Tables 1 and 2, as well as ‘pathogen boxes’ distributed by initiatives like Medicines for Malaria Venture (MMV; https://mmv.org) may provide the necessary chemical matter to perform these experiments in LMICs, coupled with the development at a relatively limited throughput of chemical series in local synthetic chemistry laboratories.

To also help address these challenges as exemplified in antiviral therapeutics, our group has developed new and leveraged existing assays which can be transferred to laboratories in LMIC for independent research. For example, publicly available cell lines such as the J-Lat T cells (Jordan et al., 2003) which contain an inducible but non-infectious HIV clone encoding a GFP reporter, can be probed to monitor effects of chemical leads on HIV latency reversal or suppression of HIV provirus transcription (Tietjen et al., 2018; Divsalar et al., 2020). If local propagation of live virus is available, infection-based assays that include use of publicly available, lab-adapted subtype B (Adachi et al., 1986) and subtype C (Ndung’u et al., 2000) HIV strains become possible in replication-competent cell lines or locally-acquired peripheral blood mononuclear cells (Leteane et al., 2012; Tietjen et al., 2015). If expression of a protein target of interest in trans affects cell viability, another attractive option includes the yeast growth restoration assay (Balgi and Roberge, 2009), where a multicopy DNA plasmid encoding the protein target of interest is placed under the control of an inducible GAL1 promoter. When expressed in yeast in the presence of galactose, expression of this protein target then inhibits yeast growth over time, as measured by culture turbidity, which in turn can be restored by co-incubation with chemical leads that inhibit the target. This approach, for example, allowed us to validate new inhibitors of the influenza A M2 viroporin that were initially found by virtual screening approaches (Duncan et al., 2020). If disruption of protein-protein interactions is desired, another emerging but attractive option is use of AlphaScreen or homogenous time resolved fluorescence (HTRF)-based methods where tagged proteins of interest are bound to respective donor and acceptor beads. When a binding event occurs in vitro, luminescence or fluorescence is produced, which in turn can be inhibited by binding inhibitors (Yasgar et al., 2016). Such approaches were used by us, for example, to identify natural products that block interactions of the SARS-CoV-2 spike glycoprotein with its host ACE2 entry receptor (Tietjen et al., 2021; Ivernizzi et al., 2022). Chemical leads can also be readily assessed for effects on cell viability or toxicity using colorimetric-based reagents like (3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyltetrazolium bromide (MTT) (Leteane et al., 2012). If viral infection results in extensive cytopathic effects and reduced cell viability in vitro, these reagents can also be used to monitor viral infection and restoration of cell viability by viral inhibitors (Tietjen et al., 2021). Assays like these are also amenable to being scaled up to 96-well format for improved screening throughput across NP or other chemical libraries as well as hits prioritized by AI/ML methods. While these assays do require a level of cell culture and molecular biology infrastructure, luminescence or fluorescence plate readers, and ideally access to flow cytometry, costs for these types of equipment are reducing quickly. Universities with synthetic or medicinal chemistry expertise will also be at an advantage to develop their chemical leads even with relatively straightforward synthesis strategies.

However, challenges in many LMIC include ensuring that proper scientific expertise for AI/ML methods or biological assays is perpetuated in local universities and that required infrastructure is optimally maintained. One potential option toward addressing these challenges includes introducing a series of recurring, intensive, and hands-on workforce development laboratory training and instruction sessions, akin to the Wistar Institute’s Biomedical Technology Training Program (https://wistar.org/education-training/biomedical-technician-training-program), designed to train promising students from underserved or related communities to become research technicians that can readily meet the employment needs of local academic institutions and health science industries. Similar programs can be performed in LMIC once adapted to train students in computational techniques. Alternatively, equipment technicians from HIC can be involved with these programs to not only train students on instrument use and maintenance but also repair and certify local equipment. This change of paradigm in scientific collaborations between HIC and LMIC, where committed knowledge sharing, and capacity building are embedded throughout the project design is essential to sustainably and permanently increase meaningful research capacity in LMIC. This commitment to develop capacity in LMIC is distinct from “helicopter research,” where scientists from HIC liaise with collaborators in LMIC to merely coordinate data collection or extract local resources.

Building local capacity in AI/ML for antiviral drug discovery

Consistent with objectives discussed above, the University of Buea in Cameroon is initiating a Center for Drug Discovery (UB-CeDD) focused on multiple drug discovery pipelines including the discovery of novel plant-based antivirals (Figure 1), among others. The establishment of an integrative center for drug discovery in Central Africa is key to developing the health research and development in the region, akin to what has been successfully demonstrated by the H3D Centre in Southern Africa (Winks et al., 2022). The overall goal of the UB-CeDD is to discover novel antiviral compounds based on NP core structures. Initial antiviral targets of interest include proteins from human immunodeficiency virus (HIV) and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), although other targets are intended to be pursued. The UB-CeDD will combine and implement a virtual screening procedure that couples AI/ML models and physics-based methods like molecular docking and molecular dynamics simulations. Primary hits will be identified by machine learning, and these will then be docked, with the docked poses scored using several protein-ligand scoring algorithms. The goal is to develop a cloud-based virtual screening platform that permits compounds to be screened computationally from the African Natural Products Database and others (ANPDB, Ntie-Kang et al., 2017; Simoben et al., 2020). To develop efficient AI/ML models, we will generate a well-curated dataset of compounds that have been tested in antiviral assays within the same laboratory conditions. Since such data are currently scarce, we are screening several hundred natural and synthesized compounds from collaborative partner laboratories through the Nature-inspired Discovery of Novel Antivirals (NiDNA) network. The compounds are being screened, for example, for their inhibitory capacities against vital SARS-CoV-2 drug targets like the main protease and the binding of the viral spike to the angiotensin-converting enzyme 2 (ACE2) and for their potential to reverse latency in HIV-infected cells. Importantly, these assays are transferable to the LMIC laboratories involved in the collaboration. The more compounds are tested in the assays, the more robust will the generated AI/ML models be. Within an LMIC like Cameroon, the generated models will go a long way to train graduate students and postdoctoral researchers on how to implement AI/ML in an academic setting. This will speed up the process toward finding antiviral lead compounds contained in plant biodata and synthesized leads based on pharmacophores contained in NPs and eventually guide the synthesis of novel analogues with high potency and devoid of potential toxicity effects. Some web tools which could potentially be used for developing ML models have been summarized in Table 3.

FIGURE 1

FIGURE 1. Schematic representation of the three main areas being developed at the Center for Drug Discovery at the University of Buea (UB-CeDD), Cameroon, to achieve a sustainable antiviral drug discovery pipeline in Central Africa.

TABLE 3

TABLE 3. A short and illustrative list of readily available online AI/ML, covering several stages of the drug discovery process. Please note that the list is not comprehensive. Check resources like the Ersilia Model Hub (https://ersilia.io/model-hub) for a larger compendium.

Conclusion

In this review article, we have discussed the current opportunities to apply AI/ML technologies in underserved research settings. We have focused on the discovery of antiviral drugs, an underserved therapeutic area with great importance in LMIC. To build ML models and use AI to predict biological activities of drug candidates, there is need for data. Such data would include chemical structures with known biological activities (often included in molecule databases). Such data could be included in a broad array of ML models, to make predictions. This is the case with data available in open access platforms/models. Databases of known drug targets for NPs have also been included in this survey. There are also ready-to-use models and web-based tools that only require the user to populate the model with their own data (generated from in-house chemical libraries) or through partnerships with pharmaceutical companies. In this review, we have been focused on compound libraries and ML tools that could be useful to generate predictive tools for antiviral lead compound discovery within economically limited settings like academic institutions in LMICs. We argue that AI/ML can offer a cost-effective solution, although better access to viral assays data and better data integration protocols will be needed for effective adoption of AI/ML tools. We also describe some antiviral assays we plan to conduct and are already conducting in partner laboratories to include in the generation of ML predictions. We propose that a fluent research cycle involving data collection, computational prediction and experimental testing can be implemented in-country, and we propose the emerging CeDD in Buea as an exemplary case for Western and Central Africa.

Author contributions

Conception: MD-F, FN-K, SE and IT; Generation of preliminary data: IT, CTN-N, CVS, FN-K, LM, GT, SE, and MD-F; Writing of the first draft CVS, CTN-N, GT, IT, MD-F, and FN-K; Editing and approval of the final version CVS, CTN-N, GT, IT, LJM, SE, MD-F, and FN-K.

Funding

Financial support is acknowledged from the Bill & Melinda Gates Foundation through the Calestous Juma Science Leadership Fellowship awarded to FN-K (award number: INV-036848). LJM and IT supported by Robert I. Jacobs Fund of The Philadelphia Foundation; LJM is supported by the Herbert Kean, M.D., Family Professorship.

Acknowledgments

The authors acknowledge Kelly Chibale and Wolfgang Sippl for the fruitful scientific discussions.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Adachi, A., Gendelman, H. E., Koenig, S., Folks, T., Willey, R., Rabson, A., et al. (1986). Production of acquired immunodeficiency syndrome-associated retrovirus in human and nonhuman cells transfected with an infectious molecular clone. J. Virol. 59 (2), 284–291. doi:10.1128/JVI.59.2.284-291.1986

PubMed Abstract | CrossRef Full Text | Google Scholar

Adamson, C. S., Chibale, K., Goss, R. J., Jaspars, M., Newman, D. J., and Dorrington, R. A. (2021). Antiviral drug discovery: Preparing for the next pandemic. Chem. Soc. Rev. 50 (6), 3647–3655. doi:10.1039/d0cs01118e

PubMed Abstract | CrossRef Full Text | Google Scholar

Ahmed, A., Smith, R. D., Clark, J. J., Dunbar, J. B., and Carlson, H. A. (2015). Recent improvements to binding MOAD: A resource for protein–ligand binding affinities and structures. Nucleic Acids Res. 43 (1), D465–D469. doi:10.1093/nar/gku1088

PubMed Abstract | CrossRef Full Text | Google Scholar

Altae-Tran, H., Ramsundar, B., Pappu, A. S., and Pande, V. (2017). Low data drug discovery with one-shot learning. ACS Cent. Sci. 3 (4), 283–293. doi:10.1021/acscentsci.6b00367

PubMed Abstract | CrossRef Full Text | Google Scholar

Antonova-Koch, Y., Meister, S., Abraham, M., Luth, M. R., Ottilie, S., Lukens, A. K., et al. (2018). Open-source discovery of chemical leads for next-generation chemoprotective antimalarials. Science 362 (6419), eaat9446. doi:10.1126/science.aat9446

PubMed Abstract | CrossRef Full Text | Google Scholar

Avsec, Ž., Kreuzhuber, R., Israeli, J., Xu, N., Cheng, J., Shrikumar, A., et al. (2019). The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol. 37 (6), 592–600. doi:10.1038/s41587-019-0140-0

PubMed Abstract | CrossRef Full Text | Google Scholar

Balgi, A. D., and Roberge, M. (2009). Screening for chemical inhibitors of heterologous proteins expressed in yeast using a simple growth-restoration assay. Methods Mol. Biol. 486, 125–137. doi:10.1007/978-1-60327-545-3_9

PubMed Abstract | CrossRef Full Text | Google Scholar

Banerjee, P., Erehman, J., Gohlke, B. O., Wilhelm, T., Preissner, R., and Dunkel, M. (2015). Super natural II—A database of natural products. Nucleic Acids Res. 43 (1), D935–D939. doi:10.1093/nar/gku886

PubMed Abstract | CrossRef Full Text | Google Scholar

Barrett, T., Wilhite, S. E., Ledoux, P., Evangelista, C., Kim, I. F., Tomashevsky, M., et al. (2012). NCBI geo: Archive for functional genomics data sets—update. Nucleic Acids Res. 41 (1), D991–D995. doi:10.1093/nar/gks1193

PubMed Abstract | CrossRef Full Text | Google Scholar

Baskin, I. I. (2019). Is one-shot learning a viable option in drug discovery? Expert Opin. Drug Discov. 14 (7), 601–603. doi:10.1080/17460441.2019.1593368

PubMed Abstract | CrossRef Full Text | Google Scholar

Bauer‐Mehren, A., Furlong, L. I., and Sanz, F. (2009). Pathway databases and tools for their exploitation: Benefits, current limitations and challenges. Mol. Syst. Biol. 5 (1), 290. doi:10.1038/msb.2009.47

PubMed Abstract | CrossRef Full Text | Google Scholar

Beresford, A. P., Segall, M., and Tarbit, M. H. (2004). In silico prediction of ADME properties: Are we making progress? Curr. Opin. Drug Discov. Devel. 7 (1), 36–42.

PubMed Abstract | Google Scholar

Brown N. (Editor) (2020). Artificial intelligence in drug discovery (London, United Kingdom: Royal Society of Chemistry), 75.

Google Scholar

Burki, T. (2019). Pharma blockchains AI for drug development. Lancet 393 (10189), 2382. doi:10.1016/S0140-6736(19)31401-1

PubMed Abstract | CrossRef Full Text | Google Scholar

Burley, S. K., Bhikadiya, C., Bi, C., Bittrich, S., Chen, L., Crichlow, G. V., et al. (2022). RCSB Protein Data Bank: Celebrating 50 years of the PDB with new tools for understanding and visualizing biological macromolecules in 3D. Protein Sci. 31, 187–208. doi:10.1002/pro.4213

PubMed Abstract | CrossRef Full Text | Google Scholar

Cai, H., Zheng, V. W., and Chang, K. C. C. (2018). A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Trans. Knowl. Data Eng. 30 (9), 1616–1637. doi:10.1109/TKDE.2018.2807452

Artificial intelligence for antiviral drug discovery in low resourced settings: A perspective

Introduction

Available data for antiviral drug discovery

New models for data sharing

Data integration tools for drug discovery

Ready-to-use AI/ML

Biological assays for generating AI/ML models and functional validation of AI/ML predictions

Building local capacity in AI/ML for antiviral drug discovery

Conclusion

Author contributions

Funding

Acknowledgments

Conflict of interest

Publisher’s note

References

94% of researchers rate our articles as excellent or good