A machine learning-based strategy to elucidate the identification of antibiotic resistance in bacteria

Parthasarathi, K. T. Shreya; Gaikwad, Kiran Bharat; Rajesh, Shruthy; Rana, Shweta; Pandey, Akhilesh; Singh, Harpreet; Sharma, Jyoti

doi:10.3389/frabi.2024.1405296

ORIGINAL RESEARCH article

Front. Antibiot., 18 June 2024

Sec. Antibiotic Resistance

Volume 3 - 2024 | https://doi.org/10.3389/frabi.2024.1405296

A machine learning-based strategy to elucidate the identification of antibiotic resistance in bacteria

K. T. Shreya Parthasarathi^1,2

Kiran Bharat Gaikwad^1,2

Shruthy Rajesh²

Shweta Rana³

Akhilesh Pandey^4,5

Harpreet Singh³

Jyoti Sharma^1,2*

¹Manipal Academy of Higher Education (MAHE), Manipal, Karnataka, India
²Institute of Bioinformatics, Bangalore, India
³Division of Biomedical Informatics, Indian Council of Medical Research, New Delhi, India
⁴Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, United States
⁵Center for Individualized Medicine, Mayo Clinic, Rochester, MN, United States

Microorganisms, crucial for environmental equilibrium, could be destructive, resulting in detrimental pathophysiology to the human host. Moreover, with the emergence of antibiotic resistance (ABR), the microbial communities pose the century’s largest public health challenges in terms of effective treatment strategies. Furthermore, given the large diversity and number of known bacterial strains, describing treatment choices for infected patients using experimental methodologies is time-consuming. An alternative technique, gaining popularity as sequencing prices fall and technology advances, is to use bacterial genotype rather than phenotype to determine ABR. Complementing machine learning into clinical practice provides a data-driven platform for categorization and interpretation of bacterial datasets. In the present study, k-mers were generated from nucleotide sequences of pathogenic bacteria resistant to antibiotics. Subsequently, they were clustered into groups of bacteria sharing similar genomic features using the Affinity propagation algorithm with a Silhouette coefficient of 0.82. Thereafter, a prediction model based on Random Forest algorithm was developed to explore the prediction capability of the k-mers. It yielded an overall specificity of 0.99 and a sensitivity of 0.98. Additionally, the genes and ABR drivers related to the k-mers were identified to explore their biological relevance. Furthermore, a multilayer perceptron model with a hamming loss of 0.05 was built to classify the bacterial strains into resistant and non-resistant strains against various antibiotics. Segregating pathogenic bacteria based on genomic similarities could be a valuable approach for assessing the severity of diseases caused by new bacterial strains. Utilization of this strategy could aid in enhancing our understanding of ABR patterns, paving the way for more informed and effective treatment options.

1 Introduction

Microorganisms/microbes are the oldest known life forms on Earth, dating back to approximately 3.42 billion years (Schopf et al., 2018). As the support system of the biosphere, these ubiquitous organisms are paramount for the survival of more complex organisms. They are involved in various intricate interactions including breakdown of biological components, food spoilage, climate change, and operation of basic metabolic cycles in plants (Omkar Khade, 2024). In addition to exercising these functions, several microorganisms have been reported as potential candidates for causing detrimental effects on other life forms. Such microbes that cause harm to the host form the class of pathogenic microorganisms. Salmonellosis, listeriosis, campylobacteriosis, yersiniosis, tuberculosis, gonorrhea, and syphilis are some of the life-threatening infections in humans caused by pathogenic microorganisms. In addition to the number of increasing infections by these microbes, another threat known as antibacterial resistance (ABR) has now taken a global turnover exhibiting the possibility of a future pandemic. A number of bacterial species have been identified as resistant to the available antibiotics that pose a threat to humanity in the near future (Ventola, 2015).

Several ecologists have come up with a broad spectrum of molecular techniques to investigate microbial communities (Davey and O’toole, 2000; Douterelo et al., 2014; Agrawal et al., 2015; Braga et al., 2016). These techniques aided not only in understanding the diversity among microbes but also in the characterization and selection of treatment strategies to overcome diseases caused by the pathogenic forms. With the extensive diversity and considerable number of known strains, characterization based on experimental techniques makes it expensive, labor-intensive, and time-consuming (Nemati et al., 2016; Qu et al., 2019). This reduces the potential for meta-analysis. Owing to the enormous amounts of data collected, microbiology has now emerged into a field with big data competencies (Falony et al., 2015; Kyrpides et al., 2016; Goodswen et al., 2021). Utilization of machine learning (ML) techniques for analysis of data has become a proven strategy in acquiring insights about microorganisms (Aida et al., 2022; Jiang et al., 2022; Munjal et al., 2022; Wu and Gadsden, 2023). Comprehensive studies on drug target prediction, drug resistance against antimicrobial drugs, prediction of disease outbreaks, and exploration of microbial–host interactions are now being carried out using ML techniques (Cazer et al., 2021; Kim and Ahn, 2021; Salim et al., 2021; Sudhakar et al., 2021; Kuang et al., 2022; Joshi et al., 2024). K-mer analysis and deep learning have been previously carried out to identify 16S short-read sequences from amplicon and shotgun data (Fiannaca et al., 2018). The tool MARVEL based on Random Forest algorithm aided in the prediction of dsDNA bacteriophage sequences from metagenomic studies (Amgarten et al., 2018).

Genetic programming, Random Forest, and logistic regression were previously used for the classification of microbes associated with bacterial vaginosis (Beck and Foster, 2014). Recently, another study showcased a new approach to analyze microbial–disease association through integration of multiple data sources (Fan et al., 2019). Similarly, a method for the diagnosis of malarial parasite(s) and for gastrointestinal parasite diagnosis was developed through binary image classification using convolutional neural network (Rajaraman et al., 2018; Mathison et al., 2020). Utilization of an support vector machine-based model for the prediction of secretory proteins from malarial parasites using amino acid compositions was another study that introduced ML in microbiology (Verma et al., 2008). Prediction of parasite load in the absence of quantitative polymerase chain reaction trained on clinical records of Leishmania infantum-infected dogs also indicated the application of ML in microbiology (Torrecilha et al., 2017). Certain studies have also investigated antimicrobial resistance (AMR) using ML-based approaches and have developed methods to classify genomes into resistant and susceptible against specific antibiotics (Drouin et al., 2016; Naidenov et al., 2019; Hyun et al., 2020). A study also presented the mapping of Acinetobacter baumannii, Streptococcus pneumoniae, Staphylococcus aureus, and Mycobacterium tuberculosis into three classes: susceptible, intermediate, and resistant (Davis et al., 2016). Similar studies by numerous research groups led to the development of methods for prediction of minimum inhibitory concentration (MIC) (Nguyen et al., 2018; Mujeeb et al., 2020; Umar et al., 2020; Valizadehaslani et al., 2020; Khan et al., 2021).

The current study employed ML-based algorithms and nucleotide sequences of pathogenic bacteria with humans as host and resistant to known antibiotics for clustering into groups of microorganisms sharing similar genomic features. Thereafter, a prediction model was developed to predict the cluster that is closest to the organism in question. Furthermore, the study includes the development of a multi-label classifier capable of predicting the antibiotic that the organism is resistant to, based on the cluster information. The clustering model was evaluated using the Silhouette coefficient, the Calinski–Harabasz index, and the Davies–Bouldin index (Caliński and Harabasz, 1974; David and Davies, 1979; Rousseeuw, 1987). The prediction models were evaluated on the basis of sensitivity, specificity, and 5-fold cross-validation (CV) accuracy. Although microbial infections involve the interplay of several molecular features, the pathogenic features corresponding to a certain pathogen remain unique to that pathogen (Voter et al., 2020; Liu et al., 2021; Parthasarathi et al., 2021). Here, the genes corresponding to the features selected in the prediction model were also identified that shed light on the biological importance of the features in distinguishing one strain from another. This study would aid in coming up with improved strategies for the segregation of pathogenic bacteria. Furthermore, based on genomic similarities and differences with other well-studied microorganisms, it may aid in assessing the severity of the disease produced by the bacterium. Furthermore, incorporating ML-based algorithms into clinical practice not only is viable, reproducible, and resilient, but also aids in the production of clinician-friendly outcomes. Overall, the computational prediction analyses directed the benefit of ML in clustering pathogenic bacterial forms, which may aid in the development of better strategies to improve treatment options.

2 Materials and methodology

The workflow of the study is depicted in Figure 1.

Figure 1

Figure 1 Depiction of workflow for building a prediction model for the identification of pathogenic organisms sharing similar characteristics.

2.1 Data collection

A list of pathogenic bacteria resistant to antibiotics was obtained from the Pathosystems Resource Integration Center (PATRIC) database (Gillespie et al., 2011). The complete genomes for the microbes were downloaded from National Center for Biotechnology Information’s (NCBI) RefSeq database (Tatusova et al., 2016) (Supplementary File).

2.2 Generation of k-mer matrix from nucleotide sequences

Customized Python scripts were used to fragment the genomes into k-mers of length 8, 10, 12, and 14 nucleotides on a subset of bacterial sequences. The list of k-mers was filtered to remove duplicates. Thereafter, k-mers were mapped to the sequences, and a matrix containing the details on the presence and absence of the k-mers in each pathogenic strain was generated. R (v3.4.4) libraries seqinr and Biostrings were used to generate the matrix (Charif and Lobry, 2007; Pagès et al.). The clustering was performed on the subset for each k-mer length using the methodology as mentioned in Section 2.3. Thereafter, the optimum k-mer length for further processing was defined by taking into consideration the number of k-mers obtained, the amount of time required to process the k-mers, the intermediate file sizes, and the goodness of clustering.

2.3 Segregation of pathogenic microbes

The distance between the species was calculated using Jaccard distance matrices. Thereafter, the values in the distance matrix were scaled and used as input to perform principal component analysis (PCA) (Prokopenko et al., 2016). Principal component 1 (PC1) was used as input to segregate the organism strains into different clusters using the unsupervised machine learning algorithm—Affinity propagation (Frey and Dueck, 2007). This allowed clustering of organism strains based on their genomic sequence similarities. Thereafter, the Silhouette coefficient, the Calinski–Harabasz index, and the Davies–Bouldin index were used to calculate the goodness of clustering obtained. The Silhouette coefficient ranges from −1 to +1 and a value close to +1 indicates a better-defined cluster. Higher values of the Calinski–Harabasz index indicate better separation between clusters while a lower Davies–Bouldin value corresponds to better separation between clusters. The scikit-learn library from Python v3.10 was used to segregate the organisms into different clusters (Fabian Pedregosa). Seaborn and matplotlib libraries were used for graphical visualizations (Hunter, 2007; Waskom, 2021).

2.4 Development of prediction model to predict the cluster of an organism

The clusters formed using the Affinity propagation algorithm were further used as class labels in supervised Random Forest algorithm to develop a prediction model. The binary matrix with the information on the presence and absence of k-mers along with the clusters was given as input to the Random Forest algorithm. The dataset was split into a train and test set in the ratio 80:20. Feature selection was performed using the Random Forest algorithm, and the k-mers (features) with a score > 0.0001 were selected as most informative k-mers. The train set was further divided into a train and a validation set and the hyperparameters were tuned on the validation set using Python v3.10 library GridSearchCV (Lavalle et al., 2004). The model was evaluated based on sensitivity and specificity. The model was built using customized scripts written in Python v3.10 using the scikit-learn library. The model was then saved using the joblib library (Varoquaux, 2023).

2.5 Determining the biological significance of the most informative k-mers in cluster prediction

The k-mers selected in the Random Forest model were further analyzed to determine their biological significance. Standalone Basic Local Alignment Search Tool (BLAST) (v2.15.0) was used to align the k-mers with the customized database generated using the gene sequences from various reference genomes of bacterial strains downloaded from the Database of Essential Genes (DEG) (Luo et al., 2021) (last update, 2020 September 1) (Camacho et al., 2009). The alignments with 100% identity using the “blastn-short” parameter of Standalone BLAST were saved (Camacho et al., 2009). The list of genes obtained after alignment was compared with the list of known AMR genes obtained from the Comprehensive Antibiotic Resistance Database (CARD) (Alcock et al., 2023).

2.6 Development of prediction model of microbes resistant towards antibiotics

Furthermore, the data collected from the PATRIC database on resistance of a strain towards different antibiotics were used along with the clustering output to develop a multi-label prediction model for predicting the putative antibiotics that may not be useful in treatment against a specific bacterial strain. Given a nucleotide sequence, the model calculated several genomic features including GC content and mononucleotide counts of each of the strains. This information was merged with the clustering output, and a binary matrix with details on whether an organism strain is resistant to a specific antibiotic was generated. A multi-label classification model was built assuming that resistance against each antibiotic was independent of the fact that a strain is resistant to the other antibiotics. A multi-layer perceptron (MLP) with the calculated features representing the input layers, multiple hidden layers, and an output layer representing the resistance to antibiotics was generated. The Rectified Linear Unit (ReLU) activation function was applied for the hidden layers and the binary cross-entropy loss and the Adam version of stochastic gradient descent method was implemented for weight updation (Fukushima, 1975; Diederik and Kingma, 2017). The sigmoid activation function was implemented for the output layers. The model was built using customized scripts written in Python v3.10 utilizing the keras and scikit-learn libraries (Chollet, 2015). 5-fold CV with hamming loss as the accuracy measure was used to grade the performance of the prediction model. Hamming loss evaluates individual label prediction rather than label combination. A lower hamming loss would thus indicate a better model.

3 Results

3.1 Collection of data

A list of 710 strains from seven genera of pathogenic bacterial strains resistant to 63 antibiotics was obtained from the PATRIC database. The bacterial complete genome sequences were downloaded from the NCBI RefSeq database. Table 1 summarizes the number of strains included in the present study.

Table 1

Table 1 Total number of strains from different genera included in the study.

3.2 k-mer matrix generation

The k-mers of length 10 nucleotides (10-mers) were selected as the optimum size of k-mers (Supplementary File). A total of 2,136,154 10-mers were obtained from the 710 strains. Filtering out redundant and 39,032 10-mers that contained nucleotides other than adenine, guanine, thymine, and cytosine yielded 1,048,573 unique 10-mers. A binary matrix with the dimensions 710:1,048,573 was obtained, consisting of rows representing strains, columns representing 10-mers, and cells with binary values signifying the presence or absence of the 10-mer in the individual strain.

3.3 Clustering pathogenic bacteria

The distance between the pathogenic bacterial strains on the basis of the presence and absence of 10-mers was calculated using Jaccard distance matrices (Figure 2). The bacterial strains were clustered into seven clusters. A Silhouette coefficient of 0.82, a Calinski–Harabasz index of 93,672.21, and a Davies–Bouldin index of 0.19 were obtained (Supplementary Table 1; Figure 3).

Figure 2

Figure 2 Heatmap representing the distance between the different bacterial strains calculated using Jaccard distance matrix with binary matrix of 10-mers as input.

Figure 3

Figure 3 Bar plot representing the number of strains clustered in strains from various genera.

3.4 Supervised ML model from the clusters obtained

The dataset with information on the presence and absence of 10-mers as features and the clusters as class labels was split into a train and a test set in the ratio 80:20. This accounted for 568 data points in the train set and 142 data points in the test set. Thereafter, the train set was further split into a train and a validation set in the ratio 80:20, resulting in 454 data points in the train set and 114 data points in the validation set. A total of 876 10-mers were selected as the most informative 10-mers using the Random Forest algorithm for feature selection. A Random Forest model with a maximum depth of 6, minimum samples leaves set to 2, minimum samples split set to 5, number of estimators set to 100, and criterion set as Gini was generated. 5-fold CV on the validation set using the set parameters resulted in an accuracy of 96.49%. Testing the model on the test set resulted in an overall sensitivity of 0.98 and a specificity of 0.99. Table 2 mentions the individual class sensitivity and specificity.

Table 2

Table 2 Sensitivity and specificity of individual cluster prediction.

3.5 Biological significance of 876 10-mers

All the 876 10-mers mapped to 26,058 entries from DEG that corresponded to 81 bacterial strains, 5,179 unique genes, and 8,752 unique proteins including putative and hypothetical proteins (Supplementary Table 2). Table 3 summarizes the number of 10-mers mapped to the strains from species included in the current study.

Table 3

Table 3 Number of 10-mers mapped to reference species included in the current study.

A total of 703 10-mers mapped to 30 genes known to cause AMR from CARD. Of those, 448 10-mers belonged to the organisms considered in the current study and mapped to 22 AMR genes. The 10-mers corresponding to Escherichia coli mapped to 13 AMR-related genes, namely, emrK, emrY, evgA, evgS, gadX, kdpE, marA, mdtA, mgrB, msbA, pgpB, rpsJ, and srmB. Certain 10-mers mapped to AMR-related genes in only one species, namely, efpA, mgtA, and mtrA mapped exclusively to M. tuberculosis-related 10-mers, and acrB mapped exclusively to Salmonella enterica-related 10-mers. Similarly, there were AMR-related genes that mapped exclusively to E.coli, Pseudomonas aeruginosa-, and S. pneumoniae-related 10-mers (Supplementary Table 3). Figure 4 summarizes the number of 10-mers mapped to individual AMR genes in specific species.

Figure 4

Figure 4 Scatter plot indicating the number of 10-mers mapped to ABR drivers in strains of different genera.

3.6 Prediction model for identifying putative antibiotic resistance in bacterial strain

The 710 bacterial strains used in the study were resistant towards 63 different antibiotics. Figure 5 summarizes the top 10 antibiotics found most commonly resistant among different strains in the present study. Supplementary Table 4 summarizes the list of bacterial strains resistant to the 63 antibiotics.

Figure 5

Figure 5 Funnel plot representing the top 10 antibiotics identified to be resistant by the bacterial strains.

The length of the different strains ranged from 391,326 base pairs (bps) to 7,267,567 bps with an average length of 4,017,469 bps. The strains belonging to P. aeruginosa species had the largest genomes along with a high GC content. However, the strains belonging to E. coli had high “A” and “T” mononucleotide counts (Supplementary Table 5). The MLP model with 20 hidden layers was developed. The hamming loss calculated using repeated 5-fold CV was 0.05, indicating the prediction to be false 5% of times.

4 Discussion

Advances in the processing capacity, improvements in the classical data processing algorithms, and the availability of bacterial whole genome sequences (WGS) in public databases allow for a retrospective population study of many bacterial populations. Identifying patterns in the genomic sequences resulting in mosaic structures poses challenges in comprehending and visualizing the diversity and similarities within and across various bacterial strains. However, increasing interest in the quantitative techniques to predict phenotypes from genotypes beginning with bacterial WGS are becoming popular. The pathogenicity and ABR could be the key phenotypes for predicting clinical outcomes and estimating possible treatment options.

The present study emphasizes on the utilization of ML-based techniques to examine the relatedness in the different bacterial strains. Clustering analysis was performed to segregate the pathogenic forms based on their genomic similarities. Various species undergo horizontal gene transfer in the evolution process to increase their chances of survival (Burmeister, 2015). The most evident advantage of horizontal gene transfer is that a cell can acquire a beneficial gene that originated in another cell. The emergence of new beneficial genes is likely extremely rare; therefore, stealing a gene from a neighbor should be considerably faster than waiting for it to evolve independently (Vogan and Higgs, 2011). Moreover, it would also allow a cell to reclaim a gene that had been lost by another member of the population (Vogan and Higgs, 2011). Horizontal gene transfer can also acquire beneficial features that aid adaptation to new environments, such as metabolic and antibiotic resistance genes (Hall et al., 2020). This enables organisms to become interdependent, ensuring cooperation in preserving their relationship (Hall et al., 2020). This phenomenon could be visualized in the present study as the strains from Escherichia, Salmonella, Mycobacterium, and Pseudomonas did not form single clusters. The strains belonging to E. coli and S. enterica clustered together, indicating the strains within these species to share similarities. Among the 10-mers selected, 854 10-mers mapped to Salmonella strains while 876 k-mers mapped to Escherichia strains. All the 10-mers belonging to Salmonella overlapped with the k-mers from Escherichia. A total of 342 genes identified based on the selected 10-mers were common among the two organisms (Supplementary Table 2). The two species are a part of the same family—Enterobacteriaceae. According to evolutionary rate estimates derived from 5S and 16S rRNA sequence analysis, Escherichia and Salmonella species diverged from a common ancestor (Bisi-Johnson et al., 2011). They are estimated to have separated from the common ancestor approximately 140 million years ago (Ochman and Wilson, 1987; Hu et al., 2010). Despite their contrasting lifestyles, there has been no significant rewiring at the level of local regulons involved (Peyman Zarrineh et al., 2014). There is notable conservation in signaling pathways and stress sensing across these phylogenetically similar species (Peyman Zarrineh et al., 2014). Moreover, a similarity of 76% to 100% between their housekeeping genes makes them evolutionarily closely related species (Sharp, 1991; Samuel et al., 2004; Hu et al., 2010). They are foodborne pathogens and create complex biofilms that contribute to their virulence, antibiotic resistance, and surface survival (Milho et al., 2019). Interspecies interactions occur in mixed biofilms, resulting in diverse consequences for each species (Milho et al., 2019).

Furthermore, two clusters, cluster 3 and cluster 6, had a mixture of Mycobacterium and Pseudomonas pathogenic strains. These two clusters comprised a small fraction of strains joining the opponent cluster. Species collaborate when they are mutually advantageous, when their interests are aligned, and when each individual improves the fitness of the other, thus encouraging the advancement of diverse, unique phenotypes or interactions (Hall et al., 2020). One of the most common forms of prokaryote cooperation is the secretion of products required to build biofilms, digest complex chemicals, and modulate a host’s immune response, among other important functions (Hall et al., 2020). Genes involved in such benefit production can be transmitted between organisms, opening up new possibilities for collaboration and adaptation (Hall et al., 2020).

Currently, there are no studies indicating an evolutionary relationship between Mycobacterium and Pseudomonas bacteria. However, there is evidence of the two pathogens interacting to co-colonize the same infection niches and create a mixed-species biofilm that enhances both their immune system and antibiotic resistance (Camus et al., 2022). Further studies are needed to understand their evolutionary and clinical phenotype implications. Another interesting finding from this study was that strains belonging to S. aureus segregated to two separate clusters, indicating the within-species diversity. In the course of evolution, it undergoes both horizontal and vertical gene transfer events that have resulted in the genetically diversified bacterial population (Furqan Awan et al., 2021). Their diversity makes them resistant towards almost all the antimicrobial drugs used (Mlynarczyk-Bonikowska et al., 2022).

The obtained clusters paved the way for the introduction of a strategy based on the Random Forest algorithm to segregate the strains into organisms sharing similar genomic features. The proposed model attempts to integrate genomic sequences of the disease-causing microbes and further cluster them into groups of pathogens sharing similar characteristics. Along with developing the model, the study also identified key 10-mers that were capable of differentiating the strains into clusters. This could significantly accelerate the processing time required to deliver the output in terms of cluster identification. The majority of 10-mers mapping to genes in each organism varied across organisms.

Of the 876 10-mers, majority of them mapped to gltB gene in Acinetobacter baumannii. It codes for glutamate synthase subunit alpha. Glutamate is one of the carbon sources that can support growth of Acinetobacter species, making glutamate synthetases an important protein in these organisms (Ren and Palmer, 2023).

Majority of 10-mers corresponding to E. coli mapped to toxB gene. toxB is a virulence gene present in the virulence plasmid of E. coli species. It functions in enhancing bacterial adhesion and in inhibiting host lymphocyte activation (Tozzoli et al., 2005).

The fas gene stood out within 10-mers in M. tuberculosis strains. Biosynthesis of fatty acids regulated by FAS-I polypeptide is crucial in the formation of mycobacterial cell wall components, specifically mycolic acids that form a protective lipid layer on the cell wall. This is required for the survival of the bacterium in the host environment (Kinsella et al., 2003; Apoorva Bhatt et al., 2007). rpoB gene was the most common gene in the 10-mers mapped to P. aeruginosa and S. aureus. The list of 10-mers mapped were different in both the species, although there were some overlaps. These two species form one of the most commonly observed clinical polymicrobial communities that lead to the emergence of antibiotic-resistant strains (Pajon et al., 2023). rpoB gene is a DNA-directed RNA polymerase, and studies have reported that mutations in this gene lead to resistance against rifampin, an antibiotic used against multidrug-resistant bacterial strains (Yee et al., 1996; Guo et al., 2021). In S. enterica-related 10-mers, majority of them mapped to ftsK. It is involved in cell division and peptidoglycan biosynthesis. Mutations in ftsK could result in increased susceptibility against β-lactams and ciprofloxacin-related tolerance (Curiao et al., 2016). The gene spr0328 was identified as the topmost gene in S. pneumoniae. It encodes for a conserved hypothetical protein with a role in cell wall surface anchorage. The protein was one of the selected candidates for a study related to vaccine testing due to its ability to raise immune response in infected patients (Olaya-Abril et al., 2013).

The mapping of 10-mers to ABR genes led to the exploration of developing an MLP-based model to predict the antibiotics that a specific strain could be resistant to due to its genomic properties. This could thus aid in tracking ABR strains in a time-efficient manner. The 10-mers identified in the present study could open up new avenues in the field of drug designing-based studies. However, the present study is based on a limited number of sequences, although the same model could be implemented to a larger bacterial cohort based on sequence availability.

Amid growing advances in whole genome sequencing and applications of ML-based techniques, the characterization of pathogenic microbial communities could become a rapid process in the near future. The current study demonstrates one such strategy in identifying bacterial strains based on the presence and absence of 10-mers in their genomes. A subset of 10-mer sequences across the strains in the present study could also act as signatures to explore the diversity through understanding their biological significance. Furthermore, the MLP model enabled the classification of strains to ABR and non-ABR strains against various antibiotics. Overall, the computational prediction analyses demonstrated the advantage of ML to uncover the ABR determinants that might facilitate the exploration of better treatment options. However, the study is a data-driven approach, and thus, outcomes of the study may appear in the form of overfitting or underfitting.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author. Customized Python scripts used for developing the model are available through the GitHub repository via the following URL: https://github.com/js-iob/Bacterial_clustering_AMR.

Author contributions

KTSP: Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing. KG: Data curation, Methodology, Resources, Software, Writing – review & editing. SRaj: Data curation, Methodology, Writing – review & editing. SRan: Writing – review & editing. AP: Funding acquisition, Supervision, Writing – review & editing. HS: Writing – review & editing. JS: Conceptualization, Funding acquisition, Investigation, Methodology, Supervision, Writing – review & editing.

Funding

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This work was funded by the Indian Council of Medical Research (ICMR), Government of India.

Acknowledgments

JS would like to thank ICMR, Government of India [ICMR Ref No. BMI/12(116)2021, ICMR Ref No. BMI/12(95)/2021, and ICMR Ref No. BMI/Adhoc/28/2022–23] for their research support. KTSP is supported by ICMR, Government of India [ICMR Ref No. BMI/12(95)2021], and KG is supported by ICMR, Government of India (ICMR Ref No. BMI/Adhoc/28/2022–23). JS was a recipient of the Bio-CARe Women Scientists award from the Department of Biotechnology (DBT), Government of India (BT/PR19924/BIC/101/568/2016). This work was supported by a grant from DBT/Wellcome Trust India Alliance entitled “Center for Rare Disease Diagnosis, Research, and Training” (IA/CRC/20/1/600002) to AP.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The author(s) declared that they were an editorial board member of Frontiers, at the time of submission. This had no impact on the peer review process and the final decision.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frabi.2024.1405296/full#supplementary-material

References

Agrawal P. K., Agrawal S., Shrivastava R. (2015). Modern molecular approaches for analyzing microbial diversity from mushroom compost ecosystem. 3 Biotech. 5, 853–866. doi: 10.1007/s13205-015-0289-2

PubMed Abstract | CrossRef Full Text | Google Scholar

Aida H., Hashizume T., Ashino K., Ying B. W. (2022). Machine learning-assisted discovery of growth decision elements by relating bacterial population dynamics to environmental diversity. Elife 11. doi: 10.7554/eLife.76846.sa2

PubMed Abstract | CrossRef Full Text | Google Scholar

Alcock B. P., Huynh W., Chalil R., Smith K. W., Raphenya A. R., Wlodarski M. A., et al. (2023). CARD 2023: expanded curation, support for machine learning, and resistome prediction at the Comprehensive Antibiotic Resistance Database. Nucleic Acids Res. 51, D690–D699. doi: 10.1093/nar/gkac920

PubMed Abstract | CrossRef Full Text | Google Scholar

Amgarten D., Braga L. P. P., Da Silva A. M., Setubal J. C. (2018). MARVEL, a tool for prediction of bacteriophage sequences in metagenomic bins. Front. Genet. 9, 304. doi: 10.3389/fgene.2018.00304

PubMed Abstract | CrossRef Full Text | Google Scholar

Apoorva Bhatt V. M., Gurdyal S., Besra W. R., , Laurent Kremer J. Jr (2007). The Mycobacterium tuberculosis FAS-II condensing enzymes: their role in mycolic acid biosynthesis, acid-fastness, pathogenesis and in future drug development. Mol. Microbiol. 64, 1442–1454. doi: 10.1111/j.1365-2958.2007.05761.x

PubMed Abstract | CrossRef Full Text | Google Scholar

Beck D., Foster J. A. (2014). Machine learning techniques accurately classify microbial communities by bacterial vaginosis characteristics. PloS One 9, e87830. doi: 10.1371/journal.pone.0087830

PubMed Abstract | CrossRef Full Text | Google Scholar

Bisi-Johnson M. A., Obi C. L., Vasaikar S. D., Baba K. A., Hattori T., et al. (2011). Molecular basis of virulence in clinical isolates of Escherichia coli and Salmonella species from a tertiary hospital in the Eastern Cape, South Africa. Gut Pathog. 3. doi: 10.1186/1757-4749-3-9

A machine learning-based strategy to elucidate the identification of antibiotic resistance in bacteria

1 Introduction

2 Materials and methodology

2.1 Data collection

2.2 Generation of k-mer matrix from nucleotide sequences

2.3 Segregation of pathogenic microbes

2.4 Development of prediction model to predict the cluster of an organism

2.5 Determining the biological significance of the most informative k-mers in cluster prediction

2.6 Development of prediction model of microbes resistant towards antibiotics

3 Results

3.1 Collection of data

3.2 k-mer matrix generation

3.3 Clustering pathogenic bacteria

3.4 Supervised ML model from the clusters obtained

3.5 Biological significance of 876 10-mers

3.6 Prediction model for identifying putative antibiotic resistance in bacterial strain

4 Discussion

Data availability statement

Author contributions

Funding

Acknowledgments

Conflict of interest

Publisher’s note

Supplementary material

References

94% of researchers rate our articles as excellent or good