Development of novel parameters for pathogen identification in clinical metagenomic next-generation sequencing

Jiang, Xiwen; Yan, Jinghai; Huang, Hao; Ai, Lu; Yu, Xuegao; Zhong, Pengqiang; Chen, Yili; Liang, Zhikun; Qiu, Wancen; Huang, Huiying; Yan, Wenyan; Liang, Yan; Chen, Peisong; Wang, Ruizhi

doi:10.3389/fgene.2023.1266990

ORIGINAL RESEARCH article

Front. Genet., 17 November 2023

Sec. Computational Genomics

Volume 14 - 2023 | https://doi.org/10.3389/fgene.2023.1266990

Development of novel parameters for pathogen identification in clinical metagenomic next-generation sequencing

Xiwen Jiang¹^†

Jinghai Yan²^†

Hao Huang²

Lu Ai²

Xuegao Yu²

Pengqiang Zhong²

Yili Chen²

Zhikun Liang³

Wancen Qiu³

Huiying Huang³

Wenyan Yan³

Yan Liang³

Peisong Chen²*

Ruizhi Wang²*

¹College of Biological Science and Engineering, Fuzhou University, Fuzhou, China
²Department of Laboratory Medicine, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
³Guangzhou Darui Biotechnology Co., Ltd., Guangzhou, China

Introduction: Metagenomic next-generation sequencing (mNGS) has emerged as a powerful tool for rapid pathogen identification in clinical practice. However, the parameters used to interpret mNGS data, such as read count, genus rank, and coverage, lack explicit performance evaluation. In this study, the developed indicators as well as novel parameters were assessed for their performance in bacterium detection.

Methods: We developed several relevant parameters, including 10M normalized reads, double-discard reads, Genus Rank Ratio, King Genus Rank Ratio, Genus Rank Ratio*Genus Rank, and King Genus Rank Ratio*Genus Rank. These parameters, together with frequently used read indicators including raw reads, reads per million mapped reads (RPM), transcript per kilobase per million mapped reads (TPM), Genus Rank, and coverage were analyzed for their diagnostic efficiency in bronchoalveolar lavage fluid (BALF), a common source for detecting eight bacterium pathogens: Acinetobacter baumannii, Klebsiella pneumoniae, Streptococcus pneumoniae, Staphylococcus aureus, Hemophilus influenzae, Stenotrophomonas maltophilia, Pseudomonas aeruginosa, and Aspergillus fumigatus.

Results: The results demonstrated that these indicators exhibited good diagnostic efficacy for the eight pathogens. The AUC values of all indicators were almost greater than 0.9, and the corresponding sensitivity and specificity values were almost greater than 0.8, excepted coverage. The negative predictive value of all indicators was greater than 0.9. The results showed that the use of double-discarded reads, Genus Rank Ratio*Genus Rank, and King Genus Rank Ratio*Genus Rank exhibited better diagnostic efficiency than that of raw reads, RPM, TPM, and in Genus Rank. These parameters can serve as a reference for interpreting mNGS data of BALF. Moreover, precision filters integrating our novel parameters were built to detect the eight bacterium pathogens in BALF samples through machine learning.

Summary: In this study, we developed a set of novel parameters for pathogen identification in clinical mNGS based on reads and ranking. These parameters were found to be more effective in diagnosing pathogens than traditional approaches. The findings provide valuable insights for improving the interpretation of mNGS reports in clinical settings, specifically in BALF analysis.

Introduction

Next-generation sequencing (NGS) technology, also known as high-throughput or large-scale sequencing technology, can simultaneously and independently sequence thousands to billions of DNA fragments (Boers et al., 2019). Recently, the use of NGS for clinical pathogen diagnosis is widely accepted, with three main applications in the clinical microbiology laboratory: whole-genome sequencing (WGS), metagenomic next-generation sequencing (mNGS), and targeted metagenomics sequencing (tNGS) (Mitchell and Simner, 2019). WGS involves sequencing and assembly of a microbial genome, which is applied to the pure culture growth of a bacterial organism or directly from a viral specimen. WGS is used to identify and epidemiologically track food-borne outbreaks and disease surveillance and to identify multi-drug-resistant nosocomial infections and track the transmission of these organisms. However, this approach is time-consuming due to the microbial culture and has limitations when it is difficult to culture the organism or uncultivable (Brown et al., 2015; Nimmo et al., 2017; Votintseva et al., 2017). The implementation of shotgun and targeted metagenomics sequencing directly from a clinical sample, namely, as mNGS and tNGS, offers the major advantage of eliminating the culture process entirely. The method of mNGS does not rely on traditional microbial culture and can extract all nucleic acids from specimens without bias for high-throughput sequencing (Gu et al., 2019; Han et al., 2019). After biological information analysis, human sequences are removed, and the remaining sequences are compared with pathogen databases to obtain the information on suspected pathogenic microorganisms species (Gu et al., 2019; Han et al., 2019). On the other hand, tNGS involves a selection process before library preparation and sequencing to enrich for the microbial sequences of interest. Enrichment can be achieved using various selection methods such as PCR amplification (commonly known as amplicon sequencing), probe hybridization, and CRISPR–Cas9 utilization (Salipante et al., 2014; Gu et al., 2016). The advantage of tNGS when compared to mNGS approaches is that it overcomes the challenge of amplifying low numbers of microbial sequences within highly cellular samples, often referred to as the “needle in the haystack” dilemma. However, the enrichment process, such as multiplex PCR for specific genes, may introduce target bias (Schlaberg et al., 2017). There has been a growing interest in the use of quasi-metagenomics, which lies between culture-independent metagenomics and pure-culture isolate sequencing. Quasi-metagenomics sequencing involves the analysis of modified microbiomes in food and environmental samples using WGS (Hyeon et al., 2018). In this protocol, the microbiome is modified to concentrate the genomic DNA of a specific food-borne pathogen contaminant, enabling the detection and subtyping of the pathogen in a single workflow (Hyeon et al., 2018).

mNGS can theoretically detect all pathogens in clinical samples and is especially suitable for detecting complex, rare, novel, and atypical infectious diseases (Ge et al., 2021). In particular for some viruses, mNGS might be the only feasible method of detection (Babiker et al., 2020; van Boheemen et al., 2020). mNGS can be used for a variety of common clinical microbiology samples, such as cerebrospinal fluid, whole blood, alveolar lavage fluid, pus, and tissue (Wilson et al., 2014; Doan et al., 2016). In recent years, the successful application of mNGS in clinical cases and studies has gradually increased. The advantage of mNGS in the field of infectious disease diagnosis lies in its ability to detect pathogens that may remain elusive to other conventional detection methods, i.e., the capability to detect difficult-to-culture, rare, or unprecedented pathogenic microorganisms (Wylie et al., 2013; Wilson et al., 2014; Frémond et al., 2015; Doan et al., 2016; Simner et al., 2018).

Current bioinformatics pipelines of mNGS mostly rely on the number of mapped reads for pathogen identification. However, there are significant challenges in interpreting and reporting data from mNGS. Although mNGS is increasingly being applied in clinical settings, there is no guideline as a standard for interpreting mNGS data. Different analysts may have varying interpretations based on their own independent criteria. For the same bacteria being tested, one person may identify it as a pathogenic bacterium and report it as such to the clinician, while another may dismiss it as background detection and provide the clinician with a negative report. When interpreting the mNGS report, it is common practice to classify the detected pathogen as either background or pathogenic bacteria in the specimen based on reads per million mapped reads (RPM), transcript per kilobase per million mapped reads (TPM), genus rank, and coverage. However, the performance of these parameters in pathogen identification has been barely investigated, while it is not yet clear if there are better parameters available. Hence, we developed novel parameters and compared their effectiveness with that of the traditional indexes, such as raw reads, RPM, TPM, and in-Genus rank.

Materials and methods

Different indicators of mNGS

In this study, we discussed the diagnostic efficacy of different indicators. According to the results of mNGS, except for the used indicators including raw reads, RPM, TPM coverage, and in Genus Rank, we developed the following novel indicators: 1) read indicators: 10M normalize reads and double-discard reads; 2) rank indicators: Genus Rank Ratio, King Genus Rank Ratio, Genus Rank Ratio*in Genus Rank, and King Genus Rank Ratio*in Genus Rank. The meaning and calculation methods of different indicators are shown in Table 1.

TABLE 1

TABLE 1. Definitions of the 11 indicators.

Study participants

In this study, we collected bronchoalveolar lavage fluid (BALF) samples from patients at the First Affiliated Hospital of Sun Yat-sen University. These samples will undergo both PCR and mNGS. A total of 605 patients were included in the study. The inclusion criteria were as follows: meeting criteria 1–3 and at least one of criteria 4–8, mentioned as follows: 1) being over 18 years of age; 2) showing pulmonary inflammatory lesions on lung imaging; 3) being willing to participate (either the patient or their guardian); 4) testing positive for the pathogen; 5) experiencing cough, sputum, chest pain, dyspnea, or hemoptysis; 6) presenting with acute fever; 7) showing signs of lung consolidation and/or moist rales; 8) exhibiting increased white blood cell count and C-reactive protein (CRP). Exclusion criteria were as follows: 1) less than 18 years of age; 2) having a clear non-infectious diagnosis; 3) having an insufficient number of specimens; 4) not wanting to participate (either the patient or their guardian). Then, the mNGS results were analyzed to determine the detection status of the following eight pathogens: Acinetobacter baumannii, Klebsiella pneumoniae, Streptococcus pneumoniae, Staphylococcus aureus, Hemophilus influenzae, Stenotrophomonas maltophilia, Pseudomonas aeruginosa, and Aspergillus fumigatus in each specimen. The data for each index were obtained. The diagnostic efficacy of mNGS was evaluated using the results of PCR as the gold standard (Figure 1).

FIGURE 1

FIGURE 1. Flowchart of this study.

Sample processing and DNA extraction from mNGS and PCR

An amount of 400 μl of each patient’s BALF was placed in a 2-ml low-adsorption centrifuge tube, centrifuging at 14,000 g for 3 min. The supernatant was then discarded, and 200 μl of PBS was added to the centrifuge tube for re-suspension. The resulting mixture was incubated with 5% saponins and nuclease at 37°C, 1,000 rpm for 10 min. PBS (1 ml) was added to dilute 15,000 g for 3 min. The supernatant was removed. To the centrifuge tube, 400 μl of PBS was added for re-suspension. The resulting mixture was mixed with lysozyme and glass beads and placed on a vortex mixer’s horizontal platform, stirring intensely at 2,800–3,200 rpm for 30 min. We then used the Micro DNA kit (Guangzhou Darui Biotechnology Co., Ltd., DR-HS-A010, China) to extract DNA following standard protocols. The detected samples included a negative control for detection of cross-contamination and a positive control to monitor DNA extraction efficiency. The DNA concentration was estimated using a Qubit fluorometer (Thermo Fisher Scientific).

Library preparation and sequencing construction

Based on the principle of high-throughput sequencing, we added a universal sequencing connector to both ends of the extracted DNA fragments, followed by fragment screening to construct the sequencable DNA libraries. We used the TIANSeq DNA library construction kit (TIANGEN, NG104-T3A, China) to construct the DNA libraries, following the established protocols. In order to monitor the accuracy of the experimental steps and filter microbial contaminant sequence, we added equal amounts of the nucleic acid of spike-in control (a plasmid with known amounts constructed by T4 phages) to all the samples. The DNA libraries were quantified using a Qubit fluorometer, and the sizes of the libraries were measured using a Qsep100 (BiOptic Inc.) to all the samples. The constructed mixed libraries were subsequently sequenced on the Ion Torrent platform, ensuring that the resulting qualified data consisted of at least 10 million reads per sample and a Q30 score of 85% or higher. Quality control was maintained by including a negative control sample, which was processed and sequenced in parallel with each sequencing run.

Bioinformatics analysis and mNGS analysis

After completion of sequencing, the mNGS results were compared and analyzed using the PIP (version: 1.0.0rc4) (DAAN Gene Co.) software in the PIDB_v1.1.1 (DAAN Gene Co.) database. The main steps of the pipeline include trimming of the adapter and low-quality sequences using Fastp (version: 0.21.0), excluding human sequences by mapping to the human reference (GRCh38.p13) using SNAP (version: 1.0.3), aligning the remaining sequence data to microbial databases (PIDB_v1.1.1) consisting of bacteria, fungi, viruses, and parasites by Kraken 2 (version: 2.1.1) and annotating, and then homogenizing the raw taxonomic profile with 10M as the standard.

Following the annotation analysis, double deductions were made for the reads of the detected species to obtain a more realistic taxonomic profile. mNGS experiments with a small amount of nucleic acid input are known to be susceptible to false conclusions due to potential contaminants, especially from molecular biological reagents and the environment. The majority of respiratory samples have a host ratio of over 90%, which means that the effective data ratio is less than 10%. Information on nucleic acid concentration, library concentration, and host ratio can be found in the Supplementary Materials (see Supplementary Table S1). For the first double deduction, we calculated the correlation between the read number of spike-in control added to the negative control sample and in the sample to obtain the read number of the nucleic acid of potential pollutants in the sample. Then, the error is corrected in the sample classification results, which may be due to the introduction of pollutants. Due to the equal amount of spike-in control input in qualified samples and negative control and the positive correlation between the read number of species and the proportion of nucleic acid, the fold change of the spike-in control read number between samples and negative control was equal to the fold change of the contaminated microbe read number between samples and negative control (Zinter et al., 2019). The fold change in the read number of the spike-in control and pollutant in samples and negative control was consistent.

The correlation conversion formula is as follows:

{R e a d s}_{s p i k e - i n_s a m p l e} / {R e a d s}_{s p i k e - i n_N C} = {R e a d s}_{pollutants_sample} / {R e a d s}_{pollutants_NC} .

Reads: read number, NC: negative control, and sample: clinical sample.

Due to the homology among species and the extensive contamination of microbial sequences in public databases, the classification of sequences of a single species is often accompanied by the detection of other species, which we consider to be noise data that will interfere with the judgment of results. For the second double deduction, we utilized a multiple linear regression model to quantify the interference relationship between species and then deducted a certain proportion of species read number according to the model. A clinically prevalent species–strain database of BALF was constructed using 1015 BALF samples from the hospital within the last month, which covered 375 species and 1952 strains. The 50-fold reads per strain was simulated using wgsim (version: 1.15-9-g4be6986) and analyzed by PIP to obtain the taxonomic profile. Each target species y may be produced under the joint action of multiple species $x_{k}$ , and the contribution value of each species $x_{k}$ is different, that is, each species $x_{k}$ has a corresponding weight $b_{k}$ .

The variables of the multiple linear regression model between species are composed of species variables:

y = b_{0} + b_{1} x_{1} + b_{2} x_{2} + \dots + b_{k} x_{k} + u_{t} .

The weight $b_{k}$ is equal to the fold change of the read number of species y and the reads number of species $x_{k}$ in the simulated data of species $x_{k}$ , that is expressed as follows:

b_{k} = {R e a d s}_{y_x_{s i m u l a t e}} / {R e a d s}_{x_x_{s i m u l a t e}} .

We construct a weighted correlation matrix for the interfering strains of each species. By excluding the noise introduced by species $x_{k}$ , a more realistic read number of species y can be obtained.

After the double deduction, RPM and TPM calculations were performed on the remaining data. Meanwhile, several new pathogen-reporting indexes were introduced by applying feature transformation and feature combination in feature engineering, including “In Genus Rank,” “Genus Rank Ratio,” “King Genus Rank Ratio,” “Genus Rank Ratio*in Genus Rank,” and “King Genus Rank Ratio*in Genus Rank.” To reduce the effects of experimental and human-induced errors in the sample, we take a feature construction approach by implementing the bucket sort algorithm to discretize the read number of species (Koenig and Youn, 2011). Species from the same genus were placed in the same bucket, and immediately after that, the species read number in each bucket was ranked from largest to smallest, and a new index “In Genus Rank” was assigned to each species according to the rank in each bucket. Each bucket was sorted from the largest to the smallest by the read number of the genus. According to the rank percent of the target genus in all genera, a new indicator “Genus Rank Ratio” was obtained. For example, with a total of 100 genera, the read number of the target genus ranks first among all genera, and the “Genus Rank Ratio” is equal to 1 divided by 100, that is, 1%. Similar to the “Genus Rank Ratio,” the index “King Genus Rank Ratio” quantifies the ranking proportion of all genera in the kingdom level to which the target Genus belongs, which is different from the “Genus Rank Ratio.” In feature engineering, new features may better characterize the data by combining several different features, i.e., synthetic features that encode non-linear laws in the feature space by multiplying two or more input features. Here, we combine “In Genus Rank” and “Genus Rank Ratio” as a new feature, named “Genus Rank Ratio*in Genus Rank,” “King Genus Rank Ratio” and “In Genus Rank” were combined, and it is called the “King Genus Rank Ratio*in Genus Rank.”

All reports interpret that indicators were obtained for the eight microorganisms (A. baumannii, K. pneumoniae, S. pneumoniae, S. aureus, H. influenzae, S. maltophilia, P. aeruginosa, and A. fumigatus) identified in the mNGS results of each specimen.

PCR assay of the eight pathogens

The PCR assay was performed on the Applied Biosystems™ ProFlex™ PCR system. The primers of the eight pathogens are shown in Table 2. The final reaction volume of 20 µL contained 10 µL of Platinum™ II Taq Hot-Start Green PCR Master Mix (Invitrogen™), 10 μM concentration of each primer (1 mL), and 2 µL of extracted DNA. Thermal cycling conditions were as follows: preheating at 94°C for 2 min, amplification of 40 cycles including denaturation at 94°C for 15 s, annealing at 60°C for 15 s, and extension at 68°C for 15 s. Positive and negative controls were included in each run. PCR products were detected by agarose gel electrophoresis.

TABLE 2

TABLE 2. Primer sequences of pathogens used for PCR assays.

Build machine learning models

Taking all these parameters into consideration, we selected double-discard reads and ranking indicators, such as Genus Rank, Genus Rank Ratio, and King Genus Rank Ratio, for machine learning training. Similarly, the PCR results were used as the standard for comparison. We selected a logistic regression model that randomly splits the data into a 70% training set and a 30% verification set to obtain the calculation formula of the model.

Statistical analysis

Statistical analysis was carried out by an online statistics tool (http://dxonline.deepwise.com/) and GraphPad prism. The ROC curve, sensitivity, specificity, positive predict value (PPV), and negative predict value (NPV) were calculated using the results of PCR as the reference standard. The significance was fixed at p < 0.05.

Result

The results of PCR in the eight pathogens

In this study, in order to explore the diagnostic efficacy of different indicators of mNGS, we took the results of PCR as the gold standard for judgment. The result of PCR in the eight pathogens is shown in Table 3.

TABLE 3

TABLE 3. Identification of the eight pathogens in clinical samples by PCR assay.

The assay sensitivity in detecting pathogens

In clinical specimens, pathogens are often found alongside host cells at varying abundances. Different specimens can have different rates of host cells, which may interfere with the analysis. Therefore, we first conducted a sensitivity analysis of our approach using standard substances. Three pathogen standards in different concentrations were mixed with human DNA (1.25*10⁶ cells/ml) to simulate the host background and were subjected to sequencing. The results showed that even at relatively low concentrations, our approach still demonstrated an excellent performance in detecting microbes (Figure 2).

FIGURE 2

FIGURE 2. Log10 of standardized read number at different concentrations, S1: A. baumannii, K. pneumoniae, H. influenzae, S. maltophilia, and P. aeruginosa: 250 CFU/ml; S. pneumoniae, S. aureus: 500 CFU/ml and A. fumigatus: 1,000 CFU/ml. S: A. baumannii, K. pneumoniae, H. influenzae, S. maltophilia, and P. aeruginosa: 50 CFU/ml; S. pneumoniae, S. aureus: 100 CFU/ml and A. fumigatus: 200 CFU/ml. S2: A. baumannii, K. pneumoniae, H. influenzae, S. maltophilia, and P. aeruginosa: 100 CFU/ml; S. pneumoniae, S. aureus: 20 CFU/ml and A. fumigatus: 40 CFU/ml.

The diagnostic efficacy of different indicators in mNGS of the eight pathogens

After performing the Kraken 2 analysis on all clinical samples, the total number of mapped pathogen reads ranged from 31 to 10,528,777 (Figure 3; Supplementary Table S2). The ROC curve of read indicators and rank indicators of the eight pathogens using mNGS is shown in Figures 4, 5. The ROC curve enables the determination of the cut-off value for each index, along with its corresponding sensitivity, specificity, PPV, and NPV. The areas under the curve (AUC), cut-off value, sensitivity, specificity, PPV, and NPV of the different indicators in A. baumannii, K. pneumoniae, S. pneumoniae, S. aureus, H. influenzae, S. maltophilia, P. aeruginosa, and A. fumigatus are listed in Table 4. The results demonstrated that these indicators exhibited good diagnostic efficacy for the eight pathogens. The AUC values of the five read indicators (raw reads, 10M normalized reads, double discard reads, RPM, and TPM) were all greater than 0.9. The corresponding sensitivity and specificity values were also all greater than 0.8, except for the sensitivity of 10M normalized reads in S. aureus (0.762) and the sensitivity of raw reads in H. influenzae (0.724). Furthermore, the NPV was also higher than 0.95, indicating that these indicators can effectively predict true negative results. Among the five rank indicators, both Genus Rank Ratio*in Genus Rank and King Genus Rank Ratio*in Genus Rank demonstrated better diagnostic efficiency, with AUC values greater than 0.9. The corresponding sensitivity and specificity values were greater than 0.8, except for the specificity of King Genus Rank Ratio*in Genus Rank in P. aeruginosa (0.774). Moreover, the NPV was higher than 0.99, highlighting that these two indicators are reliable predictors of true negative results.

FIGURE 3

FIGURE 3. All reads of the detected pathogens in each sample.

FIGURE 4

FIGURE 4. ROC curve of read indicators in the eight pathogens, (A) A. baumannii, (B) K. pneumoniae, (C) S. pneumoniae, (D) S. aureus, (E) H. influenzae, (F) S. maltophilia, (G) P. aeruginosa, and (H) A. fumigatus.

FIGURE 5

FIGURE 5. ROC curve of rank indicators in the eight pathogens, (A) A. baumannii, (B) K. pneumoniae, (C) S. pneumoniae, (D) S. aureus, (E) H. influenzae, (F) S. maltophilia, (G) P. aeruginosa, and (H) A. fumigatus.

TABLE 4

TABLE 4. Diagnostic efficacy of 11 indicators in mNGS of eight pathogens.

Pairwise comparisons of different indicators

Using the DeLong test for pairwise comparisons, the AUC of double discard reads was significantly higher than that of raw reads and 10M normalized reads in S. pneumoniae and H. influenzae (p < 0.01). The AUC of double discard reads was significantly higher than that of 10M normalized reads in S. maltophilia (p < 0.05) (Table 5). The AUC of the double discard reads was greater than 0.9 for all eight pathogens, indicating a high diagnostic efficacy. Compared to RPM and TPM, the AUC of double discard reads was significantly higher in K. pneumoniae and P. aeruginosa (p < 0.05), and compared to RPM, the AUC of double-discarded reads was significantly higher in S. pneumoniae (p < 0.05) and A. fumigatus (p < 0.01). By definition, double discard reads further remove host-derived readings from the results, yielding a more accurate representation of pathogen information in the specimen. Therefore, double discard reads could be the preferred option when considering the number of reads in the mNGS report (Table 5).

TABLE 5

TABLE 5. Pairwise comparisons of different indicators.

Compared to in Genus Rank, the AUC of Genus Rank Ratio*in Genus Rank and King Genus Rank Ratio*in Genus Rank was significantly higher in all the eight pathogens (p < 0.01). Compared to the Genus Rank Ratio, the AUC of Genus Rank Ratio*in Genus Rank and King Genus Rank Ratio*in Genus Rank was significantly higher in K. pneumoniae (p < 0.01 and p < 0.05, respectively), S. pneumoniae (p < 0.01), S. aureus (p < 0.01), H. influenzae (p < 0.01), and P. aeruginosa (p < 0.01). Compared to the Genus Rank Ratio, the AUC of the Genus Rank Ratio*in Genus Rank was significantly higher in A. fumigatus (p < 0.05) (Table 5). Compared to the King Genus Rank Ratio, the AUC of the Genus Rank Ratio*in Genus Rank and King Genus Rank Ratio*in Genus Rank was significantly higher in K. pneumoniae (p < 0.05 and p < 0.01, respectively), S. pneumoniae (p < 0.01), S. aureus (p < 0.01), H. influenzae (p < 0.01), P. aeruginosa (p < 0.01), and A. fumigatus (p < 0.01). The results showed that the two joint indicators (Genus Rank Ratio*in Genus Rank and King Genus Rank Ratio*in Genus Rank) outperformed the other three individual rank indicators (in Genus Rank, Genus Rank Ratio, and King Genus Rank Ratio), indicating that the two joint indicators can be selected for analysis when considering the genus ranking of pathogens in the analysis of mNGS reports (Table 5).

Construction of identification algorithms through a machine learning model

The score of the logistic regression model is as follows: score = 1/(1 + exp (-logit)). The calculation formula of logit for the training model of each pathogen (Table 6) and the evaluation of the model’s effectiveness (Table 7) were obtained through machine learning. The ROC curves corresponding to each model are shown in Figure 6. The AUC value indicates that the model has a better effect. It is evident that after machine learning training, each model demonstrates improved diagnostic performance.

TABLE 6

TABLE 6. Calculation formula of the machine learning in the eight pathogens.

TABLE 7

TABLE 7. Evaluation of the model’s effectiveness of the eight pathogens.

FIGURE 6

FIGURE 6. ROC curve of machine learning model in the eight pathogens, (A,B) A. baumannii, (C,D) K. pneumoniae, (E,F) S. pneumoniae, (G,H) S. aureus, (I,J) H. influenzae, (K,L) S. maltophilia, (M,N) P. aeruginosa, and (O,P) A. fumigatus.

Discussion

Identifying the etiology of infectious diseases is critical in their diagnosis. Traditionally, clinicians make differential diagnoses based on patients’ clinical manifestations and then conduct corresponding tests for each pathogen. However, many pathogenic microorganisms are difficult to cultivate in vitro and hard to diagnose through clinical symptoms, which restricts the use of traditional microbiology tests. In comparison, mNGS is able to cover a wider range of pathogens, making it an essential tool for diagnosis (Goldberg et al., 2015).

The positive threshold for mNGS is determined based on the number of microbial-specific sequences and their genome coverage (Chinese Society of Laboratory Medicine, 2020). Since viruses rarely survive in the environment, even a small number of specific sequences can be detected as positive (Miller et al., 2019). It is crucial to avoid reporting environmental bacteria, symbiotic bacteria, and conditioned pathogens that are not clinically relevant. Typically, the higher the number of general sequences, the greater the likelihood of pathogenic microorganisms (tens of specific sequences) (Chinese Society of Laboratory Medicine, 2020). For pathogens of significant clinical concern and those that are difficult to detect, such as Mycobacterium tuberculosis, Yersinia pestis, and Brucella, independent interpretation criteria can be adopted (Chinese Society of Laboratory Medicine, 2020). For instance, if one specific sequence is detected, it can be judged as positive (Miao et al., 2018; Fan et al., 2019). Since the parasite genome is relatively complex and similar to the human genome, it should be interpreted strictly after the sequence specificity is confirmed (Infect Inflamm Rep, 2020). If the detected sequence is a new species, the threshold is not limited, but the homology comparison results must be provided.

In addition to specimen collection, pretreatment, and detection processes, the accuracy of mNGS reports can also be influenced during the result analysis stage. Numerous published studies have investigated the detection efficiency of mNGS on clinical specimens. These studies have found that when the sample contains a high content of pathogenic microorganisms, mNGS displays overall detection performance that is similar to that of PCR, with no significant difference between the two methods. However, in cases where the viral load is low, mNGS may yield false negatives with a specificity of only around 20 million specific sequences, which is lower than that of PCR. To enhance the detection sensitivity in such cases, it is necessary to increase the amount of data used for analysis (Yang et al., 2011; Fischer et al., 2015; Thorburn et al., 2015; Graf et al., 2016; Xie et al., 2019). It is critical to note that mNGS results should not be solely relied upon for clinical decision-making, and negative results should be verified with the patient’s clinical exclusion of infection (Schlaberg et al., 2017). In the current practical application, priority is given to analyzing the mNGS report for pathogens and their species ranking. However, the presence of background contamination and host interference can impact the analysis results, introducing a certain level of subjectivity when combined with clinical symptoms. Therefore, there is a need for new indicators with improved sensitivity and specificity to accurately reflect the true results.

At present, interpretation of mNGS results primarily focuses on the reads and genus rank of the detected pathogens. However, as mentioned previously, different types of pathogens have different criteria for judgment. Relying solely on these two indicators may lead to misinterpretation of the results. Therefore, it is necessary to develop new parameters that can enhance the sensitivity and specificity of these indicators, thereby improving the accuracy of mNGS results. In this study, we developed several new indicators for the diagnosis of mNGS, including 10M normalized reads, double discard reads, TPM, Genus Rank Ratio, King Genus Rank Ratio, Genus Rank Ratio*in Genus Rank, and King Genus Rank Ratio*in Genus Rank. We compared these novel indexes with the existing indicators such as raw reads, RPM, and in Genus Rank. We then analyzed the diagnostic efficacy of these indicators for eight pathogens, namely, A. baumannii, K. pneumoniae, S. pneumoniae, S. aureus, H. influenzae, S. maltophilia, P. aeruginosa, and A. fumigatus. Based on the analysis of the results for these pathogens, among the five read indicators, double discard reads demonstrated better diagnostic efficiency than the other indicators. It provided a more accurate representation of the actual reads from the pathogens. Among the five ranking indicators, the two combined indicators exhibited superior diagnostic efficiency compared to the three separate indicators. There was no significant difference between the two combined indicators, indicating that either of them can be selected for analyzing the species ranking of pathogens. The analysis results highlight that double discard reads showed higher sensitivity and specificity than raw reads. Additionally, double discard reads better reflected the actual number of detected pathogens during the analysis. The Genus Rank Ratio*in Genus Rank and King Genus Rank Ratio*in Genus Rank also demonstrated improved sensitivity and specificity compared to the in Genus Rank, which is a novel index developed in this study. Importantly, the new indexes exhibited enhanced diagnostic efficiency over the original indexes, thereby increasing the reliability of mNGS results.

However, it is important to acknowledge that in some cases, a single index may not fully reflect the true results. In practical applications, a combination of double discard reads and the two joint ranking indicators can be used for analysis to complement each other and improve the accuracy of diagnosis. Furthermore, the results of the machine learning analysis also indicate that double discard reads, the King Genus Rank Ratio, and in Genus Rank demonstrate good diagnostic performance after training the model.

As high-throughput technology continues to advance, third-generation sequencing technology, such as Nanopore, is already being implemented in clinical laboratories. The key advantages of third-generation technology include long reads (≥500 bp), low capital cost, and short turnaround time (Petersen et al., 2019). Third-generation sequencing has been utilized to bridge the gaps in unfinished genomes sequenced on short-read platforms, thanks to its ability to generate long reads (Bouchez et al., 2018). However, it is important to note that our results were derived from the utilization of mNGS technology and are specifically applicable to mNGS. Different research platforms may have distinct analysis parameters, and it is necessary to conduct further investigations to determine the compatibility of our developed parameters with third-generation sequencing platforms. Additionally, our findings are based on BALF specimens obtained from clinical patients, and additional research is required to evaluate the feasibility of applying our approach to other types of specimens, such as blood, cerebrospinal fluid, and various other body fluids.

Conclusion

mNGS is a novel technology currently being developed for clinical applications. While it has the potential to identify rare pathogens more quickly than traditional biological detection methods, it still requires improvements to enhance its clinical utility. Therefore, accurate analysis of the mNGS results is crucial. In this study, we analyzed the diagnostic efficiency of several novel indicators. We recommend selecting double discarded reads when considering pathogen reads in the report analysis. For genus ranking, we suggest selecting the two novel indicators: Genus Rank Ratio*in Genus Rank and King Genus Rank Ratio*in Genus Rank. In practical application, when analyzing the mNGS report, using these new indicators in combination can enhance the accuracy of the report, thereby promoting the clinical application of this technology. This will enable the precise detection of pathogens in patients and facilitate timely symptomatic treatment.

Data availability statement

The raw data supporting the conclusion of this article will be made available by the authors, without undue reservation.

Ethics statement

The studies involving humans were approved by the Institutional Review Board and Ethics Committee of the First Affiliated Hospital of Sun Yat-sen University. The studies were conducted in accordance with the local legislation and institutional requirements. The human samples used in this study were acquired from primarily isolated samples as part of our previous study for which ethical approval was obtained. Written informed consent for participation was not required from the participants or the participants’ legal guardians/next of kin in accordance with the national legislation and institutional requirements.

Author contributions

XJ: Writing–review and editing. JY: Data curation, Writing–original draft, Writing–review and editing, Investigation. HaH: Writing–review and editing. LA: Writing–review and editing. XY: Writing–review and editing. PZ: Writing–review and editing. ZL: Writing–review and editing. PC: Conceptualization, Investigation, Writing–review and editing. RW: Conceptualization, Investigation, Writing–review and editing.

Funding

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This study was supported by 2021 Guangdong Province Key Areas Research and Development Plan “Biosafety Technology” Key Project (2022B1111020003) and Guangdong Natural Science Foundation-General Program (2023A1515011252). Guangdong special project on the prevention and treatment of 2019-nCoV (2022B111102003).

Conflict of interest

Authors ZL, WQ, HuH, WY, and YL were employed by Guangzhou Darui Biotechnology Co., Ltd.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2023.1266990/full#supplementary-material

Supplementary Table S1 | Nucleic acid concentration, library concentration and host ratio of all samples.

Supplementary Table S2 | Total number of mapped pathogen reads of all samples.

References

Babiker, A., Bradley, H. L., Stittleburg, V. D., Ingersoll, J. M., Key, A., Kraft, C. S., et al. (2020). Metagenomic sequencing to detect respiratory viruses in persons under investigation for COVID-19. J. Clin. Microbiol. 2020, 20178764–20178820. doi:10.1101/2020.09.09.20178764

PubMed Abstract | CrossRef Full Text | Google Scholar

Boers, S. A., Jansen, R., and Hays, J. P. (2019). Understanding and overcoming the pitfalls and biases of next-generation sequencing (NGS) methods for use in the routine clinical microbiological diagnostic laboratory. Eur. J. Clin. Microbiol. Infect. Dis. 38 (6), 1059–1070. doi:10.1007/s10096-019-03520-3

PubMed Abstract | CrossRef Full Text | Google Scholar

Bouchez, V., Baines, S. L., Guillot, S., and Brisse, S. (2018). Complete genome sequences of Bordetella pertussis clinical isolate FR5810 and reference strain Tohama from combined Oxford nanopore and Illumina sequencing. Microbiol. ResourAnnounc 7, e01207-18–e01218. doi:10.1128/MRA.01207-18

PubMed Abstract | CrossRef Full Text | Google Scholar

Brown, A. C., Bryant, J. M., Einer-Jensen, K., Holdstock, J., Houniet, D. T., Chan, J. Z. M., et al. (2015). Rapid whole-genome sequencing of Mycobacterium tuberculosis isolates directly from clinical samples. J. Clin. Microbiol. 53 (7), 2230–2237. doi:10.1128/JCM.00486-15

PubMed Abstract | CrossRef Full Text | Google Scholar

Chinese Society of Laboratory Medicine (2020). Expert consensus on clinical standardized application of metagenomics next-generation sequencing for detection of pathogenic microorganisms. Chin. J. Lab. Med. Dec. 43, 12. doi:10.3760/cma.j.cn114452-20200903-00704

CrossRef Full Text | Google Scholar

Doan, T., Wilson, M. R., Crawford, E. D., Chow, E. D., Khan, L. M., Knopp, K. A., et al. (2016). Illuminating uveitis: metagenomic deep sequencing identifies common and rare pathogens. Genome Med. 8, 90. doi:10.1186/s13073-016-0344-6

PubMed Abstract | CrossRef Full Text | Google Scholar

Fan, S., Wang, X., Hu, Y., Shi, J., Zou, Y., Zhao, W., et al. (2019). Metagenomic next-generation sequencing of cerebrospinal fluid for the diagnosis of central nervous system infections: a multicentre prospective study. bioRxiv. doi:10.21203/rs.2.17334/v1658047

CrossRef Full Text | Google Scholar

Fischer, N., Indenbirken, D., Meyer, T., Lütgehetmann, M., Lellek, H., Spohn, M., et al. (2015). Evaluation of unbiased next-generation sequencing of RNA (RNA-seq) as a diagnostic method in influenza virus-positive respiratory samples. J. Clin. Microbiol. 53 (7), 2238–2250. doi:10.1128/JCM.02495-14

PubMed Abstract | CrossRef Full Text | Google Scholar

Frémond, M.-L., Pérot, P., Muth, E., Cros, G., Dumarest, M., Mahlaoui, N., et al. (2015). Next-generation sequencing for diagnosis and tailored therapy: a case report of astrovirus-associated progressive encephalitis. J. Pediatr. Infect. Dis. Soc. 4 (3), e53–e57. doi:10.1093/jpids/piv040

CrossRef Full Text | Google Scholar

Ge, M., Gan, M., Yan, K., Xiao, F., Yang, L., Wu, B., et al. (2021). Combining metagenomic sequencing with whole exome sequencing to optimize clinical strategies in neonates with a suspected central nervous system infection. Front. Cell Infect. Microbiol. 11, 671109. doi:10.3389/fcimb.2021.671109

PubMed Abstract | CrossRef Full Text | Google Scholar

Goldberg, B., Sichtig, H., Geyer, C., Ledeboer, N., and Weinstock, G. M. (2015). Making the leap from research laboratory to clinic: challenges and opportunities for next-generation sequencing in infectious disease diagnostics. mBio 6 (6), 018888–e1915. doi:10.1128/mBio.01888-15

CrossRef Full Text | Google Scholar

Graf, E. H., Simmon, K. E., Tardif, K. D., Hymas, W., Flygare, S., Eilbeck, K., et al. (2016). Unbiased detection of respiratory viruses by use of RNA sequencing-based metagenomics: a systematic comparison to a commercial PCR panel. J. Clin. Microbiol. 54 (4), 1000–1007. doi:10.1128/JCM.03060-15

PubMed Abstract | CrossRef Full Text | Google Scholar

Gu, W., Crawford, E. D., O’Donovan, B. D., Wilson, M. R., Chow, E. D., Retallack, H., et al. (2016). Depletion of abundant sequences by hybridization (DASH): using Cas9 to remove unwanted high-abundance species in sequencing libraries and molecular counting applications. Genome Biol. 17, 41. doi:10.1186/s13059-016-0904-5

PubMed Abstract | CrossRef Full Text | Google Scholar

Gu, W., Miller, S., and Chiu, C. Y. (2019). Clinical metagenomic next-generation sequencing for pathogen detection. Annu. Rev. Pathol. 14, 319–338. doi:10.1146/annurev-pathmechdis-012418-012751

PubMed Abstract | CrossRef Full Text | Google Scholar

Han, D., Li, Z., Li, R., Tan, P., Zhang, R., and Li, J. (2019). mNGS in clinical microbiology laboratories: on the road to maturity. Crit. Rev. Microbiol. 45 (5-6), 668–685. doi:10.1080/1040841X.2019.1681933

PubMed Abstract | CrossRef Full Text | Google Scholar

Hyeon, J. Y., Mann, D. A., Townsend, A. M., and Deng, X. (2018). Quasi-metagenomic analysis of Salmonella from food and environmental samples. J. Vis. Exp. 140, 58612. doi:10.3791/58612

PubMed Abstract | CrossRef Full Text | Google Scholar

Infect Inflamm Rep (2020). Consensus group of experts on application of metagenomic next generation sequencing in the pathogen diagnosis in clinical moderate and severe infections, et al. Expert consensus for the application of metagenomic next generation sequencing in the pathogen diagnosis in clinical moderate and severe infections (first edition). Infect. Inflamm. Rep. 21 (2). doi:10.3760/cma.j.cn121430-20200228-00095

CrossRef Full Text | Google Scholar

Koenig, L., and Youn, E. (2011). Hierarchical signature clustering for time series microarray data. Adv. Exp. Med. Biol. 696, 57–65. doi:10.1007/978-1-4419-7046-6_6

PubMed Abstract | CrossRef Full Text | Google Scholar

Miao, Q., Ma, Y., Wang, Q., Pan, J., Zhang, Y., Jin, W., et al. (2018). Microbiological diagnostic performance of metagenomic next-generation sequencing when applied to clinical practice. Clin. Infect. Dis. 67 (2), S231. doi:10.1093/cid/ciy693

PubMed Abstract | CrossRef Full Text | Google Scholar

Miller, S., Naccache, S. N., Samayoa, E., Messacar, K., Arevalo, S., Federman, S., et al. (2019). Laboratory validation of a clinical metagenomic sequencing assay for pathogen detection in cerebrospinal fluid. Genome Res. 29 (5), 831–842. doi:10.1101/gr.238170.118

PubMed Abstract | CrossRef Full Text | Google Scholar

Mitchell, S. L., and Simner, P. J. (2019). Next-generation sequencing in clinical microbiology: are we there yet? Clin. Lab. Med. 39 (3), 405–418. doi:10.1016/j.cll.2019.05.003

PubMed Abstract | CrossRef Full Text | Google Scholar

Nimmo, C., Doyle, R., Burgess, C., Williams, R., Gorton, R., McHugh, T. D., et al. (2017). Rapid identification of a Mycobacterium tuberculosis full genetic drug resistance profile through whole genome sequencing directly from sputum. Int. J. Infect. Dis. 62, 44–46. doi:10.1016/j.ijid.2017.07.007

PubMed Abstract | CrossRef Full Text | Google Scholar

Petersen, L. M., Martin, I. W., Moschetti, W. E., Kershaw, C. M., and Tsongalis, G. J. (2019). Third-generation sequencing in the Clinical Laboratory: exploring the advantages and challenges of nanopore sequencing. J. Clin. Microbiol. 58, e01315-19–e01319. doi:10.1128/JCM.01315-19

PubMed Abstract | CrossRef Full Text | Google Scholar

Salipante, S. J., Hoogestraat, D. R., Abbott, A. N., SenGupta, D. J., Cummings, L. A., Butler-Wu, S. M., et al. (2014). Coinfection of Fusobacterium nucleatum and Actinomyces israelii in mastoiditis diagnosed by next-generation DNA sequencing. J. Clin. Microbiol. 52 (5), 1789–1792. doi:10.1128/JCM.03133-13

PubMed Abstract | CrossRef Full Text | Google Scholar

Schlaberg, R., Chiu, C. Y., Miller, S., Procop, G. W., Weinstock, G., et al. Professional Practice Committee and Committee on Laboratory Practices of the American Society for Microbiology (2017). Validation of metagenomic next-generation sequencing tests for universal pathogen detection. Arch. Pathol. Lab. Med. 141 (6), 776–786. doi:10.5858/arpa.2016-0539-RA

PubMed Abstract | CrossRef Full Text | Google Scholar

Simner, P. J., Miller, S., and Carroll, K. C. (2018). Understanding the promises and hurdles of metagenomic next-generation sequencing as a diagnostic tool for infectious diseases. Clin. Infect. Dis. 66 (5), 778–788. doi:10.1093/cid/cix881

PubMed Abstract | CrossRef Full Text | Google Scholar

Thorburn, F., Bennett, S., Modha, S., Murdoch, D., Gunson, R., and Murcia, P. R. (2015). The use of next generation sequencing in the diagnosis and typing of respiratory infections. J. Clin. Virol. 69, 96–100. doi:10.1016/j.jcv.2015.06.082

PubMed Abstract | CrossRef Full Text | Google Scholar

van Boheemen, S., van Rijn, A. L., Pappas, N., Carbo, E. C., Vorderman, R. H. P., Sidorov, I., et al. (2020). Retrospective validation of a metagenomic sequencing protocol for combined detection of RNA and DNA viruses using respiratory samples from pediatric patients. J. Mol. Diagn 22, 196–207. doi:10.1016/j.jmoldx.2019.10.007

PubMed Abstract | CrossRef Full Text | Google Scholar

Votintseva, A. A., Bradley, P., Pankhurst, L., Del Ojo Elias, C., Loose, M., Nilgiriwala, K., et al. (2017). Same-day diagnostic and surveillance data for tuberculosis via whole-genome sequencing of direct respiratory samples. J. Clin. Microbiol. 55 (5), 1285–1298. doi:10.1128/JCM.02483-16

PubMed Abstract | CrossRef Full Text | Google Scholar

Wilson, M. R., Naccache, S. N., Samayoa, E., Biagtan, M., Bashir, H., Yu, G., et al. (2014). Actionable diagnosis of neuro leptospirosis by next-generation sequencing. N. Engl. J. Med. 370 (25), 2408–2417. doi:10.1056/NEJMoa1401268

PubMed Abstract | CrossRef Full Text | Google Scholar

Wylie, K. M., Weinstock, G. M., and Storch, G. A. (2013). Virome genomics: a tool for defining the human virome. Curr. Opin. Microbiol. 16, 479–484. doi:10.1016/j.mib.2013.04.006

PubMed Abstract | CrossRef Full Text | Google Scholar

Xie, Y., Du, J., Jin, W., Teng, X., Cheng, R., Huang, P., et al. (2019). Next generation sequencing for diagnosis of severe pneumonia: China, 2010-2018. J. Infect. 78 (2), 158–169. doi:10.1016/j.jinf.2018.09.004

PubMed Abstract | CrossRef Full Text | Google Scholar

Yang, J., Yang, F., Ren, L., Xiong, Z., Wu, Z., Dong, J., et al. (2011). Unbiased parallel detection of viral pathogens in clinical samples by use of a metagenomic approach. J. Clin. Microbiol. 49 (10), 3463–3469. doi:10.1128/JCM.00273-11

PubMed Abstract | CrossRef Full Text | Google Scholar

Zinter, M. S., Mayday, M. Y., Ryckman, K. K., Jelliffe-Pawlowski, L. L., and DeRisi, J. L. (2019). Towards precision quantification of contamination in metagenomic sequencing experiments. Microbiome 7 (1), 62. doi:10.1186/s40168-019-0678-6

PubMed Abstract | CrossRef Full Text | Google Scholar

Keywords: mNGS, ROC curve, indicators, reads, rank

Citation: Jiang X, Yan J, Huang H, Ai L, Yu X, Zhong P, Chen Y, Liang Z, Qiu W, Huang H, Yan W, Liang Y, Chen P and Wang R (2023) Development of novel parameters for pathogen identification in clinical metagenomic next-generation sequencing. Front. Genet. 14:1266990. doi: 10.3389/fgene.2023.1266990

Received: 28 July 2023; Accepted: 27 October 2023;
Published: 17 November 2023.

Edited by:

Nathan Olson, National Institute of Standards and Technology (NIST), United States

Reviewed by:

Abdolrahman Khezri, Inland Norway University of Applied Sciences, Norway
Jiemin Zhou, Vision Medicals Co., Ltd., China
Han Xia, Hugobiotech Co., Ltd., China

Copyright © 2023 Jiang, Yan, Huang, Ai, Yu, Zhong, Chen, Liang, Qiu, Huang, Yan, Liang, Chen and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Ruizhi Wang, d2FuZ3J6aDNAbWFpbC5zeXN1LmVkdS5jbg==; Peisong Chen, Y2hwc0BtYWlsMy5zeXN1LmVkdS5jbg==

^†These authors have contributed equally to this work

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.