Towards an accurate and robust analysis pipeline for somatic mutation calling

Jin, Jingjie; Chen, Zixi; Liu, Jinchao; Du, Hongli; Zhang, Gong

doi:10.3389/fgene.2022.979928

METHODS article

Front. Genet. , 15 November 2022

Sec. Genomic Assay Technology

Volume 13 - 2022 | https://doi.org/10.3389/fgene.2022.979928

This article is part of the Research Topic Insights in Genomic Assay Technology: 2021 View all 8 articles

Towards an accurate and robust analysis pipeline for somatic mutation calling

Jingjie Jin^1,2^†

Zixi Chen³^†

Jinchao Liu⁴^†

Hongli Du³*

Gong Zhang^1,2,4*

¹Key Laboratory of Functional Protein Research, Guangdong Higher Education Institutes, Jinan University, Guangzhou, China
²MOE Key Laboratory of Tumor Molecular Biology, Institute of Life and Health Engineering, Jinan University, Guangzhou, China
³School of Biology and Biological Engineering, South China University of Technology, Guangzhou, China
⁴Chi-Biotech Co. Ltd., Shenzhen, China

Accurate and robust somatic mutation detection is essential for cancer treatment, diagnostics and research. Various analysis pipelines give different results and thus should be systematically evaluated. In this study, we benchmarked 5 commonly-used somatic mutation calling pipelines (VarScan, VarDictJava, Mutect2, Strelka2 and FANSe) for their precision, recall and speed, using standard benchmarking datasets based on a series of real-world whole-exome sequencing datasets. All the 5 pipelines showed very high precision in all cases, and high recall rate in mutation rates higher than 10%. However, for the low frequency mutations, these pipelines showed large difference. FANSe showed the highest accuracy (especially the sensitivity) in all cases, and VarScan and VarDictJava outperformed Mutect2 and Strelka2 in low frequency mutations at all sequencing depths. The flaws in filter was the major cause of the low sensitivity of the four pipelines other than FANSe. Concerning the speed, FANSe pipeline was 8.8∼19x faster than the other pipelines. Our benchmarking results demonstrated performance of the somatic calling pipelines and provided a reference for a proper choice of such pipelines in cancer applications.

Introduction

Somatic mutation is a key to provide insights and treatment of cancer. Most targeted cancer therapies are targeting specific somatic mutations, such as EGFR, BRAF, VEGF and BRCA mutations in various cancers (Gerlinger et al., 2012; Hirsch et al., 2017). Somatic mutation of circulating tumor DNA (ctDNA) may also serve as an indicator of cancer and its progression (Rolfo et al., 2021). Recently, the somatic mutation-derived cancer neoantigens become potential targets for immunotherapy (Ma et al., 2022). Therefore, accurate detection using next-generation sequencing (NGS) at genome-wide scale is crucial in cancer clinical applications. Although many analysis pipelines have been developed, the accuracy of somatic mutation detection is still a great problem. For example, a study showed zero sensitivity in finding pathogenic mutations using whole exome sequencing of 57 patients (Park et al., 2015). The mutation of 40 ctDNA samples, sequenced by two individual companies, showed only 12% congruence (Torga and Pienta, 2018). Such a low reproducibility illustrated the well-known “alarming reproducibility crisis” (Nekrutenko and Taylor, 2012).

When the specimen collection and experimental processes are standardized and quality-controlled, getting high quality sequencing raw data is expected. However, different computational pipelines can produce significantly different results. Many studies exhibited low concordance of variant-calling pipelines (Park et al., 2015; O'Rawe et al., 2013). The analysis pipeline normally contains two steps: mapping and somatic mutation calling. There are many algorithms available for each step. Users often randomly choose tools for analysis, creating a chaos in the field. Most benchmarking efforts used simulated datasets to test various pipelines (Wang et al., 2013; Ewing et al., 2015; Kroigard et al., 2016), but their conclusions hardly match, probably because the features of the simulated datasets vary. This also indicated that their performance in the simulated datasets may not necessarily reflect the performance in real-world applications. A few approaches used datasets of clinical samples (Alioto et al., 2015). However, it is almost impossible to yield true set for somatic mutations at genome-wide scale. In another aspect, with the rapid decreasing NGS experimental cost, the computational cost becomes a major part. The more and more sophisticated variant-calling algorithms often implements multiple filter steps, leading to prolonged running time. However, the clinical applications usually need the analysis as fast as possible. Therefore, it is also important to investigate how to accelerate the analysis when ensuring the accuracy.

Various somatic mutation detection pipelines have been developed, such as VarScan (Koboldt et al., 2012), VarDictJava (Lai et al., 2016), Mutect2 (in GATK (McKenna et al., 2010)), Strelka2 (Kim et al., 2018) and FANSe (Zhang et al., 2021). All these pipelines have been successfully used in clinical cases to detect specific cancer mutation, especially driver mutations (Welch et al., 2012; Desai et al., 2018; Lin et al., 2020; Mathioudaki et al., 2020; Farswan et al., 2021). Although single mutations can be experimentally validated by other methods, the accuracy of genome-wide mutation detections needs to be systematically evaluated using standard benchmarking datasets with known mutation results (“true set”). Chen et al. generated somatic mutation benchmarking datasets using real-world data (Chen et al., 2020). They mixed the whole-exome sequencing (WES) datasets of two normal human cell lines together to generate test datasets, and provided somatic mutation true sets using the germline variations, which is relatively easy to get. Therefore, such benchmarking datasets are representative. In this study, we evaluated the abovementioned 5 somatic mutation detection pipelines for their accuracy (precision and recall) and speed, at various mutation rates and sequencing depths.

Materials and methods

Datasets and true sets

Standard somatic mutation benchmarking files were taken from Chen et al. (Chen et al., 2020). Sequencing depth: 100 ×, 200 ×, 300 ×, 500 ×, 800 ×. Mutation rates: 1%, 5%, 10%, 20%, 30%, and 40%. For each configuration (depth and mutation rate), three independent files were used. The true sets for each dataset were also taken from Chen et al. (Chen et al., 2020)

Hardware

For the speed test, all analyses were run in a workstation with AMD Threadripper 1950X CPU (16 cores, 32 threads) and 128GB RAM.

Calling somatic mutations

BWA-MEM (Li and Durbin, 2009) and FANSe3 (Zhang et al., 2021) were used to map raw reads to the human reference genome hg19. The BWA-MEM were set to default parameters, and FANSe3 was set to “5% error tolerance, indel detection on, unique mapped reads only”.

In the SNV (single nucleotide variation) detection stage, the major parameters of the pipelines were set as follows:

VarDictJava(v1.8.3): “-f 0.01” was both set when runing VarDict program and var2vcf_paired.pl script including in VarDictJava.

VarScan(v2.4.2): “--min-coverage 1 --min-reads2 1 --min-var-freq 0.001” was set when running pileup2snp subprogram to call SNV, then “--min-coverage-tumor 3 --min-coverage 3 --min-coverage-normal 3 --min-var-freq 0.001 --somatic-p-value 1” was set when running somatic subprogram to call somatic variants.

Mutect2(v4.1.0.0, v4.1.5.0, v4.2.0.0, v4.2.5.0): parameter “disable-read-filter” was set to “MateOnSameContigOrNoMappedMateReadFilter”.

Strelka2(2.9.10): parameter “--exome” was set.

FANSe3 (v3.12, commercial version): minimum coverage was set to 3.

The detailed workflow of somatic mutation is illustrated in Figure 1.

FIGURE 1

FIGURE 1. Workflow of 5 somatic mutation pipelines.

Somatic mutations exported by all 5 pipelines were then compared to the true mutation set. The precision, recall and F-score were defined as Chen et al. (Chen et al., 2020) In brief, Precision rate, recall rate and F-score were defined as TP/(TP + FP), TP/(TP + FN) and 2*recall*precision/(recall + precision), respectively, where TP = true positives, FP = false positives, FN = false negatives.

Results

Running speed of the pipelines

In clinical practice, hundreds of cancer samples can be sequenced in one batch due to the capability of modern next-generation sequencers, creating massive computational workloads. Therefore, running speed can be a limiting factor of the entire pipeline. In general, the analysis time comprises three major parts: network transfer/decompression time, mapping time and SNV calling time.

Due to the large file size, transferring and decompressing the FASTQ files from storage server to the analysis server requires considerable time (Figure 2A). The mapping was performed using the BWA-MEM and FANSe3 algorithms, respectively. The speed of FANSe3 was almost 8 times faster than BWA-MEM (Figure 2B). In the SNV-calling stage, Mutect2, Strelka2, Varscan and VarDictJava accepted the .bam files from BWA-MEM, while FANSe needs its specific SNV module to call SNVs due to its special output format (Figure 2C). There are enormous differences between the speed of these pipelines. Strelka2 used ∼20 min to finish the SNV calling for the 800 × WES dataset, which was 20 times faster than Varscan, Mutect2 and VarDictJava. However, FANSe SNV module used only less than 3 min to finish the SNV calling, 6.7 times faster than the Strelka2 and 131 times faster than Mutect2. When summing all the running times together, the FANSe solution showed great advantage in speed (Figure 2D). When deployed in cloud-computing infrastructure, FANSe can be configured to do the mapping when the file transfer is going on. This would further save the time of transfer and thus decrease the total time of processing (Figure 2D).

FIGURE 2

FIGURE 2. The speed of the 5 pipelines tested on an AMD Threadripper 1950X computer with 128G RAM. (A) Transfer (gigabit LAN) and decompression time of the datasets. (B) The mapping time of FANSe3 and BWA-MEM algorithms. (C) The SNV calling time. (D) The sum of running time of 5 pipelines.

Recall and precision of the pipelines

Previous study showed that the Mutect2 and Strelka2 pipelines could not effectively detect the low frequency somatic mutations even at 800x sequencing depth (Chen et al., 2020), which prohibited them from clinical practice. However, after excluding the PCR duplicates (e.g., using UMI), a specific nucleotide mutation should be detectable given high enough sequencing depths. Theoretically, at sequencing depth of 800 × and 0.5% sequencing error, 1% mutation rate should be significantly detected (p = 0.021, Fisher exact test). Therefore, we believe that there should be algorithms that can reliably detect most of the 1% somatic mutations at high sequencing depth. Indeed, the recall rate of FANSe, VarDictJava and VarScan was overwhelming against Mutect2 and Strelka2 (Figure 3A). For 800 × datasets, FANSe reached ∼87% recall rate, and VarDictJava and VarScan reached 81% and 79% in average, respectively. This performance was already informative. In contrast, Mutect2 and Strelka2 reached 26% and 23%, respectively, even lower than the recall rate of FANSe at 100 × sequencing depth (28%). This validated our hypothesis that proper algorithm can effectively detect low frequency somatic mutations.

FIGURE 3

FIGURE 3. Accuracy of the 5 pipelines. All detailed numbers are listed in Supplementary Table S1. (A) The recall rate of the 5 pipelines at different sequencing depths and mutation rates. For each mutation rate and sequencing depth, the data points for the 3 datasets were all shown, and the lines illustrate their mean values. (B) The precision-recall diagram of the 5 pipelines. Color indicates the pipeline, and the dot size indicates the sequencing depths. (C) The F-scores of the 5 pipelines.

At all sequencing depths and mutation rates, FANSe’s recall rate was higher than VarDictJava and VarScan. For example, for 1% and 5% mutation rates, FANSe at 200 × depth was comparable to VarDictJava and VarScan at 300 ×. This indicated that using FANSe lowers the need of sequencing throughput, i.e., reduces the sequencing cost.

The precision of all pipelines were quite high (Figure 3B). At 1% mutation rate, the precision of all pipelines were greater than 95%, and in higher mutation rates, the precision were all greater than 99% except VarDictJava. At 1% mutation rate, FANSe showed the best recall rate but slightly lower precision than VarDictJava and VarScan. However, the precision of FANSe rapidly increased when the mutation rate was higher. When mutation rate was higher than 5%, Mutect2 and VarDictJava showed remarkable disadvantage in precision and recall. When assessed by the criteria F-score (Figure 3C), FANSe was the best in all cases, and the VarScan ranked the second. VarDictJava performed quite well at 1% and 5% mutation rates, but was surpassed by Mutect2 and Strelka2 at higher mutation rates.

Version advances cause remarkable difference in results

It has been noted since decade that the different version of the same software can generate totally different results when analyzing the same NGS dataset, contributing to the “reproducibility crisis”. In the 5 pipelines tested in this study, Mutect2 was constantly updated for newer versions. We tested the Mutect2 versions 4.1.0, 4.1.5, 4.2.0 and 4.2.5 on the benchmarking datasets. The v4.1.5 showed a remarkable drop in both recall rate and precision (Figure 4A). Newer version did not necessarily provide better recall and precision. For higher sequencing throughput and/or higher mutation rate (≥5%), the version-dependent variation was in general smaller. Assessed using the F-score, not a single version was the best in all cases (Figure 4B). For example, v4.1.0 had the best F-score at 30–40% mutation rates and 200–300 × depth, and had a large advantage against all later versions at 1% mutation rate and 800 × depth. The newest v4.2.5 was not the best at all mutation rates and 800 × depth. Moreover, the somatic mutations called by different versions differ largely. As an example, for a dataset of 1% mutation rate and 100x depth, v4.2.0 and v4.2.5 screened 115 and 114 somatic mutations, respectively. 93% of them overlap. However, the consistency of the v4.2.5 and v4.1.0 was only 64% (Figure 4C). Although algorithmic improvements may be implemented during the version iterations, such a high inconsistency set a warning that the users must be aware the discrepancy caused by the software versions.

FIGURE 4

FIGURE 4. Comparison of four versions of the Mutect2. Detailed numbers were listed in Supplementary Table S1. (A) The precision-recall diagram of Mutect2 v4.1.0 to v4.2.5. The size of dots indicates the version, and the color indicates the sequencing depths. For each sequencing depth and each mutation rate, only one dataset was chosen to make the diagram for visibility. (B) The F-scores of the four versions of the Mutect2. (C) The Venn diagram of the somatic mutations detected by four versions of the Mutect2.

Filtering is a key of SNV detection

The somatic mutations detected by the 5 pipelines showed very little consistency. For a dataset of 1% mutation rate and 100 × depth, only 69 somatic mutations were identified by all 5 pipelines (Figure 5A). FANSe identified many more somatic mutations than the other pipelines, and the mutations identified by the other pipelines were almost covered by FANSe (Figure 5B). Considering the high precision of FANSe, the mutations that were solely identified by FANSe were almost all true. It is an interesting question why the other pipelines missed so many mutations. After comparison, we found that 11 mutations were missed by other pipelines because the BWA-MEM failed to map the mismatch-containing reads to the reference genome (Figure 5C, details listed in Supplementary Table S2). Most of such reads contained multiple mismatches against the reference sequences (an example was illustrated in Figure 5D), indicating that the BWA-MEM mapping algorithm is not robust enough to map such reads. The rest 439∼811 (>97.5%) missed mutations were detected by the SNV-calling algorithms but were missed because the algorithms filtered them out (Figure 5C). Obviously, these filters need to be improved. However, the details of these filters were not clear in the literatures, prohibiting us to raise suggestions to improve.

FIGURE 5

FIGURE 5. Comparison of the somatic mutations identified by 5 pipelines on the dataset #1 of 1% mutation rate, 100x sequencing depth. (A) The Venn diagram of the somatic mutations identified by the 5 pipelines. (B) The somatic mutations identified by FANSe pipeline and the other pipelines, respectively. (C) The reason of the missed mutations by the other 4 pipelines compared to FANSe. (D) An example of mutation site (chr11, position 60531346), where the mismatch-containing reads were not mapped by BWA-MEM. Four mismatch-containing reads were mapped by FANSe and piled on the reference sequence. The read #1∼#3 were mapped to the forward strand and the read #4 was mapped to the reverse strand (illustrated as reverse-complement). Red bases illustrated the mismatches against the reference sequences.

Discussion

Taking the advantage of the standard benchmarking WES datasets, we can assess the performance of any somatic mutation detection pipelines. Chen et al. compared the Mutect2 and Strelka2 pipelines. They concluded that the sensitivity at low mutation frequency is inacceptable, and cannot be compensated by the elevation of sequencing depth. However, we found that many other pipelines like VarDictJava, VarScan and FANSe, showed several times higher sensitivity than Mutect2 and Strelka2. This indicated that there are many more sensitive algorithms available; however, most researchers, especially those who are not professional in bioinformatics, did not choose a proper analysis pipeline based on real-world data benchmarking. This algorithm blind pick may generate misleading results that confuse further investigations, and causes inconsistent conclusions, intensifying the “alarming reproducibility crisis”.

The good news from our study is the very high precision of all 5 tested pipelines: at 1% mutation frequency, the precision is above 95%, and at higher mutation frequency, the precision is nearly perfect. This means that the somatic mutations called by these pipelines were nearly true. However, the sensitivity is a major challenge, i.e., most pipelines missed a considerable fraction of somatic mutations. Considering only the computational analysis, flaws in two steps contribute to the loss of sensitivity: 1) The miss of the mapping algorithms. The insufficient robustness and error tolerance of the mapping algorithms failed to map some mismatch-containing reads to the reference genomes, especially when these reads also contains some germline mutations. In this study, we found that 1∼2.5% of the somatic mutations were missed by this reason, which is relatively minor. 2) The improper filter. Each SNV-calling algorithm has its distinguished statistical model, but is rarely publicized in detail. Therefore, it is hard for the users to suggest improvements of the model. In addition, the filter criteria are often changed during the version upgrade, leading to considerable inconsistency over the versions. We demonstrated such inconsistency in Mutect2. The researchers need to be aware of this. Some algorithms like Mutect2 contain many filter steps, creating complexity that might reduce robustness when compared to simpler models like VarScan.

FANSe pipeline showed distinct advantage in both accuracy and speed. It has the best F-score in all cases, especially at low mutation frequency and low sequencing depth. At 1% mutation rate, its sensitivity reached 87% for 800 × sequencing depth, which is quite useful in clinical practice. At 5% mutation rate, its sensitivity reached 95% for 300 × depth; increasing the sequencing depth to 800 × only slightly increased the sensitivity to 97%. At 10% mutation rate, 200 × depth almost reaches the stationary level of sensitivity, and more throughput hardly benefits. This means that the sensitive and accurate somatic mutation detection can be achieved at relatively low sequencing cost. Also, its speed is 8.8 times faster than the Strelka2, and 19 times faster than VarScan, which showed a great advantage in the computational efficiency–and cost. It may be the optimal solution for somatic mutation detection in the precision medicine era.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding authors.

Author contributions

GZ and HD conceived and supervized the study. JJ, ZC and JL analyzed the data. JL and GZ visualized the data. JL and GZ wrote the manuscript.

Funding

This study was collectively supported by the Ministry of Science and Technology of China, National Key Research and Development Program (2018YFC0910200/2017YFA0505001) and Guangdong Key R&D Program (2019B020226001).

Conflict of interest

JL and GZ were employed by the company Chi-Biotech Co. Ltd.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2022.979928/full#supplementary-material

References

Alioto, T. S., Buchhalter, I., Derdak, S., Hutter, B., Eldridge, M. D., Hovig, E., et al. (2015). A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing. Nat. Commun. 6, 10001. doi:10.1038/ncomms10001

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, Z., Yuan, Y., Chen, X., Chen, J., Lin, S., Li, X., et al. (2020). Systematic comparison of somatic variant calling performance among different sequencing depth and mutation frequency. Sci. Rep. 10 (1), 3501. doi:10.1038/s41598-020-60559-5

PubMed Abstract | CrossRef Full Text | Google Scholar

Desai, P., Mencia-Trinchant, N., Savenkov, O., Simon, M. S., Cheang, G., Lee, S., et al. (2018). Somatic mutations precede acute myeloid leukemia years before diagnosis. Nat. Med. 24 (7), 1015–1023. doi:10.1038/s41591-018-0081-z

PubMed Abstract | CrossRef Full Text | Google Scholar

Ewing, A. D., Houlahan, K. E., Hu, Y., Ellrott, K., Caloian, C., Yamaguchi, T. N., et al. (2015). Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection. Nat. Methods 12 (7), 623–630. doi:10.1038/nmeth.3407

PubMed Abstract | CrossRef Full Text | Google Scholar

Farswan, A., Jena, L., Kaur, G., Gupta, A., Gupta, R., Rani, L., et al. (2021). Branching clonal evolution patterns predominate mutational landscape in multiple myeloma. Am. J. Cancer Res. 11 (11), 5659–5679.

PubMed Abstract | Google Scholar

Gerlinger, M., Rowan, A. J., Horswell, S., MathM., , Larkin, J., Endesfelder, D., et al. (2012). Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N. Engl. J. Med. 366 (10), 883–892. doi:10.1056/NEJMoa1113205

PubMed Abstract | CrossRef Full Text | Google Scholar

Hirsch, F. R., Scagliotti, G. V., Mulshine, J. L., Kwon, R., Curran, W. J., Wu, Y. L., et al. (2017). Lung cancer: Current therapies and new targeted treatments. Lancet 389 (10066), 299–311. doi:10.1016/S0140-6736(16)30958-8

PubMed Abstract | CrossRef Full Text | Google Scholar

Kim, S., Scheffler, K., Halpern, A. L., Bekritsky, M. A., Noh, E., Kallberg, M., et al. (2018). Strelka2: Fast and accurate calling of germline and somatic variants. Nat. Methods 15 (8), 591–594. doi:10.1038/s41592-018-0051-x

PubMed Abstract | CrossRef Full Text | Google Scholar

Koboldt, D. C., Zhang, Q., Larson, D. E., Shen, D., McLellan, M. D., Lin, L., et al. (2012). VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22 (3), 568–576. doi:10.1101/gr.129684.111

PubMed Abstract | CrossRef Full Text | Google Scholar

Kroigard, A. B., Thomassen, M., Lænkholm, A. V., Kruse, T. A., and Larsen, M. J. (2016). Evaluation of nine somatic variant callers for detection of somatic mutations in exome and targeted deep sequencing data. PLoS One 11 (3), e0151664. doi:10.1371/journal.pone.0151664

PubMed Abstract | CrossRef Full Text | Google Scholar

Lai, Z., Markovets, A., Ahdesmaki, M., Chapman, B., Hofmann, O., McEwen, R., et al. (2016). VarDict: A novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Res. 44 (11), e108. doi:10.1093/nar/gkw227

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, H., and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25 (14), 1754–1760. doi:10.1093/bioinformatics/btp324

PubMed Abstract | CrossRef Full Text | Google Scholar

Lin, H., Zhang, G., Zhang, X. C., Lian, X. L., Zhong, W. Z., Su, J., et al. (2020). Germline variation networks in the PI3K/AKT pathway corresponding to familial high-incidence lung cancer pedigrees. Bmc Cancer 20 (1), 1209. doi:10.1186/s12885-020-07528-3

PubMed Abstract | CrossRef Full Text | Google Scholar

Ma, W., Pham, B., and Li, T. (2022). Cancer neoantigens as potential targets for immunotherapy. Clin. Exp. Metastasis 39 (1), 51–60. doi:10.1007/s10585-021-10091-1

PubMed Abstract | CrossRef Full Text | Google Scholar

Mathioudaki, A., Ljungstrom, V., Melin, M., Arendt, M. L., Nordin, J., Karlsson, A., et al. (2020). Targeted sequencing reveals the somatic mutation landscape in a Swedish breast cancer cohort. Sci. Rep. 10 (1), 19304. doi:10.1038/s41598-020-74580-1

PubMed Abstract | CrossRef Full Text | Google Scholar

McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., et al. (2010). The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20 (9), 1297–1303. doi:10.1101/gr.107524.110

PubMed Abstract | CrossRef Full Text | Google Scholar

Nekrutenko, A., and Taylor, J. (2012). Next-generation sequencing data interpretation: Enhancing reproducibility and accessibility. Nat. Rev. Genet. 13 (9), 667–672. doi:10.1038/nrg3305

PubMed Abstract | CrossRef Full Text | Google Scholar

O'Rawe, J., Jiang, T., Sun, G., Wu, Y., Wang, W., Hu, J., et al. (2013). Low concordance of multiple variant-calling pipelines: Practical implications for exome and genome sequencing. Genome Med. 5 (3), 28. doi:10.1186/gm432

PubMed Abstract | CrossRef Full Text | Google Scholar

Park, J. Y., Clark, P., Londin, E., Sponziello, M., Kricka, L. J., and Fortina, P. (2015). Clinical exome performance for reporting secondary genetic findings. Clin. Chem. 61 (1), 213–220. doi:10.1373/clinchem.2014.231456

PubMed Abstract | CrossRef Full Text | Google Scholar

Rolfo, C., Mack, P., Scagliotti, G. V., Aggarwal, C., Arcila, M. E., Barlesi, F., et al. (2021). Liquid biopsy for advanced nsclc: A consensus statement from the international association for the study of lung cancer. J. Thorac. Oncol. 16 (10), 1647–1662. doi:10.1016/j.jtho.2021.06.017

PubMed Abstract | CrossRef Full Text | Google Scholar

Torga, G., and Pienta, K. J. (2018). Patient-paired sample congruence between 2 commercial liquid biopsy tests. JAMA Oncol. 4 (6), 868–870. doi:10.1001/jamaoncol.2017.4027

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, Q., Jia, P., Li, F., Chen, H., Ji, H., Hucks, D., et al. (2013). Detecting somatic point mutations in cancer genome sequencing data: A comparison of mutation callers. Genome Med. 5 (10), 91. doi:10.1186/gm495

PubMed Abstract | CrossRef Full Text | Google Scholar

Welch, J. S., Ley, T. J., Link, D. C., Miller, C. A., Larson, D. E., Koboldt, D. C., et al. (2012). The origin and evolution of mutations in acute myeloid leukemia. Cell 150 (2), 264–278. doi:10.1016/j.cell.2012.06.023

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, G., Zhang, Y., and Jin, J. (2021). The ultrafast and accurate mapping algorithm FANSe3: Mapping a human whole-genome sequencing dataset within 30 minutes. Phenomics 1 (1), 22–30. doi:10.1007/s43657-020-00008-5

CrossRef Full Text | Google Scholar

Keywords: somatic mutation analysis, algorithm, accuracy, performance, next-generation sequencing—NGS

Citation: Jin J, Chen Z, Liu J, Du H and Zhang G (2022) Towards an accurate and robust analysis pipeline for somatic mutation calling. Front. Genet. 13:979928. doi: 10.3389/fgene.2022.979928

Received: 16 July 2022; Accepted: 01 November 2022;
Published: 15 November 2022.

Edited by:

Youri I. Pavlov, University of Nebraska Medical Center, United States

Reviewed by:

Anna S. Zhuk, Saint Petersburg State University, Russia
Chen Ding, Fudan University, China

Copyright © 2022 Jin, Chen, Liu, Du and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Hongli Du, aGxkdUBzY3V0LmVkdS5jbg==; Gong Zhang, emhhbmdnb25nLXVuaUBxcS5jb20=

^†These authors have contributed equally to this work

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Towards an accurate and robust analysis pipeline for somatic mutation calling

Introduction

Materials and methods

Datasets and true sets

Hardware

Calling somatic mutations

Results

Running speed of the pipelines

Recall and precision of the pipelines

Version advances cause remarkable difference in results

Filtering is a key of SNV detection

Discussion

Data availability statement

Author contributions

Funding

Conflict of interest

Publisher’s note

Supplementary material

References

95% of researchers rate our articles as excellent or good

95% of researchers rate our articles as excellent or good