Validation of a Bioinformatics Workflow for Routine Analysis of Whole-Genome Sequencing Data and Related Challenges for Pathogen Typing in a European National Reference Center: Neisseria meningitidis as a Proof-of-Concept

Bogaerts, Bert; Winand, Raf; Fu, Qiang; Van Braekel, Julien; Ceyssens, Pieter-Jan; Mattheus, Wesley; Bertrand, Sophie; De Keersmaecker, Sigrid C. J.; Roosens, Nancy H. C.; Vanneste, Kevin

doi:10.3389/fmicb.2019.00362

ORIGINAL RESEARCH article

Front. Microbiol. , 06 March 2019

Sec. Infectious Agents and Disease

Volume 10 - 2019 | https://doi.org/10.3389/fmicb.2019.00362

Validation of a Bioinformatics Workflow for Routine Analysis of Whole-Genome Sequencing Data and Related Challenges for Pathogen Typing in a European National Reference Center: Neisseria meningitidis as a Proof-of-Concept

$\r\nBert Bogaerts$ Bert Bogaerts¹

Raf Winand¹

Qiang Fu¹

Julien Van Braekel¹

Pieter-Jan Ceyssens²

Wesley Mattheus²

Sophie Bertrand²

Sigrid C. J. De Keersmaecker¹

Nancy H. C. Roosens¹

Kevin Vanneste^1*

¹Transversal Activities in Applied Genomics, Sciensano, Brussels, Belgium
²Bacterial Diseases, Sciensano, Brussels, Belgium

Despite being a well-established research method, the use of whole-genome sequencing (WGS) for routine molecular typing and pathogen characterization remains a substantial challenge due to the required bioinformatics resources and/or expertise. Moreover, many national reference laboratories and centers, as well as other laboratories working under a quality system, require extensive validation to demonstrate that employed methods are “fit-for-purpose” and provide high-quality results. A harmonized framework with guidelines for the validation of WGS workflows does currently, however, not exist yet, despite several recent case studies highlighting the urgent need thereof. We present a validation strategy focusing specifically on the exhaustive characterization of the bioinformatics analysis of a WGS workflow designed to replace conventionally employed molecular typing methods for microbial isolates in a representative small-scale laboratory, using the pathogen Neisseria meningitidis as a proof-of-concept. We adapted several classically employed performance metrics specifically toward three different bioinformatics assays: resistance gene characterization (based on the ARG-ANNOT, ResFinder, CARD, and NDARO databases), several commonly employed typing schemas (including, among others, core genome multilocus sequence typing), and serogroup determination. We analyzed a core validation dataset of 67 well-characterized samples typed by means of classical genotypic and/or phenotypic methods that were sequenced in-house, allowing to evaluate repeatability, reproducibility, accuracy, precision, sensitivity, and specificity of the different bioinformatics assays. We also analyzed an extended validation dataset composed of publicly available WGS data for 64 samples by comparing results of the different bioinformatics assays against results obtained from commonly used bioinformatics tools. We demonstrate high performance, with values for all performance metrics >87%, >97%, and >90% for the resistance gene characterization, sequence typing, and serogroup determination assays, respectively, for both validation datasets. Our WGS workflow has been made publicly available as a “push-button” pipeline for Illumina data at https://galaxy.sciensano.be to showcase its implementation for non-profit and/or academic usage. Our validation strategy can be adapted to other WGS workflows for other pathogens of interest and demonstrates the added value and feasibility of employing WGS with the aim of being integrated into routine use in an applied public health setting.

Introduction

Whole-genome sequencing (WGS) has become a well-established technique, spurred by the rapid development of different next-generation sequencing (NGS) technologies, and ample case studies have been published in recent years that demonstrate the added value of WGS for surveillance monitoring and outbreak cases for many microbial pathogens of interest in public health (Mellmann et al., 2011; Kwong et al., 2015; Aanensen et al., 2016; Charpentier et al., 2017; Harris et al., 2018). WGS offers the potential to replace traditional molecular approaches for typing of microbial pathogens because of several advantages: more cost-efficient, less labor-intensive, faster, more information per sample and at a higher resolution (Gilmour et al., 2013; Kwong et al., 2015; Allard, 2016; Deurenberg et al., 2017). WGS for instance enabled the development of novel typing methods such as core genome multilocus sequence typing (cgMLST), which expands the breadth of standard MLST by including several hundreds of loci (Maiden et al., 2013). Additionally, the resolution up to the nucleotide level enables pathogen comparison and clustering with unprecedented precision (Carriço et al., 2013; Dallman et al., 2015; Ronholm et al., 2016). A gap nevertheless still exists between the acclaimed success and the everyday implementation and usage of this technology in a public health setting, especially for many national reference laboratories (NRLs) and centers (NRCs) in smaller and/or less developed countries, which do not always have access to the same resources that are available for public health agencies in larger and/or more developed countries that already routinely process large volumes of samples with NGS technologies (WHO, 2018). In Europe, recent surveys in 2016 by both the European Food Safety Authority (EFSA) (García Fierro et al., 2018) and the European Centre for Disease Prevention and Control (ECDC) (Revez et al., 2017) indicated that NGS was being used in 17 out of 30 and 25 out 29 responding constituents, respectively, and that large discrepancies existed between different European countries in the advancement of implementing this technology for different microbial pathogens of interest, for which the lack of expertise and financial resources were often quoted.

The data analysis bottleneck in particular represents a serious obstacle because it typically consists out of a stepwise process that is complex and cumbersome for non-experts. An overview of data analysis tools that can be used for capacity building was recently published by the ENGAGE consortium, which aims to establish the NGS ability for genomic analysis in Europe (Hendriksen et al., 2018). Many of these tools still require substantial expertise because they are only available using the command line on Linux, but a subset is also available as web-based platforms with a user-friendly interface open to the scientific community. For instance, an entire suite of tools for pathogen characterization through WGS data has been made available by the Center for Genomic Epidemiology¹ hosted at the Technical University of Denmark allowing, among others, assembly, serotyping, virulence detection, plasmid replicon detection, MLST, and phylogenetic clustering (Deng et al., 2016), and is frequently used by the different enforcement laboratories in Europe (García Fierro et al., 2018). PubMLST² is another popular web-based platform that maintains databases with sequence typing information and schemas for a wide variety of pathogens, and can be queried with WGS data (Jolley and Maiden, 2010). Some resources have also been developed tailored specifically toward certain pathogens, such as NeisseriaBase as a platform for comparative genomics of Neisseria meningitidis (Katz et al., 2011). While these resources are most definitely useful, they do have some disadvantages. Several databases and tools typically still need to be combined manually, whereas an integrated approach encompassing all required analytical steps is preferred for a public health laboratory (Lindsey et al., 2016). In addition, a set of standardized tools and guidelines is not defined yet, limiting the collaboration and reproducibility between different NRCs and NRLs that all have their own way of analyzing WGS data (Rossen et al., 2018). Many of these resources are also lacking traceability. Database versions, and tool parameters and versions, can be missing in the output or change without notice, making it hard to compare and exchange results with other laboratories. Systematic international collaboration between different NRCs and NRLs across multiple years is, however, only possible when a standardized workflow is used. The time between submitting data to the webserver and receiving results varies and could be a limiting factor in emergency cases, unless the application is locally installed.

Moreover, most NRCs and NRLs operate according to a strict quality system that requires an extensive validation to demonstrate that methods are “fit-for-purpose,” thereby fulfilling the task for which the method was developed in order to produce high-quality results, which is also important to obtain accreditation (Rossen et al., 2018). For classical typing methods, the process of validation is typically dependent upon the exact type of analysis and the (often limited) number of well-characterized samples. A standardized approach to validate WGS for routine use in public health laboratories for microbiological applications is not available yet and still under development by the International Organization for Standardization (ISO) in the working group “WGS for typing and genomic characterization” (ISO TC34-SC9-WG25) (Portmann et al., 2018). Although this working group is expected to lead to an official standard in the next few years, many NRCs and NRLs already face the need for validation at the current moment, as evidenced by many recent case studies that describe the validation of components of the WGS workflow. Portmann et al. (2018) presented the validation of an end-to-end WGS workflow for source tracking of Listeria monocytogenes and Salmonella enterica. Holmes et al. (2018) reported the validation of a WGS workflow for the identification and characterization of Shiga toxin-producing Escherichia coli (STEC) focusing on standardization between different public health agencies. Mellmann et al. (2017) described external quality assessment options for WGS. Lindsey et al. (2016) documented the validation of WGS for a commercial solution for STEC. Dallman et al. (2015) reported the validation of WGS for outbreak detection and clustering of STEC. Recently, Kozyreva et al. (2017) detailed an entire modular template for the validation of the WGS process not limited to certain species but generally applicable for a public health microbiology laboratory. Such case studies help to propel the implementation of WGS for clinical microbiology, but the comprehensive validation of the underlying bioinformatics analysis has not been documented yet. This is however of paramount importance as bioinformatics analysis is inherently part of the evaluation of every step of the entire WGS workflow going from sample isolation, DNA extraction, library preparation, sequencing, to the actual bioinformatics assays. It is therefore imperative to thoroughly validate this step before the other levels of the WGS workflow are evaluated (Angers-Loustau et al., 2018). The bioinformatics analysis acts as the “most common denominator” between these different steps, allowing to compare and evaluate their performance. An exhaustive validation of the bioinformatics analysis for WGS for clinical and/or public health microbiology has, however, not yet been described, and is not an easy task because classical performance metrics cannot be directly applied to bioinformatics analyses, and it is often not possible to obtain a realistic ‘gold standard’ for systematic evaluation (Kozyreva et al., 2017).

As a proof-of-concept, we focus here on N. meningitidis, a Gram-negative bacterium responsible for invasive meningococcal disease, causing symptoms such as meningitis, septicemia, pneumonia, septic arthritis, and occasionally inflammatory heart disorders. The Belgian NRC Neisseria analyses approximately 100–130 strains per year, and traditionally employed the following molecular techniques for pathogen surveillance: species identification by real-time polymerase chain reaction (qPCR); matrix assisted laser desorption/ionization; or biochemistry to verify that the infection is N. meningitidis and not another pathogen also causing bacterial meningitis such as Streptococcus pneumonia or L. monocytogenes; serogroup determination by slide agglutination or qPCR of the capsule genes; drug susceptibility and antibiotics resistance testing by determining the minimum inhibitory concentration on plated samples; and subtyping by PCR followed by Sanger sequencing of several loci of interest such as for instance the classic seven MLST genes (Maiden et al., 1998) and vaccine candidates such as Factor H-binding protein (fHbp) (Brehony et al., 2009). The rapid progression, high fatality rate, and frequent complications render N. meningitidis an important public health priority in Belgium and an ideal candidate to investigate the feasibility of using WGS while effectively mitigating the data analysis bottleneck. At the same time, the strict requirement for ISO15189 accreditation, which deals with the quality and competence of medical laboratories for all employed tests and is demanded by the Belgian national stakeholder (the national institute for health and disability insurance), renders it an ideal proof-of-concept to investigate how to validate the bioinformatics workflow.

We describe here the first exhaustive validation of a bioinformatics workflow for microbiological isolate WGS data, extensively documenting the performance of different bioinformatics assays at the genotypic level by means of a set of traditional performance metrics with corresponding definitions and formulas that were adapted specifically for WGS data. The WGS workflow was evaluated both on a set of sequenced reference samples and collected public data generated by means of the Illumina sequencing platforms, and demonstrates high performance. Our validation strategy can serve as a basis to validate other bioinformatics workflows that employ WGS data, irrespective of their targeted pathogen, and illustrates the feasibility of employing WGS as an alternative to traditional molecular techniques for a relatively small-scale laboratory in a public health context.

Materials and Methods

Bioinformatics Workflow

Data (Pre-)processing and Quality Control

Figure 1 provides an overview of the bioinformatics workflow. The workflow supports all WGS data generated by means of the Illumina technology. Pre-trimming data quality reports are first generated to obtain an overview of raw read quality with FastQC 0.11.5 (available at https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) using default settings. Afterward, Trimmomatic 0.36 (Bolger et al., 2014) is used to trim raw reads using the following settings: “LEADING:10” (i.e., first remove all residues at the beginning of reads with a Q-score < 10), “TRAILING:10” (i.e., then remove all residues at the ending of reads with a Q-score < 10), “SLIDINGWINDOW:4:20” (i.e., then clip reads as soon as the average Q-score is <20 over a sliding window of four residues), and “MINLEN:40” (i.e., then remove all reads that are <40 residues after the previous steps). The “ILLUMINACLIP” option can be set to either “Nextera,” “TruSeq2,” or “TruSeq3” dependent upon the used sequencing protocol (see below), and is added to the command as “[adapter]-PE:2:30:10.” All other options are left at their default values. Post-trimming data quality reports are then generated using FastQC to obtain an overview of processed read quality. Afterward, processed paired-end reads are de novo assembled using SPAdes 3.10.0 (Bankevich et al., 2012). Orphaned reads resulting from trimming (i.e., reads where only one read of the pair survived) are provided to the assembler as unpaired reads. The “–careful” option is used. All other options are left at their default values (kmers are chosen automatically based on the maximum read length). Assembly statistics such as N50 and number of contigs are calculated with QUAST 4.4 (Gurevich et al., 2013) using default settings. The processed reads are then mapped back onto the assembled contigs using Bowtie2 2.3.0 (Langmead and Salzberg, 2012) with the following settings: “–sensitive,” “–end-to-end,” and “–phred33” (all other options are left at their default values). The mapped reads are used to estimate the coverage by calculating the median of the per position depth values reported by SAMtools depth 1.3.1 (Li et al., 2009) using default settings (SPAdes by default also maps reads back to the assembly but reports coverage in terms of kmer coverage). Lastly, several quality metrics are checked to determine whether data quality are sufficient before proceedings toward the actual bioinformatics assays. Threshold values for these quality metrics were set based on the quality ranges observed during validation by selecting more and less stringent values for metrics exhibiting less and more variation between samples/runs, respectively. An overview of all quality metrics and their corresponding warning and failure thresholds is provided in Table 1.

FIGURE 1

Figure 1. Overview of the bioinformatics workflow. Each box represents a component corresponding to a series of tasks that provide a certain well-defined functionality (indicated in bold). Major bioinformatics utilities employed in each module are also mentioned (indicated in italics). Abbreviations: paired-end (PE).

TABLE 1

Table 1. Advanced quality control metrics with their associated definitions and threshold values for warnings and failures.

Resistance Gene Characterization Assay

Genotypic antimicrobial resistance is detected by identifying resistance genes with nucleotide BLAST+ 2.6.0 (Camacho et al., 2009) using default values against four widely used resistance gene databases: ARG-ANNOT (Gupta et al., 2014), CARD (Jia et al., 2017), ResFinder (Zankari et al., 2012), and NDARO³. These databases are automatically pulled in-house and updated on a weekly basis (the date of the last database update is always included in the output). First, hits that cover less than 60% of, or have less than 90% identity to the subject, are removed. Second, overlapping hits (i.e., hits located on the same contig with at least one base overlap) are grouped into clusters. The best hit for each cluster is then determined using the method for allele scoring as described by Larsen et al. (2012). The different possibilities for “hit types” and their corresponding color codes used in the output are detailed in Supplementary Table S1. Visualizations of pair-wise alignments are extracted from the blast output generated with the pair-wise output format (“-outfmt 1”).

Sequence Typing Assay

Several relevant databases for sequence typing hosted by the PubMLST platform⁴ (Jolley and Maiden, 2010) are employed for genotypic sequence typing (Table 2). All sequences and profiles are obtained using the REST API (Jolley et al., 2017) and are automatically pulled in-house and updated on a weekly basis (the date of the last database update is always included in the output). For every database, loci are typed separately by aligning the assembled contigs against all allele sequences of that locus using Blastn and Blastx for nucleotide and protein sequences, respectively (Camacho et al., 2009). Filtering and best hit identification are performed as described previously for resistance gene characterization. If multiple exact matches exist, the longest one is reported. For protein sequences, alignment statistics are calculated based on the translated sequences. The different possibilities for “hit types” and their corresponding color codes used in the output are detailed in Supplementary Table S2. If sequence type definitions are available and the detected allele combination matches a known sequence type, this is reported along with associated metadata in the output (for classic MLST: corresponding clonal complex; for rplF: genospecies and associated comments). For rpoB (rifampicin resistance) and penA (penicillin resistance), the phenotypically tested susceptibility to the corresponding antibiotics of strains carrying that particular allele is also retrieved from PubMLST and included in the output.

TABLE 2

Table 2. Overview of employed typing schemas^#.

Serogroup Determination Assay

Serogroup determination is based on the genotypic sequence typing of the capsule loci (Harrison et al., 2013). Serogroup profiles are obtained from PubMLST using the REST API and are automatically pulled in-house and updated on a weekly basis (the date of the last database update is always included in the output). Serogroup profiles are available on PubMLST for the following 10 serogroups: A, B, C, E, H, L, W135, X, Y, and Z. Genotypic profiles for other serogroups are not available and can hence not be detected by the assay. The serogroups are assigned to categories based on the number and type of hits (see Supplementary Table S2) for the corresponding schemas. The first category contains serogroups for which all loci are found as perfect hits. In the second category, all loci are found as perfect or imperfect identity hits. In the third category, loci are detected as perfect hits, imperfect identity hits, imperfect short hits, or multi-hits. In the fourth category, not all loci are detected. The serogroup of the highest possible category is always reported. When there are multiple possible serogroups in the highest category, the serogroup with the highest fraction of perfect hits (i.e., number of perfect hits divided by number of loci in the schema) is reported. Serogroup determination fails when less than 75% of loci are detected for the best serogroup reported according to the above classification.

Implementation and Availability

The bioinformatics workflow was implemented in collaboration with the experts from the Belgian NRC Neisseria to ensure it complied with the needs of its actual end users by providing a “push-button” pipeline solution. On the front-end, the bioinformatics workflow was integrated as a stand-alone tool into a local instance of the Galaxy Workflow Management System (Afgan et al., 2016) to ensure a user-friendly interface that only requires uploading the data and selecting the desired bioinformatics assays (the data pre-processing and quality control are always executed by default). An illustration of the input interface is provided in Supplementary Figure S1. The bioinformatics workflow is compatible with data from all Illumina sequencing platforms (other sequencing technologies are not supported). Threshold values for some quality metrics such as the sequence length distribution (Table 1) are dynamically adapted based on the detected sequence length. Results are presented as an interactive user-friendly HTML output report and a tabular summary file. The HTML output report presents the results of all quality checks and bioinformatics assays, also containing linked data such as the trimmed reads, assembly, and all alignments, which can easily be accessed and/or downloaded by clicking the interactive link and can afterward be used for additional analyses, that are not part of the routine requirements, within Galaxy or other bioinformatics software. The tabular summary file contains an accumulation of the most important statistics and results in tab-separated format that can be useful for programmatic processing. On the back-end, the bioinformatics workflow was written in Python 2.7 and set up to comply with both the direct and indirect needs of the NRC. All required components (tools, databases, etc.) were directly integrated within the high-performance computational infrastructure at our institute. Employed public databases are pulled automatically in-house on a weekly basis to ensure that results are always up-to-date and that execution can be performed at all times without any direct dependencies on external resources (e.g., for outbreak situations). All tool parameters and options were optimized and validated (see below) to ensure that no parameter tweaking is required. The full workflow including all assays takes on average only 1 h to run to completion for a dataset sequenced at 60× coverage, which is only a fraction of the time compared to the data generation that can take multiple days. All individual components are version controlled and traceable, ranging from tool versions (managed through Lmod, available at https://github.com/TACC/Lmod), databases (managed through Git), in-house code (managed through Git), and workflow runs (managed through storing all essential information for rerunning the workflow in a custom-designed SQL database). This “push-button” pipeline is also available at the public instance of the Galaxy Workflow Management System of our institute, which is accessible at https://galaxy.sciensano.be. This is offered as a free resource for academic and non-profit usage (registration required), specifically intended for scientists and laboratories from other smaller European and/or developing countries that process a limited number of samples and/or do not have access to the required expertise and/or financial means themselves to analyze NGS data for N. meningitidis (with the caveat of depending on an external service for which 100% uptime cannot be guaranteed). A specific training video for this resource is also available (see Supplementary Material).

Validation Data

Core Validation Dataset

The core validation dataset consisted out of N. meningitidis reference strains selected from the global collection of 107 N. meningitidis strains maintained at the University of Oxford (Bratcher et al., 2014) for which sequence data were generated in-house. Reference strains from this biobank were originally used to validate cgMLST at the genotypic level, and were extensively characterized using several sequencing technologies and platforms, thereby constituting a valuable resource that represents the global diversity of N. meningitidis for which high-quality genotypic information is available. A subset of 67 samples was selected by specialists from the Belgian NRC Neisseria to ensure that these cover the entire spectrum of clonal complexes that are representative for Belgium (see also the section “Discussion”). The selected samples were originally collected from 26 different countries over a timespan of more than 50 years encompassing endemic, epidemic, and pandemic disease cases, as well as asymptotic carriers. At least one sample was selected for each of the disease causing serogroups (A, B, C, W135, X, Y) (Harrison et al., 2009). An overview of these 67 samples is provided in Supplementary Table S4. Genomic DNA was extracted using the column-based GeneElute kit (Sigma), using the manufacturer’s instructions. Sequencing libraries were prepared with an Illumina Nextera XT DNA sample preparation kit and sequenced on an Illumina MiSeq instrument with a 300-bp paired-end protocol (MiSeq v3 chemistry) according to the manufacturer’s instructions. The 67 selected samples were sequenced three times in total. Runs A and B were performed on the same MiSeq instrument using newly created libraries from the samples, whereas run C was performed on a different MiSeq unit but using the same libraries as prepared for run B. Run A was done by a different operator than runs B and C. All WGS data generated for these samples in the three different runs have been deposited in the NCBI Sequence Read Archive (SRA) (Leinonen et al., 2011) under accession number SRP137803. Individual accession numbers for all sequenced samples for all runs are listed in Supplementary Table S14.

Extended Validation Dataset

The extended validation dataset consisted out of 64 N. meningitidis samples selected from publicly available NGS data. This additional dataset was collected to evaluate our bioinformatics workflow on data coming from different laboratories, as is often the case in real-world applications. In this extended dataset, we included additional strains of serogroups Y and W135, which are underrepresented in the global collection of 107 N. meningitidis samples maintained at the University of Oxford, and which are currently causing epidemics in both the United States and Europe (Mustapha et al., 2016; Whittaker et al., 2017) (see also the section “Discussion”). Additionally, the majority of samples in the extended validation dataset were generated by means of the HiSeq instrument in contrast to the MiSeq used for the core validation dataset, and read lengths for this dataset were consequently typically shorter. An overview of these samples with their corresponding NCBI SRA accession numbers is available in Supplementary Table S20.

Validation of the Bioinformatics Workflow

Validation Strategy

We built upon previously described case studies (Lindsey et al., 2016; Kozyreva et al., 2017; Yachison et al., 2017; Holmes et al., 2018; Portmann et al., 2018) by implementing performance metrics that were adapted toward our purpose of exhaustively validating the bioinformatics workflow: repeatability, reproducibility, accuracy, precision, (diagnostic) sensitivity, and (diagnostic) specificity. A full overview of all performance metrics and their corresponding definitions and formulas is presented in Table 3. Some metrics were not evaluated for all bioinformatics assays, since it was not always possible to find suitable definitions in context of the specific analysis (see also the section “Discussion”). Precision specifically refers to the positive predictive value rather than repeatability and reproducibility as is the case in Kozyreva et al. (2017). “Within-run” replicates refer to duplicate (bioinformatics) analysis by executing the bioinformatics workflow twice on the same dataset for the calculation of repeatability. “Between-run” replicates refer to duplicate (bioinformatics) analysis by executing the bioinformatics workflow twice on the same sample generated on a different sequencing run for the calculation of reproducibility. Note that reproducibility could never be calculated for the extended validation dataset because no between-run replicates were available for these samples. The accuracy, precision, sensitivity, and specificity metrics all require the classification of results as either true positives (TPs), false positives (FPs), true negatives (TNs), or false negatives (FN), which by definition all require the comparison against a reference or standard that represents the “truth,” for which we adopted two reference standards. First, genotypic information for the reference strains from the core validation dataset available in the PubMLST isolate database (accessible at https://pubmlst.org/bigsdb?db=pubmlst_neisseria_isolates) was used to compare results from our bioinformatics workflow against, which was referred to as “database standard.” Second, because no such high-quality genotypic information was available for the samples from the extended validation dataset with the sole exception of the serogroup, results of our bioinformatics workflow were compared against the results of tools commonly used and adopted by the scientific community, which was referred to as “tool standard.” This second approach was also employed for the core validation dataset to evaluate consistency between both standards, and also because for the resistance gene characterization assays no genotypic information was available in the associated reference database for the core validation dataset. All analyses were done through a local “push-button” implementation of the bioinformatics workflow (see the section “Implementation and Availability”). All output files were downloaded upon completion and the performance of the different bioinformatics assays was evaluated by querying the output files using in-house scripts that collected the performance metrics presented in Table 3. All output reports of the bioinformatics workflow are publicly available at Zenodo⁵.

TABLE 3

Table 3. Overview of performance metrics and their corresponding definitions and formulas adopted for our validation strategy.