AUTHOR=Rosewick Nicolas , Hahaut Vincent , Durkin Keith , Artesi Maria , Karpe Snehal , Wayet Jérôme , Griebel Philip , Arsic Natasa , Marçais Ambroise , Hermine Olivier , Burny Arsène , Georges Michel , Van den Broeke Anne TITLE=An Improved Sequencing-Based Bioinformatics Pipeline to Track the Distribution and Clonal Architecture of Proviral Integration Sites JOURNAL=Frontiers in Microbiology VOLUME=11 YEAR=2020 URL=https://www.frontiersin.org/journals/microbiology/articles/10.3389/fmicb.2020.587306 DOI=10.3389/fmicb.2020.587306 ISSN=1664-302X ABSTRACT=
The combined application of linear amplification-mediated PCR (LAM-PCR) protocols with next-generation sequencing (NGS) has had a large impact on our understanding of retroviral pathogenesis. Previously, considerable effort has been expended to optimize NGS methods to explore the genome-wide distribution of proviral integration sites and the clonal architecture of clinically important retroviruses like human T-cell leukemia virus type-1 (HTLV-1). Once sequencing data are generated, the application of rigorous bioinformatics analysis is central to the biological interpretation of the data. To better exploit the potential information available through these methods, we developed an optimized bioinformatics pipeline to analyze NGS clonality datasets. We found that short-read aligners, specifically designed to manage NGS datasets, provide increased speed, significantly reducing processing time and decreasing the computational burden. This is achieved while also accounting for sequencing base quality. We demonstrate the utility of an additional trimming step in the workflow, which adjusts for the number of reads supporting each insertion site. In addition, we developed a recall procedure to reduce bias associated with proviral integration within low complexity regions of the genome, providing a more accurate estimation of clone abundance. Finally, we recommend the application of a “clean-and-recover” step to clonality datasets generated from large cohorts and longitudinal studies. In summary, we report an optimized bioinformatics workflow for NGS clonality analysis and describe a new set of steps to guide the computational process. We demonstrate that the application of this protocol to the analysis of HTLV-1 and bovine leukemia virus (BLV) clonality datasets improves the quality of data processing and provides a more accurate definition of the clonal landscape in infected individuals. The optimized workflow and analysis recommendations can be implemented in the majority of bioinformatics pipelines developed to analyze LAM-PCR-based NGS clonality datasets.