AUTHOR=Xu Mengyang , Guo Lidong , Qi Yanwei , Shi Chengcheng , Liu Xiaochuan , Chen Jianwei , Han Jinglin , Deng Li , Liu Xin , Fan Guangyi TITLE=Symbiont-screener: A reference-free tool to separate host sequences from symbionts for error-prone long reads JOURNAL=Frontiers in Marine Science VOLUME=10 YEAR=2023 URL=https://www.frontiersin.org/journals/marine-science/articles/10.3389/fmars.2023.1087447 DOI=10.3389/fmars.2023.1087447 ISSN=2296-7745 ABSTRACT=

Metagenomic sequencing facilitates large-scale constitutional analysis and functional characterization of complex microbial communities without cultivation. Recent advances in long-read sequencing techniques utilize long-range information to simplify repeat-aware metagenomic assembly puzzles and complex genome binning tasks. However, it remains methodologically challenging to remove host-derived DNA sequences from the microbial community at the read resolution due to high sequencing error rates and the absence of reference genomes. We here present Symbiont-Screener (https://github.com/BGI-Qingdao/Symbiont-Screener), a reference-free approach to identifying high-confidence host’s long reads from symbionts and contaminants and overcoming the low sequencing accuracy according to a trio-based screening model. The remaining host’s sequences are then automatically grouped by unsupervised clustering. When applied to both simulated and real long-read datasets, it maintains higher precision and recall rates of identifying the host’s raw reads compared to other tools and hence promises the high-quality reconstruction of the host genome and associated metagenomes. Furthermore, we leveraged both PacBio HiFi and nanopore long reads to separate the host’s sequences on a real host-microbe system, an algal-bacterial sample, and retrieved an obvious improvement of host assembly in terms of assembly contiguity, completeness, and purity. More importantly, the residual symbiotic microbiomes illustrate improved genomic profiling and assemblies after the screening, which elucidates a solid basis of data for downstream bioinformatic analyses, thus providing a novel perspective on symbiotic research.