AUTHOR=Lupo Valérian , Van Vlierberghe Mick , Vanderschuren Hervé , Kerff Frédéric , Baurain Denis , Cornet Luc TITLE=Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics JOURNAL=Frontiers in Microbiology VOLUME=12 YEAR=2021 URL=https://www.frontiersin.org/journals/microbiology/articles/10.3389/fmicb.2021.755101 DOI=10.3389/fmicb.2021.755101 ISSN=1664-302X ABSTRACT=
Contaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algorithms that can sometimes produce widely different estimates of the contamination level, the majority of genomic studies rely on a single method of detection, which represents a risk of systematic error. In this work, we used two orthogonal methods to assess the level of contamination among National Center for Biotechnological Information Reference Sequence Database (RefSeq) bacterial genomes. First, we applied the most popular solution, CheckM, which is based on gene markers. We then complemented this approach by a genome-wide method, termed Physeter, which now implements a