AUTHOR=García-López Rodrigo , Vázquez-Castellanos Jorge Francisco , Moya Andrés 

TITLE=Fragmentation and Coverage Variation in Viral Metagenome Assemblies, and Their Effect in Diversity Calculations

JOURNAL=Frontiers in Bioengineering and Biotechnology

VOLUME=Volume 3 - 2015

YEAR=2015

URL=https://www.frontiersin.org/journals/bioengineering-and-biotechnology/articles/10.3389/fbioe.2015.00141

DOI=10.3389/fbioe.2015.00141

ISSN=2296-4185

ABSTRACT=Metagenomic libraries consist of DNA fragments from diverse species, with varying genome size and abundance. High-throughput sequencing platforms produce large volumes of reads from these libraries which may be assembled into contigs, ideally resembling the original larger genomic sequences. The uneven species distribution, along with the stochasticity in sample processing and sequencing bias, impact the success of accurate sequence assembly. 

Several assemblers enable the processing of viral metagenomic data de novo, generally using Overlap Layout Consensus or de Bruijn graph approaches for contig assembly. The success of viral genomic reconstruction in these datasets is limited by the degree of fragmentation of each genome in the sample, which is dependent on the sequencing effort and the genome length. Depending on ecological, biological or procedural biases, some fragments have a higher prevalence, or coverage, in the assembly. However, assemblers must face challenges such as the formation of chimerical structures and intra-species variability.

Diversity calculation relies on the classification of the sequences that comprise a metagenomic dataset. Whenever the corresponding genomic and taxonomic information is available, contigs matching the same species can be classified accordingly and the coverage of its genome can be calculated for that species. This may be used to compare populations by estimating abundance and assessing species distribution from this data. Nevertheless, the coverage does not take into account the degree of fragmentation, or else genome completeness, and is not necessarily representative of actual species distribution in the samples. Furthermore, undetermined sequences are abundant in viral metagenomic datasets, resulting in several independent contigs that cannot be assigned by homology or genomic information. These may only be classified as different Operational Taxonomic Units (OTUs), sometimes remaining inadvisably unrelated. Thus, calculations using contigs as different OTUs ultimately overestimate diversity when compared to diversity calculated from species coverage.
...
Last time we submitted a long abstract (the limit was 1000 characters) but this time it wont allow it to be longer than 350. Has this changed and why? For now, we'll just include the complete abstract in the manuscript. Please help us figure this out.