Exploring Virome Diversity in Public Data in South America as an Approach for Detecting Viral Sources From Potentially Emerging Viruses

Mazur, Fernando G.; Morinisi, Leandro M.; Martins, Junior Olímpio; Guerra, Pedro Pontes Bueno; Freire, Caio C. M.

doi:10.3389/fgene.2021.722857

ORIGINAL RESEARCH article

Front. Genet., 21 January 2022

Sec. Evolutionary and Population Genetics

Volume 12 - 2021 | https://doi.org/10.3389/fgene.2021.722857

This article is part of the Research TopicDecoding the Genetics of Viral EvolutionView all 6 articles

Exploring Virome Diversity in Public Data in South America as an Approach for Detecting Viral Sources From Potentially Emerging Viruses

Fernando G. Mazur

Leandro M. Morinisi

Junior Olímpio Martins

Pedro Pontes Bueno Guerra

Caio C. M. Freire*

Department Genetics and Evolution, UFSCar—Federal University of São Carlos, São Carlos, Brazil

The South American continent presents a great diversity of biomes, whose ecosystems are constantly threatened by the expansion of human activity. The emergence and re-emergence of viral populations with impact on the human population and ecosystem have shown increases in the last decades. In deference to the growing accumulation of genomic data, we explore the potential of South American-related public databases to detect signals that contribute to virosphere research. Therefore, our study aims to investigate public databases with emphasis on the surveillance of viruses with medical and ecological relevance. Herein, we profiled 120 “sequence read archives” metagenomes from 19 independent projects from the last decade. In a coarse view, our analyses identified only 0.38% of the total number of sequences from viruses, showing a higher proportion of RNA viruses. The metagenomes with the most important viral sequences in the analyzed environmental models were 1) aquatic samples from the Amazon River, 2) sewage from Brasilia, and 3) soil from the state of São Paulo, while the models of animal transmission were detected in mosquitoes from Rio Janeiro and Bats from Amazonia. Also, the classification of viral signals into operational taxonomic units (OTUs) (family) allowed us to infer from metadata a probable host range in the virome detected in each sample analyzed. Further, several motifs and viral sequences are related to specific viruses with emergence potential from Togaviridae, Arenaviridae, and Flaviviridae families. In this context, the exploration of public databases allowed us to evaluate the scope and informative capacity of sequences from third-party public databases and to detect signals related to viruses of clinical or environmental importance, which allowed us to infer traits associated with probable transmission routes or signals of ecological disequilibrium. The evaluation of our results showed that in most cases the size and type of the reference database, the percentage of guanine–cytosine (GC), and the length of the query sequences greatly influence the taxonomic classification of the sequences. In sum, our findings describe how the exploration of public genomic data can be exploited as an approach for epidemiological surveillance and the understanding of the virosphere.

Introduction

Viral emergencies showed a progressive increase in the last decades (Jones et al., 2008; Gould and Higgs, 2009; Ewald, 2011; Coffey et al., 2014; Nobre et al., 2016). Many factors linked to the expansion of human activity (agriculture, migration, urbanization, etc.) are indicated as agents that threaten the balance of viral communities in nature (Gould and Higgs, 2009; Woolhouse et al., 2012; Mahy, 2014). Currently, accumulated evidence reveals an overwhelming number of new viruses and routes of interaction that were not considered a few decades ago (Wong et al., 2007; Sobel Leonard et al., 2017; Nouri et al., 2018; Schoeman and Fielding, 2019). This impressive volume of new viral sequences allowed us to investigate the still hidden viral diversity that has great magnitude and actively participates in ecological processes (Koonin and Dolja, 2013; Posada-Cespedes and Seifert, 2017). Thus, the sequencing technologies permitted the observation with better resolution of the viral world, changing the old paradigm of parasitic entities to a category containing the largest and most variable genetic information, the virosphere (Shi et al., 2018). Nevertheless, the viral proportions observed to date are only a small fraction of the virosphere volume (Zhang et al., 2018).

Otherwise, the description of viral diversity presents great challenges. These entities have attributes that confer them a rapid response to the environment. Characteristics such as high mutational rate, high reproductive number, recombination, and reorganization, even in very divergent families, contribute to the wide diversity of viral groups (Skewes-Cox et al., 2014; Shi et al., 2016a; Gilbert et al., 2016). Thus, in contrast to microorganism classification methods where it is possible to classify operational taxonomic units (OTUs) from a few markers with high homology (e.g., RNA16S) (Hackl et al., 2004), virus detection is limited by the wide genetic mosaicism detected in the virosphere (Dolja and Koonin, 2018). Therefore, virus or viral trace detection considers alternative strategies to infer the presence of viral activity (Herath et al., 2017). Thus, the diversity discovered in the virosphere, largely explained by horizontal gene transfer (Cai et al., 2006; Gilbert et al., 2016; Wolf et al., 2018), reveals intricate genetic relationships that began to be described with greater resolution in the age of genomics (Xu, 2006).

The increasing addition of new viruses identified in the reference databases, advances in taxonomic classification (Krishnamurthy and Wang, 2017; Simmonds et al., 2017), and the implementation of new statistical and computational techniques make it possible to increase the informative capacity of viral sequences. The combined detection of multiple and diverse patterns in viral genes, such as protein domains or gene sequences of conserved genes, helps to detect OTUs of viral origin despite divergence between taxonomic groups. These innovations make it possible to detect molecular clues of viral origin that may indicate ecological processes, as well as biological phenomena in hosts (Hurst, 2011; Fermin, 2018; Sadeghi et al., 2021). In this sense, the profile of virus species and the abundance of each viral family detected in an organism can reveal key elements in the interaction with its environment. In addition, viral traces can be used as indicators of anomalous bacterial activity, environmental pollutants, or the presence of serious modifications in the ecosystem (Hurst, 2013; Fermin, 2018). This type of evidence allows the detection of large-scale phenomena in the environment and in non-culturable viruses (Roux et al., 2017; Wolf et al., 2020), thus contributing relevant sources of information for ecological monitoring (Fermin, 2018).

The exploration of the world of viral RNA reached a better resolution through the new tools, which permitted the accumulation of enough data to reveal key events in the evolution of this viral group (Duffy et al., 2008). The abundance and characteristics of replication of RNA viruses are favorable in the rapid expansion of its reproductive substrate, which can result in eventual emergencies (Holmes, 2009; Wang and Crameri, 2014). Conjunctions of factors such as diversity disturbances, increased density, and exposure of organisms to sources of infection (Bordería et al., 2011; Geoghegan and Holmes, 2017) are frequently associated with host jumps. However, several viral emergence events resulted from unknown viruses that had reduced sylvatic cycles, due to alterations in the environment, forcing them to occupy new niches and impacting humans (Weaver and Reisen, 2010; Castro et al., 2019). This is the case of several arboviruses and common viruses that infect vertebrates such as coronaviruses, bunyaviruses, and retroviruses (De Wit et al., 2016; Chouin-Carneiro, 2017; Lu et al., 2020). As a surveillance alternative, different research approaches indicate that viral migration pathways can be mapped by observing molecular evidence from viral lineages in a time–space context (Rosario and Breitbart, 2011; Geoghegan and Holmes, 2017).

Moreover, ecological and epidemiological aspects often overlap in the environment. It was indicated that most of the gastrointestinal diseases transmitted by recreational contact on water surfaces have their origin in human waste (Morace et al., 2002; Aw and Gin, 2010; Hjelmsø et al., 2019). Based on this, many countries and research groups have opted for molecular detection methods in aquatic environments as a solution to this problem. It is an important area of surveillance since fecal–oral transmission viruses show a high prevalence in watercourses. As a key feature to consider, fecal–oral transmission viruses show infective capacity at low frequencies (Lin and Ganesh, 2013). Thus, viruses such as reovirus, adenovirus, hepatitis, rotaviruses (RV), and human polyomaviruses are viral particles of waterborne transmission, which have high stability against environmental perturbations such as high UV degrees and long periods in the environment.

Therefore, the reanalysis of public databases to describe and explore various aspects of the virosphere is an accessible and powerful tool, which is at an early stage. As a first step, this work aims to investigate the large volume of publicly available sequencing data to detect viral sequences. Once viral evidence has been identified, we aim to label, categorize, and contextualize the data, according to the aspect to be analyzed through the collection of suitable metadata. For this purpose, we kept our focus on South America, mainly the Brazilian territory, whose tropical location is associated with the greatest biological diversity. In addition, this area presents a large number of biomes, intense agricultural activity, highly populated cities, and other socioeconomic factors related to the emergence and/or imbalance of viral communities. Depending on the geographical setting and the metagenomes available, our work also evaluates the ability to detect viral signals using the reference viral genomes available to date. In order to do this, we investigated if the knowledge of viral diversity related to a specific context can be used to drive responsible management of human-induced factors with a high impact on the biosphere.

Materials and Methods

To analyze the presence of viral patterns in public databases, we established a workflow that adapts the heterogeneity of the obtained samples to recover viral sequences using long-established and frequently used tools in sequence similarity (Blast-like) and comparison methods (HMMer-like) with an updated viral database.

Metagenomic Sequences

We choose metagenomic sequences of samples from Brazil and other countries in South America, using the keywords “Brazil,” “host,” “mosquitoes,” “South America,” and “virus” in the search engine of Sequence Read Archives (“SRA”) database. Also, we filtered the results to obtain sequences from “metagenomic,” “transcriptomic,” “soil,” “Environmental,” “Water,” “Freshwater,” “Host,” and others related to epidemiologically relevant samples.

Sequence Pre-Processing

The Fastq-dump and Prefetch programs from the SRAtoolkit package v2.3.2 were used to download and manage the files. The integrity of samples was analyzed using FASTQC v0.11.9, and data curation was performed with Fastp (Chen et al., 2018).

Processing and Analysis of Sequencing Reads

Mapping of viral sequences was performed with Bowtie 2 (Borozan et al., 2013) and the clustered-RVDB-v21 database (Goodacre et al., 2018), which allowed us to index more than 28,000 viruses as the subject for the mapping against the metagenome samples. The results in SAM format were filtered according to PRHED 20 quality, which considers a confidence percentage of 9% chance of error (Suzuki et al., 2017). We also filtered the mapped sequences in which the guanine–cytosine (GC) content ranged from 35% to 75%, using Python and R to perform these analyses. To determine the abundance of viral sequences, we calculated the percentage of reads that mapped with RVDB. Later, we checked the quality of these sequences with Aliview (Larsson, 2014). Once we detected the sequences with the first method (Bowtie 2), the sequences of putative viral origin were translated into the six reading frames (ORF) by EMBOSS (Rice et al., 2000) and subsequently mapped against clustered-RVDB protein (C-RVDB-v21, contains 3,899,699 protein sequences) against Blastp (BLAST+ v2.11.0) using a script based on the MetablastR package.

The results were associated with metadata and taxonomical classification and then filtered, and short reads with low-complexity regions were excluded. Was used clustered-RVDB (C-RVDB-v21) as a database, reporting only viral sequences that aligned with e-value >0.001 and coverage >75% with the level of taxonomic assignment (minimum inclusion criteria = family). Subsequently, we proceeded with a third method, where viral sequences were mapped using hidden Markov model viral profiles with the HMMer tool (Skewes-Cox et al., 2014; Goodacre et al., 2018), which used RVDB-prot (3,899,699) as a reference database to detect viral profiles (vFAMs—13,201 clusters) (Bigot et al., 2019).

Contig Assembly and Virus Mapping

We performed de novo assemblies of the filtered reads with SPADES (Bankevich et al., 2012) and CAP3 (Huang and Madan, 1999). In the same way that we proceeded with the reads, the assembled viral sequences with BLAST+ were inspected, using C-RVDB-v21 database as the first method. Then, the assemblies were filtered, using the same criteria of the reads that were based on GC content, e-value, and coverage. As a second method, each assembled sequence was also translated into the six ORFs by EMBOSS (Rice et al., 2000), analyzed with BLASTP against clustered-RVDB (C-RVDB-v21), and filtered by e-value, coverage, and low level of taxonomic assignment. Finally, contigs were mapped with HMMer using RVDB vFAMs, excluding e-value >0.001.

Analyses of Results

The results were analyzed using the R language where scripts were created using the packages Gtsummary, phyloR, Gmaps, Strings, SequinR, Biostrings, Rentrez, MetablastR, Pheatmap, and the n Python 3.8.0 language with the Biopython (https://biopython.org/) and pandas (https://pandas.pydata.org/) modules. Correlation analyses of diversity distribution detected by sample were performed according to the type of mapping method. The R packages used in correlation analysis were R, Hmisc, PerformanceAnalytics, and Corrplot packages.

Analysis of Observed Diversity According to Environmental, Clinical, or Economic Impact

To analyze the host range of sample indicated by the contigs in each sample, metadata information (Fermin, 2018) that shows a taxonomic assignment of hosts in a comprehensive way was used with the following groups of classification: A (Animalia), Ar (Archaea), B (Bacteria), C (Chromista), F (Fungi), P (Plantae), or Pr (Protozoa). In order to identify groups of sequences according to their environmental, economic, veterinary, or clinical impact, our contig sequences were labeled with information from Viral Zone (Hulo et al., 2011). The labeling, taxonomy, and visual representation analyses were performed with the R language using the Rentrez and Sequinr packages. The workflow is shown graphically in Supplementary Figure S1.

In sum, we were able to analyze 19 bioprojects containing 120 SRA experiments, which represented virological diversity from several environments (Supplementary Table S1). In addition, we took into account evidence for, e.g., the presence of families with bioindicator potential, taxonomic groups with high emergence records, or relevant proteins.

Results

Viral Sequences

In the first phase of our analysis, 2,750,113 reads from viruses were detected from a total of 7.10E+08 reads in the metagenomes, corresponding to 0.38% of the reads in the 120 metagenomic samples from 19 projects, which represented the heterogeneous sampling origins (Figure 1). Moreover, clustering of the viral sequences by genome type (Table 1) revealed that the highest percentages of virus (Classified) recovered were dsRNA, ssRNA (+), dsDNA, ssRNA (−), and ssRNA-RT (Table 1), in this order.

FIGURE 1

FIGURE 1. The number of sequences of probable viral origin identified in each metagenome analyzed.

TABLE 1

TABLE 1. Summary of rate of viral sequences by genome type.

The highest proportion was observed in the dsRNA category from the Aedes aegypti samples (Figure 2) and corresponded to sequences identified as Reoviridae, Picobirnaviridae, Partitiviridae, Hypovirus, and Sedoreovirinae. This uneven distribution of sequences according to the genome in Aedes sp. samples could be biased by the methods to obtain and concentrate samples.

FIGURE 2

FIGURE 2. Proportion of viral sequences according to the related genome type by sample origin. The colors allow to identify in the horizontal bars the proportion of sequences clustered according to their genome type (y-axis) by sample origin (x-axis).

RNA (+) viruses were found in more than 21% of the total samples, when they were added to the percentage of other RNA (+) viruses but classified within taxa of variable composition (RNA +/−), totaling more than 40% (Figure 2). In the remaining categories, viruses “ssRNA-RT,” which grouped the Orthoretrovirinae and Retroviridae families, stood out with 11% and a more homogeneous distribution in the samples. The viral sequences corresponding to ssDNA, dsRNA, and dsDNA-RT classes represented the lowest frequencies that we found. In agreement, we found the highest rate of RNA viruses over DNA in the environment (“Amazonia river,” “Marine,” “Activated sludge,” “Wastewater,” and “Bioreactor”) and bat metagenomes (Figure 2).

A broad observation of family rate distribution (Figure 3 and Supplementary Figure S4) indicated that most of the viral sequences were grouped in taxonomic categories, which have not yet been clarified. Otherwise, when grouping the sequences considering orders and families with epidemiological relevance (impact on human health or economy), we found a large number of sequences related to RNA (+) families (Arenaviridae, Hepaciviridae, Togaviridae, etc.) and RNA (−) (Phenuiviridae, Peribunyaviridae, and Hantaviridae). The sequences related to relevant dsDNA viruses were mostly classified in the families Alphaherpesviridae, Betaherpesviridae, Gamaherpesviridae, Picornaviridae, and Cytomegalovirus.

FIGURE 3

FIGURE 3. Rate of viral sequences detected in each metagenome. (A) The proportions of families recovered with Blastn. (B) The proportions of families recovered using contigs translated with Blastp. The RVDB-v21 database reference was used in both methods.

When analyzing the data in a more restrictive way, allowing a minimum level of common taxonomic assignment in the family level, a percentage higher than 50% of all the sequences was indicated to be of viral origin; the methods were described by similarity search (Bowtie 2, and Blastp) and the presence of 78 families in the total of analyzed samples (Figure 3 and Supplementary Figure S4). Likewise, sequences reads and contigs translated into peptides (six ORF) were analyzed with HMMer (Supplementary Figure S9).

An overall evaluation of our results shows that more than 750,000 unique viral sequences were detected in reads (Bowtie 2), 170,000 unique sequences in translated reads (Blastp), 5,800 unique sequences in contigs, and 6,930 unique sequences in translated contigs.

A more restrictive classification, allowing a minimum assignment criterion to include family, showed that the pairwise alignment and HMM comparison methods indicated similar diversity rates in the samples (Figure 3 and Supplementary Figure S4). Based on this, a correlatogram (Pearson’s correlation) was performed, which included the distribution of viral diversity detected in reads (Supplementary Table S8) and contigs (Supplementary Table S8) with each of the methods (Bowtie, Blastn, Blastp, and HMMer) in each type of sample. The correlogram (Figure 4) revealed a high level of correlation with significance between the distribution of diversity detected among contigs with Blastn, Blastp, and HMMer (pearson cor. >0.98). In the read category, it was possible to detect positive correlation >0.97 between the three methods (Bowtie, Blastp, and HMMer), demonstrating convergence between the diversity detected by pairwise alignment (diversity described by family) and comparison by hidden Markov models (diversity described by vFAMs detected) (Figure 4, Supplementary Figure S9). As expected, the contigs reveal a narrower diversity of families due to the exclusion of sequences in the assembly step, which is reflected in the correlation of their results. However, the lengthening of the sequences by assembly indicated a great diversity of proteins that served to indicate the presence of relevant viral species (Figures 5, 6).

FIGURE 4

FIGURE 4. Correlatogram between the assigned family frequency distributions in the metagenomes obtained with all the methods used.

FIGURE 5

FIGURE 5. Heatmaps of the proportions of sequencing reads classified into viral families by sample origin. Graphs (A and B) correspond to the highest proportion (families with >2% of total viral reads by sample) and the lowest proportion of reads (families with <2% of total viral reads by sample). The top columns represent annotations of the genome and host type of each identified family. Uncolored boxes correspond to unclassified families. Genome types are coded as dsDNA, dsRNA, ssDNA(+), ssRNA-RT, and ssRNA(+). Their hosts are coded here as A (Animalia), Ar (Archaea), B (Bacteria), C (Chromista), F (Fungi), and P (Plantae).

FIGURE 6

FIGURE 6. Geographical distribution of the samples containing Arenavirus, Togavirus, and Flavivirus. The colored dots on the map symbolize the coordinates of each sampling group analyzed. The difference in size is proportional to the number of sequences of the family shown.

Some methods detected unique families; likewise, contig assembly in some samples allowed to describe and classify taxa not included by reads. Environmental samples such as rhizosphere, compost, and wastewater metagenome revealed viral sequences only after assembly (Figure 3). A total of 27 common families among reads and translated reads were detected: Baculoviridae, Togaviridae, Arenaviridae, Phycodnaviridae, Reoviridae, Mononegavirales, Protostomia, Retroviridae, Herpesviridae, Picobirnaviridae, Flaviviridae, Partitiviridae, Mitoviridae, Mimiviridae, Closteroviridae, Picornaviridae, Iflaviridae, Poxviridae, Marseilleviridae, Hepeviridae, Potyviridae, Iridoviridae, Astroviridae, Caliciviridae, Betaflexiviridae, and Alphaflexiviridae (Figure 3 and Supplementary Figure S4). In addition, the mapping of untranslated reads identified 76 families, indicating a significant number of sequences belonging to more than 40 viral families not detected in translated reads (Supplementary Figure S4). The distribution of families detected with contigs by Blastn and Blastp showed no significant differences (Figure 3). Moreover, the assemblers differed in the recovery of some viral species, since the efficiency of CAP3 or SPADEs is probably subject to the sequencing characteristics (transcriptomic, shotgun, pair-end, single-end, etc.) (Rosario and Breitbart, 2011; White et al., 2017). The two assembly approaches recovered in similar proportions of protein classes such as hypothetical (25%), photosystem (18%), nonstructural (8%), and structural (8%). Although no viral sequences larger than 4,000 nt were assembled with the techniques used, it was possible to detect important evidence of viruses such as RDRP, glycoproteins, nucleoproteins, and capsids (Figures 7, 8).

FIGURE 7

FIGURE 7. Proportion of contigs obtained related to viral proteins in each metagenome of organisms. The colors allow to identify in the horizontal bars the proportion of sequences related to viral proteins (x-axis) according to the viral family (y-axis) in each animal model.

FIGURE 8

FIGURE 8. Proportion of contigs obtained related to viral proteins in each environmental metagenome. The colors allow to identify in the horizontal bars the rate of sequences related to viral proteins (x-axis) according to the viral family (y-axis) in each environmental model.

Due to the heterogeneity of the experiments, each sample type detected different levels of viral diversity, which were largely OTUs consistent with the expected viruses according to the type of sample (Figures 3, 5 and Supplementary Figures S4, S5). In addition, many samples analyzed showed a diversity of sequences that allowed classification at the species level.

Amazon River

In the metagenomes from the Amazon River, a higher amount of sequences from viruses related to the families Flaviviridae, Togaviridae, and Paramyxoviridae were detected. Flaviviridae sequences were represented by sequences related to the hepatitis C virus (RNA+). In the Alphaviridae group, we detected sequences related to Semliki Forest virus (length, ∼200 nt; e-value, 1.10e−66; identity, ∼91%).

In the low proportion (<1% total viral sequences), we identified sequences from Guanarito mammarenavirus (length, 300 nt; e-value, 8e−116; identity, ∼96%) and Phenuiviridae (Joao virus) families (length, ∼300 nt; e-value, 8e−116; identity, ∼96%). In addition, in the lower frequency of reads (<0.1), we were able to observe viral evidence, corresponding to plants (Betapartitiviurs, Bromoviridae, Virgaviridae, Tobamovirus, Tymovirus, and Goraviurs), fungi (Narnavirus), and Enterovirus, Paramyxoviridae, Bunyaviridae, Pestivirus, Phenuiviridae, Rhaboviridae, and Sedoreovirinae. The assembly also revealed some proteins signals of interesting viral traces, such as movement protein [Elderberry aureusvirus 1], VP1 [Mangshi virus], movement protein [Rice virus A], polymerase [Reticuloendotheliosis virus], RNA-dependent RNA polymerase [Rhizopus microsporus 23S narnavirus], and nonstructural protein [Spodoptera exigua iflavirus 1]. Furthermore, we found sequences indicating a diversity of glycoprotein-specific motifs and structural and nonstructural proteins of Venezuela equine encephalitis virus (length, ∼300 nt; e-value, 8e−116; identity, ∼96%) and Mosso das pedras virus (length, ∼300 nt; e-value, 8e−116; identity, ∼96%).

Aedes aegypti Samples

The number of viral sequences recovered from A. aegypti samples was high as compared with most of the analyzed metagenomes (Supplementary Figure S2). Nonetheless, the highest proportion of viral sequences in A. aegypti samples clustered in few viral families. Thus, the largest number of reads was grouped in unclassified families, endogenous elements, and Retrovirus sequences. Less frequently, close to 1%, we detected Flaviviridae- and Betaherpesvirinae-related sequences (Heatmap, Figure 5). The Flaviviridae family showed a high occurrence of reads grouped in Hepacivirus C or unclassified Flaviviruses. Otherwise, the sequences in very low frequency (<0.5%) were grouped in more than 70 groups, which allowed us to point out fragments of epidemiological interest. Also, the highest proportion of RNA (−) viruses were related to viruses like Phasi Charoen virus, a specific insect virus of the Bunyaviridae family. In an epidemiological context, our observation points out reads related to Togaviridae and fragments that mapped with polyproteins of Mosso das pedras virus (length, ∼300; e-value, 6e−73; identity, 100.00%) and Mucambu virus (length, 300 nt; e-value, 4e−50; identity, 85%), which are below the Venezuelan equine Encephalitis virus complex. In addition, the assembly allowed to detect longer sequences of Togaviridae family, corresponding to glycoproteins, SNPs, and structural and nonstructural proteins related to Venezuelan equine Encephalitis virus, Mucambo virus, and Rio Negro virus (length, 400–1,350; identity, >90%). Phenuiviridae was represented by glycoprotein and RDRP of Phasi Charoen-like phasivirus (Ferreira et al., 1994; Auguste et al., 2009; Lopes et al., 2014; Espósito and da Fonseca, 2015; Esposito and Fonsecada, 2017). The samples from Aedes showed an important rate of RNA viruses; they also indicated a high percentage of DNA sequences, which in most cases correspond to endogenous viruses (Figures 2, 5 and Supplementary Figure S5).

Chiroptera Samples

Our investigations for bat species in public databases revealed six bat metagenomes samples from three projects. The species Sturnira ludovici and Carollia brevicauda, corresponding to the SRA project PRJNA563501, which focuses on the analysis of vomer-nasal censoring; the project PRJNA480298 that analyzed endogenous viral elements in the species Desmodus rotundus, Carollia perspicillata, and Artibeus lituratus (brain tissue samples); and the project PRJNA382386, which studies the presence of Trypanosoma in bat blood samples. First, we observed that the species S. ludovici and C. brevicauda clustered together, presenting a similar profile in both high and low frequencies. These two species presented a high percentage of Flavivirus, Arenavirus, Herpesvirus, and Endogenous viral elements. The remaining species (D. rotundus, C. perspicillata, and A. lituratus), although sampled with the same methodology, showed a different viral profile but overlapping viral families. The diversity profile corresponding to the “Bats (blood)” study was not represented in the heatmap (Figure 5 and Supplementary Figure S5) because the low proportion of reads becomes incomparable when scaling the matrix of viral abundances. In addition, reads grouped in higher proportions showed lower family diversity (N < 4), which contrasted with a higher number of families in lower frequency (N > 8).

The identification of Arenavirus in D. rotundus, C. brevicauda, and S. ludovici was also verified by the presence of contigs from these viruses. These sequences corresponded to fragments between 200 and 700 nt of glycoprotein and L-segment regions of G. mammarenavirus (e-value, ∼2e−57; identity, 98.46%). These viral motifs were also detected with Blastp in reads and contigs.

Marine Metagenome

The detected viral sequences were from dsDNA families Cytomegalovirus, Myoviridae, and Lymphocryptovirus. A large number of viral reads from Prochlorococcus phage P-HM2 were detected (Myoviridae) (Supplementary Tables S2–S4).

Compost and Sugarcane Rhizosphere Metagenomes

Classification of reads pointed to a few viral groups (Beta Baculoviridae), Durnovirales, and Herpesvirales. Nevertheless, the assembly of the sequences revealed the presence of the families Flaviviridae, Mimiviridae, Narnaviridae, Partitiviridae, Poxviridae, and Retroviridae (Supplementary Tables S2, S3). The contigs from Flaviviridae (ssRNA+) were related to (length, 300–500 nt; e-value, 7e−39; identity, >90%) polyprotein of Bovine viral diarrhea virus 2. Mimiviridae (dsDNA) was represented by contigs related to CDS (methionyl-tRNA synthetase, leucyl-tRNA synthetase, glutamyl-tRNA synthetase, cyclophilin type peptidyl-prolyl cis-trans, Capsid, Hsp70) of Edafosvirus, Gaeavirus, Harvfovirus, Terrestrivirus, and Hokovirus HKV1 (length, 150–1,300 nt; e-value, 1e−07; identity, ∼90%). The contigs recovered in Mitoviridae corresponded to RDRP proteins, which at the species level were related to Colletotrichum fructicola (length, 150–580 nt; e-value, 5e−38; identity, ∼75%), R. microsporus (length, 190–450 nt; e-value, 6e−170; identity, ∼95%), and Pleurotus ostreatus virus (length, 200–850 nt; e-value, 6e−170; identity, ∼91%), whose hosts are mushrooms. The contigs grouped in Poxviridae indicated the presence of polyubiquitins related to Magpiepox virus, an avian Poxivirus (length, ∼400 nt; e-value, 2e−48; identity, ∼98%). The largest proportion of contigs was grouped in Retroviridae, related to the gag protein of Porcine endogenous retrovirus.

The second group of samples belongs to the project Metatranscriptome from sugarcane rhizosphere under drought stress, which was collected in February 2015 in São Carlos, Brazil. The analysis in the rhizome samples grouped a low number of sequences, which indicated the presence of Hypovirus, Fabavirus, and Retrovirus porcine viruses.

Wastewater

Although a low number of sequences could be recovered, we grouped sequences in the medical importance category related to G. mammarenavirus (length, 200 nt, 2e−40; identity, 97.06%, Guanarito Mammarenavirus isolate CVH-960201 segment L). Also, we detected sequences corresponding to Astrovirus, Norovirus, and Oxbow Orthohantavirus (length, ∼200 nt; identity, ∼86%). The assembly of reads allows recovery sequences of capsids related to human Astrovirus (length, ∼190 nt; identity, ∼89.19%).

Diversity Observed According to Host Range and Environmental and Clinical Impact

Using data on host range, ecosystem, and societal impact, we were able to classify the viral sequences and motifs detected in each sample, facilitating the broad observation of groups of interest (Figure 9 and Supplementary Figure S10). The observed clustering was consistent with the expected viral diversity in each sample type, showing in environmental models the highest host potential in “Aquatic metagenome,” “Compost metagenome,” “Marine metagenome,” and “Rizosphere metagenome.” Interestingly, we also detected a great diversity of viruses with environmental impact in samples corresponding to C. brevicauda (fruit-eating bat), S. ludovici (fruit-eating bat), and Desmodus rotundus (Hematophage) (Figure 9). The sequences related to viruses of clinical importance showed a proportional distribution in most of the samples, with a number of viral sequences being identified in “Wastewater metagenomic” and “C. brevicauda.” The highest proportion of sequences related to clinical significance was clustered in the Aquatic metagenome and S. ludovici samples. The observation of the number of viral sequences according to hosts (Supplementary Figure S10) was consistent with sample type, where the highest number of viral sequences having animals as hosts was detected in the animal model samples. In contrast, viral sequences related to infection in prokaryotes, fungi, bacteria, and plants were mostly detected in Aquatic, Compost, and Rizosphere metagenomes (Figure 9 and Supplementary Figure S10).

FIGURE 9

FIGURE 9. Rate of sequences classified by their relevance clinical, environmental, economic, or veterinary relevance of the taxonomic group to which they belong. Also shown is the host range of the virome of each sample. Their hosts are coded here as A (Animalia), Ar (Archaea), B (Bacteria), C (Chromista), F (Fungi), and P (Plantae).

Analysis of Families Detected According to Length of Sequences

Depending on the informative capacity of the sequence, its size can directly influence the classification of viral signals in a metagenome. A histogram based on the length of the sequences identified in reads and contigs a heterogeneous distribution (alternate Supplementary Figures S6, S7) in the frequencies, which in reads is explained by the sequencing methods of the experiment. Reads of Illumina grouped in class intervals between 50 and 200 nt and Ion Torrent greater than 300 nt.

In order to evaluate how the size of the sequences influenced the taxonomic assignment (up to family level), we classified and compared the number of families detected as the length of the reads and contigs increased. We could observe, although to a lesser extent, that short sequences allow the identification of viral families, yet the highest informative capacity to classify is observed from 150 nt (Figure 10). This effect of read size on sequence identification is variable according to the type of experiment (Figure 10), but most indicated greater sensitivity with increasing read length.

FIGURE 10

FIGURE 10. Rate of families recovered in the total metagenomes as the length of reads is increased. The number of families recovered is shown (y-axis) according to the length of the sequence (x-axis). The colored boxes represent the viral family.

Description of Diversity

The starting point of our analysis focused on the results obtained in reads and contigs processed with Bowtie 2 and Blastn, where we obtained an interesting diversity classification, which was consistent with expected families in the samples. However, because of a large number of ambiguous short sequences that were related with virus, we resorted to analyzing diversity by alternative methods. In addition, a random sampling of sequences was performed to obtain an approximation of probable false positives due to pattern redundancy in nonviral reference proteins. We selected 10,000 random reads and 100 translated contigs (https://github.com/Fernando-Gmzr/Viral-signals-Public-metagenomes) and processed in Blastp to identify sequences related to nonviral origin with a high percentage of identity. The results of the random sampling of reads against reference proteins identified only sequences with a percentage of identity <60%, which belonged to sequences referenced by the RVDB as sequences of Guanarito virus (Arenaviridae) and Hepacivirus (Supplementary Table S6). These sequences were related to mammals and Actinobacteria. Only three contigs identified sequences with a high percentage of identity (proteomic reference of nonviral OTUs), which were related to the mammalian class (Supplementary Table S7).