Integrating Computational Methods to Investigate the Macroecology of Microbiomes

Mascarenhas, Rilquer; Ruziska, Flávia M.; Moreira, Eduardo Freitas; Campos, Amanda B.; Loiola, Miguel; Reis, Kaike; Trindade-Silva, Amaro E.; Barbosa, Felipe A. S.; Salles, Lucas; Menezes, Rafael; Veiga, Rafael; Coutinho, Felipe H.; Dutilh, Bas E.; Guimarães, Paulo R.; Assis, Ana Paula A.; Ara, Anderson; Miranda, José G. V.; Andrade, Roberto F. S.; Vilela, Bruno; Meirelles, Pedro Milet

doi:10.3389/fgene.2019.01344

REVIEW article

Front. Genet., 17 January 2020

Sec. Computational Genomics

Volume 10 - 2019 | https://doi.org/10.3389/fgene.2019.01344

This article is part of the Research TopicComputational Methods for Microbiome AnalysisView all 15 articles

Integrating Computational Methods to Investigate the Macroecology of Microbiomes

Rilquer Mascarenhas¹

Flávia M. Ruziska¹

Eduardo Freitas Moreira¹

Amanda B. Campos¹

Miguel Loiola¹

Kaike Reis²

Amaro E. Trindade-Silva^1,3

Felipe A. S. Barbosa¹

Lucas Salles⁴

Rafael Menezes^3,5

Rafael Veiga⁶

Felipe H. Coutinho⁷

Bas E. Dutilh^8,9

Paulo R. Guimarães Jr.¹⁰

Ana Paula A. Assis¹⁰

Anderson Ara¹¹

José G. V. Miranda⁵

Roberto F. S. Andrade^5,6

Bruno Vilela¹

Pedro Milet Meirelles^1,3*

¹Institute of Biology, Federal University of Bahia, Salvador, Brazil
²Chemical Engineering Department, Polytechnic School of Federal University of Bahia, Salvador, Brazil
³Department of Ecology, Biosciences Institute, University of Sao Paulo, Sao Paulo, Brazil
⁴Institute of Geology, Federal University of Bahia, Salvador, Brazil
⁵Institute of Physics, Federal University of Bahia, Salvador, Brazil
⁶Center of Data and Knowledge Integration for Health (CIDACS), Instituto Gonçalo Muniz, Fundação Oswaldo Cruz, Brazil
⁷Evolutionary Genomics Group, Departamento de Producción Vegetal y Microbiología, Universidad Miguel Hernández de Elche, San Juan de Alicante, Spain
⁸Theoretical Biology and Bioinformatics, Utrecht University, Utrecht, Netherlands
⁹Centre for Molecular and Biomolecular Informatics, Radboud University Medical Centre, Nijmegen, Netherlands
¹⁰Department of Ecology, Biosciences Institute, University of Sao Paulo, Butantã, Brazil
¹¹Institute of Mathematics, Federal University of Bahia, Salvador, Brazil

Studies in microbiology have long been mostly restricted to small spatial scales. However, recent technological advances, such as new sequencing methodologies, have ushered an era of large-scale sequencing of environmental DNA data from multiple biomes worldwide. These global datasets can now be used to explore long standing questions of microbial ecology. New methodological approaches and concepts are being developed to study such large-scale patterns in microbial communities, resulting in new perspectives that represent a significant advances for both microbiology and macroecology. Here, we identify and review important conceptual, computational, and methodological challenges and opportunities in microbial macroecology. Specifically, we discuss the challenges of handling and analyzing large amounts of microbiome data to understand taxa distribution and co-occurrence patterns. We also discuss approaches for modeling microbial communities based on environmental data, including information on biological interactions to make full use of available Big Data. Finally, we summarize the methods presented in a general approach aimed to aid microbiologists in addressing fundamental questions in microbial macroecology, including classical propositions (such as “everything is everywhere, but the environment selects”) as well as applied ecological problems, such as those posed by human induced global environmental changes.

Introduction

The purpose of macroecology is to describe spatial patterns of species distribution and abundance, as well as the mechanisms underlying such patterns (McGill, 2003; McGill and Nekola, 2010). The availability of large amounts of data (Hampton et al., 2013) has helped to uncover global ecological patterns in species distribution and abundance, greatly advancing the field of macroecology. This is highlighted by several studies discussing the contribution of microbial community investigations to a unified macroecological theory (Barberán et al., 2014; Blaser et al., 2016; Nelson et al., 2016; Shade et al., 2018). Strong evidence suggests that micro-organisms in deep display biogeographical patterns which are driven by dispersal processes, climate and evolutionary history, such as species-area and distance-decay associations (Horner-Devine et al., 2004; Astorga et al., 2012; Barberán et al., 2015). The field of microbial macroecology has therefore emerged as a promising research path to the unified understanding of ecological processes shaping patterns across different branches in the tree of life.

The contributions of microbiology to macroecology are currently possible largely due to the methodological advances in theoretical and computational tools for investigating microbiomes. Advances in molecular biology and DNA sequencing in the last decade have provided microbial ecologists with new tools allowing the extraction of an unprecedented amount of information from myriads of microbial communities (Snyder et al., 2009). As a result of the growing amount of stored data, new software has been developed for the systematic study of microbial communities on a macroecological scale. Integration among these tools, however, is not a simple task. Microbial macroecology stands to benefit from a formal summary describing the coupling of microbial community characteristics with spatial environmental information.

In this review, we summarize important conceptual challenges as well as computational and methodological opportunities in the study of microbial macroecology, in order to facilitate data integration. We begin by reviewing what has already been described in this field, specifically addressing the conceptual issues of transitioning from micro- to macro- scales when using micro-organisms as model systems. Then, we provide a comprehensive summary of computational tools for describing microbial communities and predicting the distribution of taxa across large spatial scales. Finally, we conclude by proposing a general framework to aid microbiologists in incorporating the peculiarities of micro-organisms into conceptual frameworks of macroecology, in order to promote the unification of microbial ecology and general ecology.

What Have We Done So Far: A Brief Review of Macroecological Studies in Microbiology

Most macroecological studies of microbial communities to date sought primarily to describe patterns in large spatial scales, investigating whether biogeographical patterns exist for the microbiota (Noguez et al., 2005). Most studies were conducted in soil and marine environments and revealed that such patterns do exist. They suggest that environmental predictors for microbiomes could differ from those usually assumed for macroorganisms (i.e., temperature, precipitation and altitude; Fierer and Jackson, 2006); environmental features such as pH, edaphic conditions and land usage are stronger and better predictors for soil microbiomes. However, soil moisture and temperature also appear to be important to predict microbial community composition in some cases (Fierer and Jackson, 2006; Lauber et al., 2009; Drenovsky et al., 2010; Zhou et al., 2016). In marine environments, spatial structure for microbial communities appears to be less prominent (i.e., lower beta-diversity) in comparison to terrestrial and freshwater systems, which is probably due to the more homogeneous abiotic structure of the open ocean (Soininen, 2012) in relation to other environments. Additionally, temperature was a strong predictor for a latitudinal gradient pattern found in planktonic bacteria, with little importance from other variables, such as productivity and salinity (Fuhrman et al., 2008). One study suggested the influence of altitude—a factor that influences that altitudinal patterns of macroorganisms (Lomolino, 2001)—seem to be not relevant for micro-organisms (Fierer et al., 2011). By contrast, Delgado-Baquerizo et al. (2016) stated that altitude gradients are important drivers for microbial diversity considering a wide spatial range (0–4600 m). Finally, it was suggested that micro-organisms in the atmosphere follow a precipitation gradient at continental scales (Barberán et al., 2015). These studies show that some macroecological patterns exist at microbial scales and that they may be similar to those found for macroorganisms in some cases, but not similar in other instances. This raises the question: to which extent are these patterns ubiquitous through all domains of life?

Although much effort has been made to unravel microbial macroecological patterns, so far there is no consensus on which abiotic factors are good predictors of microbial community composition, hampering the implementation of macroecological models to microbial data. Additionally, even though the studies above show strong correlations between variables and microbiome composition, it is still unknown whether the used variables are true drivers of the observed processes, or whether they are actually correlated to unmeasured, confounding factors (Rahbek, 2005). Biotic interactions seem to be equally important in determining community composition; a modeling approach using Artificial Neural Network (Larsen et al., 2012) highlighted the importance of such interactions for creating more accurate models, and a recent study using large microbial community datasets suggested that rarer taxa are better predictors of community structure than environmental factors (Ramirez et al., 2018). Therefore, a modeling framework based on the conceptual idiosyncrasies of microbiomes is required.

Conceptual Challenges for Transitioning Across Spatial and Temporal Scales

An issue arising in all studies addressing microbial macroecology is the proper evaluation of spatial and temporal scales under investigation. The idea that ecological patterns are scale-dependent is pervasive in ecological theory (e.g., Levin, 1992; Crawley and Harral, 2001; Chase and Leibold, 2002; Wu et al., 2002). Two macroecological studies (Willig et al., 2003; Rahbek, 2005) performed at different spatial scales reported distinct patterns for how species richness was associated with latitude and altitude. Hump-shaped patterns dominate species richness and altitude relationships, when the scale of the gradient survey is higher than 1,000 km, but is an uncommon pattern when the scale is below this value. The two studies cited above define two attributes of the sampling design that determine the scale that is being analyzed (Figure 1): the unit of sampling and the geographic space covered. The sampling unit is determined by the grain or focus size, i.e., the size of the common analytical unit in the analysis, whereas the geographic space covered, also called the extent, represents the geographical space on which inferences can be made (Figure 1A), in other words, the spatial extent covered by all sampling sites (Rahbek, 2005). Macroecological studies investigate processes in large geographical spaces, e.g., continental or global scales (Fierer and Jackson, 2006; Fuhrman et al., 2008; Nelson et al., 2016), which in general define a large extent for macroecological inference. The unit of sampling is represented by the degree of resolution in both response and predictor variables utilized, which can vary widely across studies. Communities' abundance or richness profiles (the response variable) might represent samples in a specific point in space, or samples across different spatial points in the same assumed community (Figure 1B). Equally, a single value in a predictor variable (e.g., abiotic conditions, such as temperature, pH, altitude, humidity, etc.) might represent either a 1 km² or a 10 km² geographic area, depending on how coarse the available environmental information is (Nottingham et al., 2018). The choice and evaluation of the available information is an important step in macroecological studies and may have a deep impact on the results obtained.

FIGURE 1

Figure 1 Spatial extent and sampling unit in macroecological analyses. (A) Different spatial extents can be analyzed in a macroecological study, which will reflect on the environmental information available for inference and how much extrapolation can be derived from the conclusions of the study. The figure shows annual mean temperature per cell, ranging from low temperatures in blue and high temperatures in red. Notice that the lowest temperatures (blue and green cells) are different for each extent. For instance, when studying Central America, the lowest temperatures can be found in Mexico highlands, whereas an extent focused on the whole Neotropics show lowest temperatures around the Andes mountains. Therefore, caution is necessary when inferences from studies on the Central America are extrapolated to the Neotropics extent. (B) Example of two different sampling units in macroecological studies: equally distant squared grids and local sites unevenly distributed through the globe. As highlighted by Hillebrand (2004), squared grids consist of a value averaged across sites within the grid, which decreases the effect of local scale factors (e.g., biotic interactions, dispersal and stochasticity) on the latitude gradient diversity pattern.

Several processes that might be important at local scales may have little effect on, and sometimes even confuse, a pattern at larger spatial scales. For example, Hillebrand (2004) compared studies on the latitudinal species richness gradient, a long well-recognized macroecological pattern, where species richness was known from occurrences equally sized squared areas equally distributed across space (i.e., grids) and studies where species richness was known from sampling points from different studies unevenly distributed across the globe (i.e., local sites). The results demonstrated that the decline of diversity towards higher latitudes was steeper in grid-based studies, suggesting the pattern is easier to detect by using a coarse-grained metric of diversity (as exemplified Figure 1B) because local processes (e.g., biotic interactions, dispersal and stochasticity) are averaged out. Additionally, microbial communities seem to be spatially structured mostly at larger study scales (Soininen, 2012), since such scales encompass multiple biogeographical regions separated by dispersal barriers and large variation in climate (Martiny et al., 2006). Therefore, at a smaller spatial scale, community composition may seem stochastic, or greatly vary in short periods of time. The overall conclusion from these studies is that different predictor variables will be biologically relevant at different ecological scales. This suggests that selection a set of predictor variables for model calibration must take into account the ecological scale of the investigated process. Traditionally, in macroecological species distribution models, temperature and precipitation have been successfully used as predictors for macro-organisms, although recent approaches have successfully incorporated biotic interactions into such models (e.g., Araújo and Luoto, 2007; Wisz et al., 2013). A remaining question is whether these same variables are biologically relevant for micro-organisms at large scales. At least for specific and microbiologically diverse ecosystems such as soils, climate—expressed both in terms of climatic factors such as temperature and precipitation, as well as climate-associated attributes such as soil pH, aridity and productivity—is considered a key driver of the structuring and functioning of global microbiomes (Delgado-Baquerizo et al., 2016; Delgado-Baquerizo et al., 2018; Bastida et al., 2019).

There are two main aspects of micro-organisms, which suggest that biologically relevant variables to predict micro-organisms' distribution may indeed be different from those used for macro-organisms. First, micro-organisms exhibit a higher evolutionary rate. Second, due to the organism size, the spatial scale at which micro-organisms perceive the environment is different from that of macro-organisms (Barberán et al., 2014). The first of these aspects indicates that micro-organisms readily adapt to new environments, which means that the distribution range of different microbial taxa is likely to be in equilibrium with environmental variables, which is not always true for macro-organisms (Araújo and Pearson, 2005). Additionally, a high evolutionary rate in micro-organisms indicates that temporal variability in microbiome composition may be high: when environmental changes occur, the microbiome structure is rapidly modified in response, whereas such responses in macro-organisms (expressed in the arrival and disappearance of species, as well as the rise of new adaptations in native species) may take a longer time. This suggests that each microbial sampling is invariably a narrow temporal snapshot of the microbiota, highlighting the importance of time-series sampling to describe for macroecological trends. The very reduced organism size implies that micro-organisms interact with different aspects of the environment, indicating that relevant predictor variables might include, but are certainly not restricted to, large-scale environmental variation. This is still a debatable topic in macroecology of micro-organisms, as some studies argue that micro-organisms respond to continental-scale climatic and environmental variation (e.g., Barberán et al., 2014; Delgado-Baquerizo et al., 2018), whereas others highlight that microscale environmental variation might be more important in predicting distribution patterns (Hendershot et al., 2017). Therefore, when implementing microbiome modeling, one should keep in mind that there is no consensus on which predictor variables should be used. For micro-organisms, the word “environment” might reflect both biotic and abiotic factors surrounding individuals of a species in a defined area, and the relative importance of these two types of factors might be different from what is known for macro-organisms.

The differences between micro- and macro-organisms need to be considered when implementing any of the methods described in this review. For each approach, it is necessary that the macroecological question is clearly stated, and in a way that the scale of sampling and the scale of the studied processes are in agreement with the scale of the proposed questions. In the following sections, we discuss different macroecological approaches for microbiomes, focusing on the description of macroecological patterns and the modeling of microbiomes at macroecological scales. In each case, we highlight how available methods and information can help researchers to answer questions at different spatial and temporal scales.

Describing the Microbiome in Macroecological Scales

Taxonomic Profiling and Exploratory Analyses in Microbial Macroecology

The basic input data for macroecological studies is a matrix displaying the presence-absence or abundance data of a biological entity in any taxonomic level across different sampling units (usually a locality defined by a pair of coordinates, but may reflect finer or coarser areas, depending on the specific question, Shade et al., 2018). For microbial communities, such a matrix is usually obtained through the taxonomic annotation of several short DNA sequences (i.e., reads) derived from the high-throughput sequencing of an environmental sample (Riesenfeld et al., 2004; Hugenholtz and Tyson, 2008). Reads must first be filtered according to quality and to remove possible contaminants, in order to minimize annotation errors; these tasks can be accomplished using tools such as Prinseq (Schmieder and Edwards, 2011) and Trimmomatic (Bolger et al., 2014). A common and desired practice is to deposit filtered reads in public repositories along with associated metadata, providing public access to the information. This is particularly important for macroecological studies, which make use of secondary data for analysis at large spatial scales. The most prominent repositories for metagenomic data are the NCBI short read archive (SRA; Leinonen et al., 2011b), MG-RAST (Meyer et al., 2008) and the European Nucleotide Archive (ENA; Leinonen et al., 2011a), some of which also provide bioinformatics tools for taxonomic annotation and statistical analysis (e.g., MG-RAST and MGnify; Mitchell et al., 2018). Is worth mentioning that the metadata standard for sequences deposited in International Nucleotide Sequence Database Collection (INSDC) is MIxS (Yilmaz et al., 2011).

Multiple approaches currently exist for obtaining taxonomic profiles from metagenomic sequences, and they mostly fall into four categories depending on the type of data used: 1) amplicon reads, 2) Whole Genome Shotgun (WGS) sequencing reads, 3) assembled contigs and 4) Metagenome-assembled Genomes (MAGs; Figure 2A). Each of these has unique advantages and limitations and is suitable to address different scientific questions (Table 1). Amplicon analysis consists mostly of PCR amplification of the 16S rRNA gene through the use of degenerate primers designed to cover as much of the diversity of Bacteria and Archaea as possible (Schmidt et al., 1991; McDonald et al., 2012). Next, amplicon sequences are mapped to reference databases, such as RDP (Cole et al., 2014), SILVA (Quast et al., 2013) and Greengenes (DeSantis et al., 2006), which contain pre-computed high-quality alignments of 16S rRNA genes, allowing for fast taxonomic assignments for millions of sequences. This approach tends to be accurate at low taxonomical levels (e.g., genera) and is cost effective, considering the coverage of sequencing per sample, making it possible to sample many more replicates per study. On the other hand calculating taxa abundances across samples can be a limitation due to the presence of multiple copies of the 16S rRNA gene in a single genome. Additionally, the so-called universal primers used for amplicon analysis usually do not amplify genes derived from major fractions of the diversity of Bacteria and Archaea, such as the candidate phyla radiation (Hug et al., 2016a).

FIGURE 2

Figure 2 A workflow summary for taxonomic annotation and exploratory analyses. Taxonomic annotation methods are used to generate, for instance, presence-absence matrices (A), which can be combined with environmental variables into correlation analyses (B). The biological variation in environmental variables can be simplified through ordination analyses (such as PCA and MDS). Finally, distance matrices can be created for both ecological and environmental variation, and distance matrix correlation can be used to infer if environmental distances correlate with ecological differences among sampling sites.

TABLE 1

Table 1 Approaches for obtaining taxonomic profiles from metagenomic samples.

One common alternative to amplicon sequencing is Whole Genome Shotgun (WGS), i.e., the sequencing of DNA fragments covering the whole diversity of genes in an environmental sample. Similar to amplicon based studies, WGS reads are annotated by comparing them to previously characterized sequences deposited in reference databases, encompassing genes from multiple taxa. This comparison can be based on homology or the search for similar k-mer profiles (i.e., the set of all possible sub-strings of different lengths for a DNA sequence). Due to redundancy in the genetic code, proteins are more conserved than nucleotide sequences; using homology to detect similar protein sequences is more sensitive and suitable for detecting distant evolutionary relationships, allowing more sequences to be classified. Because the degree of identity between the sequences of naturally occurring microbes and those available in reference databases is often very low, annotations of WGS reads often require using permissive cutoffs (i.e., reads are assigned to a taxon even if the identity is low, e.g., only 30%), provided that it falls within other assumed cutoffs of alignment, length and e-value. Several reference databases are currently available, as well as tools to detect protein-protein and protein-nucleotide homology (Table 1). As an alternative to homology searches, k-mer composition profiles are significantly faster and make it possible to rapidly analyze a large number of samples (Table 1).

Using WGS sequencing further allows for the assembly of raw reads into larger contigs, and, in some cases, later binning into metagenome-assembled genomes (MAGs; Figure 2). This approach may improve taxonomic classification by assessing longer genomic fragments that derive from such sequence assembly. The Critical Assessment of Metagenome Interpretation (CAMI) challenge reviewed several metagenomics tools (Sczyrba et al., 2017). This study distinguished between taxonomic binners (which allow taxonomic abundances to be inferred by clustering individual sequences, then assessing longer genomic fragments Lin and Liao, 2016; Wu et al., 2016), from taxonomic profilers (which focus on predicting a taxonomic abundance profile without necessarily classifying every sequence, often assessing only raw reads Ounit et al., 2015; Koslicki and Falush, 2016). They show that classifiers in general were more accurate than profilers in estimating the relative abundances of taxa. This increased performance is due to the fact that longer sequences contain more phylogenetic information than short reads, leading to less noise in the taxonomic profile. Moreover, because sequence assembly reduces the total volume of sequence data to be classified, more sensitive homology searches that are computationally more demanding may be applied than the rapid searches that are used for classification of short, raw reads. Two recently developed tools that explicitly exploit the added information in assembled contigs are MEGAN-LR (Huson et al., 2018) and the Contig Annotation Tool [CAT, (von Meijenfeldt et al., 2019); https://github.com/dutilh/cat] that exploit all sequences in the full GenBank reference database for taxonomic classification. A limitation of metagenomic assembly is that it is susceptible to possible errors arising during the assembly, which is aggravated when population diversity of the sampled microbial community is high (Sczyrba et al., 2017). Moreover, high levels of sequence heterogeneity between related strains may lead to abundant genomes in the sample being misassembled as chimeras, and potentially misclassified. The subsampling of shotgun metagenomic reads before assembly has been applied to resolve this problem (Hug et al., 2016b).

Once contigs have been assembled into longer fragments of the genomes present in the community, metagenome-assembled genomes (MAGs) may be reconstructed by binning contigs from the same genome together. Several software tools are available to perform MAG reconstruction (Table 1). At this stage, phylogenetic and phylogenomic methods can be used to determine the taxonomic affiliation of these MAGs with even more confidence than that of individual contigs. Additionally, MAGs and assembled contigs can be used to build custom sample-specific reference databases for read mapping (e.g., Speth et al., 2016). The main advantage of using such databases is that often many more reads can be assigned, because the contig sequences represent the strains that are reconstructed from the same sample, minimizing the occurrence of false positives. Therefore, the obtained taxonomic profile contains less noise and more comprehensively represents the data.

The taxonomic profiles obtained from the methods above can be assembled into presence-absence or abundance matrices and further explored using classic multivariate exploratory analyses, such as multivariate ordination/canonical methods (Hanson et al., 2012; Xue et al., 2018). Under the macroecological rationale, exploratory analyses are used to describe the biological variation across a global or continental gradient in potential explanatory variables (e.g., describing diversity or abundance variation across the latitudinal temperature gradient, continental atmospheric variation, etc.; Shade et al., 2018). Correlation among explanatory variables is a common issue in biological statistics, and multivariate ordination is then used to reduce dimensionality and yield new mathematically uncorrelated axis from the original correlated explanatory variables (Legendre and Legendre, 2012; Figure 2B). A few approaches widely used for this purpose are: 1) Principal components analysis (PCA), which is based on covariance or correlation matrices and is suitable for sets of linearly correlated measures; 2) principal coordinates analysis (PCoA), which differs from the PCA by extracting eigenvalues from similarity or distance matrices, therefore being appropriate for non-linear relationships; 3) multidimensional scaling (MDS) that, unlike PCA and PCoA, is not based on eigenvalues decomposition and, like PCoA, is limited to Euclidean distances matrices and 4) correspondence analysis (CA), based on contingency table of categorical variables (Bray and Curtis, 1957; Clarke, 1993). The new mathematical axes provide a mathematical space where measurements from the actual environmental samples can be placed and compared. The associations between variables (e.g., diversity and temperature) can also be tested by classic statistical analyses like regression and correlation, which can be based on both original explanatory variables and new mathematical axes created by ordination analyses. Additionally, ecological similarity between localities can be explored using distance measures (e.g. Euclidean, Mahalanobis, Jaccard, and Bray-Curtis) and compared against a distance matrix for a potential explanatory variable in the same localities and statistical significance can then be assessed by using a test such as the Mantel test (Figure 2B). Such approaches are commonly used in macroecological studies to statistically assess the correlation between two distance matrices based on variables of interest (e.g., Duarte et al., 2009; Bell, 2010).

Describing Community Structure With Co-Occurrence Networks

Co-occurrence networks (CNs) has been used to describe associations within microbial community (Figure 3). Usually, in these networks, the nodes represent taxa and the edges represent statistically significant positive or negative correlations in the abundance of taxa across several samples in a given environment or host (Faust and Raes, 2012). A few authors may also include abiotic factors as nodes (e.g. Li et al., 2017). Using CNs can reveal insights about possible ecological interactions and distribution patterns of microbial taxa (Faust and Raes, 2012; Cardona et al., 2016). Two important types of information can be retrieved from CNs: 1) changes in community structure across environmental gradients, that is, variation not only in the species abundance, but especially in the degree of correlation between taxa across environmental gradients; and 2) potential biotic interactions that can be useful for macroecological modeling (Predicting Microbial Distribution and Community |Composition). Since CNs are based on abundance correlation, it is desirable that they are built over a large number of sampling units, and therefore hold great potential for application in macroecological studies (Berry and Widder, 2014). Distinct approaches have been used to construct CNs and derive information from their structure, such as distance or similarity matrix metrics among the samples used to construct the networks (Fan et al., 2018; Jackson et al., 2018; Marasco et al., 2018; Box 1). Overall, the same matrix generated by the software tools listed in the previous section can be used as input for CN calculation. Samples can be grouped according to the macroecological variable of interest (e.g., temperature variation across latitudes, atmospheric variation across a continent, variation in land cover across the globe) and the structure of CNs from each of these groupings can be compared across global or continental scales (Figure 3). Note that comparison of microbial community structure has often been performed across different ecosystems (e.g., comparing the structure of networks between fresh and saline water environment), but the macroecological approach supports the rationale of a comparison within the same environment (e.g., soil samples) across an environmental gradient (e.g., temperature, pH, etc.; Barberán et al., 2012). Several measures exist to describe network structure, such as symmetry, degree distribution, checkerboard index (Horner-Devine et al., 2007; Araújo et al., 2011; Layeghifard et al., 2017), but the best usage of such metrics is an ongoing debate (Layeghifard et al., 2017) and is highly dependent on the ecological question being asked.

FIGURE 3

Figure 3 Co-occurrence networks applied to microbial macroecology. (A) A hypothetical example of a co-occurrence network. Circles represent different taxa and edges connecting two circles indicate statistically significant co-occurrence between those two taxa, i.e., they co-occur more than expected by chance in the set of samples analyzed. Network structure can indicate ecosystem properties, and these can be translated into statistics summarizing network topology (see Box 1). For instance, this hypothetical network shows two subunits (or modules) separated by the taxon indicated as a red circle. This taxon is also a node with high betweenness centrality (i.e., indirect connections between any two nodes in the network has a high probability of going through this node), whereas the green circle represents a node with high degree (i.e., showing a connection to many other taxa). (B) A hypothetical example of a macroecological study using co-occurrence networks. Red squares represent an area where several samples were gathered and analyzed, yielding a single abundance matrix and a corresponding co-occurrence network (two sites pointing to the same network represent areas in which networks are highly similar). The topology of the network changes in different ecosystems across the globe, and the overall hypothetical pattern is represented in the graphics below: network modularity (i.e., defined as the number of subunits within the network, as well as the relative proportion between connections within and between modules) decreases as precipitation and temperature increases (but the change is less intense for temperature).

Box 1. Building and Interpreting Co-Occurrence Networks.
Several tools are available to build and interpret co-occurrence networks. The software CoNet (Faust and Raes, 2016), developed in Cytoscape (Shannon et al., 2003), allows the usage of several measures for dependency, similarity and dissimilarity, to build and visualize co-occurrence networks. In order to build these CNs, the microbial composition data is provided in relative abundances. Some annotation tools provide microbial composition in read counts, in this case one can use SparCC (Friedman and Alm, 2012), which calculates abundance correlations among taxa without the issues associated with compositional data (Mendes et al., 2018), for further CNs analysis. Alternatives to SparCC are REBACCA (Ban et al., 2015) and CCLasso (Fang et al., 2015). Kurtz et al. (2015) presented another tool: SPIEC-EASY, a pipeline that transforms relative abundance data and estimates interaction graphs. Finally, a few approaches are based on information theory, for instance: using mutual information combined with other metrics, implemented in CoNet (Lima-Mendez et al., 2015). Choosing a correlation method for network construction is critical once networks generated by different methods can provide contrasting results (Weiss et al., 2016). Methods should be chosen taking into consideration if microbial community data are presented as relative abundance or in absolute read counts.

Keystones in CN

There is no consensus on the operational definition of keystone for microbial ecology (reviewed in Banerjee et al., 2018). However, a usually proposed definition is that keystones are highly connected microbial taxa presenting a unique and crucial role for community structure and functioning, so their loss or removal should have large impacts on microbial community (Banerjee et al., 2018). In this sense, network theory provides us with quantitative ways to characterize how connected a given microbial taxa is. One criterion, based in network theory, to determine a putative keystone taxon is high betweenness centrality (BC; e.g., Lupatini et al., 2014; Banerjee et al., 2016; Jiao et al., 2016; Li et al., 2017; Mendes et al., 2018), albeit an investigation based on dynamical modeling found lower BC to be correlated with higher probability of a taxon being keystone (Berry and Widder, 2014). The BC of a node A is the number of shortest paths connecting two nodes which pass through the node A. Nodes with high BC connect portions of the network that would otherwise be sparsely or not connected at all. Therefore, removing high BC nodes leads to a sparser network, disconnecting modules in several cases. The number of connections a node presents, which is called the node's degree, is also a frequent metric used as a keystone index (Comte et al., 2016; Hartman et al., 2018). This is based on the idea that, taxa (nodes) that are connected with multiple others are important to network structure, and their potential removal would have a high impact to the community. It is interesting to highlight that, whereas one node can have both high degree and high BC (in which case this taxa would be considered keystone by both definitions), it is also possible to find nodes in which BC is high and degree is low, or vice-versa, leading to a disagreement between these two keystones definitions. Therefore, it is important to have in mind the biological process of interest because this will determine the more important features in a given community and what keystone definition one should use.

A different approach, based on metabolic networks (Guimera and Amaral, 2005), assumes that the network is formed by modules (i.e., semi-independent groups of cohesive, interacting taxa). In this approach, one can calculate the z-score, which is a measure of the number of interactions a taxon has within its module; and the c-score, which describes how evenly distributed are the interactions of a given taxon across multiple modules. These two values allow us to classify the taxa in network hubs (z-score > 2.5; c-score > 0.6), module hubs (z-score > 2.5; c-score < 0.6), connectors (z-score < 2.5; c-score > 0.6) and peripherals (z-score < 2.5; c-score < 0.6) (Poudel et al., 2016; Fan et al., 2018). Putative keystones taxa would then be the nodes identified as network hubs, module hubs and connectors. One advantage is that this definition takes into account multiple features that might make a node important to a network (e.g., participating in a network within a hub or as connectors between hubs), whereas, when one looks only at BC or high degree, a single type of keystone feature is taken into account.

Indirect Effects From CNs

In networks, species that do not directly interact can influence each other through cascading effects that spread through the network (indirect effects). Guimarães et al. (2017) developed an analytical framework to quantify the total amount and the importance of the indirect effects in a given network. Their results show that network structure is what drives how the indirect effects spread through the network (Guimarães et al., 2017). Networks of micro-organisms, which are species-rich networks formed by a small core of highly connected species and many species poorly connected (Banerjee et al., 2018), are predicted to show a higher amount of indirect effects than poor, highly modular networks. Therefore, quantifying indirect effects might be an important aspect in the study of which micro-organisms are keystones to a given community relevant to maintain relevant ecosystems functions and contribution to resilience and stability in face of global environmental changes (Berry and Widder, 2014).

In addition to measuring indirect effects, it is possible to explore the consequences of such effects. Resilience and stability are important aspects of network structure that can be measured by using approaches derived from the study of dynamical systems. Coyte et al. (2015) proposed an extremely general and suitable framework that can be used to analyze species-rich microbial networks. Their approach uses the eigenvalues of the matrix that describes the effects of ecological interactions at the equilibrium (Jacobian matrix) associated to a given network, to analyze the stability and resilience of microbiome networks. Their approach can be used in networks that possess any combination of different types of interactions (cooperation, competition, exploitation, amensalism and commensalism). One important result of their analyses is that cooperation tends to destabilize microbial networks. The destabilization effect happens because of the presence of positive feedbacks between the species when they cooperate, which leads to cascading effects. For example, a decrease in population size of one species might lead to all the species they positively interact with to decrease as well. On the other hand, competition gives a stabilizing effect in the network; compensating the destabilizing effect that increasing richness can have in an ecological community (May, 1972).

Co-occurrence networks may also be used to identify keystone taxa (Box 1). The keystone concept was first coined by Paine (1966), who demonstrated that the removal of the sea-star Pisaster ochraceus caused a dramatic change in community structure on a rocky shore, concluding that the species functioned as an important element for maintaining community integrity, most likely due to its non-redundant role (Paine, 1969)⁠. This definition can be applied in the microbial ecosystem and be empirically investigated by using network approaches. Keystone taxa can be compared across macroecological scales to investigate whether and how the importance of specific groups as key taxa in communities across an environment varies on global scales. Since keystone taxa usually perform important and non-redundant functions, their identification may be important to understanding ecosystem functioning.Thus, an approach coupling keystone identification with measurements of functional diversity across macroecological scales holds potential to bring numerous insights (see below). Finally, another insight derived from CNs is how the network structure may favor or constrain cascading effects (Box 1), which may favor or imperil the resilience of the communities against perturbations (another ongoing debate within ecosystem ecology; Oliver et al., 2015). Cascading effects often propagate across networks, connecting the dynamics of taxa that do not directly interact with each other. In fact, networks of taxa are subject to influences from taxa they directly interact with, as well as to indirect effects that pervade the network, i.e. from taxa with which they do not interact directly. Under certain conditions, indirect effects can be more important to the network dynamics than the direct effects (Ohgushi, 2005). Indirect effects can be measured across macroecological scales to assess, in a spatially explicit manner, in which ecosystems indirect effects seem to play a more important role to maintain microbial community stability (Guimarães et al., 2017).

Revealing Macroecological Patterns From Microbiome Functional Diversity

Functional ecology, defined as the study of the roles that organisms play in their ecosystems, also holds great potential for microbial macroecology. Studies investigating levels of functional diversity across macroecological scales are already common for macro-organisms (Fu et al., 2017; Jarzyna and Jetz, 2018)⁠, both in theoretical investigations of processes determining functional diversity (Safi et al., 2011) and in more practical inquiries such as the conservation of ecosystem functions (Devictor et al., 2010). Yet similar studies have not been performed for micro-organisms. For instance, previous studies have explored like global patterns of mammalian functional diversity (Safi et al., 2011) as well as global scale marine macroecological patterns (Amend et al., 2013) have no equivalent investigation concerning microbial functional diversity. Macroecological studies might yield insights on the patterns observed for the functional diversity of micro-organisms across different environments in the globe, and address their relation to ecosystem functioning and service provision (Mace et al., 2012).

Functional diversity is one of the three main biodiversity dimensions investigated in macroecology, alongside taxonomic and phylogenetic diversity (Webb et al., 2002; Devictor et al., 2010). Functional diversity is usually defined as the amount, variation and distribution of traits in a community (Dıaz and Cabido, 2001), originally measured by the calculation of the total branch length of the functional dendrogram constructed from information about taxa' functional traits (Petchey and Gaston, 2002). From this initial method, several new conceptual and mathematical approaches have been developed and implemented (a few revised in Petchey et al., 2004), but none of them dismiss the need to 1) choose the functional traits through which organisms will be distinguished, 2) define how the diversity of the trait information will be summarized into a measure of functional diversity, and 3) validate the measurements through quantitative analyses and experimental tests (Petchey and Gaston, 2006). In micro-organisms, functional traits are usually viewed as the genetic and biochemical characteristics of organisms affecting ecosystem functioning, such as the production of metabolic inhibitors or enhancers, or enzymes playing a role in ecosystem metabolic pathways (Dıaz and Cabido, 2001). In this sense, the function of micro-organisms in an ecosystem is defined by their genetic composition, which ultimately dictates the molecules they metabolize (Faure and Joly, 2016). Similar to taxonomic annotation, functional traits can be derived by direct functional annotation of metagenomic short-reads from an environmental sample (with no taxonomic annotation). Alternatively, prior metataxonomic approaches (e.g., 16S rRNA) can be used to taxonomically assign individuals in a sample, and then functional annotation can be derived from their phylogenetic position. Software tools to perform both approaches are summarized in Table 2, with their respective references. All of these metagenomic and metataxonomic functional annotation approaches are based on genomic databases and the accuracy of annotation depends on the quality of software databases. Furthermore, many genes are still unassigned, and their functions are unknown, making it challenging to infer ecological functions from genetic content alone (Faure and Joly, 2016).

TABLE 2

Table 2 Tools used to annotate functional potential profiles from metagenomic reads or to infer them from 16S taxonomic annotation.

The degree of functional diversity has been used to investigate two main macroecological patterns in microbial communities: 1) relationships between community taxonomic and functional composition among microbial communities (Louca et al., 2016; Vieira-Silva et al., 2016; Galand et al., 2018) and; 2) how microbial functions vary in time and space (Dinsdale et al., 2008; Ren et al., 2017; Galand et al., 2018). Usually the most accessed functional measures are diversity (including functional richness, evenness and divergence), composition, redundancy and rarity. Several algorithms and computational tools have been published in order to assess and quantify these functional features (Table 3, also reviewed in Mouchet et al., 2010; Schleuter et al., 2010; Song et al., 2014; Bond-Lamberty et al., 2016; Ricotta et al., 2016). Addressing the above-cited questions, one of the emerging patterns in micro-organisms is a decoupling between functional and taxonomic composition (Louca et al., 2016). This trait suggests that microbial communities may present a high degree of functional redundancy, meaning that shifts in taxonomic community composition do not lead to shifts in functional community composition. It has been hypothesized that the mechanisms underlying microbial assemblage are distinct from mechanisms governing functional composition, and that environmental factors are potential predictors of functional composition (Louca et al., 2018). We further suggest that approaches for characterizing functional diversity should also be coupled with estimates of function turn-over and nestedness; metrics that in macroecology are commonly used to measure shifts in species composition mostly along abiotic gradients, the so called beta-diversity (Legendre et al., 2005; Anderson et al., 2006; Jost, 2007). This information would allow us to answer questions such as whether a specific subset of functions is filtered and maintained in a specific environment; or how functions are changing across abiotic gradients.

TABLE 3

Table 3 Tools to calculate functional diversity features.

Predicting Microbial Distribution and Community Composition

Macroecologists describe spatial patterns of biodiversity aiming to ultimately create accurate models that can predict biodiversity under different scenarios. The patterns described are analyzed, and the underlying biotic and abiotic drivers of species distribution and abundance are tested in a statistical framework. Understanding the mechanisms behind these patterns allows macroecologists to predict biodiversity in geographic areas not yet studied, contributing to decrease biodiversity shortfalls (Hortal et al., 2015) as well as how biodiversity would respond to changes in the environment (Kerr et al., 2007). The BAM (as an abbreviation for ‘biotic, abiotic and movements') diagram is a conceptual framework used in macroecological modeling to summarize the determinants of species distribution on global scales (Figure 4; Soberón and Nakamura, 2009).

FIGURE 4

Figure 4 The BAM Diagram. (A) A scheme of a hypothetical BAM diagram (abbreviation for “biotic, abiotic, and movements”), highlighting the intersection between the different aspects determining the presence-absence of species. The b circle, colored in green, represents biological aspects allowing the presence of the species; the a circle, colored in blue, represents the abiotic aspects; finally, the m circle, colored in orange, represents the movement aspect, which consists in the dispersal capacity of the species. The intersection represents areas where more than one of those aspects allows the existence of the species. For instance, the green intersection represents an area where both biotic and abiotic conditions allow the species to exist, but the species is unlikely to disperse to that area. Similarly, the purple intersection represents an area where abiotic conditions allow the species to exist and is within the species' dispersal capacity; however, biotic conditions (for example, presence or absence of important species with which it interacts) do not allow their existence. All species occur only in areas represented by the dark green intersection, i.e. the intersection of all three factors. Mathematical models, however, can calibrate species niche based, solely on abiotic factors (which is the case of most SDM approaches), and, in these cases, the BAM diagram is a good conceptual framework to interpret the results. (B) A geographical projection of the BAM diagram for a hypothetical microorganism in South America. The grey areas across the continent represent sites to where the species can potentially disperse to (based on the idea that micro-organisms have high dispersal capacity, see Predicting Microbial Distribution and Community Composition in text). Assuming our hypothetical species prefer freshwater conditions, rivers in South America are colored in brown, to represent the intersection between factors a and m in the diagram. Finally, the green color of the Amazon river indicates an area where all factors allow the existence of the species (i.e., the species can disperse to the area, it is a freshwater environment, and it shows biotic conditions favorable to its establishment, e.g. the presence of specific species with which it cooperates).

In the BAM framework, the presence of a focal species in a specific site is determined by: (1) the presence, absence and/or abundance of other species in the same environment (i.e., biotic factors, the B in BAM); (2) the availability of the environmental attributes that are suitable for the focal species (i.e., abiotic factors, the A in BAM) and; (3) the focal species capacity to migrate into biotically and abiotically suitable areas (i.e., movement capacity, the M in BAM; Figure 4). This idea is described in a more formal manner in the Hutchinsonian concept of ecological niche, i.e., the n-dimensional hypervolume in which a species can exist (Colwell and Rangel, 2009; Holt, 2009; Figure 4). This conceptual framework is important for models that attempt to predict the occurrence of taxa, since it highlights which factors are expected to affect taxa presence in different locations. For macroorganisms, models are usually calibrated with the usage of abiotic factors at large spatial scales, specifically temperature and precipitation, which were shown to be good predictors of terrestrial species distribution range (e.g., Soberón, 2010). Such models usually show acceptable accuracy, but several studies highlight the importance of accounting for migration capacity and species interactions in distribution modeling (Araújo and Luoto, 2007; Wisz et al., 2013).

When it comes to micro-organisms, it is necessary to clearly understand which factors affect the distribution of microbial species. The BAM diagram offers an adequate conceptual framework to start addressing this question. Several authors have suggested that the dispersal capacity of micro-organisms is much higher than that of macroorganisms (Finlay and Clarke, 1999; Martiny et al., 2006; Barberán et al., 2014). In this aspect, the movement feature of the BAM diagram would have little effect on the distribution of species, since several studies indicate that micro-organisms are highly dispersive (Bovallius et al., 1980; Fenchel and Finlay, 2004; Martiny et al., 2006; Barberán et al., 2014; but see, e.g., Peay et al., 2010), and that spatial structuring of microbial communities are only perceivable on large spatial scales. This leaves us with the biotic and abiotic factors as major drivers of micro-organisms' distribution. As previously discussed in Conceptual Challenges for Transitioning Across Spatial and Temporal Scales, a few studies have highlighted the importance of different abiotic factors in structuring microbial community, which are not always related to the environmental predictors used in distribution modeling of macroorganisms. Such variables include, besides temperature and precipitation, edaphic conditions, soil pH and concentrations of different chemical molecules (Lauber et al., 2009; Drenovsky et al., 2010; Zhou et al., 2016). Additionally, the biotic interactions among species have been advocated as important determinants of species occurrence (Larsen et al., 2012; Ramirez et al., 2015; Ramirez et al., 2018). Therefore, in the following sections we describe how to access available spatial-explicit environmental data for micro-organisms modeling, as well as modeling approaches that can account for both biotic and abiotic factors.

Using Abiotic Variables to Model Microbial Communities

Each sample taken from the environment is under the influence of a huge number of variables in many spatial and temporal scales. In order to model the composition of microbiomes, and therefore the distribution of micro-organisms across the globe, it is important to have available environmental data on the relevant spatial and temporal scales. The variables used to model micro-organisms will depend on the specific environment under study. Micro-organisms living in the soil are affected by different environmental factors than those living in a freshwater lake or in the ocean. This is different than what is seen for macroorganisms, where global temperature and precipitation play major roles defining biogeographic realms (McGill, 2010). While acknowledging that global variation in temperature and precipitation might define biogeographic areas for micro-organisms (Martiny et al., 2006), we argue that this definition will differ when comparing between micro-organisms living in different environment types (e.g., soil vs freshwater micro-organisms).

Physical properties are usually important in several environments, such as temperature, precipitation, moisture and solar radiation. These variables can be measured or modeled via remote sensing platforms and remote sensing-based modeling tools. Due to the advent of environmental monitoring satellites and the creation of on-line data processing and distribution platforms, there is a wealth of environmental data with global coverage available to the general public, ranging from raw satellite images to validated measurements of parameters, such as land surface temperature, precipitation rates, the concentration of gases such as CO₂ in the troposphere and photosynthetic activity (Table 4). These databases contain climatic spatially explicit information such as land surface temperature, net primary productivity, vegetation and leaf area indexes, evapotranspiration, detailed landcover map and precipitation rate. Additionally, since other aspects of soil and atmosphere might also be necessary to fully characterize the abiotic environment of micro-organisms. Information pertaining to soil physical (e.g., clay content) and chemical (e.g., pH) conditions, as well as soil classification across the globe can be retrieved from these databases. Similarly, when investigating the atmosphere microbiome, the atmospheric chemical composition may play a large role on community composition by changing the chemical properties such as pH and playing an important role on ecological processes, such as nitrification (Keller et al., 2006; Hutchins et al., 2009; Hatzenpichler, 2012). An example of atmospheric chemical composition data available, such as the products based on the Atmospheric Infrared Sounder (AIRS), is a hyperspectral instrument on board of Aqua satellite (Table 4). By decomposing the infrared radiation in 2,378 bands, AIRS can provide daily measurements of trace components abundances in the atmosphere, including ozone, carbon monoxide, carbon dioxide, methane, and sulfur dioxide in different strata of the atmosphere, among other parameters (Morgan et al., 2004; Maddy et al., 2008; Xiong et al., 2008; Engelen et al., 2009; Lin et al., 2013).

TABLE 4

Table 4 Databases for spatially explicit abiotic ecological data for use in community modeling.

Furthermore, the data gathered from satellites and ground observations, are used in the parameterization of climatic models, which allows the calculation of additional climatic variables. The Global Land Data Assimilation System (GLDAS) is a good example of this kind of climatic modeling (Rodell et al., 2004; Rodell et al., 2009). It models land surface states and fluxes, using advanced land surface modeling techniques based on optimal fields (Rodell et al., 2004). Currently GLDAS includes datasets from four land surface models implemented in NASA's software LIS (Land Information System), namely Mosaic, Noah, the Community Land Model (CLM), and the Variable Infiltration Capacity (VIC), resulting in massive archive maps of up to 40 climatic parameters, water and energy flux, as well as underground temperature and moisture, with maximum depth of 1.1 m and with temporal coverage ranging from 1979-01-01 to nowadays (Kumar et al., 2006; Peters-Lidard et al., 2007). Another good example of a climatic model available is the Worldclim, one of the most used climatic datasets in ecological modeling. It comprises a set of 19 climatic variables relevant to many ecological processes, with a global coverage of 1000 m spatial resolution (Fick and Hijmans, 2017). This set of variables is a result of the averaging of climatic parameters from 1970 to 2000, modeled through the usage of general circulation models (GCM), which are suitable to model worldwide geographic variation in ecological processes that respond to spatial patterns of climatic heterogeneity. The calculation methods to produce this set of variables were implemented in R and are available through the function biovars, from the Package ‘dismo', version 1.1-4 (Hijmans et al., 2017). In addition, Worldclim also provides future projections for the same set of 19 climatic variables for two periods, 2050 (average for 2041–2060) and 2070 (average for 2061–2080), based on the set of models used in the Fifth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC) for the four scenarios of greenhouse gases concentration (Stocker et al., 2014). These future projections provided by Worldclim have the advantage of being bias corrected, using the current climate Worldclim data as base line, making the three sets of variables compatible. In addition, the AIRS, TRMM, GPM, and GLDAS products are available in NASA’s Goddard Earth Sciences Data and Information Services Center (GES DISC), which is part of the Earthdata platform, specialized in processing and distribution of climatic data.

Given the huge amount of climatic and environmental data available to the global landscape, microbial ecologists are now using those same analytical tools used in traditional macroecological studies. This allows them to select the most important drivers in predicting microbial diversity distribution patterns and to predict the structure of microbial communities across the globe, thereby accessing cause and effect associations. In these efforts, machine learning approaches, especially classification or regression Random Forest analysis and structural equation modeling (SEM) should be highlighted (Breiman, 2001; Grace, 2006). Specifically, Random Forest analysis constitutes specific algorithms of statistical methods of classification and regression trees (CARTs) that use binary division or regression, respectively, to form a set of trees where the importance of each predictor is inferred by decreased prediction accuracy through the random permutation of the values of these predictors (Liaw and Wiener, 2002; Wei et al., 2010). SEM routines are then used in microbial ecology studies coupled with Random Forest in order to reveal the relation between those ‘a priori' selected abiotic drivers and the target-variable in question, such as the Shannon Index, used as a proxy for microbial diversity (Delgado-Baquerizo et al., 2016). Therefore, SEM is a valuable alternative when the objective is to detail the specific relationships between multiple predictors and the modeled variable, separating them as individual pathways in the network of relationships that characterizes natural systems (Delgado-Baquerizo et al., 2017).

Incorporating Biotic Interactions in Modeling Microbial Communities

Another important issue in macroecological modeling is the inclusion of biotic interactions as predictor variables. There is an increasing evidence that species interactions improve the explanatory and predictive power of species distribution models, based on environmental variables for macroorganisms (Araújo and Luoto, 2007). Usually the inclusion of biotic interactions in species distribution models is based on previous biological knowledge of the studied species and uses a limited number of species/taxa per model, while considering their geographical distribution (Araújo and Luoto, 2007; Wisz et al., 2013; de Araújo et al., 2014). These models are usually based on species distribution models and use a maximal entropy approach—e.g., Maxent for modeling (Phillips and Dudík, 2008). However, there are also integrative modeling approaches that incorporate co-occurrence patterns into species distribution models (Pollock et al., 2014). Other modeling techniques use machine learning approaches, such as neural networks, which do not make assumptions related to species occurrence probabilities and linear relationships among environmental and biological variables, and so provide more realistic assemblage models (Harris, 2015).

Studies with micro-organisms have also suggested that including biotic interactions is necessary to build suitable predictive models (Larsen et al., 2012). However, despite their importance, these interactions can be elusive to detect, and unraveling the interactions network in microbial communities is an ongoing challenge (Faust and Raes, 2012). Biotic interactions can be inferred to some extent from co-occurrence networks (Describing Community Structure With Co-ccurrence Networks), but the increase of computational capacity and the development of accurate machine learning and network modeling methods has made possible to explore new approaches to statistically assess biotic interactions from large abundance datasets, such as Bayesian networks (BNs) and Genetic Algorithms (GA). The BNs are graphical models consisting of a set of variables (represented as nodes in the network) and directed arcs that describe the sets of conditional dependencies between these variables, as well as the joint probability distribution among then (Pearl, 2014; see also Figure in Box 2). The variables set in BNs may be both abiotic factors as well as biotic interactions, and the model can be calibrated with the same input abundance matrices generated by taxonomic annotation methods (Taxonomic Profiling and Exploratory Analyses in Microbial Macroecology). Additional columns representing abiotic aspects of each sampling site can be added to the abundance matrix to represent the abiotic environment experienced by a specific microorganism. This approach allows the creation of species distribution models by taking into account both biotic and abiotic aspects simultaneously in a model across large geographical scales (Staniczenko et al., 2017). These models can be further used to predict the change in the abundance of an organism when any other node (either an abiotic aspect or another species abundance) changes in the environment. A few microbial studies have already used a BN approach to study, e.g., the bacterial diversity in gut microbiota for patients with psoriatic arthritis (Scher et al., 2014) and the gut microbiota in HIV positive patients (Vázquez-Castellanos et al., 2015). Similarly, in macroecology, a few studies have used the BN approach, e.g., for range prediction of California grassland community (Staniczenko et al., 2017) and assessment of threat status of pacific walrus population in Russian and Alaskan waters at four different time periods (scenarios) throughout the twenty-first century (Jay et al., 2011).

Box 2. Bayesian Networks: Advantages and Drawbacks.
Bayesian networks show several advantages that support their recent application in complex fields, such as: 1) network modularity, being able to integrate multiple ecosystem components (Chen and Pollino, 2012; Nojavan et al., 2014; Nojavan et al., 2017; Uusitalo, 2007), such as in management decisions field, where it is possible to integrate several sub-models as social, ecological and economic aspects (Chen and Pollino, 2012); 2) the capability of dealing with complex and nonlinear systems (Uusitalo, 2007; Aguilera et al., 2011; Phan et al., 2016; Beuzen et al., 2018); 3) possibility of incorporating expert knowledge (Uusitalo, 2007; Aguilera et al., 2011; Alameddine et al., 2011; Death et al., 2015; Phan et al., 2016), through blacklists (i.e., unrealistic relationships that are not allowed in the model) and whitelist (i.e., relationships already known in the literature); 4) being able to use a small number of samples (Uusitalo, 2007; Phan et al., 2016) 5) simplicity and little difficulty in interpreting outputs, even for non-modelers (Aguilera et al., 2011; Death et al., 2015); 6) being a rather “open” approach, different from other methods, which can be considered complicated “black-box” approaches (Chen and Pollino, 2012); 7) being able to handle high dimensional systems with the proper number of samples (Aguilera et al., 2011); 8) dealing with missing data through conditional probabilities or Bayes theorem (Uusitalo, 2007; Aguilera et al., 2011; Death et al., 2015), and finally 9) presenting less computational cost to analyze and compare different scenarios, such as climatic changes, by setting variables states in the model (Chen and Pollino, 2012; Death et al., 2015).

The main weakness of the BN approach is the lack of feedback possibilities in the model, due to it being directed acyclic graph (DAG; Phan et al., 2016). This can be bypassed by integrating models. The most critical drawback pointed in most studies is the discretization of continuous variables (Uusitalo, 2007; Aguilera et al., 2011; Nojavan A. et al., 2014; Death et al., 2015; Phan et al., 2016). The principal argument is that it causes an inevitable loss of information from data, linear relationships and consequently model performance (Uusitalo, 2007; Nojavan A. et al., 2017; Beuzen et al., 2018). However, using discrete values allows for better modeling of non-linear relationships between variables, as well as complex distributions such as bi- or multimodal distributions and can introduce greater robustness against error (Hartemink, 2001). As alternatives, there are models that could handle continuous data and not have mathematical restrictions, such as Mixture of Truncated Exponentials (MTE) models and the BN created for continuous variables (Qian and Miltner, 2015). However, it is hard to find simple examples and they are not easily found in any commercial software, which makes implementation difficult for non-modelers.

FIGURE IN BOX 2

Figure In Box 2 A graphical example of a hypothetical Bayesian Network (BN), showing both biological taxa (green circles) and predictor abiotic variables (blue circles). NDVI = Normalized difference vegetation index.

Similarly, the use of predictive models based on the genetic algorithm (GA) method holds great potential to infer microbial interactions but has not been explored by microbiologists so far, to the best of our knowledge. The GA is an approach to solve problems inspired by the process of natural selection. Genetic programming (GP) is a particular type of GA that can be used to generate computational artifacts, such as computer programs, mathematical models, and logical models, that help to explain an observed data (Koza, 1992). The GP approach usually starts from a population of programs (algorithms) that show random levels of success in solving a task (in this case, describing the significant biotic interactions observed in a microbiome dataset). The fittest programs, that is, those best describing the data, are selected for reproduction and may undergo some “mutation” according to predefined parameters. This process is repeated over several generations in an analogy to natural selection, and the final generations are expected to show a population of much fitter programs than the initial ones. This procedure is essentially a heuristic search technique that looks for an optimal or at least suitable program among the space of all programs available. Since the construction of the models is totally guided by data, without the need of a priori hypotheses, the greatest potential of this technique is to generate hypotheses about the relationship between micro-organisms, as well as between micro-organisms and environment, that can be assessed by other approaches (such as BNs, dynamical modeling or common correlative statistics, described above). Applications of GP include designing electrical circuits (Koza et al., 2000), reverse engineering biochemical reactions (Sugimoto et al., 2005) and describing epidemiological relationships (Veiga et al., 2018).

Another promising approach to resolve microbial interactions is the use of dynamical models (Widder et al., 2016), which can bridge the gap between fundamental ecological knowledge and empirical interactions between taxa, by relying on explicit and mechanistically sound hypotheses. For such purpose, several modelling approaches are available (reviewed by Song et al., 2014 and by Succurro and Ebenhöh, 2018), each presenting its own set of assumptions concerning biotic and abiotic components of community. The most widespread approach is assuming direct biotic interactions among taxa and representing these interactions by using the generalized Lotka-Volterra model (gLV). This is a particular case of the population dynamic model, which can then serve to investigate concepts related to community dynamics such as co-occurrence networks and keystone taxa (Berry and Widder, 2014; see Box 1). Some authors also advocate the use of metabolic-explicit dynamical models that integrate aspects of community and environmental variables, such as stoichiometry-based models and flux balance analysis (FBA; Song et al., 2014). While these approaches avoid black-box modeling and provide valuable insights into community functioning across environments, they present parameterization challenges, in gLV for instance, the number of parameters increases with the square of the number of interacting species, hindering model analysis. Future developments integrating dynamical modeling and statistical parameterization techniques are thus poised to improve the suitability of dynamical modeling approaches to exploration of microbial community interactions; meanwhile, dynamical modeling is readily available to investigate important subsystems with fewer interacting organisms.

Species Distribution Modeling for Community Prediction

The steps described in Using Abiotic Variables to Model Microbial Communities and Incorporating Biotic Interactions in Modeling Microbial Communities allow us to highlight important abiotic environmental factors as well as biotic interactions necessary to model our focal microbial communities. Although few of the techniques presented, such as BNs, can model community composition on their own, another approach largely used in macroecology for this purpose is the set of modeling tools known as species distribution modeling (SDM). The use of SDM has been regarded as a well-established approach that can be used to overcome the lack of species spatial data, and holds great advantages for micro-organisms, a group in which the Wallacean deficit (i.e., the lack of information about species distribution) tends to be high. The SDM techniques are generally based on the concept of species ecological niches, which is the set of biotic and abiotic conditions that allows a species to persist indefinitely in a location (Soberón, 2007). Evidence so far suggests that biotic interactions should have a larger importance at smaller scales (but see Gotelli et al., 2010 and Araújo and Rozenfeld, 2013), while abiotic conditions, such as climate, should have a larger influence at larger spatial scales (McGill, 2010). Based on this, macroecologists have used the set of climatic conditions where a macroorganisms lives to estimate its potential geographic distribution. Whereas this is largely efficient for macroorganisms, more empirical evidence is necessary to evaluate these premises for micro-organisms.

Two sets of approaches can be used for SDMs: the mechanistic and correlative species distribution modeling (Figure 5). Mechanistic SDMs use information obtained from ex-situ experiments that indicate the environmental conditions that a species can tolerate (e.g., maximum and minimum temperature). This information on physiological tolerances can then be used to map areas that are environmentally suitable for the species, which can be transformed into presence/absence information (Kearney and Porter, 2009; Figure 5). The lack of experimental information indicating species tolerance have limited the use of mechanistic approaches; however, in areas where experimental data is abundant, such as agricultural science, mechanistic models have been used to predict potential areas for determined crop varieties (e.g., Nabout et al., 2012). This approach can be potentially useful for microbial macroecology, since these organisms can be easily manipulated ex-situ, because of their small, short life span and large population sizes (Jessup et al., 2004).

FIGURE 5

Figure 5 A workflow on techniques for species distribution modelling. Ecological niches can be modeled both by using mechanistic models (upper left figure, representing temperature laboratory manipulative experiments on plants) or by using correlative models (lower left figure, representing the use of spatial-explicit environmental data combined with the knowledge about occurrence points for the species). The ecological niche is then calibrated on an n-hyperdimensional volume defined by all predictor variables used in the study (only three dimensions are shown in the cube to the center). Green points indicate known occurrence for the species projected into the environmental space; dashed green lines represent the ecological niche inferred from those points. The inferred ecological niche can then be projected into geographical space, which consists on the geographical areas having environmental conditions within those inferred to be the species' niche (are highlighted as suitable areas for the species in the map). Since the niche is statistically calibrated, i.e., as a statistical relation between predictor environmental variables and presence-absence response variables, the final map shows a gradient of environmental suitability for the species across the space.

The correlative approach, on the other hand, uses statistical associations between acknowledged species occurrences and environmental conditions to estimate the Grinellian Niche (Figure 5). The type of statistical model used for this approach is then chosen upon the type of occurrence data available: continuous (abundance data), binary (i.e., presence/absence data) or presence-only data (usually the latter, since abundance information is not always available and real absence data is challenging to confirm). Presence-only models of species distributions are largely used for macroecological studies, with several algorithms available, from simple ones, such as the BIOCLIM, up to more complex models based on machine learning techniques - e.g, Random Forest and MAXENT (Elith and Leathwick, 2009). While some authors claim that some algorithms have a better performance than others, the current view is that the choice of the algorithm also depends on the context in which SDMs are applied (see Peterson et al., 2010). Despite the known importance of abiotic conditions to determine large-scale species distributions, one must consider also current and historical movement limitations, such as geographical barriers, dispersal capacity and biogeographical history (Barve et al., 2011). However, it is still necessary to identify whether and how movement limitations are important to model microbial distributions, because of their overall high dispersal capacity.

Several computational tools can be used to apply SDMs, many of them freely available, open source, and collaborative (e.g., Naimi and Araújo, 2016; Kass et al., 2018). Microbiology can benefit from these methods in many research lines, since SDMs have been used not only to predict individual species distribution, but also species richness and composition (e.g., Guisan and Rahbek, 2011), species potential invasive areas (e.g., Smolik et al., 2010), as well as to understand niche evolution and speciation patterns (e.g., Silva et al., 2014; Silva et al., 2016a), and past species dynamics (e.g., Nogués-Bravo, 2009); and to model geographical range responses to climate change (e.g., Pecl et al., 2017). Specifically, SDMs present an important method to understand how species geographic range may respond to climate change. However, because of high microbial adaptation capacity, it may be a methodological challenge for microbiologists to incorporate evolution when trying to model species distribution into other time periods (Ofori et al., 2017).

Conclusion

The vast amount of microbial community data available represents an exciting prospect for advancing the field of microbial macroecology. In this review, we outlined the main questions in macroecology, community ecology and addressed how microbial ecologists can address them with bioinformatics, statistical and modeling tools. We covered fundamental aspects of biodiversity, reviewed classical approaches used in microbial ecology in a macroecological context, and highlighted the existing caveats and solutions to implement ecological modeling of microbial communities, which is a crucial research area for both the theoretical and practical aspects of macroecology. These approaches can serve as a general framework for microbial macroecology, addressing the two-part focus of macroecology: describing community patterns (and their drivers) at large scales and predicting community composition across the globe (Figure 6). The framework we present here consists of 1) gathering biological data to generate an abundance matrix, and environmental data to generate an environmental matrix; 2) exploring the associations between biological and environmental data at macroecological scales, using exploratory and network approaches; 3) incorporating insights from the previous step into modeling tools for community prediction.

FIGURE 6

Figure 6 A methodological framework to investigate the macroecology of micro-organisms. The framework shows methods related to (A) gathering taxonomic data on environmental samples, (B) exploring the data with exploratory analyses as well as statistical tests (e.g., correlation and regression analyses), and (C) using the data to create predictive models about the presence/absence of species across different environments. Solid red arrows indicates input and output data that is used as input for analyses, and blue arrows indicate the output of these analyses. Dashed red arrows indicate data that can yield indirect insights for an analysis (although they are not commonly used as direct data input for the method). Grey boxes indicate external information sources and green boxes indicate the methodological approaches reviewed in this manuscript. Dark green boxes within green boxes indicate the specific techniques used in each approach. White boxes indicate the final outputs for the macroecological approach, i.e., models explaining how environment and biotic interactions affect species presence-absence and ultimately community composition. (A) Data from metagenomic databases can be annotated taxonomically to yield presence-absence or abundance matrixes for several ecosystems. (B) Spatial-explicit environmental data can be incorporated into exploratory analyses (such as PCA and MDS) as well as correlation analyses (such as regression and Mantel test) to investigate micro-organisms diversity patterns on global scales. Functional diversity can also be investigated on macroecological scales (both directly inferred from sequence reads or from the taxonomic annotation of samples). Co-occurrence networks are commonly used in microbiology studies and can yield interesting insights when different groups of samples are compared across an environmental gradient. The understanding of functional diversity and functional redundancy can be coupled with co-occurrence networks to infer the existence of keystone taxa, as well as the extent of direct and indirect effects throughout a network, and then describe the community structure and ecosystem functioning. Such structure can then be compared across macroecological scales (e.g., analyzing how the importance of specific taxa as keystone taxa varies across different environments). (C) Spatial-explicit environmental data can also be incorporated into models to understand community structure (such as Bayesian network modeling and genetic programming) as well as models to calibrate ecological niche (such as mechanistic and correlative niche models). These models can incorporate insights from analyses shown in (B). Similarly, insights on biotic interactions, derived from community structure models, can be incorporated into ecological niche models (which commonly only use abiotic environmental variables as predictors). The final predictive models will allow microbiologists to understand interaction rules structuring microbial communities, predict the present of important taxa in different environments and infer microbial community composition across the globe.

The main difficulties for this research avenue are the theoretical implications derived from the biology of micro-organisms, such as higher dispersal capacity, higher evolutionary rate and the putative environmental drivers of community composition. New studies are necessary to address which environmental factors are relevant for modeling microbial distribution and to define whether the high dispersal capacity of micro-organisms makes this aspect uninformative for biogeographic patterns (i.e. the classic statement of “Everything is everywhere”). Also to evaluate whether the adaptive potential of micro-organisms is indeed high enough to violate the usual assumption of niche conservatism applied to ecological modeling. The insights from these future studies will have great impact on microbial ecological model interpretation. We predict that the development of modeling methods and approaches used in microbial macroecology, an exciting and flourishing field, will significantly contribute to the unification of microbial ecology and macroecology.

Author Contributions

RM designed all figures. AT-S built Table 1. AC built Tables 2 and 3. EM built Table 3. All authors listed co-authored and proof-read the manuscript. All authors approved the final manuscript.

Funding

This work was primarily supported by the Serrapilheira Institute (grant number Serra-1709-17818). Further support was provided by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES)—Finance Code 001. AT-S thanks CAPES (88887.301758/2018-00). PMM thanks to PROPESQ-UFBA (11268). FC was supported by APOSTD/2018/186 Post-doctoral fellowship from Generalitat Valenciana. BD was supported by the Netherlands Organization for Scientific Research (NWO) Vidi grant 864.14.004. PG was supported by FAPESP (2017/08406-7) and CNPq. AC was supported by CNPq (132261/2018-9). RM was supported by CAPES (88882.460937/2019-01). AA was supported by São Paulo Research Foundation (FAPESP) grant number 2016/14277-2, and CAPES. RV was supported by CNPq, grant number 151157/2018-9.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

We would like to thank Gustavo Athayde for improving figures design and Noriko Cassman for language and grammar editing of the manuscript text.

References

Aguilera, P. A., Fernández, A., Fernández, R., Rumí, R., Salmerón, A. (2011). Bayesian networks in environmental modelling. Environ. Model. Sofftw. 26, 1376–1388. doi: 10.1016/j.envsoft.2011.06.004