- 1Department of Plant Life Science, Faculty of Agriculture, Ryukoku University, Otsu, Japan
- 2Institute of Vegetable and Floriculture Science, National Agriculture and Food Research Organization, Tsukuba, Japan
- 3Institute of Crop Science, National Agriculture and Food Research Organization, Tsukuba, Japan
- 4Institute for Agro-Environmental Sciences, National Agriculture and Food Research Organization, Tsukuba, Japan
- 5Department of Agronomy, Iowa State University, Ames, IA, United States
- 6Graduate School of Agricultural and Life Science, The University of Tokyo, Nishitokyo, Japan
It has not been fully understood in real fields what environment stimuli cause the genotype-by-environment (G
Introduction
Genotype-by-environment (G
Here, we propose a novel method termed Environmental Covariate Search Affecting Genetic Correlations (ECGC) to reveal quantitative environmental stimuli (referred to as environmental covariates) and genetic architecture underpinning the G
Materials and Methods
Model Description
As described above, ECGC searches environmental covariates whose similarity matrices between environments are correlated with genetic correlation matrices between environments. Here it will be illustrated how the similarity of environmental covariates is associated with genetic correlation between environments.
For the jth environment, the phenotype adjusted for variations due to years and management conditions,
where
where
by letting
and the genetic correlation between these environments,
Equation 1 illustrates how the similarity of environmental covariates (
Data Analysis Overview
To apply ECGC to real data, five steps are required.
(1). Calculation of environmental covariates (
(2). Calculation of similarities of environmental covariates between environments (
(3). Estimation of genetic correlation between environments (
(4). Scanning environmental covariates associated with genetic correlation based on
(5). Estimation of slopes (
These steps correspond with Figures 1B–F. In the following sections whose titles start as “ECGC step”, we explain how these steps were conducted in our analyses using a large-scale multi-environment data of soybean. Preceding to these steps, we also conducted model development and fitting for each environment-trait combination to calculate the phenotypes adjusted with fixed effects (
FIGURE 1. Illustration of the framework of ECGC. (A) Forty-one fields and 52 environments were analysed in this study. The blue points indicate fields where multiple management conditions were regarded as different environments. E and L denote early and late sowing, and D and S denote dense and sparse plant densities, respectively. (B) Calculation of environmental covariates representing each environment. Precipitation is illustrated as a meteorological factor. The whole growth periods of plants were divided into 30 stages (10 from sowing to flowering and 20 from flowering to maturity), and the meteorological values within each stage were averaged. These averaged values were again averaged across all plants at the environment, thus creating 30 environmental covariates (red and blue boxes in the upper-right triangle). The environmental covariates were then averaged across all spans within the 30 stages (1st to 2nd stages, 1st to 3rd stages, etc.), thus generating 435 additional covariates (pale-blue boxes in the triangle). In total, 465 (30 + 435) environmental covariates for each trait-meteorological factor combination were generated. Because 14 meteorological factors were considered, 465 × 14 = 6,510 environmental covariates were generated for each trait. (C) Calculation of the similarity matrix of environmental covariates between environments. The figure illustrates the similarity matrix of precipitation at the 1st to 3rd growth stage, as an example. For each environmental covariate, values were extracted from the 52 environments and a 52 × 52 similarity matrix was calculated using the linear kernel. (D) Estimation of genetic correlations between environments. Using mixed models and genome-wide SNPs, genetic correlations were estimated for the 52 environments in a pairwise manner (E) Calculation of the Pearson correlation coefficient (r) of off-diagonal elements between the similarity and genetic correlation matrices. r was calculated for all similarity matrices, and −log10 p values are presented as heat maps. (F) Genome-wide association mapping for uncovering genetic architecture. For the environmental covariates significantly detected in (E), slopes were estimated for each variety by regressing the additive genetic effects estimated using mixed models on the environmental covariates. Genome-wide association mapping was conducted on the slopes.
Multi-Environment Data of Soybean
The data set used in this study consisted of 25,158 records of 624 varieties evaluated at 41 fields (Supplementary Data S1). This data set was extracted as described below from a historical data of soybean including 72,829 records of 6,106 varieties evaluated at 440 fields over 55 years in Japan (from 1961 to 2015). The overview of this historical data is provided in Supplementary Methods S1. In nine fields, multiple management conditions (e.g., early or late sowing dates) were consistently conducted in multiple years. For these fields, different management conditions were defined as different environments as described later. As a result, a total of 52 environments were defined from 41 fields (Supplementary Table S2, Figure 1A). Thus, in this study, “environments” denote the combinations of fields (locations) and management conditions.
DNA Extraction and SNP Genotyping
Seeds of varieties included in the historical data were collected from the breeding centres and NARO genebank (as many seeds as possible). The seeds were sown in pots and grown until the first trifoliate leaves emerged in greenhouses. Total genomic DNA was extracted from the first trifoliate leaves of one plant using a method based on guanidine hydrochloride and proteinase K (Khosla et al., 1999), with modifications. Among the ∼2,000 varieties with extracted DNA, 573 varieties were genotyped using the Axiom SNP custom array (ThermoFisher, MA, United States), which was designed for Japanese soybean varieties. These 573 varieties consisted mainly of breeding lines that had been evaluated until later generations (typically F8 or later) as promising lines. In addition, from the ∼2,000 varieties, 149 varieties that were registered as major varieties by the Ministry of Agriculture, Forestry and Fisheries and 187 varieties included in the soybean core collection (Kaga et al., 2012) were also genotyped using the array. The 573 varieties were selected to avoid overlap with the 336 (149 + 187) varieties. Among the SNPs genotyped, SNPs that showed high genotyping quality (i.e., SNPs termed “PolyHighResolution” in the Axiom genotyping system) and could be mapped to the Williams 82 genome assembly version 2.0 (Wm82.a2.v1/Glyma 2.0) were extracted (138,555 SNPs) (Supplementary Table S1). The average call rate of SNPs was 0.998 (SD ± 0.003). Missing genotypes were imputed using Beagle 4.1 (08Jun17.d8b) (Browning and Browning, 2016).
Extraction of Phenotypic Records
The varieties that had SNP genotypes from the Axiom custom array were subjected to subsequent analyses. Because the names of varieties usually change with the advancement of generations, the variety names were integrated using the names at the last generations. Among the 440 fields included in the historical data, 41 fields were selected for analyses because of the number of records and balanced geological locations across Japan (Figure 1A). For each field, phenotypic values that were out of the mean ± 3SD (standard deviation) range were treated as missing values.
The following six traits were selected for analyses: DTF (days), DTM (days), SL (cm), PR (%), YI (kg/a), and SW (g/100 seeds). These traits were selected because of their importance for breeding and the number of phenotyped records. In Japanese soybean breeding, to unify the measuring methods of traits, regulations on trait measurement were established in 1954 by a committee where the Ministry of Agriculture and Forestry at the time took the leading role. Although these regulations seem to have undergone several minor updates to date, trait definitions were largely common across the study period. Briefly, DTF was defined as the date when 40–50% of the buds of the strains reached flowering. DTM was the date when 80–90% of the pods of the strains showed variety-unique colours at maturity. SL was the length of the main culm between the ground surface and the growth point. PR was mainly measured using near-infrared spectroscopy, but there seemed to be some variations in the measuring methods and used machines. YI and SW were defined as the weight with 15% moisture, although there also seemed to be some variations in the moisture percentage among stations and years.
Records that had at least one observation for these six traits were extracted for subsequent analyses. As a result, the extracted records consisted of 25,158 records of 624 varieties evaluated at 41 fields (Supplementary Data S1). The summaries of the extracted phenotypic data and used varieties are presented in Supplementary Tables S2, S3, respectively.
Meteorological Factors
Meteorological information was taken from MeteoCropDB ver. 1.0 (Kuwagata et al., 2011). The database was designed to help the utilisation of crop growth models and provides daily values at each weather station of the Japan Meteorological Agency (JMA). Meteorological information for each environment was taken from the nearest JMA station. In this study, 13 meteorological factors, i.e., mean temperature (°C), maximum temperature (°C), minimum temperature (°C), precipitation (mm), vapour pressure (hPa), vapour pressure deficit (hPa), relative humidity (%), minimum relative humidity (%), wind speed (m/s), maximum wind speed (m/s), hours of sunshine (h), solar radiation (W/m2) and potential evapotranspiration (mm), were used. Among these factors, wind speed and maximum wind speed were adjusted as observed at 2.5 m of altitude by the developer of the database from the original values of the JMA. Solar radiation was estimated by the database developer from the observations of the JMA regarding hours of sunshine. Potential evapotranspiration was also estimated by the developer based on mean temperature, vapour pressure, wind speed, solar radiation and air pressure. The values of the other factors were according to the observations of the JMA. In addition to these factors, photoperiod (h) was calculated for each environment. For simplicity, daily length was also considered as a meteorological factor. The values of these meteorological factors are included in Supplementary Data S1, together with the phenotype records.
ECGC Step 0: Developing Mixed Models at Each Environment
Single-trait mixed models were fitted to the records at each environment. This process had two purposes; the first was to eliminate variations attributable to years and management conditions from phenotypic values, that is, to create
The year effects were added as fixed effects in the mixed models. The conditions of three management methods (plant density, fertiliser and sowing date) were modelled in various ways. In nine fields (F04, F07, F08, F09, F22, F24, F25, F27, and F34), multiple management conditions were consistently conducted in multiple years. For these fields, different conditions were defined as different environments resulting in a total of 52 environments (Supplementary Table S2, Figure 1A).
In most fields, however, the management conditions often varied across years. When the effects of management conditions were indistinguishable from those of year effects, the effects were absorbed to the year effects and not modelled. Otherwise, the effects were added in the mixed models. The modelling schemes varied according to the environments and management methods. Although the conditions of plant density, fertilizer, and sowing date are intrinsically continuous variables, when the number of conditions at an environment was few, these conditions were regarded as categorical variables and the effects were modelled as fixed effects. Otherwise, the conditions were regarded as continuous and the effects were model using basis functions (e.g., B-splines, Hastie et al., 2009). In each case, the interactions between genotypes and management were also considered. Variance components were estimated using REML implemented in airemlf90 (Misztal et al., 2002), and model selection was conducted using the AICs provided by the programme. The covariance structure of additive genetic effects among varieties (i.e., genomic relationship matrix) was defined using the genome-wide SNPs (VanRaden, 2008). The A.mat function provided by the R package, rrBLUP (ver. 4.6.1) (Endelman, 2011; R Core Team, 2020), was used for calculation (Supplementary Data S2). Thus, the mixed models applied here were the so-called genomic or genome-enabled BLUP (GBLUP) (de los Campos et al., 2013). The details of this procedure are described for each environment in the Supplementary Methods S2. Narrow-sense SNP heritability estimated by the selected models are presented in Supplementary Table S4.
ECGC Step 1: Calculation of Environmental Covariates
Environmental covariates were calculated for each environment, i.e., combination of fields and management conditions. Because trials had been conducted for multiple years at each field (Supplementary Table S1), in principle, environmental covariates were calculated by averaging meteorological values across years. The details are explained below.
Even on the same calendar day, the growth stages of plants can vary depending on the sowing date and growth speed, which depend on the environmental conditions (e.g., temperature) and genotypes. Thus, a meteorological event (e.g., high or low temperature) on a calendar day can have different effects on plants with different growth stages. To consider the difference in growth stage, the growth period between the sowing dates and days of flowering was divided into 10 equal-sized stages (on average 5.5 ± 1.2 days per stage), and the period between the days of flowering and days of maturity was divided into 20 equal-sized stages (3.8 ± 0.5 days per stage). For each variety at each environment and year, the daily meteorological values within stages were averaged (Figure 1B).
Meteorological values representing each environment were then obtained by averaging the above-mentioned meteorological values of all plants included in the environment, yielding 30 values for the 30 growth stages for each environment (Figure 1B, Supplementary Tables S5–S10). In addition, meteorological values were averaged across stages, e.g., averages across the 1st to 2nd stages, across the 1st to 3rd stages, etc. This procedure was inspired by the joint genomic regression analysis (Li et al., 2018). As a result, 435 (30 × 29/2) additional representative values were generated, resulting in 465 values for each environment (Figure 1B). That is, 465 × 14 (meteorological factors) = 6,510 environmental covariates were generated.
Three issues were notable in this procedure. First, the number of stages will depend on the prior knowledge or assumptions on growth stages. For soybean, typically five and eight growth stages are assumed for the vegetative and reproductive phases (Fehr and Caviness, 1977). The number of stages used here (30) were determined to be able to discriminate these known growth stages. If a crop experiences more growth stages, setting a larger number will facilitate interpretation. Second, although the growth stages of plants can be represented using cumulative temperature and/or photoperiod, we avoided using these indices, to simplify the procedures as much as possible. Lastly, although equal-sized stages were used here, unequal-sized stages also can be used. In particular, because environments during anthesis are important for plant growth, finer stages during anthesis may bring more precise information on G × E interactions.
ECGC Step 2: Calculation of Similarities of Environmental Covariates Between Environments
For each environmental covariate (i.e., combination of growth stage and meteorological factor), 52 representative values corresponding to 52 environments were used to calculate the similarity between environments (Figure 1C). The similarity was defined using a linear kernel. That is, the similarity between the jth and kth environments were calculated as xjxk where xj and xk indicate the deviations (i.e., differences from the mean of the environmental covariate) of the environmental covariate values at environments j and k, respectively. As a result, a 52 × 52 similarity matrix was obtained for each environmental covariate. Total 465 growth stages × 14 meteorological factors = 6,510 similarity matrices were obtained for each trait.
ECGC Step 3: Estimation of Genetic Correlation Between Environments
Genetic correlations between environments were estimated bivariate mixed models in a pairwise manner. The model can be written as
where
Subsequently, for each trait, genetic correlations between environments were estimated in a pairwise manner. First, the bivariate mixed model (Eq. 2) was fitted to all pairs of environments (52 × 51/2 = 1,326 pairs), to estimate the genetic covariances between environments (
It is notable that genetic correlations between environment can be estimated using methods other than the pairwise estimation described here (e.g., standard multivariate models or factor analytic models). But the pairwise estimation will be one of the easiest methods to conduct in particular when the number of environments is great.
ECGC Step 4: Scanning Environmental Covariates Associated With the Genetic Correlation
Now we have the 6,510 (465 growth stages × 14 meteorological factors) similarity matrices of environmental covariates and six (number of traits) genetic correlation matrices. The sizes of both kind matrices were 52 (number of environments) × 52 for the traits except for PR where the sizes were 30 × 30. The Pearson’s product moment correlation coefficients between the upper (or lower) triangle off-diagonals of the similarity matrices and genetic correlation matrices were then calculated (Figure 1E) using the cor.test function of the R package stats (ver. 4.0.3). p values of the coefficients were also calculated using the function. The significance of the coefficients was judged after Bonferroni correction of the p values. Considering the number of environmental covariates (6,510) and traits (6), the threshold of significance was set at 0.05/(6,510 × 6) = 1.28e–6.
Even after Bonferroni correction, 10,397 environmental covariate/trait combinations remained significant. This was attributed to redundancy in the environmental covariates. For example, a value at the 10th to 12th growth stages was often highly correlated with a value at the 9th to 13th growth stages. To eliminate this redundancy, environmental covariates were clustered using the hierarchical clustering implemented in the hclust function of R. Similarities were defined based on Euclidean distance, and complete-linkage clustering was used. The number of clusters was determined using the gap statistic (Tibshirani et al., 2001). The clustering results are presented in Supplementary Figure S1. When multiple environmental covariates were significant at a cluster, the combination with the highest r2 was selected, resulting in 555 significant environmental covariate/trait combinations (Supplementary Table S17).
ECGC Step 5: Estimation of Slopes and GWA Mapping on the Slopes
For each significant environmental covariate/trait combination, the additive genetic effects of the 624 varieties at each environment were regressed on the environmental covariates, and the slopes (
To verify the validity of these associations, subsampling analyses were conducted for these 270 combinations. Specifically, 500 (80%) varieties were randomly selected from the 624 varieties and subjected to GWA mapping. Random sampling was adopted because the aim of this subsampling is to perturbate the genetic structure which might produce false positives. This procedure was repeated 10 times. Then replications where the regions detected using the full data also showed significant associations were counted (threshold −log10p = 5.041614; Supplementary Table S19). Regions with counts >4 were regarded as reliable results. As a result, 948 regions were remained for 179 environmental covariate/trait combinations. These regions were found to be constituted with 39 regions by removing overlaps between combinations (Supplementary Table S19).
Orthologs of Arabidopsis thaliana mapped in the detected regions were extracted from JBrowse provided by Phytozome ver.12.1 (https://phytozome.jgi.doe.gov/jbrowse/index.html?data=genomes/Gmax). Gene ontology of these orthologs were surveyed with overrepresentation analyses provided by PANTHER ver.16.0 (http://pantherdb.org/) using Arabidopsis genes as a reference.
Simulation Analyses
Simulation analyses were conducted to assess the performance to ECGC to detect environment covariates. As illustrated in Eq. 1, the detection power of ECGC depends on the proportion of the variance of
(1). Assign values for r2 and M from grids 0.01, 0.018, 0.025, 0.035, 0.053, 0.075, 0.1, and 0.15, and 5, 10, 15, 20, 30, 40, 50, and 60, respectively.
(2). Generate
(3). Determine the variance of
(4). Generate
(5). Generate a symmetric matrix by adding
(6). Convert the symmetric matrix of (5) to the correlation matrix.
For each combination of r2 and M, the genetic correlation matrix was simulated 2000 times. For each simulated matrix, 99 additional environment covariates that were not associated with the correlation matrix were also simulated as true negatives. Then the Pearson correlation between
Results and Discussion
Interpretation of ECGC
Eq. 1 shows that environmental covariates (
Detection of Environmental Covariates
The multi-environment data set used here consisted of 25,158 records of 624 varieties that were evaluated from 1961 to 2015 at 41 fields. The included varieties consisted of cultivars and breeding lines developed in Japan for the sake of food production, such as tofu (Supplementary Table S3). Among the 41 fields, in nine fields, multiple management conditions (early/late sowing dates and sparse/dense plant densities) were consistently conducted in multiple years. For these fields, different management conditions were defined as different environments as described in Materials and Methods. As a result, a total of 52 environments were defined from 41 fields (Figure 1A and Supplementary Table S2).
The steps of ECGC are shown in Figures 1B–F. In the first step, environmental covariates representing each environment were calculated (Figure 1B). Here, we considered 14 meteorological factors as the environmental stimuli (Materials and Methods). We divided the whole growth period of a plant, from sowing to maturity, into 30 stages; this was achieved by dividing the growth period (from sowing to flowering) into 10 stages, and the period from flowering to maturity into 20 stages (Figure 1B). For each meteorological factor (e.g., daily mean temperature), the meteorological values within each stage were averaged, yielding 30 values for each plant. These values were then averaged across plants evaluated at the environment for all the years, thus yielding 30 values for each environment (Supplementary Tables S5–S10). Subsequently, the values were further averaged across the 30 stages, i.e. across the 1st to 2nd stages, the 1st to 3rd stages, etc. This procedure yielded 465 (30 + 30 × 29/2) environmental covariates for each trait−meteorological factor combination. Thus, total 465 × 14 (meteorological factors) = 6,510 environmental covariates were generated.
In the second step, for each environmental covariate (e.g., precipitation at the 1st to 3rd stages), values were extracted from the 52 environments, and the linear kernel (i.e., the similarity matrix) between environments was calculated (Figure 1C). In the third step, genetic correlations of additive genetic effects between environments were estimated using mixed models for each trait (Figure 1D, Supplementary Tables S11–S16, Supplementary Figure S3). Estimated genetic correlations suggest strong G
FIGURE 2. Associations of environmental covariates with genetic correlations between environments. The heat maps represent the −log10 p values of correlation coefficients of off-diagonal elements between the similarity matrix of each environmental covariate and the genetic correlation matrix. The diagonal elements of the triangles correspond with the 1st to 30th growth stages, from the lower left to the upper right. The off-diagonal elements correspond to the growth periods that span multiple stages, where the x and y axes denote the start and end of the periods, respectively. The broken lines indicate flowering time. Abbreviations: YI, yield; SW, seed weight; DTF, days to flowering; DTM, days to maturity; SL, stem length; PR, protein content; T, mean temperature; Tmax, maximum temperature; Tmin, minimum temperature; Pr, precipitation; e, vapour pressure; VPD, vapour pressure deficit; N, hours of sunshine; Sd, solar radiation; EP, potential evapotranspiration; Ph, photoperiod.
For YI, three regions in Figure 2 showed high −log10 p values. The first region was detected around the 2nd to 22nd growth stages (upper-left part of the triangles in Figure 2) of the temperature-related environmental covariates, such as mean temperature, minimum temperature and vapour pressure. The highest −log10 p values of these meteorological factors were observed at the 2nd to 22nd growth stages (r2 = 0.036,
The highest −log10 p value for SW was observed in relation to potential evapotranspiration at the 8th to 10th stages (r2 = 0.050,
For DTF, the significant environmental covariates included mean temperature at the 6th stage (r2 = 0.140,
The p value patterns of PR were different from those of the other traits. Meteorological factors generally exert their effects after flowering, around the 21st to 27th stages, as observed for mean temperature (highest
Simulation Results
To assess the validity of these detected environmental covariates, the detective power of ECGC was verified with simulations. As illustrated in the Methods, the performance of ECGC is affected by r2 and the number of environment M. The receiver operating characteristic (ROC) curves obtained under different combinations of r2 and M are shown in Figure 3. As expected, the performance of ECGC gained as r2 and M increased. In Figures 4A,B, observed r2 values for each trait are shown for comparison with Figure 3. The r2 values of the detected environmental covariates were greater than 0.053 for PR (M = 30) and greater than 0.018 for the other traits (M = 52) (Figure 4B). The ROC curves under the corresponding combinations of r2 and M in simulation results (0.053 and 30, and 0.018 and 50, respectively) suggest that ECGC detected the environmental covariates with reasonable accuracy under these r2 and M values.
FIGURE 3. Receiver operating characteristic (ROC) curves in simulation analyses. The titles of the plots denote the number of environments (M). The ROC curves of different r2 values are drawn with different colours. The x and y axes are the false positive rate and the true positive rate, respectively. In these simulations, 99 true negatives were simulated for each true positive. Thus, the coordinate (x, y) = (1, 0.01) means that one true positive is detected with 0.99 false positives.
FIGURE 4. Observed r2 values of off-diagonals between the similarity matrix of environmental covariates and the genetic correlation matrix. (A) Histograms for all environmental covariates. (B) Histograms for the environmental covariates significantly detected. Abbreviations: YI, yield; SW, seed weight; DTF, days to flowering; DTM, days to maturity; SL, stem length; PR, protein content.
Association Mapping on Genotype Sensitivity
In the final step of ECGC, the genetic architecture underpinning the sensitiveness of the varieties to the detected environmental covariates was examined using association mapping (Figure 1F). For the environmental covariates detected in the fourth step, slopes were estimated for each variety by regressing the additive genetic effects at each environment on the environmental covariates (Supplementary Table S18). The slope represented the sensitivity of the variety to the changes in the environmental covariates, and genome-wide association (GWA) mapping was conducted on the slopes. By pruning with false discovery rate (FDR) < 0.05 and subsampling analyses, 39 chromosomal regions were significantly detected for traits except for SW (Supplementary Figures S5–S10 and Supplementary Table S19). DTF shared one and two regions with DTM and SL, respectively, and these three traits shared one region (Figure 5A) which is located at near the known flowering gene (E2) (Watanabe et al., 2011). On the other hand, YI and PR shared no regions with each other, and nor with DTF, DTM, and SL (Figure 5A). These results suggest that major genes responsible for the G
FIGURE 5. Summary of chromosomal regions detected by genome-wide association mapping. (A) Number of chromosomal regions significantly detected by association mapping for each trait. (B) Distributions of meteorological factor/growth stage combinations where significant associations were detected. Red dots indicate the combinations with significant associations. See Supplementary Figure S11 for the other traits and meteorological factors. The diagonal elements of the triangles correspond with the 1st to 30th growth stages, from the lower left to the upper right. The off-diagonal elements correspond to the growth periods that span multiple stages, where the x and y axes denote the start and end of the periods, respectively. The broken lines indicate flowering time. Abbreviations: T, mean temperature; Tmax, maximum temperature; Tmin, minimum temperature; Pr, precipitation; e, vapour pressure; VPD, vapour pressure deficit; N, hours of sunshine; EP, potential evapotranspiration; Ph, photoperiod.
GWA mapping of ECGC could narrow down a genomic region that was suggested to be responsible for the G
FIGURE 6. Examples of the genome-wide association mapping results. (A) Slopes of YI for precipitation at the 1st to 3rd growth stages. The x and y axes are the environmental covariates and the additive genetic effects, respectively. Both axes are standardised. (B) Manhattan and QQ plots of the association analysis of the slopes illustrated in (A). The horizontal dashed line indicates the false discovery rate (0.05) threshold. The x and y axes of the QQ plot are the expected and observed −log10 p values, respectively. (C) Slopes of days to flowering for maximum temperature at the 5th to 14th growth stages. (D) Manhattan and QQ plots for the slopes illustrated in (D).
Strong associations were observed on multiple adjacent regions on chromosome 10 spanning 44.46–46.61 Mbp for various meteorological factors, including temperature-related covariates and photoperiod (Figures 6C,D, Supplementary Table S19, and Supplementary Figure S9). These regions were also detected for DTM and SL, reflecting the similar tendencies of the G
Insights on Selection History
The slopes provide insights on how the G
FIGURE 7. Correlations between environmental covariates and the averages of the slopes of the varieties evaluated at each environment. Positive/negative correlations indicate directional selection for the environmental covariates. (A) Weighted averages of slopes for each trait. YI, yield; SW, seed weight; DTF, days to flowering; DTM, days to maturity; SL, stem length; PR, protein content. (B) Examples of correlations between the environmental covariates and the averages of the slopes observed for yield. The environment covariates that showed strong associations with the G
Advantages of ECGC
The base model of ECGC is a conventional model for reaction norm where the genotypic value is divided into a component representing sensitivity to environmental covariates and a component free from them (van Eeuwijk et al., 2005; Hayes et al., 2016). The novelty of ECGC is to associate similarity matrices of environmental covariates with the genetic correlation matrix, and the idea of ECGC can be derived by extending the base model as described in Materials and Methods. This approach brings the following advantages. First, the environmental covariate search by ECGC is directly related to the G
Conclusion
Owing to these advantages, our proposed ECGC was applicable to the large-scale multi-environment data set of soybean, and able to depict comprehensive landscapes on how environment stimuli are involved in the G
Data Availability Statement
The data used in this study is subject to the following licenses/restrictions: the SNP data cannot be freely disclosed. Requests to access the dataset should be directed to onogiakio@gmail.com. The other data (phenotype and meteorological values, Supplementary Data 1–4) is freely available at Dryad (https://doi.org/10.5061/dryad.rr4xgxd6r).
Author Contributions
AO collected and organised historical data, conceived the ECGC framework, conducted all statistical analyses, drafted the manuscript and collaborated on DNA extraction and genotyping. DS extracted DNA and collaborated in drafting the manuscript. AK designed the SNP array, extracted DNA and collaborated on genotyping and drafting the manuscript. SN, TY, and JY collaborated on drafting the manuscript. SN suggested the importance of historical data and collaborated on drafting the manuscript.
Funding
This study was supported by Japan Science and Technology Agency PRESTO (grant number 15656049) and a grant from the Ministry of Agriculture, Forestry and Fisheries of Japan (Smart-breeding system for Innovative Agriculture (BAC1001)).
Conflict of Interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher’s Note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Acknowledgments
The authors thank Naohiro Yamada of the Nagano Vegetable and Ornamental Crops Experiment Station, Kaori Hirata, Akio Kikuchi, Nobuhiko Oki, Yoshitake Takada and Koji Takahashi of National Agricultural and Research Organization (NARO), Shohei Fujita, Satoshi Kobayashi, Fumiko Kousaka, Hideki Kurosaki, Tomoaki Miyoshi and Keiichi Senda of Hokkaido Research Organization for providing historical data and seeds. The seeds were also provided by Genebank of NARO. The authors also appreciate Makito Hajika of NARO and Hiroyoshi Iwata of the University of Tokyo for their assistance in collaborative works and Ryoma Takeshima, Hiroe Yoshida and Kaori Sasaki of NARO for fruitful discussions. The authors thank two private companies Kumogamisya and Seisyou for digitalising historical booklets.
Supplementary Material
The Supplementary Material (Supplementary Methods S1-5 and Figures S1-12) for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2021.803636/full#supplementary-material. The Supplementary Tables S1-21 and an example R script to execute ECGC are available at Dryad (https://doi.org/10.5061/dryad.rr4xgxd6r)
References
Bernard, R. L. (1972). Two Genes Affecting Stem Termination in Soybeans 1. Crop Sci. 12, 235–239. doi:10.2135/cropsci1972.0011183X001200020028x
Browning, B. L., and Browning, S. R. (2016). Genotype Imputation with Millions of Reference Samples. Am. J. Hum. Genet. 98, 116–126. doi:10.1016/j.ajhg.2015.11.020
Covarrubias-Pazaran, G. (2016). Genome-assisted Prediction of Quantitative Traits Using the R Package Sommer. PLoS One 11, e0156744. doi:10.1371/journal.pone.0156744
Crews, D. H., and Kemp, R. A. (2001). Genetic Parameters for Ultrasound and Carcass Measures of Yield and Quality Among Replacement and slaughter Beef Cattle. J. Anim. Sci. 79, 3008–3020. doi:10.2527/2001.79123008x
de los Campos, G., Hickey, J. M., Pong-Wong, R., Daetwyler, H. D., and Calus, M. P. L. (2013). Whole-genome Regression and Prediction Methods Applied to Plant and Animal Breeding. Genetics 193, 327–345. doi:10.1534/genetics.112.143313
Des Marais, D. L., Hernandez, K. M., and Juenger, T. E. (2013). Genotype-by-environment Interaction and Plasticity: Exploring Genomic Responses of Plants to the Abiotic Environment. Annu. Rev. Ecol. Evol. Syst. 44, 5–29. doi:10.1146/annurev-ecolsys-110512-135806
Egli, D. B. (2010). “Soybean Yield Physiology: Principles and Processes of Yield Production,” in The Soybean Botany, Production and Uses. Editor G. Singh (Wallingford, Oxfordshire: CAB International). doi:10.1079/9781845936440.0113
Endelman, J. B. (2011). Ridge Regression and Other Kernels for Genomic Selection with R Package rrBLUP. The Plant Genome 4, 250–255. doi:10.3835/plantgenome2011.08.0024
Fehr, W. R., and Caviness, C. E. (1977). Stages of Soybean Development. Special Report 87. Ames, Iowa: Iowa State University, Agricultural and Home Economics Experiment Station. Available at: https://lib.dr.iastate.edu/specialreports/87.
Fieuws, S., and Verbeke, G. (2006). Pairwise Fitting of Mixed Models for the Joint Modeling of Multivariate Longitudinal Profiles. Biometrics 62, 424–431. doi:10.1111/j.1541-0420.2006.00507.x
Gayler, K. R., and Sykes, G. E. (1981). β-Conglycinins in Developing Soybean Seeds. Plant Physiol. 67, 958–961. doi:10.1104/pp.67.5.958
Guo, T., Mu, Q., Wang, J., Vanous, A. E., Onogi, A., Iwata, H., et al. (2020). Dynamic Effects of Interacting Genes Underlying rice Flowering-Time Phenotypic Plasticity and Global Adaptation. Genome Res. 30, 673–683. doi:10.1101/gr.255703.119
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning. New York: Springer.
Hayes, B. J., Daetwyler, H. D., and Goddard, M. E. (2016). Models for Genome × Environment Interaction: Examples in Livestock. Crop Sci. 56, 2251–2259. doi:10.2135/cropsci2015.07.0451
Jarquín, D., Crossa, J., Lacaze, X., Du Cheyron, P., Daucourt, J., Lorgeou, J., et al. (2014). A Reaction Norm Model for Genomic Selection Using High-Dimensional Genomic and Environmental Data. Theor. Appl. Genet. 127, 595–607. doi:10.1007/s00122-013-2243-1
Kaga, A., Shimizu, T., Watanabe, S., Tsubokura, Y., Katayose, Y., Harada, K., et al. (2012). Evaluation of Soybean Germplasm Conserved in NIAS Genebank and Development of Mini Core Collections. Breed. Sci. 61, 566–592. doi:10.1270/jsbbs.61.566
Khosla, S., Augustus, M., and Brahmachari, V. (1999). Sex-specific Organisation of Middle Repetitive DNA Sequences in the Mealybug Planococcus Lilacinus. Nucleic Acids Res. 27, 3745–3751. doi:10.1093/nar/27.18.3745
Kokubun, M. (2013). Genetic and Cultural Improvement of Soybean for Waterlogged Conditions in Asia. Field Crops Res. 152, 3–7. doi:10.1016/j.fcr.2012.09.022
Kuwagata, T., Yoshimoto, M., Ishigooka, Y., Hasegawa, T., Utsumi, M., Nishimori, M., et al. (2011). MeteoCrop DB: an Agro-Meteorological Database Coupled with Crop Models for Studying Climate Change Impacts on rice in Japan. J. Agric. Meteorol. 67, 297–306. doi:10.2480/agrmet.67.4.9
Li, X., Guo, T., Mu, Q., Li, X., and Yu, J. (2018). Genomic and Environmental Determinants and Their Interplay Underlying Phenotypic Plasticity. Proc. Natl. Acad. Sci. USA 115, 6679–6684. doi:10.1073/pnas.1718326115
Liu, B., Kanazawa, A., Matsumura, H., Takahashi, R., Harada, K., and Abe, J. (2008). Genetic Redundancy in Soybean Photoresponses Associated with Duplication of the Phytochrome A Gene. Genetics 180, 995–1007. doi:10.1534/genetics.108.092742
Malosetti, M., Ribaut, J.-M., and van Eeuwijk, F. A. (2013). The Statistical Analysis of Multi-Environment Data: Modeling Genotype-By-Environment Interaction and its Genetic Basis. Front. Physiol. 4, 44. doi:10.3389/fphys.2013.00044
Mather, K., and Jones, R. M. (1958). Interaction of Genotype and Environment in Continuous Variation: I. Description. Biometrics 14, 343–359. doi:10.2307/2527879
McElreath, R. (2020). Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Second edition. Boca Raton, FL: CRC Press.
Millet, E. J., Kruijer, W., Coupel-Ledru, A., Alvarez Prado, S., Cabrera-Bosquet, L., Lacube, S., et al. (2019). Genomic Prediction of maize Yield across European Environmental Conditions. Nat. Genet. 51, 952–956. doi:10.1038/s41588-019-0414-y
Misztal, I., Tsuruta, S., Strabel, T., Auvray, B., Druet, T., and Lee, D. H. (2002). “BLUPF90 and Related Programs,” in Proceedings of the 7th World Congress On Genetics Applied To Livestock Production, Montpellier, France.
R Core Team (2020). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. Available at: http://www.R-project.org/.
Sakazono, S., Nagata, T., Matsuo, R., Kajihara, S., Watanabe, M., Ishimoto, M., et al. (2014). Variation in Root Development Response to Flooding Among 92 Soybean Lines during Early Growth Stages. Plant Prod. Sci. 17, 228–236. doi:10.1626/pps.17.228
Sinclair, T. R., Kitani, S., Hinson, K., Bruniard, J., and Horie, T. (1991). Soybean Flowering Date: Linear and Logistic Models Based on Temperature and Photoperiod. Crop Sci. 31, 786–790. doi:10.2135/cropsci1991.0011183X003100030049x
Sing, T., Sander, O., Beerenwinkel, N., and Lengauer, T. (2005). ROCR: Visualizing Classifier Performance in R. Bioinformatics 21, 3940–3941. doi:10.1093/bioinformatics/bti623
Storey, J. D., and Tibshirani, R. (2003). Statistical Significance for Genomewide Studies. Proc. Natl. Acad. Sci. 100, 9440–9445. doi:10.1073/pnas.1530509100
Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimating the Number of Clusters in a Data Set via the gap Statistic. J. R. Stat. Soc. Ser. B. Stat. Methodol. 63, 411–423. doi:10.1111/1467-9868.00293
Tsubokura, Y., Watanabe, S., Xia, Z., Kanamori, H., Yamagata, H., Kaga, A., et al. (2014). Natural Variation in the Genes Responsible for Maturity Loci E1, E2, E3 and E4 in Soybean. Ann. Bot. 113, 429–441. doi:10.1093/aob/mct269
van Eeuwijk, F. A., Malosetti, M., Yin, X., Struik, P. C., and Stam, P. (2005). Statistical Models for Genotype by Environment Data: from Conventional ANOVA Models to Eco-Physiological QTL Models. Aust. J. Agric. Res. 56, 883–894. doi:10.1071/AR05153
Van Nguyen, L., Takahashi, R., Githiri, S. M., Rodriguez, T. O., Tsutsumi, N., Kajihara, S., et al. (2017). Mapping Quantitative Trait Loci for Root Development under Hypoxia Conditions in Soybean (Glycine max L. Merr.). Theor. Appl. Genet. 130, 743–755. doi:10.1007/s00122-016-2847-3
VanRaden, P. M. (2008). Efficient Methods to Compute Genomic Predictions. J. Dairy Sci. 91, 4414–4423. doi:10.3168/jds.2007-0980
Wand, M. P., and Jones, M. C. (1995). Kernel Smoothing. Volume 60 of Monographs on Statistics and Applied Probability. Boca Raton, FL: Chapman and Hall/CRC. .
Watanabe, S., Hideshima, R., Xia, Z., Tsubokura, Y., Sato, S., Nakamoto, Y., et al. (2009). Map-based Cloning of the Gene Associated with the Soybean Maturity Locus E3. Genetics 182, 1251–1262. doi:10.1534/genetics.108.098772
Watanabe, S., Xia, Z., Hideshima, R., Tsubokura, Y., Sato, S., Yamanaka, N., et al. (2011). A Map-Based Cloning Strategy Employing a Residual Heterozygous Line Reveals that the GIGANTEA Gene Is Involved in Soybean Maturity and Flowering. Genetics 188, 395–407. doi:10.1534/genetics.110.125062
Keywords: genotype-by-environment interactions, genetic correlation, genome-wide association, historical data, multi-environmental trial, environmental covariate
Citation: Onogi A, Sekine D, Kaga A, Nakano S, Yamada T, Yu J and Ninomiya S (2021) A Method for Identifying Environmental Stimuli and Genes Responsible for Genotype-by-Environment Interactions From a Large-Scale Multi-Environment Data Set. Front. Genet. 12:803636. doi: 10.3389/fgene.2021.803636
Received: 28 October 2021; Accepted: 06 December 2021;
Published: 22 December 2021.
Edited by:
Zitong Li, Commonwealth Scientific and Industrial Research Organisation (CSIRO), AustraliaReviewed by:
Kyuya Harada, Osaka University, JapanSang He, La Trobe University, Australia
Yanjun Zan, Swedish University of Agricultural Sciences, Sweden
Copyright © 2021 Onogi, Sekine, Kaga, Nakano, Yamada, Yu and Ninomiya. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Akio Onogi, onogiakio@gmail.com
†ORCID: Akio Onogiorcid.org/0000-0003-1707-9539