Integrating a growth degree-days based reaction norm methodology and multi-trait modeling for genomic prediction in wheat

Raffo, Miguel Angel; Sarup, Pernille; Andersen, Jeppe Reitan; Orabi, Jihad; Jahoor, Ahmed; Jensen, Just

doi:10.3389/fpls.2022.939448

ORIGINAL RESEARCH article

Front. Plant Sci. , 02 September 2022

Sec. Plant Breeding

Volume 13 - 2022 | https://doi.org/10.3389/fpls.2022.939448

Integrating a growth degree-days based reaction norm methodology and multi-trait modeling for genomic prediction in wheat

$\r\nMiguel Angel Raffo*$ Miguel Angel Raffo^1*

Pernille Sarup^1,2

Jeppe Reitan Andersen²

Jihad Orabi²

Ahmed Jahoor^2,3

Just Jensen¹

¹Center for Quantitative Genetics and Genomics, Aarhus University, Tjele, Denmark
²Nordic Seed A/S, Odder, Denmark
³Department of Plant Breeding, Swedish University of Agricultural Sciences, Uppsala, Sweden

Multi-trait and multi-environment analyses can improve genomic prediction by exploiting between-trait correlations and genotype-by-environment interactions. In the context of reaction norm models, genotype-by-environment interactions can be described as functions of high-dimensional sets of markers and environmental covariates. However, comprehensive multi-trait reaction norm models accounting for marker × environmental covariates interactions are lacking. In this article, we propose to extend a reaction norm model incorporating genotype-by-environment interactions through (co)variance structures of markers and environmental covariates to a multi-trait reaction norm case. To do that, we propose a novel methodology for characterizing the environment at different growth stages based on growth degree-days (GDD). The proposed models were evaluated by variance components estimation and predictive performance for winter wheat grain yield and protein content in a set of 2,015 F6-lines. Cross-validation analyses were performed using leave-one-year-location-out (CV1) and leave-one-breeding-cycle-out (CV2) strategies. The modeling of genomic [SNPs] × environmental covariates interactions significantly improved predictive ability and reduced the variance inflation of predicted genetic values for grain yield and protein content in both cross-validation schemes. Trait-assisted genomic prediction was carried out for multi-trait models, and it significantly enhanced predictive ability and reduced variance inflation in all scenarios. The genotype by environment interaction modeling via genomic [SNPs] × environmental covariates interactions, combined with trait-assisted genomic prediction, boosted the benefits in predictive performance. The proposed multi-trait reaction norm methodology is a comprehensive approach that allows capitalizing on the benefits of multi-trait models accounting for between-trait correlations and reaction norm models exploiting high-dimensional genomic and environmental information.

Introduction

Genomic selection (GS, Meuwissen et al., 2001) is an efficient selection method based on whole-genome prediction (WGP) that has been successfully applied for a variety of complex traits in animals and plants (De los Campos et al., 2009; Hayes and Goddard, 2010; Gianola and Rosa, 2015; Crossa et al., 2017). Over the last few decades, genomic models have been extended to accommodate greater availability of data from new technologies and increasing computational resources that allow incorporation of high dimensional data. In plant breeding, there has been an increasing interest in using multi-trait (MT) multi-environmental (ME) models over single-trait (ST) single-environmental models, as it may result in an improvement in accuracy of selection and prediction (Malosetti et al., 2007; Montesinos-Lopez et al., 2016; Montesinos-Lopez et al., 2019). MTME models combine the benefits of exploiting between-trait correlations from MT models and characterizing genotype-by-environment interactions (G×E) from ME evaluations. Efficient G×E modeling is required to counteract the negative impacts of climate change, either by selecting more resilient lines with decreased G×E or by developing lines well adapted to specific environments.

The advantages of MT models lie in that they can represent and exploit the correlation between traits, which can increase prediction accuracy and reduce bias in predictions compared to single-trait models (Henderson and Quaas, 1976; Pollak et al., 1984). The use of between-trait correlations is especially beneficial to enhance accuracy of predicted breeding values when genetic and environmental correlations are of opposite sign, and when MT models are used to infer low-heritability traits genetically correlated with high-heritability traits (Thompson and Meyer, 1986; Jia and Jannink, 2012; Jiang et al., 2015). However, MT models increase statistical complexity in the number of parameters to estimate, which could reduce the accuracy of estimation and their benefits for prediction (Haile et al., 2018; Lado et al., 2018). MT models are also useful for predicting genetic values of individuals not phenotyped for specific traits of interest but having phenotypic records for correlated traits. This may represent an additional advantage for the MT analysis, particularly if the trait of interest is difficult to measure or has high phenotyping cost.

Modeling G×E has been valuable for wheat genomic prediction (GP), as it has shown increases in prediction accuracy ranging from 10 to 40 % (reviewed by Crossa et al., 2017) and allows prediction of breeding values for lines not tested in the target environment, but genetically related with other tested lines. G×E can be identified in ME evaluations and incorporated into statistical models to detect the responses of genotypes to changing environments (also termed Macro-environmental sensitivity); the differential response of genotypes to environments is called the reaction norm (RN, Falconer, 1990; Falconer and Mackay, 1996; Calus and Veerkamp, 2003). The G×E can be modeled in an RN framework as a function of markers and environmental covariates (ECs) collected from weather stations (e.g., temperature, precipitation, solar radiation) and soil characterization (e.g., water storage capacity, sand content, hydraulic conductivity). However, given the high dimensional nature of genomic data and ECs, modeling marker × ECs interactions can lead to computational challenges.

A computationally efficient approach to quantify the G×E is using (co)variance structures. Burgueño et al. (2012) model G×E in a ME version of the genomic best linear unbiased predictor model (G-BLUP), where (co)variances structures of molecular markers and pedigree were used to represent genetic relationships within environments. The Burgueño et al. (2012) approach can also be seen as an MT methodology since environments were considered different traits, representing heterogeneous variance and covariances among environments. However, their approach did not include ECs to model G×E. Jarquin et al. (2014) developed an RN model, where the main and interaction effects of markers and ECs were introduced using high-dimensional (co)variance structures of markers and ECs. The ECs were specifically computed for different phenological stages summarizing water availability, temperature, and radiation. A similar definition of ECs has been implemented by Heslot et al. (2014), but the phenological stages were simulated using crop modeling instead of the empirical measure of phenology. Such approaches can be interpreted as RN models (Falconer, 1990; Falconer and Mackay, 1996; Calus and Veerkamp, 2003; Su et al., 2006) since phenotypes are -implicitly- linearly regressed on ECs. The RN model proposed by Jarquin et al. (2014) has been applied successfully in breeding programs of cotton and wheat (Pérez-Rodríguez et al., 2015; Crossa et al., 2016) and has recently been applied in combination with historical weather records and simulation to tackle the problem of predicting cultivars’ future performance under uncertain conditions (De Los Campos et al., 2020). Other methods have also been proposed to incorporate marker by environment interaction into the GS framework, for instance, models using Gaussian Kernel (Cuevas et al., 2016; Cuevas et al., 2017) or deep learning methodologies (Montesinos-Lopez et al., 2018) and have been successfully applied for GP.

Approaches for incorporating both MT and ME information in GS have been developed using different methodologies (Malosetti et al., 2007; Montesinos-Lopez et al., 2016; Montesinos-Lopez et al., 2019; Ward et al., 2019). However, to the best of our knowledge MT reaction norm models (MTRN) incorporating ECs information are lacking. In our study, we propose to extend a RN model incorporating G×E through high-dimensional (co)variance structures of markers and ECs (Jarquin et al., 2014; Pérez-Rodríguez et al., 2015; De Los Campos et al., 2020) to the MT case (MTRN). To do that, we propose a novel methodology for characterizing the environment at different growth stages based on growth degree-days (GDD) instead of phenological stages. Thus, GDD periods (e.g., computed each 100-GDD) encompassing the crop stages throughout the full season can be used. The GDD have been broadly recognized as one of the main forces driving phenology in wheat (Nuttonson, 1955; Slafer and Rawson, 1994; Salazar-Gutierrez et al., 2013; Aslam et al., 2017). We hypothesize that using GDD to define wheat growth periods can be useful for several reasons:

I. It simplifies the implementation of RN models since phenological records or crop simulations are no longer needed to infer the phenology of the lines. This may be convenient for wheat breeding because, in most cases, the full phenological stages are not recorded, or alternatively, it avoids the use of crop simulations that can represent a challenge for breeders since crop simulations are not commonly used in breeding programs.

II. GDD periods represent a convenient approach to deal with the differences in growth stages of breeding lines. Accurately capturing differences in the growth stages of breeding lines can be difficult using crop simulation because the critical periods for different breeding lines can occur at different moments within the same environment (year-location combinations). Nevertheless, this is no longer a problem when GDD periods are used.

III. Using GDD periods can be helpful when ECs is determined for traits where the relationships between the environmental conditions and resulting phenotype are not well established; while this may not represent a problem for wheat grain yield since these relationships have been extensively defined (Meynard and Sebillotte, 1994), it can be relevant for quality or diseases traits.

In addition, the proposed MTRN model using (co)variance structures to incorporate G×E represent a convenient choice to reduce the high demand on computational resources, which have often been a relevant restriction for developing MTME models. This study uses a large set of winter wheat breeding lines phenotyped for grain yield and protein content in multiple environments. The proposed models were evaluated for both traits using cross-validation (CV) and trait-assisted genomic prediction (TA-GP) in two prediction scenarios relevant to plant breeders: (i) predicting the performance of breeding lines that have been tested in some environments but not in others (CV1, leave-one-year-location-out), and (ii) predicting the performance of new lines across breeding cycles (CV2, leave-one-breeding-cycle-out). The TA-GP was performed considering the phenotypes of the additional trait to exploit between-trait correlations, which is intended to improve predictions of breeding values for the trait of interest.

Materials and methods

Plant material and phenotyping

In this study, 2,015 sixth-generation (F₆) winter wheat lines (T. aestivum L.) from the breeding company Nordic Seed A/S were used; a subset of these data has been used in earlier studies (Cericola et al., 2017; Kristensen et al., 2018; Tsai et al., 2020; Raffo et al., 2022). The breeding lines came from seven breeding cycles (BC) tested in years 2014–2019, and each breeding cycle originated from approximately 60 parental line-crosses followed by five generations of selfing, including creating single seed descent (SSD) lines in generation F₄. Each BC were composed of around 330 lines sown in one or 2 years (cycle 1: 2014, cycle 2: 2014–2015, cycle 3: 2015–2016, cycle 4: 2016–2017, cycle 5: 2017, cycle 6: 2018, cycle 7: 2019) at three locations in Denmark (DK): Odder (Central DK), Holeby (South DK), and Skive (North-west DK). For breeding cycles from 2014 to 2016 (cycle 2, 3 and 4) lines were tested in 2 years instead of one in order to have a quick construction of a reliable training population for genomic prediction. Each year the field trials were designed in 15 blocks of 46 line plots of 8.25 m², having two replicates of 21 F₆ lines and two checks randomly assigned per year-location combination. The traits analyzed were grain yield measured as kg per plot (8.25 m²) and protein content (%) determined by near-infrared spectroscopy (NIRS), and both traits were chosen according to their high relevance for breeding. Similar agronomic management was applied for all trials (e.g., sowing and assessment time, fertilization, application of treatments etc.).

Genotyping

A modified CTAB method was used to perform DNA extractions (Rogers and Bendich, 1985). The plant material was genotyped using the 15K Illumina Infinium iSelect HD Custom Genotyping BeadChip technology (Wang et al., 2014). Quality control was carried out by removing genotyped SNPs with minor allele frequency (MAF) lower than 5% and call rate lower than 0.90. In total, 12,893 SNPs remained after quality control.

Environmental data and environmental covariates

Climatic variables were collected on a daily basis and described temperature, relative humidity, wind speed, vapour-pressure deficit, and global radiation from weather stations within 20 km from field trials; precipitation within 10 km; minimum and maximum temperature and potential evaporation from the closest weather stations located in Silstrup, Askov, and Flakkebjerg for Skive, Dyngby, and Holeby, respectively (Plauborg, pers. comm.). Soil information was available in each locality for depths 0–30, 30–60, 60–100, 100–200 cm and described texture (clay, sand, silt), carbon content, hydraulic conductivity and plant available water (Adhikari et al., 2013; Kotlar et al., 2020).

The climatic variables were used to compute ECs summarizing environmental descriptors linked to water availability, radiation, and temperature at different crop stages (see Figure 1 and Table 1), which together with the soil information, were used to describe the environments at the level of year-location. The crop stages were defined each 100 GDD, where GDD were estimated as the thermal sum of daily average temperature over 0°C (McMaster and Wilhelm, 1997; Salazar-Gutierrez et al., 2013) from sowing date until August 15 (estimated end of harvest season in Denmark). The 100-GDD stages were intended to summarize environmental conditions in short periods of days that encompass the crop phases throughout the full crop period. In total, 17 climatic variables were computed for each GDD stage, and seven soil variables specific for each locality were defined for four depths (0–30, 30–60, 60–100, 100–200 cm). Quality control of ECs was performed by removing variables with more than 10% of missing values (NA) or 30% of repeated values as indicated in Jarquin et al. (2014). After quality control, 300 climatic ECs from the first to the 27 GDD stages remained for each year-location combination plus 23 soil ECs (300 climatic ECs + 23 soil ECs = 323 ECs; see Supplementary material 1 for a complete description of variables obtained after quality control). Climatic ECs for GDD stages above 2,700 GDD did not pass quality control due to presenting more than 10% on NA values. The maturity date was simulated for the different environments following Pullens et al. (2021), and it was confirmed that all relevant growth stages until maturity were within the 27 GDD stages obtained after quality control.

FIGURE 1

Figure 1. Computation of growth stages and environmental covariates (ECs) for year 2014 in the locality of Skive. In total, the complete crop cycle was divided into 27 shorter periods of 100 GDD (27 growth stages). For each growth stage, a set of ECs summarizing environmental descriptors linked to water availability, radiation and temperature were computed as described in Supplementary material 1. The same approach was used for all year-location combinations. GDD: growth degree days, the GDD were estimated as the thermal sum of daily average temperature over 0°C.

TABLE 1

Table 1. Environmental covariates (ECs) description, modified from Lecomte (2005) and Heslot et al. (2014).

Statistical analysis

In this study, five models were proposed to evaluate the effect of including G×E interaction and MT modeling for grain yield and protein content. The full data set was used for VCs estimation to obtain estimates as accurate as possible. First, a baseline mixed model without genomic information (M1) was used as a starting point for constructing the other models; a similar baseline model has been used in earlier works with Nordic Seed A/S data (Cericola et al., 2017; Tsai et al., 2020). Second, the baseline model was extended by a genomic [SNPs] effect (M2), capturing the main additive genetic effects (Habier et al., 2007; VanRaden, 2008). Third, a G×E effect based on genomic [SNPs] by ECs interactions extended M2 to the RN framework (M3). Lastly, M2 and M3 were extended to the MT case (M4 and M5, respectively), considering grain yield and protein content. A detailed model description is provided below.

Model 1 (M1, baseline model)

The M1 (Equation 1) is a mixed model considering the main sources of variability affecting the data and the experimental design. It was defined as:

y_{n} = X b + Z_{1} l + Z_{2} f + Z_{3} s + e (1)

where y_n is the vector of phenotypes for grain yield (n = 1) or protein content (n = 2); X and Z_n(n = 1, 2, 3) are design matrices for fixed and random effects, respectively; b is the vector of fixed trial effects nested within year-location-breeding cycle, which is intended to capture the variation due to overall effects of year, location, the interaction between year and location as well as effects of the spatial location of the trial within the field.; l is a vector of line effect with $l \sim N I I D (0, I σ_{l}^{2})$ , where I is an identity matrix and $σ_{l}^{2}$ is the variance due to uncorrelated line effects; f is a vector of line × environment interaction (L×E) with $f \sim N I I D (0, I σ_{f}^{2})$ , where $σ_{f}^{2}$ is the variance due to uncorrelated L×E effects; s is a vector of spatial effect, which follows a multivariate normal density (MVN) with $s \sim M V N (0, S σ_{s}^{2})$ , where S is a spatial relationship matrix and $σ_{s}^{2}$ is the spatial effect variance. The spatial effect was defined as the combination of 9-spatial sub-components, where one sub-component is the spatial effect for the square centered on the plot of the observation (i.e., target plot), and the eight remaining sub-components are the spatial effects for the square centered on the eight plots surrounding the target plot (Figure 2). Virtual plots were added over trial borders and into empty spaces of the X by Y grid to control border effects and ensure all plots have eight surrounding plots. The spatial effects were identified by assigning a unique id given by its X (row) and Y (column) coordinates. The spatial relationship matrix S was computed as $S = \frac{{XX}^{'}}{t r ({XX}^{'}) / n}$ , where X_{n× q} is a n × q matrix, with n = number of observations (i.e., number of real plots), and q = number of real plus virtual plots. The X_{n× q} is an indicator matrix relating observations to spatial effects in S. A heatmap of the S matrix for a subset of the data (Dyngby, 2017) is shown in Figure 3. Additional explanations for the modeling of the spatial effect can be found in Tsai et al. (2020), where the spatial effect was equivalently modeled by the first time, but with the difference that for our case we used the relationship matrix S, and they used the regression of the 9-spatial covariates for each plot observation directly as random effects. However, although these two implementations are mathematically equivalent, our implementation is computationally faster. The e is a vector of random residuals with $e \sim N I I D (0, I σ_{e}^{2})$ , where $σ_{e}^{2}$ is the residual variance.

FIGURE 2

Figure 2. Representation of spatial information in a field trial. The target and eight surrounding plots were used together to correct the spatial variability across the field. The trial borders’ effect was considered by adding virtual plots to complete the eight surrounding plots for all observations. Virtual plots were also added in empty X-Y coordinates (with no plot observation registered) to ensure all plots have the eight surrounding plots. Hence, the spatial effects on an individual plot is the sum of effects with the square centered on the plot itself plus the effects of the eight surrounding plots with a square centered on those plots.

FIGURE 3

Figure 3. Heatmap of the spatial relationship matrix for locality Dyngby in the year 2017. A total of 311 plots were observed at Dyngby 2017, and the spatial relationship between plots are represented. Higher to lower relationships are represented from yellow to light-blue colors; the dark blue represents a lack of relationship between plots (not neighboring connections).

Model 2 (M2)

The M2 (Equation 2) extend M1 by adding a genomic [SNP] additive effect using the VanRaden (2008) genomic relationship matrix as (co)variance structure, and therefore, M2 can be seen as a G-BLUP model (Habier et al., 2007; VanRaden, 2008). M2 was defined as:

y_{n} = X b + Z_{1} l + Z_{2} f + Z_{3} s + Z_{4} g + e (2)

where y_n, X, Z_n, b, l, f, s, and e were defined as in M1; g is a vector of the genomic additive breeding values (GEBVs), with $g \sim M V N (0, G σ_{g}^{2})$ , $σ_{g}^{2}$ is the genomic additive variance, and G is the genomic relationship matrix (GRM) based on the first method proposed by VanRaden (2008): $G = \frac{{ZZ}^{'}}{2 \sum p_{i} (1 - p_{i})}$ , where p_i is the minor allele frequency (MAF) of the i^thSNP;Z is M−P; M is a SNP matrix coded −1, 0, 1, and P is a matrix with the MAF of SNP i calculated as 1(2(p_i−0.5)) for column i.

The line effect (l) was kept in all genomic models since it is intended to capture non-additive variance and additive variance not captured by SNPs in the genomic effect; therefore, this model definition helps to improve the characterization for the specific additive genomic variance captured by the SNPs.

The narrow (h²) and broad-sense (H²) plot heritabilities were estimated for M2 as $h^{2} = d (G) σ_{g}^{2} / σ_{P}^{2}$ and $H^{2} = (σ_{l}^{2} + d (G) σ_{g}^{2}) / σ_{P}^{2}$ , where d(G) is the mean diagonal value of G with d(G) = 1.869, the d(G) value can be interpreted as 1 plus the average genomic inbreeding coefficient for the population (VanRaden, 2008); $σ_{g}^{2}$ and $σ_{l}^{2}$ are the genomic estimated additive and line variance as defined for M2 and M1, respectively; $σ_{P}^{2}$ is the estimated phenotypic variance of the plot calculated as: $σ_{P}^{2} = σ_{l}^{2} + d (G) σ_{g}^{2} + σ_{f}^{2} + σ_{s}^{2} + σ_{e}^{2}$ , with $σ_{l}^{2}$ , d(G), $σ_{g}^{2}$ , $σ_{f}^{2}$ , $σ_{s}^{2}$ , and $σ_{e}^{2}$ as defined previously in M2 and M1. Note that h² and H² were reported at the plot level and not for family means as sometimes are used in plant breeding.

Model 3 (M3)

The M3 (Equation 3) is the most developed model in terms of definition of effects as it extends M2 by including a genomic [SNPs] × ECs interaction effect (gw), and therefore, it can be viewed as a linear RN model where the genetic and environmental gradients are specified as regressions on markers and ECs. As pointed in Jarquin et al. (2014) and Pérez-Rodríguez et al. (2015), the M3 uses the (co)variance patterns induced by linear-by-linear reaction norm, where the intercept of the reaction norm are implicitly in the main effects of lines and environment and the slope in the genomic [SNPs] × ECs interaction effect. The M3 was defined as:

y_{n} = X b + Z_{1} l + Z_{2} f + Z_{3} s + Z_{4} g + Z_{5} g w + e (3)

where y_n, X, Z_n, b, l, f, s, g, and e were defined as in M1 and M2; gw is a vector of genomic [SNPs] additive × ECs interaction values with $gw \sim M V N (0, (Z_{g} {GZ}_{g}^{'}) ° Ω σ_{g w}^{2})$ , where Z_g is a design matrix connecting phenotypic observations with cultivars, G is the GRM, andΩ is the (co)variance matrix computed as in De Los Campos et al. (2020) and derived from ECs as $Ω = \frac{{WW}^{'}}{q}$ , where W_nxq is a matrix of centered and scaled ECs with n rows (number of phenotypic observations) and q columns (number of ECs). The operator “°” represents the cell-by-cell (or Hadamard) product between matrices, and $σ_{g w}^{2}$ is the variance term associated with the genomic [SNPs] × ECs interaction effect.

Note that the inclusion of the line × environment interaction (f) and the genomic [SNPs] × ECs interaction effects together in M3 is justified since ECs do not capture all the environmental variability, and therefore, this definition of interactions with the environment contribute to specifying what is not captured by ECs. A similar concept is also extended to the main line effect (L) effect, which are present in all the defied models and are intended to cover the misspecification of the SNPs on genetic effects. The inclusion of an additional model term defining the main effect of ECs have also been used in previous works (Jarquin et al., 2014; Pérez-Rodríguez et al., 2015; De Los Campos et al., 2020); however, such an effect is not possible to estimate in our case since the main environmental effect is implicitly defined and estimated by the fixed effects of our models.

All models presented so far have been implemented in a single-trait (ST) approach for grain yield and protein content. In the following section we summarize the extension of ST models to the MT case.

Multi–trait models (M4-M5)

The M2 and M3 were extended to the MT case by M4 and M5 (Equation 4), respectively. The MT models considered grain yield and protein content in a bivariate analysis. The MT methodology allows exploiting between traits (co)variances for the different model effects, which modified the definition of the random effects from ST models. Following, we present the definition of the effects for the M5 (the most developed model in terms of effects and traits included):

[\begin{matrix} y_{1} \\ y_{2} \end{matrix}] = [\begin{matrix} X_{1} & 0 \\ 0 & X_{2} \end{matrix}] [\begin{matrix} b_{1} \\ b_{2} \end{matrix}] + [\begin{matrix} Z_{1} & 0 \\ 0 & Z_{2} \end{matrix}] [\begin{matrix} l_{1} \\ l_{2} \end{matrix}] + [\begin{matrix} Z_{3} & 0 \\ 0 & Z_{4} \end{matrix}] [\begin{matrix} f_{1} \\ f_{2} \end{matrix}]

+ [\begin{matrix} Z_{5} & 0 \\ 0 & Z_{6} \end{matrix}] [\begin{matrix} s_{1} \\ s_{2} \end{matrix}] + [\begin{matrix} Z_{7} & 0 \\ 0 & Z_{8} \end{matrix}] [\begin{matrix} g_{1} \\ g_{2} \end{matrix}]

+ [\begin{matrix} Z_{9} & 0 \\ 0 & Z_{10} \end{matrix}] [\begin{matrix} g w_{1} \\ g w_{2} \end{matrix}] + [\begin{matrix} e_{1} \\ e_{2} \end{matrix}] (4)

where y_n, X, Z_n, b, l, f, s, g, gw are the same as defined in the previous models for grain yield and protein content, except for the random effect variances which under the MT framework becomes: $[\begin{array}{l} l_{1} \\ l_{2} \end{array}]$ ∼NIID(0,I⊗L), where ⊗ denotes the Kronecker product, and L the line (co)variance matrix $[\begin{array}{l} σ_{l 1}^{2} σ_{l 12}^{2} \\ σ_{l 21}^{2} σ_{l 2}^{2} \end{array}]; [\begin{array}{l} f_{1} \\ f_{2} \end{array}]$ ∼NIID(0,I⊗F) with F the line × environment (co)variance matrix $[\begin{matrix} σ_{f 1}^{2} σ_{f 12}^{2} \\ σ_{f 21}^{2} σ_{f 2}^{2} \end{matrix}]$ ; $[\begin{array}{l} s_{1} \\ s_{2} \end{array}]$ ∼MVN(0,S⊗H) with H the spatial (co)variance matrix $[\begin{matrix} σ_{s 1}^{2} σ_{s 12}^{2} \\ σ_{s 12}^{2} σ_{s 2}^{2} \end{matrix}]$ ; $[\begin{array}{l} g_{1} \\ g_{2} \end{array}]$ ∼MVN(0,G⊗K) with K the genomic (co)variance matrix $[\begin{matrix} σ_{g 1}^{2} σ_{g 12}^{2} \\ σ_{g 12}^{2} σ_{g 2}^{2} \end{matrix}]$ ; $[\begin{matrix} g w_{1} \\ g w_{2} \end{matrix}] \sim M V N (0, (Z_{g} {GZ}_{g}^{^{'}}) ° Ω \otimes P)$ with P the genomic [SNPs] × ECs (co)variance matrix $[\begin{matrix} σ_{g w 1}^{2} σ_{g w 12}^{2} \\ σ_{g w 12}^{2} σ_{g w 2}^{2} \end{matrix}]$ , and $[\begin{array}{l} e_{1} \\ e_{2} \end{array}]$ ∼NIID(0,I⊗R) with R the residual (co)variance matrix $[\begin{matrix} σ_{e 1}^{2} σ_{e 12}^{2} \\ σ_{e 21}^{2} σ_{e 2}^{2} \end{matrix}]$ . Moreover, for MT models (M4-M5), we calculated the between-trait correlation for all model effects; for example, the genetic correlation was: $C O V_{g 12} / \sqrt{σ_{g 11}^{2} σ_{g 22}^{2}}$ , where COV_g12 is the between-trait covariance for the genomic effect, and $σ_{g 11}^{2}$ and $σ_{g 22}^{2}$ are the variances associated to grain yield and protein content, respectively.

A summary of the effects included in the models and the ST or MT case is presented in Table 2. All proposed models were developed under the context of a multivariate normal distribution assuming Gaussian priors for random effects and the analyses were performed using the BGLR R package (Pérez and de los Campos, 2014) with 50,000 iterations, burn-in of 10,000, and a thinning of 10. The initial specifications in BGLR were set to an R² = 0.90 to approximate the R² expected of our models (known from previous data analysis using Average Information Restricted Maximum Likelihood, Cericola et al., 2017; Raffo et al., 2022) and degree of freedom (df) = 0.0001 to set uninformative priors. The posterior standard deviation (PSD) for VCs estimates was computed. In addition, the convergence on parameter estimation was analyzed using the package CODA in R (R Core Team, 2014) by estimating the Monte Carlo standard errors (MCMC error), effective sample (ESS) and by graphical analysis of Markov chains trace plots and posterior density plot. The convergence analysis is presented in Supplementary material 2.

TABLE 2

Table 2. Summary of the effect included in the models and single-trait (ST) or multi-trait (MT) case.

Cross-validation analysis and model validation

The predictive performance of the proposed models was evaluated using two CV approaches. The CV1 (leave-one-year-location-out) was carried out by masking the phenotypes of one year-location in the validation set and using the remaining phenotyped lines to predict the masked lines (Supplementary material 3). This process was repeated n times (n = number of year-locations = 17) until predictions for all year-locations were obtained. The CV1 simulates the prediction problem where breeding lines have been tested in some environments but not in others, and the genetic values for lines in the untested environment are desired. The CV2 (leave-one-breeding-cycle-out) was carried out by masking the phenotypes of one BC in the validation set and using the remaining phenotyped lines to predict the masked lines (Supplementary material 3). This process was repeated n-times (n = no. of breeding cycles = 7) until all BCs were predicted. The CV2 simulates the prediction problem where a new generation (newly developed lines) is predicted from parental and historical records. In addition, the MT models (M4-M5) were evaluated using TA-GP, where the phenotypic data in CV1 and CV2 was masked only for one of the traits in the validation set, and the phenotypes for the second trait were available for all lines.

The predictive ability (PA, $r_{\hat{g}, p}$ ) of the models was calculated as the Pearson correlation between the average value of lines in each year-location after correcting by fixed effects and the vector of predictions [ $ρ (\bar{y}, \hat{g_{g}})$ and $ρ (\bar{y_{c}}, \hat{g_{g}} + \hat{g_{g w}})$ ]. The fixed effects were estimated for each model using the full dataset in order to obtain as accurate estimates as possible. The lines corrected by fixed effects were computed subtracting the fixed effects from each corresponding plot observation, and averaging the resulting lines values within year-locations. The PA was obtained for predictions of the main additive effect [ $ρ (\bar{y_{c}}, \hat{g_{g}})$ ] of M2, M3, M4, and M5, and for predictions of the main additive effect plus the genomic [SNPs] × ECs interactions effect [ $ρ (\bar{y_{c}}, \hat{g_{g}} + \hat{g_{g w}})$ ] for M3 and M5. We used an ordinary non-parametric bootstrap with replacement based on full sample size (n = 2,015), and 10,000 replicates to obtain PA standard errors. The PA between models was contrasted using a two-tailed paired t-test (critical P-value = 0.01). The maximum PA was calculated for g and g + gw predictions in the different models; for M2, M4 it was calculated as $\sqrt{n h_{f}^{2} / (1 + (n - 1) h_{f}^{2})}$ , where n was the average number of lines repetitions within year-location, $h_{f}^{2}$ was the family-based heritability and hence is affected by the experimental design (number of replications). For M3 and M5 it was calculated using the same formula but substituting $h_{f}^{2}$ with the proportion of total variance explained by the genomic additive plus the genomic interaction effects (genomic [SNPs] × ECs interaction).

A test for variance inflation in the predicted genetic effects was performed as the slope of the regression of predicted values obtained with whole information (subscript w, estimations with complete phenotypic information for all genotypes) on the estimated with partial information (subscript p, predictions for all genotypes from CVs when their phenotypes were masked): $b_{w, p} = \frac{c o v (\hat{u_{w}} \hat{u_{p}})}{v a r (\hat{u_{p}})}$ (Legarra and Reverter, 2018).

Results

Phenotyping

The descriptive statistics for grain yield and protein content are shown in Table 3. In total, 2,015 lines and 14,430 plot observations were obtained, and normal distribution was observed for both traits. Grain yield had an average value of 8.85 kg grain/8.25 m² with a coefficient of variation of 11.63%. Protein content had an average value of 9.84%, with a coefficient of variation of 8.11%. The between traits correlation based on the observed phenotypes was in general negative for the different breeding cycles and had an average of −0.18 (Table 3). The variance for descriptive statistics was 1.06 for grain yield and 0.64 for protein content.

TABLE 3

Table 3. Descriptive statistics for the grain yield and protein content of F6 wheat breeding lines.

Variance components, heritability and trait correlations

Five proposed models differing in the incorporation (M2–M5) or not (M1) of genomic information, the incorporation of genomic [SNPs] × ECs interaction (M3 and M5), and the single-trait (M1, M2 and M3) or multi-trait case (M4 and M5) were used to estimate VCs for grain yield (Table 4) and protein content (Table 5).

TABLE 4

Table 4. Posterior mean of variance components for grain yield (kg grain/8.25 m²).

TABLE 5

Table 5. Posterior mean of variance components for protein content (%).

The phenotypic variances ( $σ_{P}^{2}$ ) for the different models represented a proportion of 31.9–35.8 % and 34.1–38.4 % of the descriptive statistical variance for grain yield and protein content, respectively. The difference between the variance from descriptive statistics and $σ_{P}^{2}$ can be attributed to the considerable amount of variation captured by the fixed effects. The M1 (model without genomic information) captured lower variance related to the main genetic effect than models including the additive genomic [SNPs] term. For example, comparing line variance from M1 to line plus genomic variance from M2, the M2 captured around 17 and 33 % more genetic variance for grain yield and protein content, respectively; this is attributed to correctly accounting for covariances among lines when the additive genomic [SNPs] effect is included in the models. The narrow- and broad-sense heritabilities at plot level were estimated using M2, and higher values were observed for protein content: h² = 0.282 (PSD = 0.025) and H² = 0.414 (PSD = 0.017) than for grain yield: h² = 0.168 (PSD = 0.020), H² = 0.307 (PSD = 0.153). The G×E variability was accounted for the line × environment (f) and genomic [SNPs] × ECs interaction (gw) effects, and a higher amount of G×E variance was observed for grain yield (~40% of $σ_{P}^{2}$ ) than for protein content (~ 17% of $σ_{P}^{2}$ ).

The VCs estimates for single-trait models M2 and M4 were similar to MT models M3 and M5, respectively. The between-trait correlations were analyzed for all terms in M3 and M5 (Table 6), and negative correlations were observed for all the effects. High negative between-trait correlations (>0.40) were observed for the main genetic model effects [line (l), additive (g)], and the interaction effects (line × environment (f), and genomic [SNPs] × ECs interaction (gw).

TABLE 6

Table 6. Between-trait correlations for model effects of multi-trait models (M4 and M5).

The MCMC errors, effective sample (ESS), and graphical analysis of trace and posterior density plots were performed as control of convergence for VCs estimates (Supplementary material 2). The parameters estimated for all models had good convergence as revealed by the MCMC error in the order of 1 × 10^–4 or lower, the high ESS, and the appropriate trace and posterior density plots.

Predictive ability of genomic predictions

The PAs ( $r_{\hat{g}, p}$ ) of M2 to M5 in leave-one-year-location-out (CV1) and leave-one-breeding-cycle-out (CV2) are shown in Figure 4 for grain yield and Figure 5 for protein content.

FIGURE 4

Figure 4. Barplot of predictive abilities (PAs) for grain yield in leave-one-year-location-out (CV1, upper panel) and leave-one-breeding-cycle-out (CV2, lower panel) cross-validations. M2: line + genomic [SNPs] additive effect + spatial effect + line × environment interaction. M3 expand M2 by adding a genomic [SNPs] additive × ECs interaction. M4 expand M2 to the multi-trait case. M4-TA is the M4 using trait-assisted (TA) genomic prediction. M5 expand M3 to the multi-trait case. M5-TA is the M4 using TA genomic prediction. Black bars are the 95% confidence interval. Differences in the letter above the bar represent significant differences between models (P-value < 0.01). Green lines are the theoretical maximum PAs.

FIGURE 5

Figure 5. Barplot of predictive abilities (PAs) for protein content in leave-one-year-location-out (CV1, upper panel) and leave-one-breeding-cycle-out (CV2, lower panel) cross-validations. M2: line + genomic [SNPs] additive effect + spatial effect + line × environment interaction. M3 expand M2 by adding a genomic [SNPs] additive × ECs interaction. M4 expand M2 to the multi-trait case. M4-TA is the M4 using trait-assisted (TA) genomic prediction. M5 expand M3 to the multi-trait case. M5-TA is the M4 using TA genomic prediction. Black bars are the 95% confidence interval. Differences in the letter above the bar represent significant differences between models (P-value < 0.01). Green lines are the theoretical maximum PAs.

The PA in CV1 varied from 0.296 to 0.528 for grain yield and 0.463 to 0.601 for protein content. A significant improvement in PA was observed for using predictions of the main additive effect plus the genomic [SNPs] × ECs interaction effect (g + gw) compared to using only predictions from the main additive effect (g). The inclusion of the gw effect represented a significant increase of 16.4% for grain yield and 7.1% for protein content (M3 compared to M2). The MT models performing TA-GP (M4 and M5) showed a significant improvement in PA compared to the ST models and MT models without TA-PA. The TA-GP showed a significant PA increase of 37.9% for grain yield and 17.2% for protein content (M5 compared to M3). The highest PAs were obtained for both traits when combined TA-GP and G×E modeling through the genomic [SNPs] × ECs interaction effect in M5 (grain yield PA: 0.528, protein content PA: 0.601).

In CV2, the PA showed similar trends as CV1. A significant PA improvement was observed for models using the g + gw predictions compared to models using only predictions from the g effect. Using g + gw predictions represented a significant increase of 40.2% for grain yield and 14.1% for protein content (M3 compared to M2). The TA-GP for MT models provided a significant increase compared to the ST models and MT models without TA-PA, revealing a PA increase of 94.8% for grain yield and 59.7% for protein content (M5 compared to M3). The highest PA was obtained for both traits when TA-GP and G×E modeling through the genomic [SNPs] × ECs interaction effect in M5 (grain yield PA: 0.448, protein content PA: 0.452). In comparison, the CV2 had lower PA than CV1 for both traits, and higher benefits were observed in CV2 for modeling G×E and using TA-GP.

The maximum potential PA followed a similar trend for both traits, and the highest values were observed for the models accounting for genomic [SNPs] × ECs predictions (M3 and M5). The M1 (baseline) was not included in this section because such a model has no PA in CVs due to independence between lines is assumed.

Variance inflation analysis and model validation

The estimates for variance inflation (b_w,p) of the predicted genetic effects (Legarra and Reverter, 2018) for CV1 and CV2 are shown in Table 7 for grain yield and Table 8 for protein content.

TABLE 7

Table 7. Slope of regression (b_w,p) of estimated genetic values with whole information on genetic values with partial information for grain yield in leave-one-year-location-out and leave-one-breeding-cycle-out cross-validations.

TABLE 8

Table 8. Slope of regression (b_w,p) of estimated genetic values with whole information on genetic values with partial information for protein content in leave-one-year-location-out and leave-one-breeding-cycle-out cross-validations.

Predictions for the main additive effect (g) did not present variance inflation in CV1 since b_w,p values were close to one for both analyzed traits. The b_w,p for predictions of the genomic [SNPs] × ECs interaction effect (gw) indicated a moderate over-dispersion for M3 (b_w,p = 0.87) and M5 (b_w,p = 0.85, without TA-GP) for grain yield. The use of TA-GP resulted in a reduction of over-dispersion in predictions of the gw effect for grain yield (M5-TA-GP: b_w,p = 0.92). For protein content the b_w,p values for predictions of the gw effect varied from 0.92 to 1.0, revealing lower over-dispersion than grain yield.

CV2 had higher variance inflation of predicted values than CV1, and trends were similar for both analyzed traits. The predictions of the main additive effect (g) presented the highest variance inflation for M2 (grain yield: b_w,p = 0.80, protein content b_w,p = 0.80) and the MT model M4 without TA-GP (grain yield: b_w,p = 0.80, protein content b_w,p = 0.81). A systematic improvement in the variance inflation for predictions of the main additive effect (g) was observed for both traits when the genomic [SNPs] × ECs interaction was included in the models (b_w,p from 0.80 in M2 to 0.89 in M3 for both traits). The b_w,p for predictions of the genomic [SNPs] × ECs interaction effect (gw) varied from 0.79 to 0.98 for grain yield and 0.86 to 0.98 for protein content, where in general, the lowest values were associated in both traits to M3 and the MT model M4 without TA-GP, and the highest b_w,p to the M5 using TA-GP.

Discussion

In this study, we proposed a growth degree-day (GDD) based reaction norm (RN) methodology to introduce genomic [SNPs] × environmental covariates (ECs) interactions via (co)variance structures in single-trait (ST) and multi-trait (MT) frameworks. The developed models were used for VCs estimation and genomic prediction of grain yield and protein content in a large set of advanced wheat breeding lines from the commercial company Nordic Seed A/S. The proposed models were evaluated using cross-validation (CV) analysis and trait-assisted genomic prediction (TA-GP) in two prediction problems relevant for plant breeding: (i) predicting breeding values for lines that have been tested in some environments but not in others (CV1, leave-one-year-location-out), and (ii) predicting breeding values for new lines across breeding cycles (CV2, leave-one-breeding-cycle-out). In CV1, the RN developed models have the potential to predict in environments where no lines have been tested, and thus, it could be used either to retrieve information from lines whose phenotyping has failed in a target environment or to predict for a completely new environment where no phenotyping has been performed before. Combining the proposed RN methodology and TA-GP proved to be an efficient approach to improve the predictive ability (PA) and reduce variance inflation of grain yield and protein content predictions in both CVs.

Variance components

The VCs were estimated for the different models in ST and MT scenarios for grain yield (Table 4) and protein content (Table 5). A significant proportion of genetic variance was captured for both traits, as it can be observed in the narrow-sense (h²) and broad-sense (H²) heritabilities for M2. In agreement with previous studies using Nordic Seed A/S data, higher heritability for protein content than grain yield was observed (Kristensen et al., 2018; Kristensen et al., 2019; Guo et al., 2020; Raffo et al., 2022). The M2 was extended in order to better account for G×E interactions, and the genomic [SNPs] × ECs interaction term (gw) was included in M3. The gw effect captured a significant proportion of variance for grain yield, and models including the G×E interaction terms f (line × environment) and gw (M3 and M5) captured higher proportion of G×E variance than models only including the f effect (M1, M2 and M4). The higher G×E variance captured by M3 and M5 can be likely attributed to better accounting for genetic and environmental covariances between lines. The trends were similar for protein content, but the variance for G×E interaction was considerably lower than for grain yield. As reported in previous studies (Jarquin et al., 2014; Pérez-Rodríguez et al., 2015; De Los Campos et al., 2020), the modeling of gw explain a limited proportion of G×E variance. It can happen due to SNPs and ECs not fully capturing the additive genetic effects (due to incomplete LD of QTLs and SNPs) and the environmental variation (e.g., due to distance between weather stations and the test field), respectively. To deal with this issue, the line × environment interaction effect is defined in the models to account for the mentioned misspecifications. In addition, another model term specifying genomic [SNPs] × environment interactions have been considered in the models; however, no significant contribution was observed and data has not been displayed since such an effect does not have the capacity of performing predictions in CV1. The inclusion of an additional term defining the main effect of ECs has been used in previous studies (Jarquin et al., 2014; Pérez-Rodríguez et al., 2015; De Los Campos et al., 2020); nevertheless, such an effect was not possible to estimate in our case as the main environmental effect is already defined at the level of fixed effects. Lastly, a significant proportion of spatial variance ( $σ_{s}^{2}$ ) was observed for both traits, revealing a high spatial variability in the experimental fields. Our proposed methodology for modeling spatial effects has the advantage of not assuming any gradient or specific pattern across the field, allowing for spatial heterogeneity in any direction. This can present an advantage compared to previous methodologies that use X and Y coordinates (row-columns) as covariates to model spatial variation, as they may assume a gradient in X and Y directions (Bernal-Vasquez et al., 2014; Cericola et al., 2017); however, although it can be practical given its simplicity, the spatial variation can generally vary randomly without following a gradient across the field. The estimate obtained for $σ_{s}^{2}$ of grain yield in our study was consistent with the observed for a similar dataset and spatial effect definition in Raffo et al. (2022), and it was higher than in Cericola et al. (2017), where spatial effects were modeled in a row-column setting for a subset of our data. The differences between studies could be, at least in part, attributed to the mentioned differences in model assumptions. Given the significant proportion observed for $σ_{s}^{2}$ in our study, the inclusion of spatial effects in selection models is justified as it may help to improve the model specification.

The MT models showed similar VCs estimates to ST models. In general, high negative between-trait correlations were observed for the different model effects (Table 6). From a breeding perspective, the negative genetic correlation between grain yield and protein content implies an unfavorable response in one trait when selecting on another (Falconer and Mackay, 1996). The negative genetic trait correlations can be due to pleiotropy or tight gene linkage disequilibrium (Chen and Lübberstedt, 2010). Understanding the genetic basis for the negative correlation can help to define an appropriate breeding program and avoid affecting long-term breeding prospects. Note that for pleiotropic gene effect it can be recommended to select against those genes, while for undesirable linkage between genes the strategy would aim to break the linkage through recombination (Falconer and Mackay, 1996; Lynch and Walsh, 1998).

Genomic predictive ability

The PA of the proposed models (Figures 4, 5) was calculated as the correlation between the average value of lines in each year-location after correcting by fixed effects and the vector of predictions (prediction of g effect for M2 to M5; predictions of g + gw effects for M3 and M5). The predictions for the main additive effect (g) had the lowest values of PAs, and they were significantly outperformed by the models accounting for the genomic [SNPs] × ECs interaction when the sum of g + gw effects was used. In CV1, using the sum of g + gw effects results in an increase of 16.4% for grain yield and 7.1% for protein content (comparing M3 to M2). Predictions for g + gw in CV1 are year-location specific and cannot be generalized to future years. In CV2, similar trends to CV1 were observed, but the increments for including the gw predictions were considerably higher than in CV1, representing a 40.2% and 14.1% increase for grain yield and protein content, respectively (comparing M3 to M2). As shown in Figures 4, 5, the PAs were higher for protein content than for grain yield. These results could be related to the underlying architecture of the traits (Momen et al., 2018) and the differences found in h², where for the lower h² values as in grain yield, lower PA is expected (Wimmer et al., 2013; Muranty et al., 2015).

The CV2 had lower PAs for all models. The decline in PA can be explained by the higher genetic/pedigree distance between the training and validation population in CV2 (Habier et al., 2007; Wolc et al., 2011); while in CV1, the same line can be present in different year-locations folds, and full sibs are included in the training and validation sets, for CV2, the replication of the same line or close relatives are not shared between training and validation sets. This fact can also be evidenced in the average genomic relationships between reference and validation sets for CV1 and CV2. For example, the average of genomic relationships for reference/validation sets in CV2 was −0.009 with a standard deviation (SD) of 0.002; the negative value of relationships can be interpreted as lines less related than average. Conversely, in CV1, lines from the same breeding cycle were included in reference/validation sets, and the average genomic relationship within breeding cycles was 0.055 (SD: 0.015). The positive value obtained for relationships within breeding cycles can be interpreted as lines more related than average. As reported by Habier et al. (2007), the decline in PA can result from the break-up of linkage disequilibrium between SNPs and QTL between the training and validation populations when relationships become more distant.

Multi-trait and trait-assisted genomic prediction

The MT models were evaluated for GP with and without TA-GP. The PAs of MT models without TA-GP were similar to ST models for the two CVs and traits (Figures 4, 5). Similar results have been found for wheat (Lado et al., 2018; Kristensen et al., 2019), and other species (Jia and Jannink, 2012; Schulthess et al., 2016), and it has been associated with high trait complexity and low traits heritabilities (Jia and Jannink, 2012; Lado et al., 2018). Conversely, MT models with TA-GP significantly improve PA compared to ST models and MT models without TA-GP. The superior performance of MT models with TA-GP can be explained by the use of “borrowed” information from the trait with complete phenotypic records, which is possible due the use of the between-trait correlations. The highest benefits for TA-GP were observed for grain yield PA, where it significantly increased (P-value < 0.01) by 14.0% (M2 compared to M4-TA) and 37.9% (M3 compared to M5-TA) for CV1, and by 57.9% (M2 compared to M4-TA) and 94.8% (M3 compared to M5-TA) for CV2. For protein content, the benefits provided by TA-GP were lower but still significant (P-value < 0.01), increasing PA by 5.4% (M2 compared to M4-TA) and 14.3% (M3 compared to M5-TA) for CV1, and by 28.2% (M2 compared to M4-TA) and 59.7% (M3 compared to M5-TA) for CV2. As reported by Jia and Jannink (2012) and Guo et al. (2014), the highest benefits conferred by the MT approach are expected for low heritability traits when they are used together with high heritability traits. Accordingly, our study empirically addressed this issue by combining a low heritability trait as grain yield with an intermediate to high heritability trait as protein content in bivariate analysis. Our results were consistent with the literature, revealing the highest benefits for the lowest heritability trait in all scenarios. Another factor contributing to the large benefit observed in TA-GP is the substantial correlation observed between grain yield and protein content (from −0.404 to −0.687 for genetic and G×E effects, Table 6). High between-trait correlations are statistically useful since measurements of one trait can more informative on the genetic values of the other correlated traits (Henderson and Quaas, 1976; Schaeffer, 1984; Thompson and Meyer, 1986; Montesinos-Lopez et al., 2016).

Inflation of variance

The variance inflation (b_w,p) for prediction of the main additive (g) and the genomic [SNPs] × ECs (gw) interaction effects in CV1 and CV2 are shown in Tables 7, 8. The CV1, did not present variance inflation for g predictions as expected due to the high information in the training population for the g effect. However, gw predictions revealed over-dispersion in CV1. A possible explanation for this can be associated with the design of CV1, where no phenotypes for the environment predicted are kept in the training population, and therefore, differences in the QTL effect/expression in the specific environment may not be captured by SNPs, resulting in inflation of the predicted effects. In CV2, over-dispersion was observed in g and gw predictions for both traits. This was expected as the genetic relationships between training and validation population are low, and over-dispersion due to genomic erosion by recombination can occur (Dekkers et al., 2021). Despite this, as described in the “Result” section, the over-dispersion of g predictions was reduced when the gw effect was included in the models. In addition, a reduction in the over-dispersion of g and gw predictions was observed for both traits in the two CVs when TA-GP was used in MT models. The benefits of using TA-GP can be related to a reduction of genomic erosion since phenotypic records of the additional trait are complete, and multi-trait modeling allows to indirectly capture the genetic effects for the trait of interest via the between-trait genetic correlations. The reduction in variance inflation conferred by TA-GP is consistent with previous studies, as reported by Pollak et al. (1984); Jia and Jannink (2012), Hayes et al. (2017), and Lado et al. (2018).

Final remarks and future prospects

The proposed GDD based methodology provides a simplified approach to incorporate environmental information into prediction models and assists in the implementation of multi-trait reaction norm models. Our study showed that the weather and soil information was efficiently exploited without explicitly linking it to specific critical crop periods, which might open an opportunity to work with traits and species where the relationships between the environmental conditions and resulting phenotype are not fully established. A future avenue for developing the proposed models could be exploiting the potentiality of reaction norms in predicting G×E for future years (e.g., using average weather or soil variables to infer future years). The developed models work under the Bayesian Ridge Regression (BRR) framework and assume Gaussian distribution for random effects, implying no specific QTLs or ECs with large effects. Further studies exploring alternative methodologies capable of performing differential shrinkage or variable selection for genetic and environmental effects is warranted.

Conclusion

In this study, we proposed a growth degree-day (GDD) based reaction norm methodology to introduce genomic [SNPs] × environmental covariates interactions through (co)variance structures in a single-trait or multi-trait framework. The growth degree-day based methodology provides a simplified approach to introduce environmental information in prediction models and assists in the implementation of multi-trait reaction norm models. The modeling of genomic [SNPs] × environmental covariates interactions, and the use of trait-assisted genomic prediction in multi-trait models, significantly enhanced the predictive ability and reduced variance inflation in the predicted genetic values for grain yield and protein content in leave-one-year-location-out (CV1) and leave-one-breeding-cycle-out (CV2) cross-validations. The genotype by environment interaction modeling via genomic [SNPs] × environmental covariates interactions, combined with trait-assisted genomic prediction, boosted the benefits in predictive performance.

Data availability statement

The original contributions presented in this study are publicly available. This data can be found here: https://doi.org/10.7910/DVN/WEXJWW.

Author contributions

JJ, MR, PS, and JA: conceptualization. MR, JA, PS, and JO: data curation. MR: formal analysis. JJ, MR, JA, JO, and AJ: funding acquisition. MR and JJ: investigation, methodology, and project administration. JJ and PS: resources and supervision. MR: software, validation, visualization, and writing—original draft. JJ, PS, JA, JO, and AJ: writing—review and editing. All authors contributed to the article and approved the submitted version.

Funding

This research received funding from the National Research and Innovation Agency (ANII) of Uruguay (POS_EXT_ 2018_1_154284) and the IFD RadiBooster project (9090-00052A) for funding contributions of JJ. We would like to thank Aarhus University for financial support for MR and JJ, and the commercial partner Nordic Seed A/S for PS, JA, JO, and AJ funding and for providing the data. The authors declare that this study also received funding from Nordic Seed A/S. The funder had the following involvement in the study: experimental design, data collection and analysis, and reviewing of previous manuscript versions.

Acknowledgments

We would like to thank the two reviewers for their insightful comments and suggestions on earlier versions of the manuscript. We would also like to thank the Danish Meteorological Institute and Dr. Finn Plauborg, Dr. Mette Balslev Greve, and Dr. Johannes Pullens from the Department of Agroecology of Aarhus University for providing climatic data, soil data, and simulated maturity dates, respectively.

Conflict of interest

PS, JA, JO, and AJ were employed by Nordic Seed A/S.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2022.939448/full#supplementary-material

References

Adhikari, K., Kheir, R. B., Greve, M. B., Bøcher, P. K., Malone, B. P., Minasny, B., et al. (2013). High-resolution 3-D mapping of soil texture in Denmark. Soil Sci. Soc. Am. J. 77, 860–876.

Integrating a growth degree-days based reaction norm methodology and multi-trait modeling for genomic prediction in wheat

Introduction

Materials and methods

Plant material and phenotyping

Genotyping

Environmental data and environmental covariates

Statistical analysis

Model 1 (M1, baseline model)

Model 2 (M2)

Model 3 (M3)

Multi–trait models (M4-M5)

Cross-validation analysis and model validation

Results

Phenotyping

Variance components, heritability and trait correlations

Predictive ability of genomic predictions

Variance inflation analysis and model validation

Discussion

Variance components

Genomic predictive ability

Multi-trait and trait-assisted genomic prediction

Inflation of variance

Final remarks and future prospects

Conclusion

Data availability statement

Author contributions

Funding

Acknowledgments

Conflict of interest

Publisher’s note

Supplementary material

References

95% of researchers rate our articles as excellent or good