Feature engineering of environmental covariates improves plant genomic-enabled prediction

Montesinos-López, Osval A.; Crespo-Herrera, Leonardo; Pierre, Carolina Saint; Cano-Paez, Bernabe; Huerta-Prado, Gloria Isabel; Mosqueda-González, Brandon Alejandro; Ramos-Pulido, Sofia; Gerard, Guillermo; Alnowibet, Khalid; Fritsche-Neto, Roberto; Montesinos-López, Abelardo; Crossa, José

doi:10.3389/fpls.2024.1349569

ORIGINAL RESEARCH article

Front. Plant Sci., 15 May 2024

Sec. Functional and Applied Plant Genomics

Volume 15 - 2024 | https://doi.org/10.3389/fpls.2024.1349569

This article is part of the Research TopicInsights in Functional and Applied Plant Genomics: 2023View all 10 articles

Feature engineering of environmental covariates improves plant genomic-enabled prediction

Osval A. Montesinos-López¹

Leonardo Crespo-Herrera²

Carolina Saint Pierre²

Bernabe Cano-Paez³

Gloria Isabel Huerta-Prado⁴

Brandon Alejandro Mosqueda-González⁵

Sofia Ramos-Pulido⁶

Guillermo Gerard²

Khalid Alnowibet⁷

Roberto Fritsche-Neto⁸

Abelardo Montesinos-López^6*

José Crossa^2,8,9,10*†

¹Facultad de Telemática, Universidad de Colima, Colima, Mexico
²International Maize and Wheat Improvement Center (CIMMYT), Texcoco, Edo. de Mexico, Mexico
³Facultad de Ciencias, Universidad Nacioanl Autónoma de México (UNAM), México City, Mexico
⁴Independent consultant, Zinacatepec, Puebla, Mexico
⁵Centro de Investigación en Computación (CIC), Instituto Politécnico Nacional (IPN), México City, Mexico
⁶Centro Universitario de Ciencias Exactas e Ingenierías (CUCEI), Universidad de Guadalajara, Guadalajara, Jalisco, Mexico
⁷Department of Statistics and Operations Research, King Saud University, Riyah, Saudi Arabia
⁸Louisiana State University, Baton Rouge, LA, United States
⁹Distinguished Scientist Fellowship Program, King Saud University, Riyah, Saudi Arabia
¹⁰Instituto de Socieconomia, Estadistica e Informatica, Colegio de Postgraduados, Montecillos, Edo. de México, Texcoco, Mexico

Introduction: Because Genomic selection (GS) is a predictive methodology, it needs to guarantee high-prediction accuracies for practical implementations. However, since many factors affect the prediction performance of this methodology, its practical implementation still needs to be improved in many breeding programs. For this reason, many strategies have been explored to improve the prediction performance of this methodology.

Methods: When environmental covariates are incorporated as inputs in the genomic prediction models, this information only sometimes helps increase prediction performance. For this reason, this investigation explores the use of feature engineering on the environmental covariates to enhance the prediction performance of genomic prediction models.

Results and discussion: We found that across data sets, feature engineering helps reduce prediction error regarding only the inclusion of the environmental covariates without feature engineering by 761.625% across predictors. These results are very promising regarding the potential of feature engineering to enhance prediction accuracy. However, since a significant gain in prediction accuracy was observed in only some data sets, further research is required to guarantee a robust feature engineering strategy to incorporate the environmental covariates.

Introduction

The global population’s rapid growth is increasing food demand, but climate change impacts crop productivity. Plant breeding is essential for high-yield, quality cultivars. Wheat production soared from 200 million tons in 1961 to 775 million tons in 2023 without expanding cultivation, thanks to improved cultivars and agricultural practices (FAO, 2023). Traditional methods used pedigree and observable traits, but DNA sequencing introduced genomic insights. Genomic selection (GS) relies on DNA markers, offering advantages over traditional methods (Crossa et al., 2017).

Numerous studies have investigated the efficacy of GS compared to traditional phenotypic selection across various crops and livestock. Butoto et al. (2022) observed that both GS and phenotypic selection were equally effective in enhancing resistance to Fusarium ear rot and reducing feminizing contamination in maize. Similarly, Sallam and Smith (2016) demonstrated that integrating GS into barley breeding programs targeting yield and Fusarium head blight (FHB) resistance yielded comparable gains in selection response to traditional phenotypic methods. Moreover, GS offered the added benefits of shorter breeding cycles and reduced costs. In contrast, research in maize breeding conducted by Beyene et al. (2015) and Gesteiro et al. (2023) revealed that GS outperformed phenotypic selection, resulting in superior genetic gains. These comparative findings underscore the considerable advantages of GS in optimizing breeding outcomes across diverse agricultural settings.

GS revolutionizes plant and animal breeding by leveraging high-density markers across the genome. It operates on the principle that at least one genetic marker is in linkage disequilibrium with a causative QTL (Quantitative Trait Locus) for the desired trait (Meuwissen et al., 2001). This method transforms breeding in several ways: a) Identifying promising genotypes before planting; b) Improving precision in selecting superior individuals; c) Saving resources by reducing extensive phenotyping; d) Accelerating variety development by shortening breeding cycles; e) Intensifying selection efforts; f) Facilitating the selection of traits difficult to measure; g) Enhancing the accuracy of the selection process (Bernardo and Yu, 2007; Heffner et al., 2009; Desta and Ortiz, 2014; Abed et al., 2018; Budhlakoti et al., 2022).

The GS methodology, embraced widely, expedites genetic improvements in plant breeding programs (Desta and Ortiz, 2014; Bassi et al., 2016; Xu et al., 2020). Utilizing advanced statistical and machine learning models (Montesinos-López et al., 2022), GS efficiently selects individuals within breeding populations. Deep learning, a subset of machine learning, has also shown promise in GS (Montesinos-López et al., 2021; Wang et al., 2023). This selection process relies on data from a training population, encompassing both phenotypic and genotypic information (Crossa et al., 2017).

The Deep Neural Network Genomic Prediction (DNNGP) method of Wang et al. (2023) represents a novel advanced on deep-learning genomic predictive approach. The authors compared the DNNGP with other genomic prediction methods for various traits using genotypic and transcriptomics on maize data. They demonstrated that DNNGP outperformed GBLUP in most datasets. For instance, for maize days to anthesis (DTA) trait, DNNGP showed superiority over GBLUP by 619.840% and 16.420% using gene expression and Single Nucleotide Polymorphism (SNP) data, respectively. When utilizing genotypic data, DNNGP achieved a prediction accuracy of 0.720 for DTA, while GBLUP reached 0.580. However, the study found varied patterns in prediction accuracy for other traits.

Following rigorous training, these models utilize genotypic data to predict breeding or phenotypic values for traits within a target population (Budhlakoti et al., 2022). The GS methodology is versatile, accommodating various scenarios including multi-trait considerations (Calus and Veerkamp, 2011), known major genes and marker-trait associations, Genotype × Environment interaction (GE) (Crossa et al., 2017), and integration of other omics data (Hu et al., 2021; Wu et al., 2022) such as transcriptomics, metabolomics, and proteomics. GE influences phenotypic trait values across diverse environments, underscoring its importance in association and prediction models. Jarquin et al. (2014) introduced a framework significantly improving prediction accuracy in the presence of GE, yet without considering environmental covariates. To enhance accuracy further, recent studies are integrating environmental information into genomic prediction models.

Jarquin et al. (2014) framework lacks consideration of environmental covariates, prompting recent studies to integrate such information to enhance prediction accuracy. For instance, Montesinos-López et al. (2023) and Costa-Neto et al. (2021a, 2021b) demonstrated significant improvements. Conversely, studies by Monteverde et al. (2019); Jarquin et al. (2020), and Rogers et al. (2021) showed modest or negligible enhancements, revealing the ongoing challenge of effectively integrating environmental data into genomic prediction models.

Achieving high prediction accuracy in GS faces significant challenges due to genetic complexities, environmental variations, and data constraints (Juliana et al., 2018). Complex traits involve multiple gene influences, while environmental conditions can alter trait expression (Desta and Ortiz, 2014; Crossa et al., 2017). Phenotyping and marker data quality are critical, and issues like overfitting and population structure can compromise prediction precision (Budhlakoti et al., 2022). Ongoing research focuses on improving models, increasing marker density, and enhancing data quality to refine genomic prediction accuracy (Crossa et al., 2017; Budhlakoti et al., 2022).

Ongoing efforts focus on refining GS accuracy through various optimizations. This includes fine-tuning training and testing sets for improved precision (Rincent et al., 2012; Akdemir et al., 2015). Researchers are also evaluating diverse statistical machine learning methods to develop robust models with minimal fine-tuning yet high accuracy (Montesinos-López et al., 2022). Moreover, integrating additional omics data, such as phenomics and transcriptomics, aims to bolster GS accuracy and identify potent predictors for target traits (Montesinos-López et al., 2017; Krause et al., 2019; Monteverde et al., 2019; Hu et al., 2021; Costa-Neto et al., 2021a, b; Rogers and Holland, 2022; Wu et al., 2022). These endeavors seek to enhance GS predictive capabilities by leveraging diverse information sources.

Feature engineering (FE) is crucial in improving machine learning model performance by selecting, modifying, or creating new features from raw data. It transforms input data into a more representative and informative format, capturing relevant patterns and relationships, and enhancing the model’s generalization ability. FE involves various tasks like selecting optimal features, generating new features, normalization/scaling, handling missing values, and encoding categorical variables. For instance, techniques like Principal Component Analysis (PCA) can transform correlated features into uncorrelated ones (Lam et al., 2017; Dong and Liu, 2018; Khurana et al., 2018). FE’s popularity is rising due to its ability to enhance model performance, extract meaningful information from complex data, improve interpretability, and boost efficiency. Successful implementations include sentiment analysis, image recognition, and predictive maintenance, showcasing FE’s effectiveness across domains (Nargesian et al., 2017; Carrillo-de-Albornoz et al., 2018; Yurek and Birant, 2019). In genomic prediction, FE has also been successful, as demonstrated by Bermingham et al. (2015) and Afshar and Usefi (2020). These examples underscore FE critical role in various domains, leading to more accurate machine learning applications (Dong and Liu, 2018).

The impact of feature engineering (FE) on reducing prediction error varies depending on the dataset, problem, and quality of FE. Well-crafted features can notably minimize prediction error in some cases, but the exact improvement is context-specific and not guaranteed. Effective FE can enhance model performance significantly, albeit its extent varies case by case (Heaton, 2016; Dong and Liu, 2018).

To optimize genomic selection’s predictive accuracy, it’s vital to adopt innovative methodologies that account for its multifaceted influences. FE in genomic prediction offers a promising approach by enhancing prediction quality, uncovering genetic insights, customizing models to specific needs, improving interpretability, and minimizing data noise. In this paper, we investigate FE applied to environmental covariates to assess its potential in enhancing prediction performance within the context of genomic selection.

Materials and methods

Dataset USP

The University of São Paulo (USP) Maize, Zea mays L., dataset is sourced from germplasm developed by the Luiz de Queiroz College of Agriculture at the University of São Paulo, Brazil. An experiment was conducted between 2016 and 2017 involving 49 inbred lines, yielding a total of 906 F1 hybrids, of which 570 were assessed across eight diverse environments for grain yield (GY). These environments were created by combining two locations, two years, and two nitrogen levels. However, we specifically used data from four distinct environments for this research, each containing 100 hybrids. It’s important to note that these environments had varying soil types and climatic conditions, and the study integrated data from 248 covariates related to these environmental factors. The parent lines underwent genotyping through the Affymetrix Axiom Maize Genotyping Array, resulting in a dataset of 54,113 high-quality SNPs after applying stringent quality control procedures. Please refer to Costa-Neto et al. (2021a) for further comprehensive information on this dataset.

Dataset Japonica

The Japonica dataset comprises 320 rice (Oryza sativa L.) genotypes drawn from the Japonica tropical rice population. This dataset underwent evaluations for the same four traits (GY, PHR: percentage of head rice, GC: percentage of chalky grains, PH: plant height) as the Indica population, but in this case, it was conducted across five distinct environments spanning from 2009 to 2013. Covariates were meticulously measured three times a year, covering three developmental stages (maturation, reproductive, and vegetative). This dataset comprises a non-balanced set of 1,051 assessments recorded across these five diverse environments. Additionally, each genotype within this dataset was meticulously evaluated for 16,383 SNP markers that remained after rigorous quality control procedures, with each marker being represented as 0, 1, or 2. For more comprehensive information on this dataset, please refer to Monteverde et al. (2019).

Dataset G2F

These three distinct datasets correspond to the Maize Crop, Zea mays L., for years 2014 (G2F_2014), 2015 (G2F_2015), and 2016 (G2F_2016) from the Genomes to Fields maize project (Lawrence-Dill, 2017), as outlined by Rogers and Holland (2022). These datasets collectively encompass a wealth of phenotypic, genotypic, and environmental information. To narrow the focus, our analysis primarily includes four specific traits: Grain_Moisture_BLUE (GM_BLUE), Grain_Moisture_weight (GM_Weight), Yield_Mg_ha_BLUE (YM_BLUE), and Yield_Mg_ha_weight (YM_Weight), carefully selected from a larger pool of traits detailed by Rogers and Holland (2022). Across these three years, the study involves 18, 12, and 18 distinct environments for the years 2014 (G2F_2014), 2015 (G2F_2015) and 2016 (G2F_2016), respectively. Regarding genotype numbers, the dataset for 2014 consisted of 781 genotypes, the dataset for 2015 featured 1,011 genotypes, and the dataset for 2016 comprised 456 genotypes. The analysis relies on 20,373 SNP markers that have already undergone imputation and filtering, following the methodology outlined by Rogers et al. (2021) and Rogers and Holland (2022). Additive allele calls are documented as minor allele counts, represented as 0, 1, or 2. For more detailed insights into these datasets, we recommend consulting the comprehensive description provided in Lawrence-Dill (2017) and Rogers and Holland (2022).

It is worth noting that each data set presents unique sets of environments. However, concerning traits, the G2F_2014, G2F_2015, and G2F_2016 datasets share identical traits, as do the Japonica dataset.

Statistical models

The four predictors under a genomic best linear unbiased predictor (GBLUP; Habier et al., 2007; VanRaden, 2008) model are described below.

Predictor P1: E+G

This predictor is represented as

\begin{array}{l} Y_{i j} = μ + E_{i} + g_{j} + ϵ_{i j}, & (1) \end{array}

where $Y_{i j}$ denotes the response variable in environment i and genotype j. $μ$ denotes the population mean; $E_{i}$ are the random effects of environments, $g_{j}, j = 1, \dots, J$ , denotes the random effects of lines, and $ϵ_{i j}$ denotes the random error components in the model assumed to be independent normal random variables with mean 0 and variance $σ^{2}$ . In the context of this predictor E+G, X, denotes the matrix of markers and $M$ the matrix of centered and standardized markers. Then $G = \frac{M M^{T}}{p}$ (VanRaden, 2008), where $p$ is the number of markers. $Z_{g}$ is the design matrix of genotypes (lines) of order $n \times J, G$ is the genomic relationship-matrix computed using markers (VanRaden, 2008). Therefore, the random effect of lines is distributed as $ɡ = {(g_{1}, \dots, g_{J})}^{T} \sim N_{J} (0, σ_{g}^{2} Z_{g} G Z_{g}^{T})$ . This model (1) was implemented in the BGLR library of Pérez and de los Campos (2014). Therefore, the linear kernel matrix for the genotype effect was determined by calculating the “covariance” structure of the genotype predictor ( $Z_{g}$ g) as $K_{g} = Z_{g} G Z_{g}^{T}$ .

On the other hand, the linear kernel matrix for the Environment effect was computed using three different techniques: not using environmental covariates (NoEC), with environmental covariates (EC), and with environmental covariates with FE.

∘ NoEC: Under this NoEC technique, the resulting linear kernel of environments was computed as $K_{E} = X_{E} X_{E}^{T} / I$ , where $I$ denotes the number of environments and $X_{E}$ the design matrix of environments with zeros and ones, with ones in positions of specific environments.

∘ EC: The EC technique involved selecting and scaling the environmental covariates (EC) that exhibited a relevant Pearson´s correlation with the response variable. Covariates are selected if their Pearson’s correlation with the response variable exceeds 0.5 in each training set per trait. Notably, covariate selection excludes response variables in the testing set, representing the environment to predict. Covariates meeting a correlation of at least 0.5 are used; otherwise, lower thresholds like 0.3 or 0.4 are considered. Correlations below these values indicate training without environmental covariates.

∘ The resulting set of selected EC’s was then used to compute an environmental linear kernel, denoted as $K_{E C}$ of order $I \times I$ . After using this kernel, the expanded environmental kernel was computed as $K_{E_{E C}} = X_{E} K_{E C} X_{E}^{T} / I$ , which was used in the Bayesian model. The scaling of each environmental covariate was done by subtracting its respective mean and dividing by its corresponding standard deviation.

∘ FE: The Feature Engineering (FE) technique involved computing various mathematical transformations between all possible pairs of ECs, including addition, difference, product, and ratio, as well as other commonly used transformations such as inverses, square powers, root squares, logarithms, and some Box-Cox transformations for each EC. These transformations were used to generate new variables through FE. The transformation of addition, difference, product and ratio were implemented for each pair of environmental covariates, that is, there were built a total the n_cov choose two new covariates, with n_cov denoting the number of environmental covariates in each data set. While with transformations such as inverses ( $1 / x$ ), square powers ( $x^{2})$ , root squares ( $\sqrt{x})$ , natural logarithms [ $ln (x)$ ], and Box-Cox transformations for each environmental covariate was created only one new environmental covariate. Then the original and new environmental covariates were concatenated in a matrix and then were submitted to the selection process explained above. Then under the FE approach these resulting covariates are used to compute the new environmental kernel matrix $(K_{E_{F E}})$ .

Predictor P2: E+G+GE

The E+G+GE predictor is similar to P1 (Equation 1) but also accounts for the differential response of cultivars in environments, that is GE. This is achieved by taking the product of the kernel matrices of the genotype (G) and environment (E) predictors, that is, they were computed as $K_{g} ° K_{E_{N o E C}}$ (for NoEC), $K_{g} ° K_{E_{E C}}$ (for EC) or $K_{g} ° K_{E_{F E}}$ (for FE), which serves as the kernel matrix for the GE. In general, adding the GE interaction to the statistical machine learning model increases the genomic prediction accuracy (Jarquin et al., 2014; Crossa et al., 2017). Also, it is important to point out that under this predictor (P2) variance components and heritability of each trait in each data set were obtained under a Bayesian framework using the complete data set (i.e., no missing values allowed). For this computation all the terms were entered as random effects into the model but without taking into account the environmental covariates.

Predictor P3: E+G+BRR

The E+G+BRR predictor is similar to P1 (Equation 1), but incorporating the ECs as fixed effects in a Bayesian Ridge Regression (BRR) framework, that is, regression coefficients are assigned normal independent and identically distributed normal distributions, with mean zero and variance $σ_{β}^{2}$ . See details of BRR in Pérez and de los Campos (2014).

Predictor P4: E+G+GE+BRR

The E+G+GE+BRR predictor is similar to P2, but also incorporates ECs as fixed effects in a Bayesian Ridge Regression (BRR) framework (see Appendix for brief details on Bayesian Ridge Regression). The priors used for GBLUP and BRR in BGLR are those default settings which are given with details in Pérez and de los Campos (2014). In this study, we found these default settings to be suitable, as we experimented with various configurations of the prior hyperparameters for the GBLUP and BRR models on the USP and G2F_2014 datasets. Remarkably, all configurations yielded identical predictions. Consequently, for the remaining datasets, we opted to utilize only the default settings.

Evaluation of prediction performance

The cross-validation approach used in this study involved leaving one environment out. In each iteration, the data from a single-environment served as the testing set, while the data from all other families constituted the training set (Montesinos-López et al., 2022). The number of iterations was equal to the number of environments to ensure that each environment was used as the testing set exactly one time. This method was employed to assess the model’s ability to predict information from a complete environment using data from other environments.

To evaluate the predictive performance we used the Mean Square Error (MSE) that quantifies the prediction error by measuring the squared deviation between observed and predicted values on the testing set. The MSE was computed for each scenario evaluated (NoEC, EC and FE) and then for comparing these three scenarios was computed the relative efficiencies as:

R E_{N o E C_v s_E C} = (\frac{M S E (N o E C)}{M S E (E C)})

R E_{N o E C_v s_F E} = (\frac{M S E (N o E C)}{M S E (F E)})

R E_{E C_v s_F E} = (\frac{M S E (E C)}{M S E (F E)})

$R E_{N o E C_v s_E C}$ compares the prediction performance of EC vs NoEC, $R E_{N o E C_v s_F E}$ compares the prediction performance of FE vs NoEC and $R E_{E C_v s_F E}$ compares the prediction performance of FE vs EC. When $R E_{N o E C_v s_E C} > 1$ the best prediction performance was obtained by the EC strategy, while when $R E_{N o E C_v s_E C} < 1$ the strategy NoEC was the best. While when the relative efficiencies are equal to 1 means that both methods had equal prediction performance. The same interpretation applies for the other comparisons in terms of RE.

Results

The results are given in three sections for three datasets (Japonica, USP and G2F_2016). For each section we provided the results for the four predictor models under study (E+G, E+G+GE, E+G+BRR, E+G+GE+BRR) and under each predictor we compared three strategies for the use of the environmental covariates: NoEC, using environmental covariables (EC) and using environmental covariables with FE. Additionally, Appendix A contains comprehensive details of the BRR model utilized in this study. Furthermore, Appendix B offers extensive information on the outcomes for Japonica, USP, and G2F_2016 datasets, which are outlined in Table B1–Table B2, Table B3, Table B4, Table B4–Table B5 respectively. Additionally, Table B7 in this appendix presents the variance components and heritability of each trait within every dataset. For the results pertaining to datasets G2F_2014 and G2F_2015, please refer to the Supplementary Materials section.

Table B1

Table B1 The prediction performance and the relative efficiency (RE) for Japonica dataset in terms of mean squared error (MSE) for each Environment and for each trait, for the predictors E+G and E+G+GE under three different techniques to compute the Kernel for the effect of the Environment: without Environmental Covariates (NoEC), using Environmental covariates (EC) and using Environmental Covariates with Feature Engineering (FE).

Table B2

Table B2 The prediction performance and the relative efficiency (RE) for Japonica dataset in terms of mean squared error (MSE) for each Environment and for each trait, for the predictors E+G+BRR and E+G+GE+BRR under three different techniques to compute the Kernel for the effect of the Environment: without Environmental Covariates (NoEC), using Environmental covariates (EC) and using Environmental Covariates with Feature Engineering (FE).

Table B3

Table B3 The prediction performance and the relative efficiency (RE) for USP dataset in terms of mean squared error (MSE) for each Environment and for each trait, for the predictors E+G and E+G+GE under three different techniques to compute the Kernel for the effect of the Environment: without Environmental Covariates (NoEC), using Environmental covariates (EC) and using Environmental Covariates with Feature Engineering (FE).

Table B4

Table B4 The prediction performance and the relative efficiency (RE) for USP dataset in terms of mean squared error (MSE) for each Environment and for each trait, for the predictors E+G+BRR and E+G+GE+BRR under three different techniques to compute the Kernel for the effect of the Environment: without Environmental Covariates (NoEC), using Environmental covariates (EC) and using Environmental Covariates with Feature Engineering (FE).

Table B5

Table B5 The prediction performance and the relative efficiency (RE) for G2F_2016 dataset in terms of mean squared error (MSE) for each Environment and for each trait, for the predictors E+G and E+G+GE under three different techniques to compute the Kernel for the effect of the Environment: without Environmental Covariates (NoEC), using Environmental covariates (EC) and using Environmental Covariates with Feature Engineering (FE).

Table B6

Table B6 The prediction performance and the relative efficiency (RE) for G2F_2016 dataset in terms of mean squared error (MSE) for each Environment and for each trait, for the predictor E+G+BRR and E+G+GE+BRR under three different techniques to compute the Kernel for the effect of the Environment: without Environmental Covariates (NoEC), using Environmental covariates (EC) and using Environmental Covariates with Feature Engineering (FE).

Table B7

Table B7 Variance components (Var_Comp) for environment (Env) Line and Genotype by environment (Env:Line) interaction for each data set. CV denotes coefficient of variation and n_Env denotes the average of number of environments in each data set.

Japonica dataset

Predictor: E+G

Figure 1A provides a summary of Table B1 across traits and reveals that FE outperformed EC in most environments with improvements of 20.260% (2010), 38.920% (2011), 1.750% (2012), and 25.470% (2013). This results in an average RE of 1.1567. EC, on the other hand, outperformed NoEC in most environments with improvements of 121.200% (2009), 48.080% (2010), and 8.140% (2012), resulting in an average RE of 1.277. Likewise, FE outperformed NoEC in 101.240% (2009), 59.560% (2010), and 4.710% (2012), with slight losses in other environments, but an average RE of 1.2814. This indicates that using EC and FE surpassed NoEC by 27.730% and 28.140%, respectively. These calculations are derived from the results presented in Table B1.

Figure 1

Figure 1 The three relative efficiencies, considering EC_vs_FE, NoEC_vs_EC, and NoEC_vs_FE, for Japonica dataset, for predictors (A) E+G, (B) E+G+GE, (C) E+G+BRR and (D) E+G+GE+BRR in terms of mean squared error (MSE) for each Environment across traits.

Predictor: E+G+GE

Figure 1B summarizes the findings from Table B1 across traits, illustrating the comparative performance of FE, EC, and NoEC techniques in various environments. The results indicate that FE outperformed EC in the majority of environments, with improvements of 4.280% (2010), 40.050% (2011), and 20.220% (2013), resulting in an average RE of 1.099. On the other hand, EC outperformed NoEC in most environments, with improvements of 78.070% (2009), 16.100% (2012), and 147.980% (2013), yielding an average RE of 1.430. Furthermore, FE surpassed the conventional NoEC technique by 68.990% (2009), 1.780% (2012), and 178.280% (2013), with an average RE of 1.462. These results indicate that using EC and FE techniques outperformed the conventional NoEC technique by 43.040% and 46.150%, respectively. The calculations are derived from the outcomes presented in Table B1.

Predictor: E+G+BRR

Figure 1C provides an overview of Table B2 across traits. It reveals that FE outperformed EC only in environments 2010 (9.630%) and 2011 (25.340%), resulting in an average RE of 0.975. On the other hand, EC outperformed NoEC in all environments, with percentages of improvement of 92.640% (2009), 20.690% (2010), 15.960% (2011), 36.170% (2012), and 9.070% (2013), and an average RE of 1.349. Additionally, FE outperformed the NoEC technique in 80.390% (2009), 34.120% (2010), 13.690% (2011), and 21.950% (2012) of the environments with a slight loss in 2013, but an average RE of 1.269. These findings indicate that using EC and FE techniques surpassed NoEC in 34.910% and 26.940% of the environments, respectively. The calculations are based on the results presented in Table B2.

Predictor: E+G+GE+BRR

Figure 1D summarizes the findings from Table B2 across traits. It reveals that FE displayed a superior performance over EC in environments 2010 (14.770%), 2011 (21.700%), and 2013 (17.870%), resulting in an average RE of 1.064. On the other hand, EC outperformed NoEC in most environments, namely 67.750% (2009), 28.390% (2010), 27.210% (2011), and 183.970% (2013), with an average RE of 1.614. Moreover, FE outperformed NoEC in most environments, specifically 54.260% (2009), 35.520% (2010), 33.140% (2011), and 197.980% (2013), with an average RE of 1.604. These findings indicate that using EC and FE surpassed NoEC in 61.390% and 60.460% of cases, respectively. The computations for these results were based on the findings presented in Table B2.

USP dataset

Predictor: E+G

Figure 2A and Table B3 provide the results of our comparison between the NoEC and FE techniques using the RE metric. FE outperformed the NoEC technique only in Env1 (1.107), displaying an improvement of 10.670%. However, in Env2 (0.910), Env3 (0.8123), and Env4 (0.989), the NoEC technique surpassed FE, resulting in an average RE of 0.955. This average RE indicates a general loss of 4.520% when using FE compared to NoEC (see Table B3).

Figure 2

Figure 2 The three relative efficiencies, considering EC_vs_FE, NoEC_vs_EC, and NoEC_vs_FE, for USP dataset, for predictors (A) E+G, (B) E+G+GE, (C) E+G+BRR and (D) E+G+GE+BRR in terms of mean squared error (MSE) for each Environment.

Predictor: E+G+GE

Figure 2B and Table B3 provide the results of our comparison between the NoEC and FE techniques based on the RE metric, including the fact that the use of FE outperformed the use of NoEC in environments Env1 (1.167), Env2 (1.016), and Env4 (1.064), resulting in respective improvements of 16.670%, 1.550%, and 6.390%. However, in Env3 (0.912), the NoEC technique outperformed FE, resulting in an average RE of 1.040. This average RE indicates a general improvement of 4.000% of the FE technique regarding the NoEC method. For more detailed information, see Table B3.

Predictor: E+G+BRR

Based on Figure 2C and Table B4, our comparison between the NoEC and FE techniques using the RE metric reveals that FE outperformed the NoEC technique in environments Env1 (1.216), Env3 (1.189), and Env4 (1.435), displaying improvements of 21.580%, 18.890%, and 43.500%, respectively. However, in Env2 (0.768), the NoEC technique outperformed using FE. In general, FE outperformed NoEC by 15.200% since an average RE of 1.152 was observed (see Table B4).

Predictor: E+G+GE+BRR

Finally, based on the analysis presented in Figure 2D and Table B4, we compared the NoEC and FE techniques using the RE metric. The results indicate that FE outperformed NoEC in Env1 (1.231), Env3 (1.368), and Env4 (1.491), displaying improvements of 23.090%, 36.760%, and 49.080%, respectively. However, in Env2 (0.901), the NoEC technique outperformed FE, although, FE outperformed the NoEC in general terms, since an average RE of 1.248 was observed (see Table B4).

G2F_2016 dataset

Predictor: E+G

Figure 3A summarizes Table B5 across different environments for each trait. It reveals that FE outperformed EC in all traits, achieving improvements of 87.970% (Grain_Moisture_BLUE), 58.100% (Grain_Moisture_weight), 21.030% (Yield_Mg_ha_BLUE), and 89.600% (Yield_Mg_ha_weight), resulting in an average RE of 1.642. In contrast, EC outperformed NoEC in most traits, with improvements of 63.960% (Grain_Moisture_BLUE), 1682.340% (Grain_Moisture_weight), and 52.860% (Yield_Mg_ha_weight), yielding an average RE of 5.497. Additionally, FE surpassed NoEC in all traits, with enhancements of 119.370% (Grain_Moisture_BLUE), 245.980% (Grain_Moisture_weight), 1.400% (Yield_Mg_ha_BLUE), and 22.630% (Yield_Mg_ha_weight), resulting in an average RE of 1.974. These findings indicate that both EC and FE techniques outperformed NoEC by 449.740% and 97.350%, respectively. The computations are based on the results presented in Table 5B.

Figure 3

Figure 3 The three relative efficiencies, considering EC_vs_FE, NoEC_vs_EC, and NoEC_vs_FE, for G2F_2016 dataset, for predictors (A) E+G, (B) E+G+GE, (C) E+G+BRR and (D) E+G+GE+BRR in terms of mean squared error (MSE) for each trait across environments.

Predictor: E+G+GE

Figure 3B and Table B5 shows that for the Yield_Mg_ha_weight trait, the NoEC technique achieved the best performance in most environments, as shown by the MSE values (DEH1_2016 [0.051], GAH1_2016 [0.026], IAH1_2016 [2.914], IAH2_2016 [0.069], MIH1_2016 [0.055], MNH1_2016 [0.146], NEH1_2016 [0.033], NYH2_2016 [0.449] and OHH1_2016 [1.202]). On average, there were slight losses of 2.210% and 2.570% when comparing EC versus NoEC and FE versus NoEC, respectively. This suggests that EC and FE techniques could have performed more adequately than the conventional NoEC technique. However, comparing EC and FE techniques based on RE showed that FE outperformed EC in most environments under NoEC, resulting in an average RE of 1.339, indicating a superiority of 33.930% for FE (see Table 5B).

Predictor: E+G+BRR

Figure 3C summarizes the findings from Table B6 across environments for each trait. It shows that FE outperformed EC in all characteristics, with improvements of 67.090% (Grain_Moisture_BLUE), 167.270% (Grain_Moisture_weight), 10.650% (Yield_Mg_ha_BLUE), and 3.960% (Yield_Mg_ha_weight), resulting in an average RE of 1.622. Additionally, EC outperformed NoEC in all traits, with improvements of 84.880% (Grain_Moisture_BLUE), 249.510% (Grain_Moisture_weight), 3.780% (Yield_Mg_ha_BLUE), and 51.630% (Grain_Moisture_weight), resulting in an average RE of 1.975. Furthermore, FE outperformed NoEC only in the traits Grain_Moisture_BLUE (129.850%) and Grain_Moisture_weight (25.410%), with an average RE of 1.360. These results indicate that EC and FE techniques outperformed the conventional NoEC technique in 62.240% and 36.020% of cases, respectively. These calculations are derived from the results presented in Table B6.

Predictor: E+G+GE+BRR

Figure 3D summarizes the results from Table B6 across different traits. It shows that FE outperformed EC in the majority of traits, specifically by 29.090% for Grain_Moisture_BLUE, 689.960% for Grain_Moisture_weight, and 38.420% for Yield_Mg_ha_weight. This leads to an average RE of 2.893. On the other hand, EC outperformed NoEC in all traits, with improvements of 65.180% for Grain_Moisture_BLUE, 408.510% for Grain_Moisture_weight, 11.690% for Yield_Mg_ha_BLUE, and 22.200% for Yield_Mg_ha_weight. The average RE for EC compared to NoEC is 2.269. Furthermore, FE outperformed NoEC in all traits, with improvements of 125.150% for Grain_Moisture_BLUE, 240.900% for Grain_Moisture_weight, 9.490% for Yield_Mg_ha_BLUE, and 11.380% for Yield_Mg_ha_weight. The average RE for FE compared to NoEC is 1.967. These results indicate that using EC and FE outperformed NoEC by 126.890% and 96.730%, respectively. These computations are derived from the outcomes of Table B6.

Summary across data sets for each predictor

In Table 1 we can observe that in any of the four predictors using environmental covariates improve prediction accuracy at least 61.400% regarding of not using the environmental covariates (NoEC_vs_EC). Also, we can see in this same table that using FE improves the prediction performance in the four predictors regarding of using the original environmental covariates (EC_vs_FE) in at least 347.300%. Regarding using FE and not using environmental covariates (NoEC_vs_FE) we can observe that also in the four predictors using FE outperform by at least 113.100% not using the environmental covariates. Also, we observed that in many cases adding directly the environmental covariates (EC) not improve (and even reduce) the prediction performance and for this reason, we observe that the gain in terms of prediction performance of NoEC_vs_FE is less pronounced regarding comparing EC_vs_FE.

Table 1

Table 1 Summary of relative efficiencies (RE) across data sets for each predictor.

Discussions

Due to the fact, that still the practical implementation of the GS methodology is challenging since not always is possible to guarantee high genomic-enabled prediction accuracy, many strategies had been developed to improve the machine learning genomic prediction ability (Sallam and Smith, 2016). For this reason, since the GS methodology is still not optimal, this investigation explored FE on the environmental covariates. FE is a crucial step in machine learning and data science that involves creating new features or modifying existing ones to improve the performance of a model. FE is a creative and essential aspect of the machine learning workflow, and it can significantly impact the success of one’s models. It is a skill that improves with experience and a deep understanding of the data and problem. For this reason, FE has been applied successfully in solving natural language processing, computer vision, time series and other issues.

FE is not new in the context of GS, since some studies had been conducted exploring feature engineering techniques from the feature selection point of view. For example, Long et al. (2011) used dimension reduction and variable selection for genomic selection to predict milk yield in Holsteins. Tadist et al. (2019) present a systematic and structured literature review of the feature-selection techniques used in studies related to big genomic data analytics. While Meuwissen et al. (2017) proposed variable selection models for genomic selection using whole-genome sequence data and singular value decomposition. More recently Montesinos-López et al. (2023) proposed feature selection methods for selecting environmental covariables to enhance genomic prediction accuracy. However, these studies are only focused on feature selection and not create new features from the original inputs.

From our results across traits and data sets, we can state that including environmental covariates significantly improves the prediction performance, since comparing no environmental covariates (NoEC) vs adding environmental covariates (EC), the resulting improvement was of 167.900% (RE=2.679 of NoEC_vs_EC), 142.100 (RE=2.242 of NoEC_vs_EC), 56.100% (RE=1.561 of NoEC_vs_EC) and 421.300% (RE=5.213 of NoEC_vs_EC) under predictor E+G, E+G+GE, E+G+BRR and E+G+GE+BRR respectively. However, it is very interesting to point out that the prediction performance can be even improved when the covariates are included but using FE. We found that the improvement of the prediction performance using FE only including only the EC was of 816.600% (RE=9.166 of EC_vs_FE), 372.900% (RE=4.729 of EC_vs_FE), 616.100% (RE=716.100 of EC_vs_FE) and 1240.900% (RE=13.409% of EC_vs_FE) under predictors E+G, E+G+GE, E+G+BRR and E+G+GE+BRR respectively. The larger gain in prediction performance was observed under the most complex predictor (E+G+GE+BRR), while the lowest gain was observed under predictor E+G+GE. Our results show that FE in genomic prediction holds tremendous potential for advancing our understanding of genetics and improving predictions related to various aspects of genomics. For this reason, FE should be considered an important tool to unlock the potential of genomic data for research and practical applications of genomic prediction.

Although our results are very promising for the use of FE, its practical implementation is very challenging, since we observed a significant improvement in some data sets but not in all, and for practical implementations, we need to be able to identify with a high degree of accuracy when the use of FE will be beneficial and when the use of this approach will not be successful. Also, it is important to point out that we have opted against utilizing the Pearson’s correlation coefficient as a performance metric for predicting outcomes. This decision is principally rooted in the lack of substantial improvement linked to this measure we observed. The marginal benefits observed with this metric can be partly ascribed to our exclusive focus on feature selection within the realm of environmental covariates. Additionally, this can be attributed to the assessment of environmental covariates not at the genotype level but rather at the environmental (location) level.

Three reasons why the FE works well for some data but not very well for others are: (1) that those data sets with low efficiency with FE are those in which the environmental covariates are less correlated with the response variable, (2) that we speculate that not for all data sets the type of FE we implemented are efficient and (3) FE capture complex relationships between the inputs and the response variable. These mean that the nature of each data set affects substantially the performance of any FE strategy. For these reasons some challenges for its implementation are: a) Domain Knowledge Requirement: Effective FE often requires a deep understanding of the domain. With domain expertise, it can be easier to identify relevant features or transformations that could enhance model performance; b) Data Quality and Quantity: Obtaining high-quality and sufficient data for FE can be challenging in many practical scenarios. Limited or noisy data can hinder the creation of meaningful features; c) Time and Resource Constraints: Implementing FE can be time-consuming, and in some real-world applications, there might be strict time and resource constraints. This makes exploring and experimenting with a wide range of FE techniques challenging; d) Dynamic Data: Real-world data often changes over time. Features that are effective at one point in time may become less relevant or even obsolete as the data distribution evolves. Maintaining and updating features in dynamic environments can be challenging; e) Overfitting Risks: Aggressive. FE can lead to overfitting, especially when the number of features is large compared to the amount of available data. Overfit models perform well on training data but generalize poorly to new, unseen data; f) Complexity and Interpretability: As the number and complexity of features increase, the resulting models can become difficult to interpret. This lack of interpretability can be challenging, especially in applications where understanding the model’s decisions is crucial; g) Automated Feature Selection: While manual FE can be effective, the process is often subjective and time-consuming. Automated feature selection methods exist, but selecting the right techniques and parameters can be challenging; h) Curse of Dimensionality: As the number of features increases, the curse of dimensionality becomes more pronounced. This can lead to increased computational requirements and decreased model performance, making it challenging to strike the right balance.

The results of this study demonstrate that the feature engineering strategy for incorporating environmental covariates effectively enhances genomic prediction accuracy. However, further research is warranted to refine the methodology for integrating environmental covariates into genomic prediction models, particularly in the context of modeling genotype-environment interactions (GE). For instance, employing the factor analytic (FA) multiplicative operator to describe cultivar effects in different environments has shown promise as a robust and efficient machine learning approach for analyzing multi-environment breeding trials (Piepho, 1998; Smith et al., 2005). Factor analysis offers solutions for modeling GE with heterogeneous variances and covariances, either alongside the numerical relationship matrix (based on pedigree information) (Crossa et al., 2006) or utilizing the genomic similarity matrix to assess GE (Burgueño et al., 2012). Further research is needed to comprehensively explore the application of the FA approach for feature engineering of environmental covariates within the framework of genomic prediction.

Conclusions

This study delved into the impact of feature engineering on environmental covariates to enhance the predictive capabilities of genomic models. Our findings demonstrate a consistent improvement in prediction performance, as measured by MSE, across most datasets when employing feature engineering techniques compared to models without such enhancements. While some datasets showed no significant gains, others exhibited notably substantial improvements. These results underscore the potential of feature engineering to bolster prediction accuracy in genomic studies. However, it’s imperative to acknowledge the inherent complexity and challenges associated with practical implementation, as various factors can influence its efficacy. Therefore, we advocate for further exploration and adoption of feature engineering methodologies within the scientific community to accumulate more empirical evidence and harness its full potential in genomic prediction.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding authors.

Author contributions

OM: Writing – review & editing, Writing – original draft, Software, Methodology, Investigation, Conceptualization. LC: Writing – review & editing, Conceptualization. CS: Writing – review & editing, Supervision, Project administration, Investigation. BC: Writing – review & editing, Software, Methodology, Formal Analysis, Data curation, Conceptualization. GH: Writing – review & editing, Software, Conceptualization. BA: Writing – review & editing, Software, Methodology, Investigation, Data curation. SR: Writing – review & editing, Software, Methodology, Investigation. GG: Writing – review & editing, Methodology, Investigation, Data curation. KA: Writing – review & editing, Methodology, Investigation. RF: Writing – review & editing, Methodology, Investigation, Conceptualization. AM: Writing – review & editing, Software, Methodology, Investigation, Conceptualization. JC: Writing – review & editing, Writing – original draft, Investigation, Conceptualization.

Funding

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. Open Access fees were received from the Bill & Melinda Gates Foundation. We acknowledge the financial support provided by the Bill & Melinda Gates Foundation (INV-003439 BMGF/FCDO Accelerating Genetic Gains in Maize and Wheat for Improved Livelihoods (AGG)) as well as the USAID projects (Amend. No. 9 MTO 069033, USAID-CIMMYT Wheat/AGGMW, Genes 2023, 14, 927 14 of 18AGG-Maize Supplementary Project, AGG (Stress Tolerant Maize for Africa)) which generated the CIMMYT data analyzed in this study. We are also thankful for the financial support provided by the Foundation for Research Levy on Agricultural Products (FFL) and the Agricultural Agreement Research Fund (JA) through the Research Council of Norway for grants 301835 (Sustainable Management of Rust Diseases in Wheat) and 320090 (Phenotyping for Healthier and more Productive Wheat Crops). We acknowledge the support of the Window 1 and 2 funders to the Accelerated Breeding Initiative (ABI).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The handling editor HL declared a past co-authorship with the author JC.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2024.1349569/full#supplementary-material

References

Abed, A., Pérez-Rodríguez, P., Crossa, J., Belzile, F. (2018). When less can be better: how can we make genomic selection more cost-effective and accurate in Barley? Theor. Appl. Genet. 131, 1873–1890. doi: 10.1007/s00122-018-3120-8

PubMed Abstract | CrossRef Full Text | Google Scholar

Afshar, M., Usefi, H. (2020). High-dimensional feature selection for genomic datasets. Knowledge-Based Syst. 206, 106370. doi: 10.1016/j.knosys.2020.106370

Feature engineering of environmental covariates improves plant genomic-enabled prediction

Introduction

Materials and methods

Dataset USP

Dataset Japonica

Dataset G2F

Statistical models

Predictor P1: E+G

Predictor P2: E+G+GE

Predictor P3: E+G+BRR

Predictor P4: E+G+GE+BRR

Evaluation of prediction performance