Skip to main content

ORIGINAL RESEARCH article

Front. Genet., 24 July 2023
Sec. Genomics of Plants and the Phytoecosystem
This article is part of the Research Topic Enriching Genomic Breeding With Environmental Covariates, Crop Models, and High-Throughput Phenotyping View all 7 articles

Do feature selection methods for selecting environmental covariables enhance genomic prediction accuracy?

Osval A. Montesinos-LpezOsval A. Montesinos-López1Leonardo Crespo-HerreraLeonardo Crespo-Herrera2Carolina Saint PierreCarolina Saint Pierre2Alison R. BentleyAlison R. Bentley2Roberto de la Rosa-SantamariaRoberto de la Rosa-Santamaria3Jos Alejandro Ascencio-LagunaJosé Alejandro Ascencio-Laguna4Afolabi Agbona,Afolabi Agbona5,6Guillermo S. GerardGuillermo S. Gerard2Abelardo Montesinos-Lpez
Abelardo Montesinos-López7*Jos Crossa,
&#x;José Crossa2,8*
  • 1Facultad de Telemática, Universidad de Colima, Colima, Mexico
  • 2International Maize and Wheat Improvement Center (CIMMYT), El Battan, Mexico
  • 3Colegio de Postgraduados, Campus Tabasco, Tabasco, Mexico
  • 4Instituto Mexicano del Transporte, Querétaro, Mexico
  • 5International Institute of Tropical Agriculture (IITA), Ibadan, Nigeria
  • 6Molecular & Environmental Plant Sciences, Texas A&M University, College Station, TX, United States
  • 7Centro Universitario de Ciencias Exactas e Ingenierías (CUCEI), Universidad de Guadalajara, Guadalajara, JA, Mexico
  • 8Colegio de Postgraduados, Campus Montecillos, Montecillos, Mexico

Genomic selection (GS) is transforming plant and animal breeding, but its practical implementation for complex traits and multi-environmental trials remains challenging. To address this issue, this study investigates the integration of environmental information with genotypic information in GS. The study proposes the use of two feature selection methods (Pearson’s correlation and Boruta) for the integration of environmental information. Results indicate that the simple incorporation of environmental covariates may increase or decrease prediction accuracy depending on the case. However, optimal incorporation of environmental covariates using feature selection significantly improves prediction accuracy in four out of six datasets between 14.25% and 218.71% under a leave one environment out cross validation scenario in terms of Normalized Root Mean Squared Error, but not relevant gain was observed in terms of Pearson´s correlation. In two datasets where environmental covariates are unrelated to the response variable, feature selection is unable to enhance prediction accuracy. Therefore, the study provides empirical evidence supporting the use of feature selection to improve the prediction power of GS.

Introduction

To meet the demands of the growing population worldwide, it is of paramount importance to increase food production. However, it is a complex task to reach a significant increase because of the deterioration of natural resources, lack of land for agriculture use, significant fluctuations in the climatic conditions, etc. For this reason, novel alternatives, such as genomic selection (GS) methodology proposed by Meuwissen et al. (2001) are required for genetic improvement to increase yield stability, yield productivity, disease resistance, and nutrition and the subsequent end-use quality of key crops like wheat, rice, maize, and others (Crespo-Herrera, et al., 2021).

GS is a transformative approach that employs high-density markers across the entire genome and operates under the assumption that at least one genetic marker is in linkage disequilibrium with a causative QTL for the desired trait (Meuwissen et al., 2001). This methodology is reshaping the field of plant and animal breeding through various means: (a) preemptively identifying promising genotypes prior to planting, (b) enhancing the accuracy of selecting superior individuals, (c) yielding significant resource savings primarily by reducing the need for extensive phenotyping, (d) accelerating the variety development process by reducing cycle length, (e) intensifying selection efforts, (f) facilitating the selection of traits that are challenging to measure, and (g) enhancing the accuracy of the selection process.

Most of the referenced advantages of the GS are because the genomic selection methodology is predictive (Meuwissen et al., 2001). For this reason, the GS methodology trains statistical machine learning models with a reference population that contains phenotypic and genotypic records, and after the training process, produces predictions for candidates individuals that only contain genotypic records (Desta and Ortiz, 2014). The term “statistical machine learning methods” refers to the methodologies that originate from the domains of statistics and machine learning, as defined by Montesinos-López et al. (2022).

GS methodology is being applied in breeding programs of many crops like groundnut, maize, cassava, wheat, rice, chickpea, etc. (Roorkiwal et al., 2016; Crossa et al., 2017; Wolfe et al., 2017; Huang et al., 2019). However, the practical implementation of GS is still challenging because GS depends on many factors that should be optimized to guarantee high prediction accuracies. However, the simultaneous optimization of all these factors is very complex and so unexpected results are often found.

Several factors influence the application of the GS methodology, including: (a) the specific prediction objectives pertaining to tested lines in untested environments, untested lines in untested environments, tested lines in tested environments, or untested lines in tested environments, (b) the selection of an appropriate statistical machine learning method for accurate predictions (Montesinos-López et al., 2022), (c) the extent of coverage and marker density employed, (d) the influence of population structure (Frouin et al., 2019), (e) the heritability of the trait being considered, (f) the degree of mismatch between individuals in the testing and training sets (Habier et al., 2007), and (g) the sample size of the testing and training sets.

The calibration process necessary for the effective implementation of the GS methodology is not a simple task. However, accumulating empirical evidence from both small- and large-scale breeding programs indicates the feasibility of utilizing this methodology successfully. Nevertheless, it is crucial to acknowledge that achieving success requires careful consideration of various factors influencing its accuracy. Failing to account for these factors may result in unexpected outcomes following implementation. Specifically, important details such as the computation method for best linear unbiased estimates (BLUEs) used as response variables and the trait heritability play a pivotal role in avoiding low prediction accuracies.

Ongoing research aims to enhance the accuracy of GS by optimizing various factors. One area of focus is training and testing optimization, which aims to improve the precision of this methodology (Rincent et al., 2012; Akdemir et al., 2015). Additionally, extensive evaluations of different statistical machine learning methods are underway to achieve better prediction performance and develop robust models that require minimal tuning while maintaining high accuracy (Montesinos-López et al., 2022). Furthermore, researchers are investigating the integration of other omics data (such as phenomics, transcriptomics, metabolomics, and Environics) to enhance the accuracy of GS and identify powerful predictors for the traits of interest (Montesinos-López et al., 2017; Krause et al., 2019; Monteverde et al., 2019; Costa-Neto et al., 2021a; Costa-Neto et al., 2021b; Hu et al., 2021; Wu et al., 2022). These efforts aim to leverage additional information sources to improve the predictive capabilities of the GS methodology.

Incorporating environmental information, also known as enviromic data, in genomic prediction models has yielded varying results. Some studies have reported significant improvements by including this information (Costa-Neto et al., 2021a; Costa-Neto et al., 2021b), while others have observed modest or no improvements (Monteverde et al., 2019; Jarquin et al., 2020; Rogers and Holland, 2022). These mixed findings highlight the absence of a robust and precise method for effectively integrating environmental information into genomic prediction models. To address this gap, our present study focuses on evaluating the use of feature selection to identify the optimal environmental predictors. The goal is to enhance the accuracy of the GS methodology when incorporating environmental covariates.

Feature selection involves choosing a subset of relevant features from a larger available dataset. It plays a vital role in statistical machine learning as selecting the appropriate features can significantly improve model accuracy and efficiency. By selecting only the most pertinent features, feature selection reduces model complexity, leading to faster training times, better generalization to new data, and more interpretable outcomes. Additionally, feature selection helps mitigate overfitting, which occurs when a model becomes excessively complex and memorizes the training data rather than learning general patterns applicable to new data. Overall, feature selection is a critical step in the statistical machine learning pipeline, with the potential to greatly impact the performance and interpretability of a model (Chandrashekar and Sahin, 2014).

In this research, two feature selection methods (Pearson´s correlation and Boruta) were evaluated, and their performance was compared without incorporating the environmental covariates to evaluate the increase of prediction accuracy that is reached by adding the selected environmental covariates with the two methods of feature selection.

Materials and methods

A summary of the 6 datasets utilized in this study is in Table 1.

TABLE 1
www.frontiersin.org

TABLE 1. Summary of the six datasets. Num. Lines denotes the number of lines evaluated in each data set. Num. Denotes number. USP denotes the University of São Pablo data set.

Data

Datasets 1. USP

The University of São Paulo (USP) dataset is derived from germplasm developed by the Luiz de Queiroz College of Agriculture at the University of São Paulo, Brazil. Between 2016 and 2017, an experiment was conducted using 49 inbred lines, resulting in 906 F1 hybrids, of which 570 were evaluated across eight different environments. The environments were created using a combination of two locations, 2 years, and two nitrogen levels. However, in this research, we used the information of 4 environments and 100 hybrids in each environment. The environments used in the study were distinct, with different soil types and climates. Environmental data was collected and used to create 248 covariables. The parent lines were genotyped using the Affymetrix Axiom Maize Genotyping Array, resulting in 54,113 high-quality SNPs after quality control procedures were applied. For more details about this data, see Costa-Neto, et al. (2021a).

Datasets 2. Indica

Indica (Monteverde et al., 2019) is a rice dataset comprising phenotypic information for four traits: Percentage of Head Rice Recovery (PHR measured in grams, as the weight of whole milled kernels, using a 100 g sample of rough rice), Grain Yield (GY of paddy rice in kilograms per hectare), Plant Height (PH measured in cm from the soil surface to the tip of the flag leaf) and percentage of Chalky Grain (GC measured as % of chalky kernels in a subsample of 50 g of total milled rice). These traits were measured in three developmental stages (maturation, reproductive, and vegetative) across three environments (2010, 2011, and 2012). In each environment were evaluated 327 genotypes and 18 environmental covariates were measured. These covariates include MaxTemp (maximum temperature in °C), MinTemp (minimum temperature in °C), TankEv (tank water evaporation in mm), Wind (wind speed in 2 m/km/24 h), PicheEv (pichi evaporation in mm), MinRelH (minimum relative humidity in %), AccumPpit (accumulated precipitation in mm), Sunhs (sunshine duration in hours), MinT15 (minimum temperature below 15°C in days), ThermAmp (thermal amplitude in °C), RelSun (relative sunshine duration in %), SolRad (solar radiation in cal/cm2/day), EfPpit (effective precipitation in mm), DegDay (degrees day in rice in °C), RelH (relative humidity in hours), PpitDay (precipitation day in days), MeanTemp (mean temperature in °C, Average of temperature over 24 h (0–24 h)), and AvTemp (average temperature in °C calculated as daily (Max + Min)/2). The dataset contains 981 assessments, with each line assessed once in each environment. For each line, 16,383 SNP markers were evaluated after quality control, with each marker coded as 0, 1, or 2.

Datasets 3. Japonica

Japonica is a dataset of 320 genotypes from the tropical rice Japonica population. It was evaluated for the same four traits (PH, PHR, GY, GC) as the indica population in five environments (2009–2013), with covariates measured three times in each year for the three developmental stages (maturation, reproductive, and vegetative). A total of 1,051 assessments were made in the five environments, and the dataset is not balanced. Additionally, each genotype was evaluated for 16,383 SNP markers that remained after quality control, with each marker coded as 0, 1, or 2. For more details about this data, see Monteverde, et al. (2019).

Datasets 4–6

The three datasets included in this study correspond to the years 2014 (data set 4), 2015 (data set 5), and 2016 (data set 6) of the Genomes to Fields maize project, as described by Rogers et al. (2021). These datasets comprise phenotypic, genotypic, and environmental information. Specifically, the phenotypic data utilized in this study focused on four specific traits, namely, Grain_Moisture_BLUE (GM_BLUE), Grain_Moisture_weight (GM_ Weight), Yield_Mg_ha_BLUE (YM_BLUE), and Yield_Mg_ha_weight (YM_Weight), selected from a larger set of traits reported by Rogers et al. (2021).

For years 2014 (data set 4), 2015 (data set 5) and 2016 (data set 6) there were 18, 12 and 18 available environments respectively (See Table 1). In terms of the number of genotypes evaluated for each year, there were 781, 1,011, and 456 genotypes for the years 2014 (data set 4), 2015 (data set 5), and 2016 (data set 6), respectively. The analysis utilized a set of 20,373 SNP markers that had already been imputed and filtered, as described by Rogers et al. (2021). The additive allele calls were recorded as counts of the minor allele (0, 1, 2). For further details regarding these datasets, please refer to Rogers et al. (2021).

There are unique values of environments for each dataset, while for traits, the G2F_2014, G2F_2015, and G2F_2016 datasets share the same traits, and the same happens with Indica and Japonica.

Models

Bayesian model

The Bayesian model used with all predictors given in Table 2 is

Yij=μ+Ei+gj+gEij+k=1rXikβk+ϵij(1)

TABLE 2
www.frontiersin.org

TABLE 2. Description of the 15 predictors implemented. Environmental covariates (0 denotes no used, while 1 denotes used), selection method of environmental covariates (C= Pearson´s correlation and B=Boruta). TC denotes threshold correlation and this takes values of 0.3, 0.4, 0.5, 0.6, and 0.7. The largest TC value was evaluated first and in case that any covariate satisfied this TC value was used the second largest and so on.

Where Yij is the response variable for the genotype j in environment i, μ is a general mean, Ei are the random effects of locations (environments) distributed as E=E1,,EITN0,σE2H.e, where H.e is the environmental relationship matrix as computed by Cuevas et al. (2016), but in place of using genomic information, it was computed using environmental variables; that is, H.e=XeXeTr, where Xe=X1,,Xr is the standardize (centered and scaled) matrix of dimension I×r containing the environmental information of I environments and for each environment were measured r environmental covariates; Xik denotes the environmental covariate k measured in environment i, βk is the beta coefficient corresponding to covariate Xik; gj,j=1,,J, are the random effects of genotypes (lines), gEij are the random effects of genotype × environment interaction (GE) and ϵij are the random error components in the model assumed to be independent normal random variables with mean 0 and variance σ2. Furthermore, it is assumed that g=g1,,gJTN0,σg2K.g, where K.g is the genomic relationship matrix as computed by (Cuevas et al., 2016; slightly different as proposed by VanRaden (2008), using the marker data (K.g=MeMeTp) where Me is the standardize (centered and scaled) matrix of dimension J×p containing the marker information of J genotypes for which were measured p markers. gE=gE11,,gE1J,,gEIJTN0,K.gecσgE2, where K.gec=K.ecZgK.gZgT, where K.ec=ZeH.eZe,TZe is the design matrix of environments, denotes the Hadamard product and Zg is the design matrix of genotypes. It is important to point out that the dimension of Xe is reduced after variable selection and in place of being I×r is I×rs with rsr.

Predictors implemented

To understand better the content of Table 2, next we describe how were computed some components of the predictors given in Table 2. For example, K.e was computed as K.e=ZeZeTI, K.ge=K.eZgK.gZgT, Xe.avg denotes an average covariate that was computed with the environmental covariates Xe,Xe use all the available environmental covariates when not was applied feature selection and only those selected with the feature selection methods when feature selection was applied. This average covariate (Xe.avg) was computed from Xe of order I×rs after variable selection as:

Step 1. First, we identify the direction (positive and negative) of the correlation of each column of Xe computed only with the response variable of the training set.

Step 2. Those columns of Xe, with negative correlation are multiply by −1 to guarantee a positive correlation with the response variable and call these new matrix as Xe*.

Step 3. Then we compute Xe.avg for the whole data set as the average of each row of Xe* and for this reason Xe.avg has order I×1. But because the covariates are measured at environment (Location) level to be in agreement with all design matrices given in model 1), Xe.avg is expanded to order IJ×1, since each covariate is the same for all lines in the same environment.

Using only one covariate as Xe.avg, implies that only one beta coefficient needs to be estimated in place of rs beta coefficients required when is used as input the Xe matrix. In predictor M9 it is important to point out that X.g.ec denotes the covariates selected, but in place of selecting only from the environmental covariates we performed the Boruta selection from the markers and environmental covariates together. For feature selection the Pearson´s correlation and Boruta method were used, which are explained in the next section. All predictors given in Table 2 were implemented in the BGLR package of Pérez and de los Campos (2014) in the R statistical software (R Core Team, 2023).

Feature selection methods and algorithms for selecting environmental covariates

The training of each model differs in the use of the environmental covariates that make up each dataset. Therefore, model M0 differs from the other models by making predictions without including any information given by the environmental covariates. For this reason, in model M0 the linear kernels K.e=ZeZeTI, K.ge=K.eZgK.gZgT, were computed only with the design matrices of environments (Ze). On the other hand, model M1 is equal to model M0 plus adding all available environmental information as covariates Xe without variable selection. Model M2 is equal to model M0, but with the difference that the computation of the linear kernels (K.ec=ZeH.eZeT and K.gec=K.ecZgK.gZgT) take into account the environmental covariates after variable selection with Pearson correlation. Model M3 is equal to model M1 but using the covariates, Xe, after variable selection with Pearson´s correlation. Model M4 is equal to model M2 but also including the environmental information as covariates (Xe) after variable selection with Pearson´s correlation. Model M5 is equal to model M3 but in place of using only Xe, as covariates, after variable selection with Pearson´s correlation, also were used the square of each column of Xe, as covariates, that is as covariates used were Xe2=Xe+Xe*Xe. Model M6 is equal to model M3 but in place of using Xe, after variable selection with Perason´s correlation as covariate used only the average covariate (Xe.avg). Model M7 is equal to model M4 but in place of using Xe, after variable selection with Perason´s correlation as covariate used also the average covariate (Xe.avg). Model M8 is equal to model M7 but the variable selection process was done with both Pearson´s correlation and Boruta simultaneously. Model M9 performed variable selection of markers and environmental covariates simultaneously using the Boruta algorithm and the resulting selected covariates are called X.g.ec, while K.e and K.ge were computed only with the design matrix of environment (Ze). Model M10 is equal to model M3 but the selection of the environmental covariates was done with the Boruta algorithm selecting tentative and confirmative covariates. It is important to point out than in model M10 also the Boruta algorithm was applied to select the markers and then with the selected markers were computed the linear kernels of lines (K.g) and genotype by environment (K.ge) interaction. Model M11 is equal to model M8 but now the environmental and markers covariates were selected with the Boruta algorithm selecting only confirmative covariates. Model M12 is equal to model M11 but with the difference that with the Boruta algorithm were selected tentative and confirmative covariates. Model M13 is equal to model M12 but with the difference that the selected environmental covariates were also used to compute the linear kernels of environments (K.ec) and genotype by environment (K.gec) interaction. Finally model M14 is equal to model M13 but with the difference that with the Boruta algorithm were selected only confirmed features. More details of each predictor are given in Table 2.

It is important to point out that only model M1, used all environmental covariates, while the remaining models use a selected subset of all available environmental covariates. The choice of feature selection method to be used depends on the specific problem and the characteristics of the data. It is often necessary to try different methods and evaluate their performance to find the most suitable one for a particular task.

The first feature selection method corresponds to the correlation present between the environmental covariates and the response variable, where the selection is made according to the largest correlation present in each training set of each trait. However, it is important to note that the selection of these covariates is done discarding the response variables in the testing set (a complete environment in this case since we will implement only a leave one environment out cross-validation), that is, the environment that will be predicted. The threshold correlations to select environmental covariates were: 0.3, 0.4, 0.5, 0.6, and 0.7. When the correlations are lower than the 0.3 value, it implies the training process was done without any environmental covariates, however it is important to realize if only few covariates satisfied the threshold correlation of 0.7 only these covariates were used in the training process, but if any satisfied this threshold were used those that satisfied lower threshold (0.5), and so on.

The second feature selection method consists of implementing the Boruta algorithm, which seeks to capture covariates that are strongly or weakly relevant to the response variable. Also, the selection of the covariates to include in the training process of the models were obtained using the response variables corresponding only to the training set; the observations that will be part of the testing are not used for selecting the important environmental covariates.

Boruta is a feature selection algorithm that was designed to handle high-dimensional datasets with noisy features (Kursa and Rudnicki, 2010) The algorithm works by creating a shadow feature set, which is a copy of the original feature set with random permutations added. The shadow features are used as a control to determine whether a feature is statistically significant or not. The original features are considered relevant if their importance scores are significantly higher than the importance scores of their shadow features. Boruta is useful for datasets with a large number of noisy features, where traditional feature selection methods may struggle. However, it can be computationally expensive and may require careful parameter tuning to achieve optimal results (Kursa and Rudnicki, 2010).

The Boruta algorithm works in the following steps.

1. Create a shadow feature set by randomly permuting the values of each feature.

2. Train a random forest model on the original and shadow feature sets.

3. Calculate the feature importance scores for each original feature by comparing them to the importance scores of their shadow features.

4. Determine the maximum importance score for each feature.

5. Check whether the maximum importance score is statistically significant using the Binomial test. If it is significant, mark the feature as important. Otherwise, mark the feature as unimportant. The Binomial test is a statistical test used in Boruta to evaluate the significance of feature importance scores. It compares the observed number of successes (e.g., the number of times a feature’s importance score exceeds a threshold) with the expected number of successes under a null hypothesis. The test assesses whether the observed results are statistically significant or can be attributed to chance. In Boruta, the Binomial test is applied to determine if the feature importance scores are significantly higher than the importance scores of shadow features, indicating the importance of the original features (Kursa and Rudnicki, 2010).

6. Repeat steps 1–5 for a predetermined number of iterations.)

7. Rank the features by their importance scores and select the top n features for the final feature set. The Boruta algorithm selects the important features based on their labels. “Confirmed” features are considered important, while “Rejected” features are deemed unimportant. “Tentative” features can be considered less important or require further investigation.

Evaluation of prediction accuracy

To assess the prediction accuracy, a leave one environment out (LOEO) cross-validation approach was employed for each dataset. This involved iteratively constructing the training set by excluding one environment (I-1 environments) while utilizing the remaining environment as the testing set. The evaluation methodology follows the approach described by Montesinos-López et al. (2022). However, as pointed out before since some of the predictors provided in Table 2 selected environmental covariates, the selection of these covariates was done after splitting the training and testing the data, and only the training was used for selection of the important covariates, because when using the whole data before splitting the data in training and testing the results are too optimistic due to a leakage of information. Data leakage, a significant issue in machine learning, happens when the data used to train an ML algorithm includes information the model is attempting to predict. The leakage of information is a primary error in machine learning, which can significantly impact the production performance and validation accuracy of the model since we obtain very optimistic results that are not translated for real applications.

The prediction accuracy was reported in terms of the Normalized Root Mean Squared Error (NRMSE) and Pearson´s correlation (Cor). Also, we computed the absolute value of the intercept (b0) and the absolute value of b = 1-slope by regressing the observed values versus the predicted values to inform about the quality of the Pearson´s correlation. In terms of Pearson´s correlation the closet to 1 the value the better the predictions, while in terms of NRMSE, b0 and b, the closest to zero the better the predictions. Furthermore, we conducted a computation to determine the count of instances where model m outperformed model in terms of NRMSE (Normalized Root Mean Squared Error), considering m = 0, … , 14 and m´ = 0, … ,14, but m different of . This count was performed for each dataset, taking into consideration the specific traits and environments under evaluation. Also, we computed the Relative Efficiency (RE), in terms of NRMSE of each model regarding the worst model with the following expression:

RE=NRMSEM0NRMSEMk1×100(2)

Where M0 denotes model M0 without environmental covariates, with k=1,,14 in each data set. While the relative efficiencies for Cor were computed using the values of Cor, with the following equation (Eq. 3):

RE=CorMkCorM01×100(3)

Also, RE were computed for b0 and b, using Eq. 2, but in place of using NRMSE we used the values of b0 and b.

Results

The results are provided in sections for each data set.

USP data

The results of this dataset are presented in Figure 1 with details in Table 3. In the count by environments, the model M9 (47/60) turns out to be the best, with a count of 47 times it is better than the different models of a total combination of 60. The second-best model is M13 (46/60), while the worst models are M6 and M1 with counts of 11/60 and 20/60, respectively. Therefore, it is possible to observe a marked difference between the best and worst models, while the M0 model turns out to be the seventh-worst model. The M9 model (14/15) is observed as the best model with the maximum count (14 out of 14 possible combinations). The M13 model (13/14) is positioned as the second-best model. Meanwhile, the worst model turns out to be M6 (0/14), and the second-worst model is M10 (1/14). Thus, the results by environments and traits consistently point to the M9 model as the best and M6 as the model with the most deficient prediction. Furthermore, the M0 model appears as the fourth-worst model by environments and the fifth-worst model through the trait.

FIGURE 1
www.frontiersin.org

FIGURE 1. USP data. (A) Count of the number of times a model is better than another, by environment. (B) Count of the number of times a model is better than another, by trait. (C) Performance of the models’ prediction measured by Normalized Root Mean Squared Error (NRMSE). (D) Relative efficiency (RE) of each predictor compared to model M0, the model without environmental information.

TABLE 3
www.frontiersin.org

TABLE 3. University of São Paulo (USP) dataset. Count of the number of times a model was better than another in terms of Normalized Root Mean Squared Error (NRMSE), both by environments and by traits. Prediction accuracy in terms of NRMSE. b0 denotes the absolute values of the intercept and b denotes the absolute value of 1-slope. Under an ideal model both b0 and b should be equal to zero. Relative efficiency (RE) or each model in percentage was computed regarding model M0 without environmental covariates. When the percentage is positive there is a gain in prediction accuracy regarding M0, while when the percentage is negative there is a loss in terms of prediction accuracy of any model regarding M0.

According to NRMSE, the M9 model is the best of all with a NRMSE = 1.19, the second-best model is M13 (NRMSE = 1.26). Meanwhile, the worst model is M6 (NRMSE = 1.50), followed by the second-worst model, M10 (NRMSE = 1.49). In terms of NRMSE the RE, compared to model (M0), the gains obtained are up to 19.13% for the best model, M9, but in terms of Cor, model M9 was worst than model M0 by 51.80%. However, in terms of Cor the best model was M4 and outperformed M0 by 6.05%. But it is interesting that model M4 improved regarding M0 in terms of the intercept (b0) and b by 884.46% and 500.55% respectively. See further details in Table 3.

Indica data

The results of this dataset are presented in Figure 2 with details in Table 4. In the count by environments, model M12 turns out to be the best, winning in 124 out of 180 possible combinations (124/180). The second-best model is M11 (112/180), while the worst models are M9, M10, and M3 (with counts of 24/180, 32/180, and 75/180, respectively). Meanwhile, by trait, model M12 (48/60) is observed as the best model, model M11 (44/60) is positioned as the second-best model. In the opposite direction, the worst model is M9 (2/60) and the second-worst model is M10 (15/60). Thus, the results by environments and traits consistently point to model M12 as the best and M9 as the model with the most deficient prediction, highlighting model M0 among the top 5.

FIGURE 2
www.frontiersin.org

FIGURE 2. Indica data. (A) Count of the number of times a model is better than another, by environments. (B) Count of the number of times a model is better than another, by trait. (C) Prediction performance of each predictor in terms of Normalized Root Mean Squared Error (NRMSE). (D) Relative efficiency (RE) of each model compared to model M0, the model without environmental information.

TABLE 4
www.frontiersin.org

TABLE 4. Indica data. Count of the number of times a model was better than another in terms of Normalized Root Mean Squared Error (NRMSE), both by environments and by traits. Prediction accuracy in terms of NRMSE. b0 denotes the absolute values of the intercept and b denotes the absolute value of 1-slope. Under an ideal model both b0 and b should be equal to zero. Relative efficiency (RE) or each model in percentage was computed regarding model M0 without environmental covariates. When the percentage is positive there is a gain in prediction accuracy regarding M0, while when the percentage is negative there is a loss in terms of prediction accuracy of any model regarding M0.

By NRMSE, model M12 is the best with a NRMSE = 0.98 and the second-best model is M11 (NRMSE = 1.02). Meanwhile, the worst model is M9 (NRMSE = 1.17) and the second best is model M3 (RE = 1.13). In terms of NRMSE the RE of M12 compared to model (M0), showed a gain of only 5.15%, but in terms of Cor, model M12 only gain by 0.27%. However, in terms of Cor the best model was M6 and outperformed M0 by only 0.44%. Model M12 improved regarding M0 in terms of the intercept (b0) and b by 22.65% and 43.85% respectively (see Table 4 for more details).

Model M12 uses the Boruta method as the variable selection method. In this data set it is very interesting that model M0 without environmental covariates appears as one of the best, indicating for this dataset, that not necessary adding the environmental covariates is helpful to increase the prediction accuracy. One of the main reasons for this result is the low correlations of the environmental covariates with the response variable, causing them to add noise to the predictions when included.

Japonica data

The results of this dataset are presented in Figure 3 (with details in Table 5). In the count by environments, model M14 turns out to be the best, winning in 189 out of 300 (189/300) possible combinations. The second-best model is M13 (185/300), while the worst models are M2 and M9, and M5 (with a count of 92/300 and 107/300, respectively). Meanwhile, by trait, model M13 (42/60) is observed as the best model, while models M14 (41/60) and M12 (41/60) are positioned as the second-best models. In the opposite direction, the worst model turns out to be M2 (7/60) and the second-worst model is M0 (18/60). Thus, the results by environments and traits are consistent in indicating model M13 as the best and M2 as the model with the poorest prediction, and model M0 is among the worst five models.

FIGURE 3
www.frontiersin.org

FIGURE 3. Japonica data. (A) Count of the number of times a model is found to be better than another, by environment. (B) Count of the number of times a model is found to be better than another, by trait. (C) Prediction accuracy of each predictor (M0 to M14) in terms of Normalized Root Mean Squared Error (NRMSE). (D) Relative efficiency (RE) of each model compared to model M0, the model without environmental information.

TABLE 5
www.frontiersin.org

TABLE 5. Japonica data. Count of the number of times a model was better than another in terms of Normalized Root Mean Squared Error (NRMSE), both by environments and by traits. Prediction accuracy in terms of NRMSE. b0 denotes the absolute values of the intercept and b denotes the absolute value of 1-slope. Under an ideal model both b0 and b should be equal to zero. Relative efficiency (RE) or each model in percentage was computed regarding model M0 without environmental covariates. When the percentage is positive there is a gain in prediction accuracy regarding M0, while when the percentage is negative there is a loss in terms of prediction accuracy of any model regarding M0.

In terms of NRMSE, model M10 is ranked the best with a NRMSE = 1.70, followed by the second-best model, M13 (NRMSE = 1.77). Meanwhile, the worst model is M2 (NRMSE = 3.75), and the second-worst model is M6 (NRMSE = 2.11). The gain of model M10 regarding model M0 in terms of NRMSE was of 14.25%, but this model reduced its accuracy in terms of Cor by 33.19%. While in terms of Cor the best model was M5 and outperformed M0 by only 0.51%. Model M5 improved regarding M0 in terms of the intercept (b0) by 299.51%, but was worst in terms of and b by 26.04% (Table 5).

G2F_2014 data

The results of this dataset are presented in Figure 4 (see Table 6). In the count by environments, model M13 (608/1440) turns out to be the best, with a count of 608 times better than the different models of a total combination of 1440. The second-best model is M1 (590/1440), while the worst models are M9, M8, and M7 (with a count of 422/1440, 424/1440, and 438/1440, respectively). Thus, a marked difference between the best and worst models can be observed, with model M0 being the fifth-worst model.

FIGURE 4
www.frontiersin.org

FIGURE 4. G2F_2014 data. (A) Count of the number of times one model is better than another, by environment. (B) Count of the number of times one model is better than another, by trait. (C) Prediction accuracy of each predictor (M0 to M14) in terms of Normalized Root Mean Squared Error (NRMSE). (D) Relative efficiency (RE) of each model compared to model M0, the model without environmental information.

TABLE 6
www.frontiersin.org

TABLE 6. G2F_2014 data. Count of the number of times a model was better than another in terms of Normalized Root Mean Squared Error (NRMSE), both by environments and by traits. Prediction accuracy in terms of NRMSE. b0 denotes the absolute values of the intercept and b denotes the absolute value of 1-slope. Under an ideal model both b0 and b should be equal to zero. Relative efficiency (RE) or each model in percentage was computed regarding model M0 without environmental covariates. When the percentage is positive there is a gain in prediction accuracy regarding M0, while when the percentage is negative there is a loss in terms of prediction accuracy of any model regarding M0.

By trait, models M14 and M6 (45/80) are observed as the best models with a count of 45 models won out of 80 possible combinations. Models M13 and M1 (41/80) are positioned as the second-best models. Meanwhile, the worst model turns out to be M7 (13/80), and the second-worst model is M8 (14/80). Thus, the results by environments and traits consistently point to models M13 and M14 as the best and models M7 and M8 as the models with the poorest prediction.

In terms of NRMSE, model M0 is placed as the best of all with a NRMSE = 4.30. The second-best model is M1 (NRMSE = 4.67). Meanwhile, the worst model is M4 (NRMSE = 12.26), and the second and third-worst models are M7 (NRMSE = 9.66) and M11 (NRMSE = 8.66). Regarding Cor only model M1 was slightly better than model M0 by 0.66%, but this model was not better than model M0 in terms of intercept and slope. In this data set we observed that the environmental covariates were not significantly related to the response variable and for this reason we can observe that adding as input the environmental covariates under these circumstances is not beneficial. See details in Table 6.

G2F_2015 data

The results of this dataset are presented in Figure 5 with details Table 7. In the counting by environments, model M12 (445/720) is the best, with a count of 445 times better than the different models in a total combination of 720, the second-best model is M13 (442/720), while the worst models are M9 and M4 (with a count of 246/720 and 264/720 respectively). Therefore, it is possible to observe a marked difference between the best and worst models. By trait, model M13 (50/80) is observed as the best model with a count of 50 out of 80 possible combinations. Model M12 (46/80) is positioned as the second-best model, followed by M14 (39/80). Meanwhile, the worst model is M9 (6/80), and the second-worst model was M4. Thus, the results by environments and traits are consistent in pointing to models M12 and M13 as the best and M9 as the model with the poorest prediction.

FIGURE 5
www.frontiersin.org

FIGURE 5. G2F_2015 data. (A) Count of the number of times one model performs better than another, by environment. (B) Count of the number of times one model performs better than another, by trait. (C) Prediction accuracy of each predictor (M0 to M14) in terms of Normalized Root Mean Squared Error (NRMSE). (D) Relative efficiency (RE) of each model compared to model M0, the model without environmental information.

TABLE 7
www.frontiersin.org

TABLE 7. G2F_2015 data. Count of the number of times a model was better than another in terms of Normalized Root Mean Squared Error (NRMSE), both by environments and by traits. Prediction accuracy in terms of NRMSE. b0 denotes the absolute values of the intercept and b denotes the absolute value of 1-slope. Under an ideal model both b0 and b should be equal to zero. Relative efficiency (RE) or each model in percentage was computed regarding model M0 without environmental covariates. When the percentage is positive there is a gain in prediction accuracy regarding M0, while when the percentage is negative there is a loss in terms of prediction accuracy of any model regarding M0.

Regarding NRMSE, model M10 is ranked as the best of all with a NRMSE = 2.95, the second-best model is M13 (NRMSE = 3.14). Meanwhile, the worst model is M0 (NRMSE = 9.41), and the second-worst model was M2 (NRMSE = 9.01). In terms of RE, compared to the worst model (M0), the gains obtained are up to 218.71% for the best model M10 and 199.81% for model M13. Also, it is noteworthy the best model, M10, was better by 23.38% compared to model M1, which is the model that takes into account all the environmental covariates without variable selection. In terms of Cor model M1 was the best with a gain of only 0.21% regarding M0, however this model was worst than model M0 in terms of b0 and b, by 76.97% and 8.39%, respectively. While model M10 that was the best in terms of NRMSE was worst than model M0 in terms of Cor and b0 by 16.46% and 65.11% respectively, however this model was better than M0 in terms of b by 72.63% (Table 7).

G2F_2016 data

The results of this dataset are presented in Figure 6 (Table 8). In the count by environments, model M13 (618/1440) is the best, with a count of 618 times in which it is better than the different models of a total combination of 1440. The second-best model is M14 (604/1440), while the worst models are M9 and M2 (with a count of 297/1440 and 440/1440, respectively). By trait, model M9 is observed as the best model with a count of 43 wins out of 80 possible combinations. Model M14 (42/80) is positioned as the second-best model. Meanwhile, the worst model turns out to be M9 (3/80), and the second-worst model is M2 (17/80). The results by environments and traits are consistent in pointing out model M13 as the best model and M9 as the model with the poorest prediction.

FIGURE 6
www.frontiersin.org

FIGURE 6. G2F_2016 data. (A) Count of the number of times a model is better than another, by environment. (B) Count of the number of times a model is better than another, by trait. (C) Prediction accuracy of each predictor (M0 to M14) in terms of Normalized Root Mean Squared Error (NRMSE). (D) Relative efficiency (RE) of each model compared to model M0, the model without environmental information.

TABLE 8
www.frontiersin.org

TABLE 8. G2F_2016 data. Count of the number of times a model was better than another in terms of Normalized Root Mean Squared Error (NRMSE), both by environments and by traits. Prediction accuracy in terms of NRMSE. b0 denotes the absolute values of the intercept and b denotes the absolute value of 1-slope. Under an ideal model both b0 and b should be equal to zero. Relative efficiency (RE) or each model in percentage was computed regarding model M0 without environmental covariates. When the percentage is positive there is a gain in prediction accuracy regarding M0, while when the percentage is negative there is a loss in terms of prediction accuracy of any model regarding M0.

Regarding NRMSE, model M3 is situated as the best of all with a NRMSE = 3.40. The second-best model is M14 (NRMSE = 3.44). Meanwhile, the worst model is M11 (NRMSE = 6.06), and the second-worst model was M1 (NRMSE = 6.01). In terms of gain measured through RE, of model M3 with respect to M0, this was of 26.85%. While the best model, M3, outperformed model M1, which uses all the environmental covariates without variable selection, by 76.76%, meaning M1 decreases prediction accuracy by including all the environmental covariates in place of improving since model M0, with no environmental covariates, performed significantly better. It is important to point out that model M3 in terms of Cor and b0 was worse than model M0 by 0.22% and 11.68% respectively, but model M3 was better than M0 by 205.09% in terms of b. Also, it is important to point out that any of the models M1 to M14 outperformed, in terms of Cor, the model without environmental covariates (M0). See details in Table 8.

Discussions

Many statistical machine learning algorithms struggle to produce accurate predictions in many datasets in the context of genomic prediction because the relationship between phenotype and genotypic information and other inputs is considerably intricate and complex. But even with these difficulties, statistical machine learning methods have been adopted for genomic prediction and are helping to solve challenging problems within genetics and genomics (Ramon et al., 2020; Montesinos-López et al., 2022).

However, care needs to be taken in the application of statistical machine learning methods since if the inputs are not related to the response variable, and we use these inputs in training our prediction models we are adding noise in the modeling process. For this reason, feature selection is of paramount importance because it helps identify the most relevant and informative features from a given dataset. The goal of feature selection is to reduce the dimensionality of the data by selecting a subset of features that can provide the most accurate predictions while minimizing the risk of overfitting. Empirical evidence supports feature selection is important because it: 1) improves model performance by selecting the most relevant features, so the model can be trained more efficiently, leading to improved accuracy, and reduced error rates; 2) reduces the risk of overfitting by selecting only the most important features; 3) improves interpretability since by selecting the most important features, the model becomes more interpretable and understandable, which helps to better illustrate the underlying relationships between features and the outcome; 4) reduces data collection and storage costs since collecting and storing large amounts of data can be costly. Overall, feature selection is a crucial step in the statistical machine learning pipeline that can help to improve model performance, reduce the risk of overfitting, improve interpretability, and reduce data collection and storage costs.

In our application within the context of GS with six real datasets, we found in four out of six datasets performing variable selection significantly improve the prediction accuracy which corroborates the empirical evidence that feature selection improves prediction accuracy. In these cases, the improvement in prediction accuracy when ignoring the environmental covariates was between 14.25% and 218.71% in terms of NRMSE, however not relevant improvement was observed in terms of Pearson´s correlation. However, in the other two datasets, instead of improving the prediction accuracy with environmental covariates, we got worse results in some scenarios. These scenarios were when all the environmental covariates were included without any variable selection. For this reason, the process of selecting the important features should be done with much care and include moderate or highly correlated inputs to guarantee an improvement in prediction accuracy.

Also, we observed the resulting models with a higher predictive capacity, differ between datasets, due to the difference in the structural composition of each dataset. But, in general we observed models with selected covariates have a higher predictive capacity in terms of NRMSE. According to Buntaran et al., 2021, the prior selection of features is advantageous for model fitting when there are many features.

Regarding the predictors (M6, M7, M8, M11, M12, M13, and M14) that exclusively incorporated a single covariate, which was computed as the mean of all the selected environmental covariates, we observed competitive predictive outcomes in five of the six datasets in terms of NRMSE. This approach offers the advantage of estimating only one parameter (beta coefficient). However, it should be noted that while computing the average covariate, we altered the direction (sign) of the covariate in relation to the response variable for those covariates that exhibited a negative correlation. This adjustment ensures a positive association for each covariate. Consequently, accurately determining the direction of the association becomes crucial in order to ensure optimal performance when utilizing this average covariate.

Also, from our research, we deduce it is very challenging to incorporate some of the environmental covariates, since those that will be included during the training process only should be computed with the training set to avoid a data leakage problem. In the context of GS, with environmental covariates we have problems of data leakage when we select inadvertently the environmental covariates using the whole data set (training and testing) in place of using only the training set and for this reason we end up with overoptimistic predictions (high prediction accuracies), but when we want to use this model for real predictions, the predictions fail because data leakage lead to overfitting, The model can accurately predict the training data but performs poorly on new data. This can cause a false sense of confidence in the model’s performance and can lead to costly errors when the model is deployed in the real world.

To avoid data leakage, it is important to carefully separate the training and test data and ensure the test data represents real-world scenarios. To guarantee reasonable predictions in the testing set, we are assuming the training and testing distributions belong to the same distribution, and in this way those covariates selected with the training information will be useful for obtaining good predictions in the testing set. However, in the two out of the six data sets we can see this assumption of the same distribution between the training and testing set is not always fulfilled, so even with the variable selection process, it is not possible to improve the prediction accuracy.

When deciding which metric to use for assessing the genomic prediction performance of different machine learning models, it is important to study and investigate the advantages and disadvantages of metrics such as NRMSE and Pearson correlation between observed and predictive values. Both metrics have their advantages and disadvantages. NRMSE is beneficial when the magnitude of model errors is important. If accurately predicting absolute values is crucial for the application, NRMSE provides a measure that considers the magnitude of errors. It penalizes larger errors more heavily and can help identify models with smaller overall deviations from the actual values. On the other hand, Pearson correlation is useful when capturing the strength and direction of the relationship between variables is critical. If the focus is on assessing the linear relationship between predicted and actual values, correlation provides a measure that quantifies the strength and direction of the association.

Correlation can also be sensitive to outliers, helping identify extreme values that deviate from the overall pattern. However, correlation may not capture complex non-linear relationships well. It should be noted, however, that this study performed genomic prediction in entire environment (or years) then larger prediction errors are expected when using NRMSE, correlations, or any other metric are expected and difficult to avoid. Further research is needed to thoroughly assess the advantages and disadvantages of metrics such as NRMSE or correlation (or any other) as the basis for assessing the prediction ability of models.

In conclusion, it is important to bear in mind that the primary objective of this study was to present a proficient and pragmatic approach for integrating environmental data into the modeling process, aiming to improve prediction accuracy through the implementation of genomic selection methodology. While our results are promising in terms of NRMSE, further empirical assessments are necessary to validate our discoveries and refine the proposed methodology since the improvement is not reflected in terms of Pearson´s correlation. This involves not only incorporating feature (variable) selection techniques but also integrating feature engineering methods to enhance the predictive capabilities of genomic prediction models.

Conclusion

In this research using six real datasets, we proposed the use of feature selection for optimally incorporating the environmental covariates in the modelling process for training genomic prediction models. We found feature selection significantly increases the prediction accuracy in terms of normalized root mean square error regarding ignoring the environmental covariates or adding all these environmental covariates without feature selection (naïve incorporation). However, the gain in prediction accuracy is data dependent, since the value can be from 0% to 218.71% in terms of Normalized Root Mean Squared Error. The key to improving prediction accuracy is to select environmental covariates that are highly correlated with the response variable. Also, we point out that the selection of the environmental covariates should be done using only the training set to avoid leakage of information problems. Finally, we encourage other researchers to apply feature selection in genomic prediction because it is an extremely powerful tool in the context of large inputs and small observations. We are convinced feature selection can be helpful to efficiently incorporate other omics data in the genomic prediction models.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding authors.

Author contributions

OM-L, Conceptualize, Investigation, Methodology, Writing LC-H, Investigation, Data, Methodology, Writing CS, Investigation, Data, Methodology, Writing AB, Investigation, Data, Methodology, Writing RR-S, Investigation, Data Analyses JA-L, Investigation, Data Analyses; AA, Investigation, Data Analyses; GG, Methodology, Investigation, Data Analyses; AM-L, Conceptualize, Investigation, Methodology, Writing; JC, Conceptualize, Investigation, Methodology, Writing. All authors contributed to the article and approved the submitted version.

Funding

We are thankful for the financial support provided by the Bill and Melinda Gates Foundation [INV-003439, BMGF/FCDO, Accelerating Genetic Gains in Maize and Wheat for Improved Livelihoods (AG2MW)], the USAID projects [USAID Amend. No. 9 MTO 069033, USAID-CIMMYT Wheat/AGGMW, AGG-Maize Supplementary Project, AGG (Stress Tolerant Maize for Africa], and the CIMMYT CRP (maize and wheat). We acknowledge the financial support provided by the Foundation for Research Levy on Agricultural Products (FFL) and the Agricultural Agreement Research Fund (JA) through the Research Council of Norway for grants 301835 (Sustainable Management of Rust Diseases in Wheat) and 320090 (Phenotyping for Healthier and more Productive Wheat Crops).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Akdemir, D., Sanchez, J. I., and Jannink, J. L. (2015). Optimization of genomic selection training populations with a genetic algorithm. Genet. Sel. Evol. 47 (1), 38–10. doi:10.1186/s12711-015-0116-6

PubMed Abstract | CrossRef Full Text | Google Scholar

Buntaran, H., Forkman, J., and Piepho, H-P. (2021). Projecting results of zoned multienvironment trials to new locations using environmental covariates with random coefficient models: Accuracy and precision. Theor. Appl. Genet. 134 (5), 1513–1530. doi:10.1007/s00122-021-03786-2

PubMed Abstract | CrossRef Full Text | Google Scholar

Chandrashekar, G., and Sahin, F. (2014). A survey on feature selection methods. Comput. Electr. Eng. 40 (1), 16–28. doi:10.1016/j.compeleceng.2013.11.024

CrossRef Full Text | Google Scholar

Costa-Neto, G., Fritsche-Neto, R., and Crossa, J. (2021a). Nonlinear kernels, dominance, and envirotyping data increase the accuracy of genome-based prediction in multi-environment trials. Heredity 126, 92–106. doi:10.1038/s41437-020-00353-1

PubMed Abstract | CrossRef Full Text | Google Scholar

Costa-Neto, G., Crossa, J., and Fritsche-Neto, R. (2021b). Environics assembly increases accuracy and reduces costs of the genomic prediction for yield plasticity in maize. Front. Plant Sci. 12, 717552. doi:10.3389/fpls.2021.717552

PubMed Abstract | CrossRef Full Text | Google Scholar

Crespo-Herrera, L., Howard, R., Piepho, H. P., Pérez-Rodríguez, P., Montesinos-López, O. A., Burgueño, J., et al. (2021). Genome-enabled prediction for sparse testing in multi-environmental wheat trials. Plant Genome 14 (3), e20151. doi:10.1002/tpg2.20151

PubMed Abstract | CrossRef Full Text | Google Scholar

Crossa, J., Pérez-Rodríguez, P., Cuevas, J., Montesinos-López, O. A., Jarquín, D., de Los Campos, G., et al. (2017). Genomic selection in plant breeding: Methods, models, and perspectives. Trends Plant Sci. 22 (11), 961–975. doi:10.1016/j.tplants.2017.08.011

PubMed Abstract | CrossRef Full Text | Google Scholar

Cuevas, J., Crossa, J., Soberanis, V., Pérez-Elizalde, S., Pérez-Rodríguez, P., de los Campos, G., et al. (2016). Genomic prediction of genotype × environment interaction kernel regression models. Plant Genome 9 (3), 1–20. doi:10.3835/plantgenome2016.03.0024

CrossRef Full Text | Google Scholar

Desta, Z. A., and Ortiz, R. (2014). Genomic selection: Genome-wide prediction in plant improvement. Trends Plant Sci. 19, 592–601. doi:10.1016/j.tplants.2014.05.006

PubMed Abstract | CrossRef Full Text | Google Scholar

Frouin, J., Labeyrie, A., Boisnard, A., Sacchi, G. A., and Ahmadi, N. (2019). Genomic prediction offers the most effective marker assisted breeding approach for ability to prevent arsenic accumulation in rice grains. PLoS One 14 (6), e0217516. doi:10.1371/journal.pone.0217516

PubMed Abstract | CrossRef Full Text | Google Scholar

Habier, D., Fernando, R. L., and Dekkers, J. C. M. (2007). The impact of genetic relationship information on genome-assisted breeding values. Genetics 177, 2389–2397. doi:10.1534/genetics.107.081190

PubMed Abstract | CrossRef Full Text | Google Scholar

Hu, H., Campbell, M. T., Yeats, T. H., Zheng, X., Runcie, D. E., Covarrubias-Pazaran, G., et al. (2021). Multi-omics prediction of oat agronomic and seed nutritional traits across environments and in distantly related populations. Theor. Appl. Genet. 134, 4043–4054. doi:10.1007/s00122-021-03946-4

PubMed Abstract | CrossRef Full Text | Google Scholar

Huang, M., Balimponya, E. G., Mgonja, E. M., McHale, L. K., Luzi-Kihupi, A., Guo-Liang Wang, G.-L., et al. (2019). Use of genomic selection in breeding rice (Oryza sativa L) for resistance to rice blast (Magnaporthe oryzae). Mol. Breed. 39, 114. doi:10.1007/s11032-019-1023-2

CrossRef Full Text | Google Scholar

Jarquin, D., de Leon, N., Romay, C., Bohn, M., Buckler, E. S., Ciampitti, I., et al. (2020). Utility of climatic information via combining ability models to improve genomic prediction for yield within the genomes to fields maize project. Front. Genet. 11, 592769. doi:10.3389/fgene.2020.592769

PubMed Abstract | CrossRef Full Text | Google Scholar

Krause, M. R., González-Pérez, L., Crossa, J., Pérez-Rodríguez, P., Montesinos-López, O., Singh, R. P., et al. (2019). Hyperspectral reflectance derived relationship matrices for genomic prediction of grain yield in wheat. G3 Genes., Genomes, Genet. 9, 1231–1247. doi:10.1534/g3.118.200856

CrossRef Full Text | Google Scholar

Kursa, M., and Rudnicki, W. (2010). Boruta—A system for feature selection. Comput. Statistics Data Analysis 54 (10), 2434–2445. doi:10.3233/FI-2010-288

CrossRef Full Text | Google Scholar

Meuwissen, T. H. E., Hayes, B. J., and Goddard, M. E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 1819–1829. doi:10.1093/genetics/157.4.1819

PubMed Abstract | CrossRef Full Text | Google Scholar

Montesinos-López, A., Montesinos-López, O. A., Cuevas, J., Mata-López, W. A., Burgueño, J., Mondal, S., et al. (2017). Genomic Bayesian functional regression models with interactions for predicting wheat grain yield using hyper-spectral image data. Plant Methods 13 (62), 62–29. doi:10.1186/s13007-017-0212-4

PubMed Abstract | CrossRef Full Text | Google Scholar

Montesinos-López, O. A., Montesinos-López, A., and Crossa, J. (2022). Multivariate statistical machine learning methods for genomic prediction. Cham, Switzerland: Springer International Publishing. ISBN 978-3-030-89010-0.

Google Scholar

Monteverde, E., Gutierrez, L., Blanco, P., Pérez de Vida, F., Rosas, J. E., Bonnecarrère, V., et al. (2019). Integrating molecular markers and environmental covariates to interpret genotype by environment interaction in rice (oryza sativa L) grown in subtropical areas. G3 (Bethesda) 9 (5), 1519–1531. doi:10.1534/g3.119.400064

PubMed Abstract | CrossRef Full Text | Google Scholar

Pérez, P., and de los Campos, G. (2014). Genome-Wide regression and prediction with the BGLR statistical package. Genetics 198 (2), 483–495. doi:10.1534/genetics.114.164442

PubMed Abstract | CrossRef Full Text | Google Scholar

R Core Team (2023). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing.

Google Scholar

Ramon, G. E., Belanche Muñoz, L. A., Perez-Enciso, M., and Prados, A. (2020). Kernel approaches for complex phenotype prediction. https://ddd.uab.cat/record/241440.

Google Scholar

Rincent, R., Laloë, D., Nicolas, S., Altmann, T., Brunel, D., Revilla, P., et al. (2012). Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: Comparison of methods in two diverse groups of maize inbreds (Zea mays L). Genetics 192 (2), 715–728. doi:10.1534/genetics.112.141473

PubMed Abstract | CrossRef Full Text | Google Scholar

Rogers, A. R., and Holland, J. B. (2022). Environment-specific genomic prediction ability in maize using environmental covariates depends on environmental similarity to training data. G3 Genes. Genomes Genet. 12 (2), jkab440. doi:10.1093/g3journal/jkab440

CrossRef Full Text | Google Scholar

Rogers, A. R., Dunne, J. C., Romay, C., Bohn, M., Buckler, E. S., Ciampitti, I. A., et al. (2021). The importance of dominance and genotype-by-environment interactions on grain yield variation in a large-scale public cooperative maize experiment. G3 (Bethesda) 11 (2), jkaa050. doi:10.1093/g3journal/jkaa050

PubMed Abstract | CrossRef Full Text | Google Scholar

Roorkiwal, M., Rathore, A., Das, R. R., Singh, M. K., Jain, A., Srinivasan, S., et al. (2016). Genome-enabled prediction models for yield related traits in Chickpea. Front. Plant Sci. 7, 1666. doi:10.3389/fpls.2016.01666

PubMed Abstract | CrossRef Full Text | Google Scholar

VanRaden, P. M. (2008). Efficient methods to compute genomic predictions. J. Dairy Sci. 91 (11), 4414–4423. doi:10.3168/jds.2007-0980

PubMed Abstract | CrossRef Full Text | Google Scholar

Wolfe, M. D., Del Carpio, D. P., Alabi, O., Ezenwaka, L. C., Ikeogu, U. N., Kayondo, I. S., et al. (2017). Prospects for genomic selection in cassava breeding. Plant Genome 10, 15. doi:10.3835/plantgenome2017.03.0015

CrossRef Full Text | Google Scholar

Wu, P. Y., Stich, B., Weisweiler, M., Shrestha, A., Erban, A., Westhoff, P., et al. (2022). Improvement of prediction ability by integrating multi-omic datasets in barley. BMC Genomics 23 (1), 200. doi:10.1186/s12864-022-08337-7

PubMed Abstract | CrossRef Full Text | Google Scholar

Keywords: genomic prediction, feature selection, environmental covariables, genotype x environment interaction, genomic selection

Citation: Montesinos-López OA, Crespo-Herrera L, Saint Pierre C, Bentley AR, de la Rosa-Santamaria R, Ascencio-Laguna JA, Agbona A, Gerard GS, Montesinos-López A and Crossa J (2023) Do feature selection methods for selecting environmental covariables enhance genomic prediction accuracy?. Front. Genet. 14:1209275. doi: 10.3389/fgene.2023.1209275

Received: 20 April 2023; Accepted: 03 July 2023;
Published: 24 July 2023.

Edited by:

Alencar Xavier, Corteva Agriscience, United States

Reviewed by:

Balpreet Kaur Dhatt, University of Nebraska-Lincoln, United States
Matheus Dalsente Krause, Iowa State University, United States

Copyright © 2023 Montesinos-López, Crespo-Herrera, Saint Pierre, Bentley, de la Rosa-Santamaria, Ascencio-Laguna, Agbona, Gerard, Montesinos-López and Crossa. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Abelardo Montesinos-López, aml_uach2004@hotmail.com; José Crossa, j.crossa@cgiar.org

ORCID: José Crossa, orcid.org/0000-0001-9429-5855

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.