Skip to main content

METHODS article

Front. Genet., 14 June 2024
Sec. Statistical Genetics and Methodology

SABO-ILSTSVR: a genomic prediction method based on improved least squares twin support vector regression

Rui Li,Rui Li1,2Jing Gao,,
Jing Gao1,2,3*Ganghui Zhou,Ganghui Zhou1,2Dongshi Zuo,Dongshi Zuo1,2Yao Sun,Yao Sun1,2
  • 1College of Computer and Information Engineering, Inner Mongolia Agricultual University, Hohhot, China
  • 2Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application for Agriculture and Animal Husbandry, Hohhot, China
  • 3Inner Mongolia Autonomous Region Big Data Center, Hohhot, China

In modern breeding practices, genomic prediction (GP) uses high-density single nucleotide polymorphisms (SNPs) markers to predict genomic estimated breeding values (GEBVs) for crucial phenotypes, thereby speeding up selection breeding process and shortening generation intervals. However, due to the characteristic of genotype data typically having far fewer sample numbers than SNPs markers, overfitting commonly arise during model training. To address this, the present study builds upon the Least Squares Twin Support Vector Regression (LSTSVR) model by incorporating a Lasso regularization term named ILSTSVR. Because of the complexity of parameter tuning for different datasets, subtraction average based optimizer (SABO) is further introduced to optimize ILSTSVR, and then obtain the GP model named SABO-ILSTSVR. Experiments conducted on four different crop datasets demonstrate that SABO-ILSTSVR outperforms or is equivalent in efficiency to widely-used genomic prediction methods. Source codes and data are available at: https://github.com/MLBreeding/SABO-ILSTSVR.

1 Introduction

With the decreasing cost of high-throughput sequencing data, genomic prediction (GP) emerges as a novel breeding approach, using high-density single nucleotide polymorphisms (SNPs) to capture associations between markers and phenotypes, thereby enabling prediction of genomic estimated breeding values (GEBVs) at an early stage of breeding (Meuwissen et al., 2001). Compared with conventional breeding methods, such as phenotype and marker-assisted selection, GP greatly shortens generation intervals, reduces costs, and enhances the efficiency and accuracy of new variety selection (Heffner et al., 2010).

From the proposal of the concept of genomic prediction to the present, a multitude of models have emerged. Early models primarily focused on improving best linear unbiased prediction (BLUP), such as ridge regression-based best linear unbiased prediction (rrBLUP) (Henderson, 1975) and genomic best linear unbiased prediction (GBLUP) (VanRaden, 2008), etc. In addition, researchers have proposed various Bayesian methods, including BayesA and BayesB (Meuwissen et al., 2001), BayesC (Habier et al., 2011) and BayesLasso (Park and Casella, 2008), Bayesian ridge regression (BayesRR) (da Silva et al., 2021), BSLMM(Zhou et al., 2013). Moreover, bayesian methods generally exhibit higher prediction accuracy than GBLUP in the majority of cases (Rolf et al., 2015). However, the Markov Chain Monte Carlo (MCMC) steps involved in parameter estimation for Bayesian methods can significantly increase computational costs. With advancements in high-throughput sequencing technologies, the increasing dimensionality of genotype data poses new challenges for GP models. To address this problem, some researchers have begun employing regularization term to mitigate the overfitting problem, such as ridge regression (Ogutu et al., 2012), Lasso (Usai et al., 2009), elastic net (Wang et al., 2019). Meanwhile, machine learning (ML) methods such as support vector regression (SVR) (Maenhout et al., 2007; Ogutu et al., 2011), random forest (RF) (Svetnik et al., 2003), gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost) (Chen and Guestrin, 2016) and light gradient boosting machine (LightGBM) (Ke et al., 2017), have made great performance in genomic prediction methods. With the development of deep learning (DL), researchers have also combined it with genomic prediction models, such as DeepGS proposed by Ma et al. (Ma et al., 2018), based on convolutional neural networks (CNN), and DNNGP proposed by Wang et al. (Kelin et al., 2023) for application in multi-omics, which have achieved better performance compared with other classic models. However, genotype data for most species exhibit high-dimensional small-sample characteristics, leading models often to fail to learn effective features from the training data. Moreover, most GP models contain a large number of parameters and there are significant differences in genotype data among different species, leading to a tedious parameter tuning process for each species, significantly increasing breeding costs. Therefore, enhancing the prediction performance of GP models and reducing the complexity of parameters tuning is of crucial importance for shortening generation intervals and reducing breeding costs.

This study explores a machine learning model, least squares twin support vector regression (LSTSVR), to address above problem. LSTSVR, proposed by Zhong et al. (2012), is a regression model that integrates the ideas of least squares method and twin support vector machine (TSVM) (Jayadeva et al., 2007). LSTSVR contains a kernel function; when linearly inseparable data exists in the original input space, it can become linearly separable after being mapped into a higher-dimensional feature space through an appropriate kernel function (Kung, 2014). LSTSVR improves the computational efficiency during the training process of traditional SVR models by introducing the least squares paradigm to replace the ε-insensitive loss function in SVR, thereby transforming the originally nonlinear optimization problem into an easier-to-solve system of linear equations, offering more stable performance. Simultaneously, adopting a two sets of support vectors for regression enhances the model’s learning capability and robustness (Huang et al., 2013). However, as a general-purpose regression model, LSTSVR still has certain limitations when dealing with various high-dimensional, small-sample genotype datasets.

To better address model overfitting and complex parameter tuning when the number of genotype samples is far less than the number of SNPs markers (Crossa et al., 2017; Tong and Nikoloski, 2020) this study has made improvements and optimizations to LSTSVR. Firstly, inspired by Lasso regularization, a penalty term was introduced to constrain model complexity on LSTSVR. Concurrently, in order to reduce the complexity stemming from model parameter tuning, the subtraction average-based optimizer (SABO) (Trojovský and Dehghani, 2023) was adopted to perform parameter optimization on the LSTSVR model. By combining the Lasso regularization-based ILSTSVR with the efficient optimization of SABO, this study successfully developed a genomic prediction model named SABO-ILSTSVR. To validate the effectiveness of SABO-ILSTSVR, comparative experiments were conducted using SABO-ILSTSVR on four different species datasets (maize, potato, wheat, and brassica napus) against commonly used genomic prediction models (LightGBM, rrBLUP, GBLUP, BSLMM, BayesRR, Lasso, RF, SVR, DNNGP). The results demonstrate that SABO-ILSTSVR exhibits equivalent or superior performance compared with widely-used genomic prediction methods. Finally, in order to reduce the difficulty of using the model, this study provides an easy-to-use python-based tool for breeders to use conveniently.

2 Materials and methods

2.1 Dataset

Four different crops datasets are used in this study, including potatoes, wheat, maize, and Brassica napus. The following provides detailed descriptions of the genotype and phenotype data for each dataset.

Potato dataset (Selga et al., 2021) is derived from a total of 256 cultivated varieties across three locations in northern and southern Sweden. Over 2000 SNPs markers used for genome-wide prediction were obtained from germplasm resources at both the Centro Internacional de la Papa (CIP, Lima, Peru) and those in the United States. According to Selga et al. (2021), this number of SNPs is already sufficient for predict genomic estimated breeding values (GEBVs) without loss of information. In this dataset, the total weight of tubers serves as the phenotype data.

Wheat dataset (Crossa et al., 2014) originates from the Global Wheat Program at CIMMYT, comprising information on 599 wheat lines. The project carried out numerous experiments across various environmental settings, with the dataset divided into four core environments according to distinct environmental parameters. Average grain yield (GY) serves as the phenotypic trait data within this dataset. It contains 1,279 SNPs markers, which were acquired following the removal of those with minor allele frequencies below 0.05 and the estimation of missing genotypes utilizing samples from the genotype edge distribution.

Maize dataset (Crossa et al., 2014) originates from CIMMYT’s maize project, comprising 242 maize lines and 46,374 SNP markers. The project encompasses multiple phenotypic data points, and we use the most significant yield-related traits as our phenotype data for this study.

Brassica napus dataset (Kole et al., 2002) is part of the MTGS package. The dataset comprises 50 lines derived from two varieties, 100 SNP markers, and phenotype information on flowering days across three distinct time periods (flower0, flower4, flower8).

2.2 SABO-ILSTSVR model

The SABO-ILSTSVR model integrates the subtraction-based average optimizer (SABO) and an improved LSTSVR method. Its overall framework is depicted in Figure 1, followed by an elaboration of the model.

Figure 1
www.frontiersin.org

Figure 1. Illustration of SABO-ILSTSVR model frameworks.

2.2.1 LSTSVR

Least squares twin support vector regression (LSTSVR) is an improved twin support vector regression (TSVR) (Peng, 2010; Shao et al., 2013) model by Yang et al. (Lu Zhenxing, 2014). TSVR is an extension of the regression model derived from twin support vector machine (TSVM) (Jayadeva et al., 2007), suitable for addressing continuous value prediction problems, which accomplishes this by determining two regression functions, shown as follows,

f1X=w1TX+b1,f2X=w2TX+b2(1)

where Eq. (1) determines ε-insensitive down-bound function f1 and ε-insensitive up-bound function f2, w1, w2 are weight vectors, b1, b2 are bias term. These can be obtained by solving the following quadratic programming problems (QPPs),

minw1,b112Yeε1Xw1+eb122+C1eTξ1s.t.YXw1+eb1eε1ξ1,ξ10,(2)
minw2,b212Y+eε2Xw2+eb222+C2eTξ2s.t.Y+Xw2+eb2eε2ξ2,ξ20,(3)

where C1, C2 are the positive penalty parameters, ε1, ε2 are up- and down-bound parameters, ξ1,ξ2 are slack vectors, e is a vector of ones with appropriate dimensions. The result of the final regression function is decided by the mean of upper and lower bound functions, as Eq. (4),

fX=12f1X+f2X=12w1T+w2TX+12bl+b2(4)

In the spirit of LSTSVM, Yang et al. apply least squares method to TSVR. TSVR finds the optimal weight vector and bias terms by solving two QPPs, whereas LSTSVR transforms the original TSVR problem into two systems of linear equations for solution, which is typically faster and more stable than directly solving q QPPs, with the loss in accuracy being within an acceptable range. For LSTSVR model, the inequality constraints of (2) and (3) are replaced with equality constraints as follows,

minw1,b112Yeε1Xw1+eb122+C1ξ1Tξ1s.t.YXw1+eb1=eε1ξ1,(5)
minw2,b212Y+eε2Xw2+eb222+C2ξ2Tξ2s.t.Y+Xw2+eb2=eε2ξ2,(6)

In formula (5) and (6), the square of L2-norm of slack variable ξ1,ξ2 is used, instead of L1-norm in (2) and (3), which makes constraint ξ10, ξ20 redundant, so the following formulas is obtained,

minw1,b112YXw1+eb122+C12Xw1+eb1Yeε122,(7)
minw2,b212YXw2+eb222+C22Xw2+eb2+Yeε222,(8)

(7) and (8) are two unconstrained QPPs, hence the solutions for w and b can be directly obtained by setting the derivatives to zero, as Eq. (9),

HTYε1eHu+c1HTHuYε1e=0,(9)

where u=wb, H=Xe, then, we have Eq. (10) and Eq. (11),

u1=w1b1=1+C12HTH1HTY+1+C1C1HTH1HTε1e,(10)
u2=w2b2=1+C22HTH1HTY1+C2C2HTH1HTε2e,(11)

thus, the final regression function is fX=12Xw1T+w2T+12bl+b2.

2.2.2 Improved LSTSVR (ILSTSVR)

When applying LSTSVR to handle high-dimensional genotype datasets with small samples, the model is prone to a high risk of overfitting. This is due to the fact that the model may overly fit noise and feature details in the training set, leading to decrease generalization performance on new samples and thereby affecting the effectiveness and reliability of the predictive results. Therefore, this study adds a Lasso regularization term for the weight parameter w in the LSTSVR framework. For linear problems, the function of ILSTSVR is as follows,

minw1,b112Yε1eXw1+b1e22+C12ξ1Tξ1+C32w11+b12s.t.YXw1+b1e=ε1eξ1,(12)
minw2,b212Y+ε2eXw2+b2e22+C22ξ2Tξ2+C42w21+b22s.t.Y+Xw2+b2e=ε2eξ2,(13)

where C1, C2, C3, C4, ε1, ε2 are positive penalty parameters, ξ1,ξ2 are slack variables, e is a vector of ones of appropriate dimensions. For the non-differentiability with L1 regularization and the convenience of calculations, we assume w1=α*w22, then (12) and (13) can be converted into as follow,

minw1,b112Yε1eXw1+b1e22+C12Xw1+b1eYε1e22+C32α1w122+b12,(14)
minw2,b212Y+ε2eXw2+b2e22+C22Xw2+b2e+Yε2e22+C42α2w222+b22,(15)

where α1, α2 are vectors of appropriate dimensions, * represents element-wise multiplication. Expand (14) yields,

L=min12Yε1eXw1+b1eTYε1eXw1+b1e+C12Xw1+b1eYε1eTXw1+b1eYε1e+C32diagα1w1Tdiagα1w1+C32b12,(16)

let the derivatives of (16) with respect to w1 and b1 respectively be zero, we obtain Eq. (17) and Eq. (18),

Lw1=XTYXw1b1eε1e+C1XTXw1+b1eε1eY+C3diagα1Tdiagα1w1=0,(17)
Lb1=eTYXw1b1eε1e+C1eTXw1+b1eε1eY+C3b1=0,(18)

composed in matrix form as Eq. (19),

1+C1XTX+C3D11+C1XTe1+C1eTX1+C1eTe+C3w1b1=1+C1XTC11XTe1+C1eTC11eTeYε1(19)

where D1=diagα1Tdiagα1. Similarly, the following result Eq. (20) can be obtained through (15),

1+C2XTX+C4D21+C2XTe1+C2eTX1+C2eTe+C4w2b2=1+C2XT1C2XTe1+C2eT1C2eTeYε2(20)

where D2=diagα2Tdiagα2. Because of the assumption w1=αw22, we set an initial α0, and calculate the final w and b through the iterative formula that updates alternately, as Eq. (21)

αik+1=1wik(21)

To make the process clear, the computational process is summarized as Table 1.

Table 1
www.frontiersin.org

Table 1. Iterative algorithm to solve the L1 regularization problem.

Then we obtain Eq. (22),

fX=12Xw1T+w2T+12bl+b2(22)

For nonlinear problems, kernel functions typically provide a good solution, and here we have chosen RBF as the kernel function for our model. The formula of RBF as Eq. (23),

Kx,x=expxx22σ2,(23)

where σ is kernel parameter, the value has a significant impact on prediction performance, easily leading to overfitting or underfitting.

We map the training data through KX,XT into a high-dimensional reproducing kernel Hilbert space (RKHS) (Aronszajn, 1950), obtaining matrix H. Thus, we obtain the function for the nonlinear problem as Eq. (24) and Eq. (25),

minw1,b112Yε1eHw1+b1e22+C12ξ1Tξ1+C32w11+b12s.t.YXw1+b1e=ε1eξ1,(24)
minw2,b212Y+ε2eHw2+b2e22+C22ξ2Tξ2+C42w21+b22s.t.Y+Xw2+b2e=ε2eξ2,(25)

Similarly, we can obtain Eq. (26) and Eq. (27),

1+C1HTH+C3D11+C1HTe1+C1eTH1+C1eTe+C3w1b1=1+C1HTC11HTe1+C1eTC11eTeYε1,(26)
1+C2HTH+C4D21+C2HTe1+C2eTH1+C2eTe+C4w2b2=1+C2HT1C2HTe1+C2eT1C2eTeYε2,(27)

the final nonlinear model as Eq. (28),

fH=12Hw1T+w2T+12bl+b2(28)

2.2.3 Optimizing ILSTSVR using SABO

For the non-linear ILSTSVR model, the choice of kernel parameter σ for RBF has a significant impact on prediction performance. Moreover, in this study, we set C1=C2 and C3=C4 for the ILSTSVR parameters C1, C2, C3, C4 as given in (12) and (13). Parameter optimization plays an important role in machine learning, as selecting appropriate parameters can significantly enhance a model’s predictive capability and accuracy. Different combinations of parameters may lead to vastly different performances of the model on both training and testing data. Furthermore, parameters affect the model complexity and learning capacity. By adjusting them, we can better strike a balance between overfitting and underfitting (Young et al., 2015). Although grid search can find the global optimal solution, it will result in enormous computational resource consumption and neglect the correlations among parameters (Vincent and Jidesh, 2023). Instead, we use the recently proposed SABO for parameter optimization, which updates the positions of population members in the search space using subtraction averages of individuals, characterized by strong optimization capability and fast convergence rates (Moustafa et al., 2023).

The basic inspiration for the design of the SABO is mathematical concepts such as averages, the differences in the positions of the search agents, and the sign of difference of the two values of the objective function. The idea of using the arithmetic mean location of all the search agents (i.e., the population members of the tth iteration), instead of just using, e.g., the location of the best or worst search agent to update the position of all the search agents (i.e., the construction of all the population members of the (t + 1)th iteration), is not new, but the SABO’s concept of the computation of the arithmetic mean is wholly unique, as it is based on a special operation " v", called the v-subtraction of the search agents B from the search agent A, which is defined as Eq. (29):

AvB=signFAFBAvB,(29)

where v is a vector of the dimension m, the operation "∗" represents the Hadamard product of the two vectors, F(A) and F(B) are the values of the objective function of the search agents A and B, respectively, and sign is the signum function (Trojovský and Dehghani, 2023).

In the proposed SABO, the displacement of any search agent Xi in the search space is calculated by the arithmetic mean of the v-subtraction of each search agent Xj, j = 1,2, … , N, from the search agent Xi. Thus, the new position for each search agent is calculated using (30).

Xinew=Xi+ri1Nj=1NXivXj,i=1,2,,N,(30)

where Xinew is the new proposed position for the ith search agent Xi, N is the total number of the search agents, and ri is a vector of the dimension m. Then, if this proposed new position leads to an improvement in the value of the objective function, it is acceptable as the new position of the corresponding agent, according to (31)

Xi=Xinew,Finew<Fi;Xi,else,(31)

where Fi and Finew are the fitness function values of the search agents Xi and Xinew, respectively.

Similar to other optimization algorithms, the primary positions of the search agents in the search space are randomly initialized using (32).

xi,d=lbd+ri,dubdlbd,i=1,,N,d=1,,m,(32)

where xi,d is the dth dimension of Xi, N is the number of search agents, m is the number of decision variables, ri,d is a random number in the interval [0, 1], and lbd and ubd are the lower and upper bounds of the dth decision variables, respectively (Trojovský and Dehghani, 2023).

Here, we define the fitness function for SABO as MSE function shown in (34), C1C1min,C1max, C3C3min,C3max and σσmin,σmax. The parameter corresponding to the smallest fitness function value obtained through iteration is the optimal combination of parameters for ILSTSVR. Finally, train the model according to the obtained parameters combination.

2.2.4 Performance evaluation

To evaluate the prediction performance of GEBVs by the model, while avoiding the problem that Pearson correlation coefficient fails to measure the distance between true and predicted values, we adopt both Pearson correlation coefficient and MSE as evaluation metrics for the relationship between predicted and true values. Furthermore, we use ten-fold cross-validation to assess the model’s performance. The original dataset is divided into ten equally-sized (or nearly equal) folds; for each fold, it serves as the validation set, while the remaining nine folds constitute the training set. Training a model using the training set, and its performance is evaluated using the validation set. After completing all ten iterations, the average of the performance measures obtained from each validation set is taken, thereby yielding an overall assessment of the model’s performance.

The Pearson correlation coefficient (PCC) is used to measure the strength and direction of the linear relationship between two continuous variables, and is defined as follows,

ρy,y=covy,yσyσy(33)

where y and y represent the true values and predicted values respectively, covy,y denotes the covariance of vectors y and y, σy and σy are the standard deviations of vector y and y respectively.

Mean squared error (MSE) measures the degree of difference between predicted values and actual values, and is defined as follows,

MSEy,y=1ni=1nyiyi2(34)

3 Result

3.1 Comparison of SABO-ILSTSVR with the base model

To validate the effectiveness of ILSTSVR, this section compares SABO-ILSTSVR with some methods prior to its improvement on four datasets (potato, wheat, maize and brassica napus). The results on the potato dataset (Figure 2) show that the SABO-ILSTSVR exhibits a 4% increase in Pearson correlation coefficient and a 2% decrease in MSE compared with SABO-LSTSVR. Furthermore, when contrasted with SABO-SVR, the SABO-ILSTSVR demonstrates a 9% improvement in the Pearson correlation coefficient and a 6% decrease in MSE.

Figure 2
www.frontiersin.org

Figure 2. Performance of Pearson correlation coefficient and MSE prediction for SABO-SVR, SABO-LSTSVR, and SABO-ILSTSVR on the potato dataset.

The results on other datasets are shown in Table 2. On the wheat dataset, SABO-ILSTSVR improves the Pearson correlation coefficient in four environments by an average of 2%, 2%, 8%, and 2% and reduces the MSE in four environments by an average of 1%, 3%, 1% and 1%, respectively, compared with SABO-SVR and SABO-LSTSVR. On the maize dataset, SABO-ILSTSVR improves the Pearson correlation coefficient by an average of 6% and reduces the MSE by 1%, respectively, compared with SABO-SVR and SABO-LSTSVR. On the brassica napus dataset, SABO-ILSTSVR improves the Pearson correlation coefficient in three traits by an average of 11%, 6%, and 24% and reduces the MSE in three traits by an average of 17%, 1% and 12%, respectively, compared with SABO-SVR and SABO-LSTSVR.

Table 2
www.frontiersin.org

Table 2. Prediction performance in ten-fold cross-validation for each trait in wheat, maize and brassica napus datasets.

3.2 Comparison of SABO-ILSTSVR with other methods

Considering the complexity of the genetic architecture, this study employs real data to evaluate the prediction performance of models, including datasets from public available sources for maize, wheat, potato, and Brassica napus. And we performed standardization on the phenotype data of all datasets. Due to the small sample sizes in the adopted datasets, random sampling errors may be relatively substantial, leading to decreased model prediction accuracy, reduced statistical power of tests, and difficulty in obtaining stable and reliable statistical inferences (Bengio and Grandvalet, 2004). Consequently, a ten-fold cross-validation is applied to each dataset in this study, with the average of the results over ten iterations used to represent the ultimate prediction performance of the models.

This section compares the prediction performance of LightGBM, rrBLUP, GBLUP, BSLMM, BayesRR, Lasso, RF, SVR, DNNGP, and SABO-ILSTSVR across four datasets (potato, wheat, maize and brassica napus). The LightGBM model originates from a LightGBM Python package developed by Microsoft. The BayesRR, Lasso, RF, and SVR models are included in the Scikit-learn python library. The GBLUP, BSLMM and rrBLUP models utilize the sommer R package (Covarrubias-Pazaran, 2016), hibayes R package (Yin et al., 2022) and the rrBLUP R package (Endelman, 2011), respectively. The DNNGP model is mentioned in the paper by (Kelin et al., 2023).

3.2.1 Potato dataset

This paper first compares the prediction performance of SABO-ILSTSVR with LightGBM, rrBLUP, GBLUP, BSLMM, BayesRR, Lasso, RF, SVR, DNNGP on the potato dataset. Detailed information about SNPs and phenotypes in the potato dataset has been described in the Materials and Methods section. As shown in Figure 3, SABO-ILSTSVR outperforms comparative models across key performance metrics. Specifically, SABO-ILSTSVR improves the Pearson correlation coefficient by 18%, 11%, 50%, 9%, 18%, 18%, 9%, 23% and 30% and reduces the MSE by an average of 19%, 11%, 32%, 73%, 26%, 20%, 11%, 23% and 33%, respectively, compared with LightGBM, rrBLUP, GBLUP, BSLMM, BayesRR, Lasso, RF, SVR, DNNGP.

Figure 3
www.frontiersin.org

Figure 3. Performance of prediction for various models on the potato dataset in terms of Pearson correlation coefficients and MSE.

3.2.2 Wheat dataset

Similarly, prediction performance comparisons were conducted for SABO-ILSTSVR, LightGBM, rrBLUP, GBLUP, BSLMM, BayesRR, Lasso, RF, SVR, DNNGP on the wheat dataset. As shown in Figure 4, the DNNGP model exhibits the highest Pearson correlation coefficients in predicting yield under environments env1 and env2 of the wheat dataset, whereas its performance in env3 and env4 is inferior to that of other models. In contrast, our proposed SABO-ILSTSVR model demonstrates higher Pearson correlation coefficients and lower mean squared errors for yield data across all four environments compared with the other models. Specifically, SABO-ILSTSVR achieves the best performance in env3 and env4 and is second only to DNNGP in env1 and env2. The reason for the differing performance may be due to their prediction performance varying across different agroclimatic regions.

Figure 4
www.frontiersin.org

Figure 4. Performance of prediction for various models on the wheat dataset. Pearson correlation coefficient (A) and MSE (B) metrics of four phenotypes (env1, env2, env3 and env4), as evaluated through ten-fold cross-validation.

3.2.3 Maize dataset

The detailed information on SNPs and phenotypes in the maize dataset has been described in the Materials and Methods section. Unlike the potato and wheat datasets, this section performs ten replicates ten-fold cross-validation separately for SABO-ILSTSVR, LightGBM, rrBLUP, GBLUP, BayesRR, Lasso, RF, SVR, DNNGP on the maize dataset. Comparison was not conducted with BSLMM due to its unstable results. The Pearson correlation coefficients of the ten results are represented (A) in Figure 5, while the mean squared errors (MSEs) of the ten results are averaged and depicted as (B). As shown in Figure 5, Lasso exhibits the lowest prediction performance, whereas DNNGP, despite having the highest Pearson correlation coefficient prediction performance, displays significantly large variations across the ten runs, resulting in elongated bars in the boxplot. In contrast, SABO-ILSTSVR has a stable Pearson correlation coefficient prediction performance and exhibits the lowest MSE.

Figure 5
www.frontiersin.org

Figure 5. Performance of prediction for various models on the maize dataset. (A) Pearson correlation coefficients of maize yield traits, represented by box plots, after ten replicates ten-fold cross-validation. (B) Average MSE of maize yield traits after ten replicates ten-fold cross-validation.

3.2.4 Brassica napus dataset

Similarly, on the brassica napus dataset, ten replicates ten-fold cross-validation was performed for SABO-ILSTSVR, LightGBM, rrBLUP, GBLUP, BSLMM, BayesRR, Lasso, RF, SVR, DNNGP, respectively. The Pearson correlation coefficients of the ten results are represented by (A) in Figure 6, while the mean squared errors (MSEs) after averaging are depicted by (B). As shown in (A) of Figure 6, SABO-ILSTSVR exhibits the highest Pearson correlation coefficient and lowest MSE prediction performance on flower0 and flower8, whereas BSLMM achieves the highest Pearson correlation coefficient prediction performance on flower4. The prediction performance of the SABO-ILSTSVR model is slightly lower than that of the GBLUP and BSLMM model but higher than that of the other comparative models on flower4.

Figure 6
www.frontiersin.org

Figure 6. Performance of prediction for various models on the brassica napus dataset. (A) Pearson correlation coefficients of three phenotypes (flower0, flower4 and flower8), represented by box plots, after ten replicates ten-fold cross-validation. (B) Average MSE of three phenotypes after ten replicates ten-fold cross-validation.

4 Discussion

In this study, we integrate Least Squares Twin Support Vector Regression (LSTSVR) with Lasso regularization, constructing a GP model named ILSTSVR. We use the SABO optimization algorithm to effectively optimize the parameters of the model. To address the number of genotype samples is far less than the number of SNPs markers, we introduce a Lasso regularization term into LSTSVR. By the unique feature selection property of Lasso, it can effectively shrink the coefficients of non-key features to zero, achieving parameter sparsity and effectively preventing overfitting. Meanwhile, to cope with the potential nonlinear relationships in genotype data, we adopt the radial basis function (RBF) kernel, mapping the raw data into a high-dimensional space to attain linear separability.

Considering the differences in genotype data among various species may lead to distinct optimal parameters, we used the SABO optimization algorithm to automatically tune the parameters of the ILSTSVR model. To validate the effectiveness of this model, we conducted evaluations on multiple datasets spanning potato, maize, wheat, and Brassica napus, and compared its prediction performance against a series of widely-used models such as LightGBM, rrBLUP, GBLUP, BSLMM, BayesRR, Lasso, RF, SVR, DNNGP.

The results showed that the SABO-ILSTSVR model demonstrated outstanding prediction performance on the potato dataset, outperforming other benchmark models. In the wheat and brassica napus datasets containing multiple phenotypic traits, our model consistently exhibited higher prediction accuracy for most traits compared with other models. The box plot analysis of the maize and brassica napus dataset further revealed the robustness of the SABO-ILSTSVR model’s predictions.

In our further exploration of the future, confronted with the high-dimensional challenges of genomic data, we will delve deeper into how to efficiently perform feature extraction (Burges, 2009). With the continuous decline in sequencing costs, large-scale genomic sequencing of samples is poised to become a reality, the increased sample size may offer more favorable application for DL models (Khan et al., 2020; Yang et al., 2024). We will investigate innovative DL models within the field of breeding.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

Author contributions

RL: Software, Writing–review and editing, Writing–original draft, Methodology. JG: Writing–review and editing. GZ: Writing–review and editing, Data curation. DZ: Writing–review and editing. YS: Writing–review and editing.

Funding

The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. This project was funded by the Major Science and Technology Projects of the Inner Mongolia Autonomous Region (2019ZD016, 2021ZD0005) and the 2023 Graduate Research Innovation Project of the Inner Mongolia Autonomous Region (S20231117Z).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Am. Math. Soc. 68, 337–404. doi:10.21236/ada296533

CrossRef Full Text | Google Scholar

Bengio, Y., and Grandvalet, Y. (2004). No unbiased estimator of the variance of K-fold cross-validation. J. Mach. Learn. Res. 5, 1089–1105.

Google Scholar

Burges, C. J. C. (2009). Dimension reduction: a guided tour. Found. Trends® Mach. Learn. 2, 275–364. doi:10.1561/9781601983794

CrossRef Full Text | Google Scholar

Chen, T., and Guestrin, C. X. G. B. (2016) A scalable tree boosting system. New York, NY, USA: Association for Computing Machinery, 785–794.

Google Scholar

Covarrubias-Pazaran, G. (2016). Genome-assisted prediction of quantitative traits using the R package sommer. PLOS ONE 11, e0156744. doi:10.1371/journal.pone.0156744

PubMed Abstract | CrossRef Full Text | Google Scholar

Crossa, J., PéREZ, P., Hickey, J., BurgueñO, J., Ornella, L., CeróN-Rojas, J., et al. (2014). Genomic prediction in CIMMYT maize and wheat breeding programs. Heredity 112, 48–60. doi:10.1038/hdy.2013.16

PubMed Abstract | CrossRef Full Text | Google Scholar

Crossa, J., PéREZ-RodríGUEZ, P., Cuevas, J., Montesinos-LóPEZ, O. A., JarquíN, D., Campos, G. D. L., et al. (2017). Genomic selection in plant breeding: methods, models, and perspectives. Trends plant Sci. 22 (11), 961–975. doi:10.1016/j.tplants.2017.08.011

PubMed Abstract | CrossRef Full Text | Google Scholar

Da Silva, F. A., Viana, A. P., Correa, C. C. G., Santos, E. A., De Oliveira, J. A. V. S., Andrade, J. D. G., et al. (2021). Bayesian ridge regression shows the best fit for SSR markers in Psidium guajava among Bayesian models. Sci. Rep. 11, 13639. doi:10.1038/s41598-021-93120-z

PubMed Abstract | CrossRef Full Text | Google Scholar

Endelman, J. (2011). Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome 4, 250–255. doi:10.3835/plantgenome2011.08.0024

CrossRef Full Text | Google Scholar

Habier, D., Fernando, R. L., Kizilkaya, K., and Garrick, D. J. (2011). Extension of the bayesian alphabet for genomic selection. BMC Bioinforma. 12, 186. doi:10.1186/1471-2105-12-186

PubMed Abstract | CrossRef Full Text | Google Scholar

Heffner, E. L., Lorenz, A. J., Jannink, J.-L., and Sorrells, M. E. (2010). Plant breeding with genomic selection: gain per unit time and cost. Crop Sci. 50, 1681–1690. doi:10.2135/cropsci2009.11.0662

CrossRef Full Text | Google Scholar

Henderson, A. C. R. (1975). Best linear unbiased estimation and prediction under a selection model. Biometrics 31 (2), 423–447. doi:10.2307/2529430

PubMed Abstract | CrossRef Full Text | Google Scholar

Huang, H.-J., Ding, S.-F., and Shi, Z.-Z. (2013). Primal least squares twin support vector regression. J. Zhejiang Univ. Sci. C 14, 722–732. doi:10.1631/jzus.ciip1301

CrossRef Full Text | Google Scholar

Jayadeva, R., Khemchandani, R., and Chandra, S. (2007). Twin support vector machines for pattern classification. IEEE Trans. Pattern Analysis Mach. Intell. 29, 905–910. doi:10.1109/tpami.2007.1068

PubMed Abstract | CrossRef Full Text | Google Scholar

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., et al. (2017) LightGBM: a highly efficient gradient boosting decision tree. Red Hook, NY, USA: Curran Associates Inc., 3149–3157.

Google Scholar

Kelin, W., Muhammad Ali, A., Rasheed, A., Crossa, J., Hearne, S., and Li, H. (2023). DNNGP, a deep neural network-based method for genomic prediction using multi-omics data in plants. Mol. Plant 16, 279–293. doi:10.1016/j.molp.2022.11.004

PubMed Abstract | CrossRef Full Text | Google Scholar

Khan, A., Sohail, A., Zahoora, U., and Qureshi, A. S. (2020). A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. 53, 5455–5516. doi:10.1007/s10462-020-09825-6

CrossRef Full Text | Google Scholar

Kole, C., Thormann, C. E., Karlsson, B. H., Palta, J. P., Gaffney, P., Yandell, B., et al. (2002). Comparative mapping of loci controlling winter survival and related traits in oilseed Brassica rapa and B. napus. Mol. Breed. 9, 201–210. doi:10.1023/a:1019759512347

CrossRef Full Text | Google Scholar

Kung, S. Y. (2014) Kernel methods and machine learning. Cambridge: Cambridge University Press.

Google Scholar

Lu Zhenxing, Y. Z. G. A. O. X. (2014). Least square twin support vector regression. Comput. Eng. Appl. 50, 140–144.

Google Scholar

Maenhout, S., De Baets, B., Haesaert, G., and Van Bockstaele, E. (2007). Support vector machine regression for the prediction of maize hybrid performance. Theor. Appl. Genet. 115, 1003–1013. doi:10.1007/s00122-007-0627-9

PubMed Abstract | CrossRef Full Text | Google Scholar

Ma, W., Qiu, Z., Song, J., Li, J., Cheng, Q., Zhai, J., et al. (2018). A deep convolutional neural network approach for predicting phenotypes from genotypes. Planta 248, 1307–1318. doi:10.1007/s00425-018-2976-9

PubMed Abstract | CrossRef Full Text | Google Scholar

Meuwissen, T. H. E., Hayes, B. J., and Goddard, M. E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 1819–1829. doi:10.1093/genetics/157.4.1819

PubMed Abstract | CrossRef Full Text | Google Scholar

Moustafa, G., Tolba, M. A., El-Rifaie, A. M., Ginidi, A., Shaheen, A. M., and Abid, S. (2023). A subtraction-average-based optimizer for solving engineering problems with applications on TCSC allocation in power systems. Biomimetics 8, 332. doi:10.3390/biomimetics8040332

PubMed Abstract | CrossRef Full Text | Google Scholar

Ogutu, J. O., Piepho, H.-P., and Schulz-Streeck, T. (2011). A comparison of random forests, boosting and support vector machines for genomic selection. BMC Proc. 5, S11. doi:10.1186/1753-6561-5-S3-S11

PubMed Abstract | CrossRef Full Text | Google Scholar

Ogutu, J. O., Schulz-Streeck, T., and Piepho, H.-P. (2012). Genomic selection using regularized linear regression models: ridge regression, lasso, elastic net and their extensions. BMC Proc. 6, S10. doi:10.1186/1753-6561-6-S2-S10

PubMed Abstract | CrossRef Full Text | Google Scholar

Park, T., and Casella, G. (2008). The bayesian lasso. J. Am. Stat. Assoc. 103, 681–686. doi:10.1198/016214508000000337

CrossRef Full Text | Google Scholar

Peng, X. (2010). TSVR: an efficient twin support vector machine for regression. Neural Netw. 23, 365–372. doi:10.1016/j.neunet.2009.07.002

PubMed Abstract | CrossRef Full Text | Google Scholar

Rolf, M. M., Garrick, D. J., Fountain, T., Ramey, H. R., Weaber, R. L., Decker, J. E., et al. (2015). Comparison of Bayesian models to estimate direct genomic values in multi-breed commercial beef cattle. Genet. Sel. Evol. 47, 23. doi:10.1186/s12711-015-0106-8

PubMed Abstract | CrossRef Full Text | Google Scholar

Selga, C., Koc, A., Chawade, A., and Ortiz, R. (2021). A bioinformatics pipeline to identify a subset of SNPs for genomics-assisted potato breeding. Plants 10, 30. doi:10.3390/plants10010030

CrossRef Full Text | Google Scholar

Shao, Y.-H., Zhang, C.-H., Yang, Z.-M., Jing, L., and Deng, N.-Y. (2013). An ε-twin support vector machine for regression. Neural Comput. Appl. 23, 175–185. doi:10.1007/s00521-012-0924-3

CrossRef Full Text | Google Scholar

Svetnik, V., Liaw, A., Tong, C., Culberson, J. C., Sheridan, R. P., and Feuston, B. P. (2003). Random forest: a classification and regression tool for compound classification and qsar modeling. J. Chem. Inf. Comput. Sci. 43, 1947–1958. doi:10.1021/ci034160g

PubMed Abstract | CrossRef Full Text | Google Scholar

Tong, H., and Nikoloski, Z. (2020). Machine learning approaches for crop improvement: leveraging phenotypic and genotypic big data. J. plant physiology 257, 153354. doi:10.1016/j.jplph.2020.153354

PubMed Abstract | CrossRef Full Text | Google Scholar

Trojovský, P., and Dehghani, M. (2023). Subtraction-average-based optimizer: a new swarm-inspired metaheuristic algorithm for solving optimization problems. Biomimetics 8, 149. doi:10.3390/biomimetics8020149

PubMed Abstract | CrossRef Full Text | Google Scholar

Usai, M. G., Goddard, M. E., and Hayes, B. J. (2009). LASSO with cross-validation for genomic selection. Genet. Res. 91, 427–436. doi:10.1017/S0016672309990334

PubMed Abstract | CrossRef Full Text | Google Scholar

Vanraden, P. M. (2008). Efficient methods to compute genomic predictions. J. Dairy Sci. 91, 4414–4423. doi:10.3168/jds.2007-0980

PubMed Abstract | CrossRef Full Text | Google Scholar

Vincent, A. M., and Jidesh, P. (2023). An improved hyperparameter optimization framework for AutoML systems using evolutionary algorithms. Sci. Rep. 13, 4737. doi:10.1038/s41598-023-32027-3

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, X., Miao, J., Chang, T., Xia, J., An, B., Li, Y., et al. (2019). Evaluation of GBLUP, BayesB and elastic net for genomic prediction in Chinese Simmental beef cattle. PLOS ONE 14, e0210442. doi:10.1371/journal.pone.0210442

PubMed Abstract | CrossRef Full Text | Google Scholar

Yang, T., Sun, X., Yang, H., Liu, Y., Zhao, H., Dong, Z., et al. (2024). Integrated thermal error modeling and compensation of machine tool feed system using subtraction-average-based optimizer-based CNN-GRU neural network. Int. J. Adv. Manuf. Technol. 131, 6075–6089. doi:10.1007/s00170-024-13369-2

CrossRef Full Text | Google Scholar

Yin, L., Zhang, H., Li, X., Zhao, S., and Liu, X. (2022). Hibayes: an R package to fit individual-level, summary-level and single-step bayesian regression models for genomic prediction and genome-wide association studies. bioRxiv. doi:10.1101/2022.02.12.480230

CrossRef Full Text | Google Scholar

Young, S. R., Rose, D. C., Karnowski, T. P., Lim, S.-H., and Patton, R. M. (2015) Optimizing deep learning hyper-parameters through an evolutionary algorithm. New York, NY, USA: Association for Computing Machinery.

Google Scholar

Zhong, P., Xu, Y., and Zhao, Y. (2012). Training twin support vector regression via linear programming. Neural Comput. Appl. 21, 399–407. doi:10.1007/s00521-011-0525-6

CrossRef Full Text | Google Scholar

Zhou, X., Carbonetto, P., and Stephens, M. (2013). Polygenic modeling with bayesian sparse linear mixed models. PLOS Genet. 9, e1003264. doi:10.1371/journal.pgen.1003264

PubMed Abstract | CrossRef Full Text | Google Scholar

Keywords: genomic prediction, LSTSVR, LASSO regularization, subtraction average based optimizer, high-dimensional data

Citation: Li R, Gao J, Zhou G, Zuo D and Sun Y (2024) SABO-ILSTSVR: a genomic prediction method based on improved least squares twin support vector regression. Front. Genet. 15:1415249. doi: 10.3389/fgene.2024.1415249

Received: 10 April 2024; Accepted: 29 May 2024;
Published: 14 June 2024.

Edited by:

Ruzong Fan, Georgetown University Medical Center, United States

Reviewed by:

Yutong Luo, Georgetown University, United States
Shibo Wang, University of California, Riverside, United States

Copyright © 2024 Li, Gao, Zhou, Zuo and Sun. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Jing Gao, Z2FvamluZ0BpbWF1LmVkdS5jbg==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.