Identification of Candidate Genes and Genomic Selection for Seed Protein in Soybean Breeding Pipeline

Qin, Jun; Wang, Fengmin; Zhao, Qingsong; Shi, Ainong; Zhao, Tiantian; Song, Qijian; Ravelombola, Waltram; An, Hongzhou; Yan, Long; Yang, Chunyan; Zhang, Mengchen

doi:10.3389/fpls.2022.882732

ORIGINAL RESEARCH article

Front. Plant Sci., 16 June 2022

Sec. Technical Advances in Plant Science

Volume 13 - 2022 | https://doi.org/10.3389/fpls.2022.882732

This article is part of the Research TopicNovel Technologies for Soybean ImprovementView all 21 articles

Identification of Candidate Genes and Genomic Selection for Seed Protein in Soybean Breeding Pipeline

Jun Qin¹

Fengmin Wang¹

Qingsong Zhao¹

Ainong Shi^2*

Tiantian Zhao¹

Hongzhou An¹

Mengchen Zhang^1*

¹National Soybean Improvement Center Shijiazhuang Sub-Center, North China Key Laboratory of Biology and Genetic Improvement of Soybean, Ministry of Agriculture, Hebei Laboratory of Crop Genetics and Breeding, Cereal & Oil Crop Institute, Hebei Academy of Agricultural and Forestry Sciences, Shijiazhuang, China
²Department of Horticulture, University of Arkansas, Fayetteville, AR, United States
³Soybean Genomics and Improvement Lab, United States Department of Agriculture - Agricultural Research Service (USDA-ARS), Beltsville, MD, United States
⁴Department of Soil and Crop Sciences, Texas A&M University, College Station, TX, United States

Soybean is a primary meal protein for human consumption, poultry, and livestock feed. In this study, quantitative trait locus (QTL) controlling protein content was explored via genome-wide association studies (GWAS) and linkage mapping approaches based on 284 soybean accessions and 180 recombinant inbred lines (RILs), respectively, which were evaluated for protein content for 4 years. A total of 22 single nucleotide polymorphisms (SNPs) associated with protein content were detected using mixed linear model (MLM) and general linear model (GLM) methods in Tassel and 5 QTLs using Bayesian interval mapping (IM), single-trait multiple interval mapping (SMIM), single-trait composite interval mapping maximum likelihood estimation (SMLE), and single marker regression (SMR) models in Q-Gene and IciMapping. Major QTLs were detected on chromosomes 6 and 20 in both populations. The new QTL genomic region on chromosome 6 (Chr6_18844283–19315351) included 7 candidate genes and the Hap.X^AA at the Chr6_19172961 position was associated with high protein content. Genomic selection (GS) of protein content was performed using Bayesian Lasso (BL) and ridge regression best linear unbiased prediction (rrBULP) based on all the SNPs and the SNPs significantly associated with protein content resulted from GWAS. The results showed that BL and rrBLUP performed similarly; GS accuracy was dependent on the SNP set and training population size. GS efficiency was higher for the SNPs derived from GWAS than random SNPs and reached a plateau when the number of markers was >2,000. The SNP markers identified in this study and other information were essential in establishing an efficient marker-assisted selection (MAS) and GS pipelines for improving soybean protein content.

Introduction

Soybean [Glycine max (L.) Merr.] provides about 60% of the vegetable-derived proteins worldwide and is a primary meal protein for human consumption, poultry, and livestock feed (Wolf, 1970; Patil et al., 2017). Improving protein content is one of the major breeding objectives in breeding programs (Li S. et al., 2019; Stewart-Brown et al., 2019). Traditional soybean breeding methods require phenotyping and multigeneration selection. Although molecular marker-assisted selection (MAS) by tagging the desired genes during breeding selection is an approach to make the selection more efficient (Collard et al., 2005), it is only relatively effective for traits with high heritability and controlled by major genes (Xu and Crouch, 2008; Xu et al., 2012; Patil et al., 2017). Genomic selection (GS) was developed for the selection of traits controlled by multiple genes, but it has not been practically applied due to the large variation of prediction accuracy in different populations and lacking efficient genotyping platforms (Zhang A. et al., 2017; Liu et al., 2018). With the rapid development of genomic tools and DNA sequencing technology, breeders and geneticists are able to explore molecular approaches to increase seed protein genetic gain (Song et al., 2004, 2013; Schmutz et al., 2010; Wang et al., 2020).

Linkage analysis (Hyten et al., 2004; Nichols et al., 2006; Pathan et al., 2013; Teng et al., 2017; Whiting et al., 2020) and genome-wide association study (GWAS) are powerful tools to identify markers associated with seed protein content in soybean (Hwang et al., 2014; Leamy et al., 2017; Lee et al., 2019; Li S. et al., 2019); to date, a total of 262 loci have been reported through linkage analysis and 107 loci have been reported through GWAS (Patil et al., 2017; Gangurde et al., 2020) per SoyBase.¹ These loci were on all the chromosomes, especially chromosome (Chr.) 15 and Chr. 20 (see text footnote 1/). Among these, several quantitative trait loci (QTLs), such as cqPro-20 on Chr. 20 and cqPro-15 on Chr. 15, were confirmed based on a low error rate (lower than 0.01) and in different populations (Patil et al., 2017). More than 150 candidate genes have been suggested to control seed protein content in soybean (Zhang D. et al., 2017; Zhang J. et al., 2018; Zhang Y. et al., 2018; Li S. et al., 2019; Zhang et al., 2019; Wang et al., 2020). The most described genes affecting seed protein content were sugar efflux transporter SWEET39 (Glyma15g05470) and sugar efflux transporter SWEET24 (Glyma08g19580) (Wang et al., 2020).

The populations used for mapping protein content in the previous reports included pedigree-based F2 and F4:6 (Csanádi et al., 2001; Chapman et al., 2003), recombinant inbred lines (RILs) population (Qi et al., 2014; Hacisalihoglu et al., 2018), backcross population (Sebolt et al., 2000; Liang et al., 2010), multiline population (Brummer et al., 1997; Wang et al., 2014; Whiting et al., 2020), nested association mapping population (Gangurde et al., 2020), and natural population (Hwang et al., 2014; Bandillo et al., 2015; Li D. et al., 2019). Most studies used a single population, but some studies used two populations for QTL verification (Vaughn et al., 2014; Zhang D. et al., 2017; Zhang et al., 2019); a few studies analyzed QTL using both the linkage mapping and associate mapping methods (Zhang et al., 2019).

The annual wild soybean (Glycine soja) is an important resource to improve soybean (Lam et al., 2010; Yao et al., 2020). Therefore, the objectives of this study were to: (1) identify QTL conferring seed protein content in RILs derived from cultivated and wild soybeans; (2) identify single nucleotide polymorphism (SNP) markers associated with seed protein content in GWAS and candidate genes controlling the trait; and (3) assess the accuracy of GS base on different SNP sets, training population size, and statistical models.

Materials and Methods

Plant Materials

Recombinant Inbred Line

A population of 180 F9-derived RILs was developed from a cross of Jidou12 (Glycine max) and Ye9 (Glycine soja). Jidou12 is a high-yield cultivar with a high protein content that is grown in Shandong Jiaodong Peninsula, Hebei Province, and south-central Shanxi. The seed protein content averaged 46.48% for Jidou12 and 48.78% for Ye9 on a dry weight basis.

Natural Population

A total of 284 soybean genotypes, including 250 accessions selected from germplasm collection by Dr. Lijuan Qiu’s laboratory at the Chinese Academy of Agricultural Sciences and 34 cultivars from Hebei Province, were used for the protein association analysis (Supplementary Table 1). These genotypes were originally from 10 provinces in China (202, 67.5%) and 6 states in the United States (76, 30.1%), South Korea (3, 1.2%), and Japan (2, 0.8%).

Field Design

Field experiments were conducted at Shijiazhuang (114°83′E, 38°03′N) in Hebei Province in a randomized complete block design with three replications in 2008, 2010, 2019, and 2020. The plot size was 3 m × 6 m with six rows and 50 cm space between rows in all the trials. The planting density was 225,000 plants per ha. Each year, the plots were irrigated once at the seed-filling stage. Plants were harvested after 95% of the leaves were falling off. Ten plants were randomly chosen from the middle of the plot for indoor laboratory seed protein content analysis when 95% of plants in the plot were matured.

Statistical Analysis of Phenotypic Data

Seed protein content was quantified using Fourier transform-near IR spectroscopy (Bruker MPA, Karlsruhe, Germany) at the North China Key Laboratory of Biology and Genetic Improvement of Soybean, Ministry of Agriculture. Under the Quant 2 method of OPUS (https://www.bruker.com/en/products-and-solutions/infrared-and-raman/opus-spectroscopy-software/downloads.html) version 5.5 software (Bruker MPA, Karlsruhe, Germany), the samples’ protein content data were calculated using the dry basis model (Yan et al., 2008). Each RIL and accession from each replication of each environment were detected three times using about 100–150 dry seeds and the average was used for statistical analysis. Analysis of variance was performed using JMP^® (https://www.jmp.com/en_us/home.html) Genomics 7 (Sall et al., 2017). The least-squares mean (LSM) of the protein content of each soybean genotype from JMP was used as the phenotypic data in the association mapping.

Genotyping by Sequencing and Single Nucleotide Polymorphism Discovery

Genomic DNA was extracted from leaves of soybean plants using the QIAGEN DNeasy Plant Mini Kit (250). DNA was digested using the restriction enzyme ApeKI following the genotyping by sequencing (GBS) protocol described by Elshire et al. (2011). The 90 bp pair-end sequencing of accessions was performed using an Illumina HiSeq 2000 machine at the Genetic Research Institute, Chinese Academy of Sciences. GBS data alignment, mapping, and SNP discovery were done using Short oligonucleotide analysis Package (SOAP) family software. An average of 3.26 M short reads for each accession was aligned to soybean whole-genome sequence (Wm82.a2.v1) using SOAPaligner/soap2. SOAPsnp version 1.05 was used for SNP calling (Li et al., 2009; Li, 2011). Approximately, a half-million SNPs were discovered among the 284 soybean germplasm accessions. The SNPs were filtered before genetic diversity and association analyses. Soybean accession with >5% missing SNP and the >2% heterozygous SNP genotypes was eliminated. After the SNP dataset was filtered to remove those SNPs with minor allele frequency (MAF) <2%, missing data >5%, and heterozygous genotype >25%, a total of 10,115 SNPs were used for genetic diversity and association analysis (Supplementary Figure 1).

Genetic Maps

The genetic maps were constructed with JoinMap 4.0 (Van Ooijen, 2006) when the threshold for the logarithm of odds (LOD) was 3.0 based on 180 F9 RILs. QTL analysis of protein content in the RIL population was performed using single-trait Bayesian interval mapping (BIM), single-trait multiple interval mapping (SMIM), single-trait composite interval mapping maximum likelihood estimator (SMLE), single marker regression (SMR) method of Q-gene software (Joehanes and Nelson, 2008) with inclusive composite interval mapping (ICIM, http://www.isbreeding.net) (Meng et al., 2015). Variance components, QTL heritability, and QTL effect for seed protein content were estimated by QTLNetwork version 2.1 based on the phenotypic data (Yang et al., 2008). Only the QTL, which was mapped in similar physical locations (<1,500 kb) on the same chromosomes based on the five methods, was defined as a reliable QTL. The selected SNP markers were further tested for their effect by variance analysis using JMP Pro 10 (Sall et al., 2017).

Population Genetic Diversity and Association Analysis

STRUCTURE is a program that uses Bayesian methods to analyze multilocus data in population genetics (Kaeuffer et al., 2007). This study used a hybrid model and an allelic variation occurrence non-correlative model to examine the population structure of soybean germplasm. The number of the subpopulation (K) was assumed to be between 1 and 12. Each K was run 10 times, the Markov Chain Monte Carlo (MCMC) length of the burn-in period was 20,000, and the number of the MCMC iterations after the burn-in was 50,000. Delta K was used to determine appropriate K-values (Earl and vonHoldt, 2012). Next, CLUMPP was used to integrate the STRUCTURE-generated results with the “repeat 1,000” parameter. In addition, two different association mapping models were used to analyze the association between the molecular markers and traits, the TASSEL general linear model (GLM-Q), and the mixed linear model (MLM) combining kinship with population structure (Q-matrix) (Yu et al., 2006; Bradbury et al., 2007).

Identification of Candidate Genes

Linkage disequilibrium (LD) analysis was performed in the regions with SNP significantly associated with protein content; SNPs with r² > 0.5 in a 1-Mb window were considered to be in one linkage disequilibrium (LD) block in the heterochromatic regions. Haplotype analysis was conducted on all the SNPs within the LD block containing significant loci. Two databases, namely, the SoyBase(see text footnote 1) and the Arabidopsis Information Resource,² were used for gene annotation and preliminary screening of candidate genes were determined by combined bioinformatics and statistics.

Genomic Selection

Ridge regression best linear unbiased prediction (rrBLUP) and Bayesian Lasso Regression (BLR) were used to predict genomic estimated breeding value (GEBV) in GS (Endelman, 2011; Legarra et al., 2011). The packages “rrBLUP” (Endelman, 2011) and “BGLR” (Pérez and de los Campos, 2014) containing the GS models rrBLUP and Bayesian Lasso (BL), respectively, were run in R software.

Prediction accuracy of seed protein was evaluated for different SNP sets, including 22 significant SNPs detected from GWAS, 22 random SNPs, 100 random SNPs, 250 random SNPs, 500 random SNPs, 1,000 random SNPs, 2,000 random SNPs, 5,000 random SNPs, and 10,115 SNPs. The effect of training population size on GS accuracy was investigated by conducting cross-validation at different levels with 100 replications for each cross-validation fold from two to ten.

Results

Seed Protein Content Variations in Two Populations

The seed protein content of the 180 RILs showed a biased normal distribution, seed protein content ranged from 34.69 to 58.71, and the Coefficient of variation (CV) was 23.39% (Supplementary Figure 2A). The seed protein content of the 284 accessions showed a biased normal distribution, seed protein ranged from 35.65 to 50.99, and the CV was 9.53% (Supplementary Figure 2B).

Genetic Map Construction and Quantitative Trait Locus Mapping in Recombinant Inbred Line Population

The RIL population was genotyped by sequencing. After filtering, a total of 2,498 polymorphic markers SNP were obtained and were mapped to 20 soybean chromosomes, thus the genetic maps were built for the RILs (Supplementary Figure 3A). According to their physical positions in the genome assembly, these markers were basically evenly distributed on 20 chromosomes. The 20 combined maps between physical distance and genetic position showed a good match (Supplementary Figure 3B). Chr. 14 had the least number of markers (68) and Chr. 18 had the largest number of markers (184). A genetic linkage map with a total length of 4,476.2 cm was constructed and the average distance between two adjacent markers was 1.8 cm (Supplementary Figure 3). The average distance between adjacent markers was the smallest on Chr. 20 (1.32 cm) but was the largest on Chr. 9 (2.26 cm).

A total of 5 QTLs on chromosomes 6, 8, 15, 17, and 20 were detected and the LOD value of the markers associated with the QTL ranged from 3.3 to 14.1; the QTL could explain 6.6%–29.6% of the genetic variation (Figure 1A and Supplementary Figure 4). Among these, one QTL with a positive allelic effect was from Jidou12 and 4 QTL with positive alleles were from Ye9 (Table 1). The QTL qtl-chr6-prot had the highest LOD and could explain 22.3–29.6% of genetic variation (Table 1 and Figure 1A). The qtl-chr6-prot was in the heterochromatic region (Figure 1B).

FIGURE 1

Figure 1. (A) QTL mapping of seed protein content in soybean chromosome 6 based on single-trait multiple IM (SMIM) in Qgene, (B) The QTL, qtl-chr6_prot was mapped on the combined map between physical distance and genetic position of the chromosome 6, where the x-axis shows physical distance (Mbp) and the y-axis shows the genetic position (cM).

TABLE 1

Table 1. Single nucleotide polymorphism (SNP) markers/quantitative trait locus (QTL) detected in recombinant inbred line (RIL) and natural populations.

Genome-Wide Association Study in Natural Population and Candidate Genes Selection

A total of 10,115 high-quality SNPs were used to perform population structure analysis of the 284 accessions using the STRUCTURE software (Kaeuffer et al., 2007). When K = 4, delta K was maximal with a relatively stable α value (Figures 2A,B). Cluster I was comprised of 102 accessions, including 77 cultivars, 21 landrace, and 5 exotic accessions; cluster II was comprised of 19 accessions, namely, 18 exotic accessions and 1 cultivar; cluster III was comprised of 93 accessions, namely, 57 exotic accessions, 34 cultivars, and 2 landraces; and cluster IV comprised of 70 accessions, namely, 51 landraces, 16 cultivars, and 3 exotic accessions. Principal component analysis (PCA) also showed the four groups (Figure 2C).

FIGURE 2

Figure 2. Structure analysis: (A) delta K-values for different numbers of populations (K) from the STRUCTURE analysis, the x-axis shows different numbers of populations (K), the y-axis shows delta K-values for different numbers of subpopulations (K). (B) Classification of 284 accessions into four subpopulations using STRUCTURE version 2.3.4, where the x-axis shows accessions and the y-axis shows the probability (from 0 to 1) of each accession belonging to subpopulation (Q = K) membership. The membership of each accession belonging to subpopulations is indicated by different colors (Q1, red; Q2, green; Q3, blue; and Q4, yellow). (C) Principal component analysis (PCA) of the population structure. Distribution of the accessions in the association panel under PC1 and PC2.

A significant association (-log P > 5.35) with seed protein was observed for 22 SNPs from 22 haplotype blocks in 13 of the 20 chromosomes using GLM and MLM (Table 2). The LOD of the 22 markers ranged from 6.6 to 20.1 in GLM analysis and 6.3 to 26.3 in MLM analysis (Table 2 and Supplementary Figure 5), indicating that these markers were strongly associated with seed protein. Eighteen of these markers were in euchromatic regions and four of these markers were in heterochromatic regions (Table 2).

TABLE 2

Table 2. Significant SNPs associated with protein content over 4 years, chromosome (Chr.) and physical position (bp) of the significant SNPs, logarithm of odds (LOD) [-log10 (p-value)] values of generalized linear model (GLM) and mixed liner model (MLM), and allele with positive effect at the SNP locus.

Two significant SNP loci on Chr. 6 and 20 were detected in linkage analysis and GWAS and the SNP loci detected on Chr. 6 by GWAS were in the QTL intervals obtained by linkage analysis. This SNP region on Chr. 6 had a high Phenotypic variation explained (PVE) (22.3–29.60%) and LOD (6.696–25.762). The region on Chr. 20 was associated with protein content with a PVE of 12.30% and LOD of 7.208 (Tables 1, 2).

A 471-kb haplotype block from Chr6_18844283 to Chr6_19315351 included 7 SNP markers and 17 genes (Figure 3A). Pairwise LD analysis of the imputed SNP data showed that the candidate gene region was from Chr6_18842491 bp to Chr6_19015855 bp (Figure 3B). Seven candidate genes were in the regions, which included polynucleotidyl transferase (Glyma.06G202900 and Glyma.06G203100), polygalacturonase activity (Glyma.06G202600 and Glyma.06G203000), ATP synthase (Glyma.06G203200), and genes without annotation (Glyma.06G202700 and Glyma.06G202800) (Figure 3B).

FIGURE 3

Figure 3. (A) The extent of linkage disequilibrium (LD) in the regions based on pairwise r² values. The r² values are indicated using the color intensity index. Heatmap showing LD between each pair of markers that passed the Bonferroni threshold in genome-wide association study (GWAS). (B) Candidate genes for each single nucleotide polymorphism (SNP) locus. The bottom panel depicts the extent of linkage disequilibrium in the regions based on pairwise r² values. The r² values are indicated using the color intensity index shown. (C) Boxplot of seed protein based on different genotypes in soybean accessions. (D) Boxplot of seed protein based on Hap.X^GG and Hap.X^AA phenotypic differences between genotype combinations of the two SNPs.

There were 7 QTL haplotypes in the LD block from Chr6_18844283 to Chr6_19315351 in the natural population that showed differences in protein content (Supplementary Table 2 and Figure 3C). The haplotypes Hap.B, Hap.C, and Hap.F had higher protein content than other haplotypes. Hap.B had the highest protein content, but no significant difference was observed among Hap.B, Hap.C, Hap.F, and Hap.G (Figure 3C). Further analysis showed that the SNP located at Chr6_19172961 may be more important; varieties carrying Hap.X^AA showed higher protein content than Hap.X^GG (Figure 3D).

Prediction Accuracy of Seed Protein Content

Prediction accuracy of different SNP densities for seed protein was conducted using 22 significant SNPs resulting from GWAS and 22 to 10,115 random SNPs, respectively. The prediction accuracy ranges from 0.44 to 0.77 using the rrBLUP model and from 0.44 to 0.78 using the BLR model (Figure 4 and Supplementary Table 3). BLR and rrBLUP performed similarly for prediction accuracy; the average prediction accuracy was 0.63 and 0.53, respectively. The prediction accuracy of the 22 SNPs obtained from GWAS was higher than that of random 22 SNPs and random 250 SNPs (Figure 4 and Supplementary Table 3). Thus, regardless of the GS model, the accuracy of GS was higher when the significant SNPs from GWAS were used. Prediction accuracy for seed protein was increased with higher SNP density. However, there is a minimal difference in prediction accuracy after the SNP number reached 2,000 (Figure 4 and Supplementary Table 3).

FIGURE 4

Figure 4. Boxplots show the effect of different SNP density sets on genomic selection in the Bayesian Lasso Regression (BLR) model and ridge regression best linear unbiased prediction (rrBLUP) models.

The effect of training population size on GS accuracy was also investigated by conducting cross-validation at different folds with 100 replications for each cross-validation (Figure 5 and Supplementary Tables 4, 5). On average, the prediction accuracy of the BLR model was 0.62 using GWAS-derived SNPs and 0.77 using the whole set of SNPs (Figure 5 and Supplementary Table 4). The prediction accuracy of rrBLUP was less than BLR, with 0.5 using GWAS-derived SNPs and 0.77 using the whole set of SNPs (Figure 5 and Supplementary Table 5). Considering average r-value and standardized deviation Sn, sevenfold resulted in a high r-value and low Sn in BLR models and sixfold resulted in a high r-value and low Sn in rrBLUP models.

FIGURE 5

Figure 5. Boxplots show the effect of training population size on genomic selection accuracy by conducting cross-validation at different folds with 100 replications for each cross-validation fold using rrBLUP.

Discussion

Quantitative Trait Locus Mapping and Candidate Genes Identification for Soybean Seed Protein

Wild soybean with desired traits may improve the yield, quality, and other traits of cultivated soybeans. In this study, we performed QTL mapping for protein content in a RIL population derived from the cross of cultivated Jidou12 and wild soybean Ye9. Five major stable QTLs were detected on Chr. 6, 8, 15, 17, and 20 using Bayesian IM, SMIM, SMLE, and SMR models in Q-gene and IciMapping. Among these QTLs, we discovered that qtl-chr6_prot contributed an average of 25.77 of the phenotypic variance and the positive additive effects of allele were from the cultivated soybean Jidou12. The qtl-chr6_prot did not overlap with or was not adjacent to any of the previously reported QTLs for seed protein content. Other QTLs, qtl-chr8_prot, qtl-chr15_prot, qtl-chr17_prot, and qtl-chr20_prot, explained an average of 13.99, 9.1, 9.85, and 12.47 of the phenotypic variance, respectively; the positive additive effects of the allele of these QTL were from the wild soybean parent. The QTL qtl-chr8_prot (7.27–8.29 Mb) overlapped with the QTLs, as previously reported by Pathan et al. (2013). In addition, the QTL qtl-chr15_prot (3.30–4.71 Mb) overlapped with the qPro15-1 (Zhang et al., 2019) and qtl-chr17_prot (12.80–13.81 Mb) with the protein 26-2 (Reinprecht et al., 2006). The position of QTL qtl-chr20_prot (26.57–33.51 Mb) was consistent with that of the confirmed QTL cqPro-20 (Diers et al., 1992; Pandurangan et al., 2012; Vaughn et al., 2014; Sonah et al., 2015; Warrington et al., 2015; Zhang Y. et al., 2018; Fliege et al., 2022). Fliege et al. (2022) concluded that a transposon insertion within the CONSTANS, CO-like, and TOC1 (CCT) domain protein encoded by the Glyma.20G85100 gene accounted for the high/low seed protein alleles of the cqSeed protein-003 QTL (31.74–31.84 Mb).

In the novel QTL region, the qtl-chr6_prot, seven candidate genes were identified. Of which, Glyma06G202900 and Glyma06G203100 were annotated as polynucleotidyl transferase, ribonuclease H-like superfamily protein, which were homologous to the AT5G61090 gene in Arabidopsis. The protein encoded by the AT5G61090 had an RNA–DNA hybrid ribonuclease activity (Stoppel and Meurer, 2012). Glyma06G202600 was annotated as plasmodesmata callose-binding protein 3, homologous to AT1G18650 with callose-binding activity and the regulating intercellular trafficking in Arabidopsis (Simpson et al., 2009). Glyma06G203000 was annotated as a pectin lyase-like superfamily protein homologous to AT3G07820 with a polygalacturonase activity in Arabidopsis (Kim et al., 2006). Glyma06G203200 was annotated as a gamma subunit of Mt ATP synthase, homologous to AT2G33040, one of mitochondrial (mt) ATP synthesis subunits. Reduced expression of these subunits of the mt ATP synthase was proposed to disturb cellular redox states (Robison et al., 2009).

Genomic Selection in Soybean

Genomic selection overcomes the problems of traditional breeding methods and MAS selection and provides a new way for the selection of quantitative traits controlled by genes with minor effects. GS allows for the estimation of the effects of all the markers across the genome. These effects can be used to predict the performance of lines (Meuwissen et al., 2001). Since the target trait phenotype of an individual is predicted using the GS model, the materials could be screened and selected before planting, thus reducing costs and improving breeding efficiency (Heslot et al., 2012; Longin et al., 2015; Spindel et al., 2015). Matei et al. (2018) showed that the selection cycle for yield and seed weight can be significantly shortened using GS.

So far, the GS study has been mainly conducted on maize, wheat, and rice. The GS study in soybean remains limited. In 2013, Shu performed GS for 100-seed weight and reported a prediction accuracy of 0.904 (Shu et al., 2013). Subsequent GS showed accuracy for soybean cyst nematode (SCN) was 0.59–0.67 (Bao et al., 2014) and 0.64 for soybean yield (Jarquín et al., 2014).

The GS was performed on amino acid concentration (Qin et al., 2019), soybean chlorophyll content, soybean cyst nematode tolerance (Ravelombola et al., 2019), yield, and yield-related traits, such as maturity, plant height, and 100-seed weight (Ravelombola et al., 2021). These studies have shown the feasibility of GS for soybean yield and quality-related traits (Matei et al., 2018; Stewart-Brown et al., 2019).

However, few reports have focused on the GS of seed protein in soybean. Stewart-Brown et al. (2019) evaluated the potential of GS for soybean seed protein using 483 elite breeding lines from 26 biparentals and reported the predictive abilities of 0.81 in all the populations, 0.55 across populations, and 0.60 within each biparental population. Duhnen et al. (2017) compared genomic prediction accuracy of seed protein obtained using models calibrated across or within two subpopulations: early lines and late lines. The results showed that calibrations within subpopulations were more efficient. Five Bayesian models were also compared with Genomic best linear unbiased prediction (GBLUP) and did not show improved prediction accuracy. In this study, we performed GS based on different SNP sets, different training population sizes, and statistical models. The results showed that the use of GWAS-derived SNPs for conducting GS significantly improved the accuracy of prediction, which was consistent with the results reported by Qin et al. (2019). The model selection criteria, SNP sets, and population training size were critical factors when conducting a GS, as reported in previous studies (Ravelombola et al., 2019, 2020, 2021). Those studies had demonstrated that 1,000–2,000 genome-wide markers across all the lines/accessions were needed to reach maximum efficiency of genomic prediction in the populations, increasing marker density that would not improve prediction efficiency (Poland et al., 2012; Bao et al., 2014; Zhang J. et al., 2016; Song et al., 2020). This study showed that there was a minimal difference in prediction accuracy after the SNP number reached 2,000 for seed protein content.

Conclusion

This study reported mapping and GS for seed protein content. Molecular markers associated with seed protein content were identified in RIL and natural populations and a novel QTL for seed protein content was detected and mapped on Chr. 6 in both populations. In addition, seven candidate genes that were related to seed protein content were identified. This is one of a few reports investigating seed protein content using RILs derived from cultivated and wild soybean crosses. Our results showed that GS accuracy was dependent on the SNP set and training population size; a set of GWAS-derived SNPs could increase GS accuracy. No significant GS accuracy difference was observed between rrBLUP and BL models. The results demonstrated the potential of using GS to improve soybean seed protein content.

Data Availability Statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.

Author Contributions

JQ and AS: data curation. JQ, CY, and LY: funding acquisition. FW, QZ, and TZ: investigation. JQ, AS, and WR: methodology. MZ, LY, and CY: project administration. AS: software. FW and TZ: validation. JQ: writing – original draft preparation. JQ, QS, and AS: writing - review and editing. All authors contributed to the article and approved the submitted version.

Funding

This study was funded by: (1) the National Natural Science Foundation of China (32072092); (2) the Basic Research Funds of Hebei Academy of Agriculture and Forestry Sciences (2021060205); (3) the Special Innovation Program of Hebei Academy of Agriculture and Forestry Sciences (2022KJCXZX-LYS-6); (4) the S&T Program of Hebei, Soybean Modern Seed Industry Science and Technology Innovation Team (21326313D); (5) the Hebei Natural Science Foundation (2020301020); and (6) the China Agriculture Research System of The Ministry of Finance (MOF) and The Ministry of Agriculture and Rural Affairs (MARA) (CARS-04).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2022.882732/full#supplementary-material

Footnotes

References

Bandillo, N., Jarquin, D., Song, Q., Nelson, R. L., Cregan, P., Specht, J., et al. (2015). A population structure and genome-wide association analysis on the USDA soybean germplasm collection. Plant Genome 8, 1–13. doi: 10.3835/plantgenome2015.04.0024

PubMed Abstract | CrossRef Full Text | Google Scholar

Bao, Y., Vuong, T., Meinhardt, C., Tiffin, P., Denny, R., Chen, S., et al. (2014). Potential of association mapping and genomic selection to explore PI 88788 derived soybean cyst nematode resistance. Plant Genome 7, 2840–2854.