Comparing Alternative Single-Step GBLUP Approaches and Training Population Designs for Genomic Evaluation of Crossbred Animals

Alvarenga, Amanda B.; Veroneze, Renata; Oliveira, Hinayah R.; Marques, Daniele B. D.; Lopes, Paulo S.; Silva, Fabyano F.; Brito, Luiz F.

doi:10.3389/fgene.2020.00263

ORIGINAL RESEARCH article

Front. Genet., 09 April 2020

Sec. Livestock Genomics

Volume 11 - 2020 | https://doi.org/10.3389/fgene.2020.00263

This article is part of the Research Topic Advances in Genomics of Crossbred Farm Animals View all 26 articles

Comparing Alternative Single-Step GBLUP Approaches and Training Population Designs for Genomic Evaluation of Crossbred Animals

$\nAmanda B. Alvarenga,$ Amanda B. Alvarenga^1,2

Renata Veroneze²

Hinayah R. Oliveira^1,3

Daniele B. D. Marques²

Paulo S. Lopes²

Fabyano F. Silva²

Luiz F. Brito¹^*

¹Department of Animal Sciences, Purdue University, West Lafayette, IN, United States
²Department of Animal Science, Federal University of Viçosa, Viçosa, Brazil
³Department of Animal Biosciences, Centre for Genetic Improvement of Livestock, University of Guelph, Guelph, ON, Canada

As crossbreeding is extensively used in some livestock species, we aimed to evaluate the performance of single-step GBLUP (ssGBLUP) and weighted ssGBLUP (WssGBLUP) methods to predict Genomic Estimated Breeding Values (GEBVs) of crossbred animals. Different training population scenarios were evaluated: (SC1) ssGBLUP based on a single-trait model considering purebred and crossbred animals in a joint training population; (SC2) ssGBLUP based on a multiple-trait model to enable considering phenotypes recorded in purebred and crossbred training animals as different traits; (SC3) WssGBLUP based on a single-trait model considering purebred and crossbred animals jointly in the training population (both populations were used for SNP weights' estimation); (SC4) WssGBLUP based on a single-trait model considering only purebred animals in the training population (crossbred population only used for SNP weights' estimation); (SC5) WssGBLUP based on a single-trait model and the training population characterized by purebred animals (purebred population used for SNP weights' estimation). A complex trait was simulated assuming alternative genetic architectures. Different scaling factors to blend the inverse of the genomic (G⁻¹) and pedigree ( $A_{22}^{- 1}$ ) relationship matrices were also tested. The predictive performance of each scenario was evaluated based on the validation accuracy and regression coefficient. The genetic correlations across simulated populations in the different scenarios ranged from moderate to high (0.71–0.99). The scenario mimicking a completely polygenic trait ( $h_{Q T L}^{2} =$ 0) yielded the lowest validation accuracy (0.12; for SC3 and SC4). The simulated scenarios assuming 4,500 QTLs affecting the trait and $h_{Q T L}^{2} = h^{2}$ resulted in the greatest GEBV accuracies (0.47; for SC1 and SC2). The regression coefficients ranged from 0.28 (for SC3 assuming polygenic effect) to 1.27 (for SC2 considering 4,500 QTLs). In general, SC3 and SC5 resulted in inflated GEBVs, whereas other scenarios yielded deflated GEBVs. The scaling factors used to combine G⁻¹ and $A_{22}^{- 1}$ had a small influence on the validation accuracies, but a greater effect on the regression coefficients. Due to the complexity of multiple-trait models and WssGBLUP analyses, and a similar predictive performance across the methods evaluated, SC1 is recommended for genomic evaluation in crossbred populations with similar genetic structures [moderate-to-high (0.71–0.99) genetic correlations between purebred and crossbred populations].

Introduction

Crossbreeding schemes are paramount for some livestock production systems in enabling the exploitation of complementarity among genetically-divergent breeds and heterosis effects (Wei and van der Werf, 1994). In tropical countries, crosses between two cattle sub-species are widely used to combine climatic adaptability (e.g., from Bos taurus indicus; Zebu breeds) and productive performance (e.g., from Bos taurus taurus; Taurine breeds) traits (Gregory and Cundiff, 1980; Mendonça et al., 2019). Genetic selection is performed on purebred animals in these production systems, aiming to optimize the performance of crossbred progeny. However, this poses various challenges to the breeding programs. For instance, there are large differences in additive and non-additive genetic parameters in traits measured in purebred or crossbred animals (Bijma and van Arendonk, 1998), which might constrain the pooling of all animals into a single training population for genomic analysis (Ribeiro et al., 2019). However, the large majority of livestock breeding programs do not account for non-additive genetic effects when estimating breeding values, and most economically important traits in livestock are not largely influenced by non-additive genetic effects (Varona et al., 2018).

Recording large-scale phenotypes on crossbred animals raised in commercial herds is usually a challenge, especially for hard- or expensive-to-measure traits, such as individual feed intake (Ibánêz-Escriche et al., 2009). Over time, several methods to perform genetic evaluations accounting for purebred and crossbred information have been proposed (Bijma and van Arendonk, 1998; Nayee et al., 2016; Junqueira et al., 2017). For instance, Wei and van der Werf (1994) proposed a model of breeding value prediction for both purebred and crossbred animals that maximizes the genetic response in crossbred animals, even for unknown, or inappropriate values of correlations of purebred and crossbred performances, and crossbreeding heritability. However, in the genomic era, Ibánêz-Escriche et al. (2009) have suggested that genomic information can increase the response to selection for crossbred performance even when selecting only purebred animals.

Genomic selection (Meuwissen et al., 2001) has been proven to be a useful tool to increase genetic gain, especially for difficult or expensive-to-measure and/or low-heritability traits. In this context, several methods have been proposed to calculate Genomic Estimated Breeding Values (GEBV) for livestock, such as the single-step Genomic Best Linear Unbiased Prediction (ssGBLUP; Misztal et al., 2009; Aguilar et al., 2010; Christensen and Lund, 2010). The ssGBLUP enables combining the pedigree-based relationship matrix (A) with the genomic relationship matrix (G) into a hybrid matrix (H). This increases the accuracy and reduces the prediction bias of GEBVs when compared to those yielded from multi-step genomic predictions (Aguilar et al., 2010; Lourenco et al., 2015; Guarini et al., 2018). Recent studies have evaluated the use of purebred information to predict crossbred performance using the ssGBLUP method (Lourenco et al., 2016; Tusell et al., 2016; Pocrnic et al., 2019). In this context, Lourenco et al. (2016), using simulated crossbred pig datasets, concluded that the highest GEBV accuracies were attained when using a training population combining both purebred and crossbred animals' datasets. However, the ssGBLUP assumes equal variances for all Single Nucleotide Polymorphisms (SNPs), which may not be the most appropriate assumption from a biological point of view (Meuwissen et al., 2001; VanRaden, 2008; Goddard and Hayes, 2009). In a recent study, Porto-Neto et al. (2014) reported that nine out of ten traits evaluated were influenced by major genes. Consequently, methods that account for locus-specific variance (e.g., weighted ssGBLUP, WssGBLUP; Zhang et al., 2016) have been proposed. The main aim of these methods is to increase the predictive performance of GEBVs using computationally efficient tools that can be easily implemented in commercial breeding programs. In the WssGBLUP method, different SNP weights are used when calculating the G matrix.

The WssGBLUP has been successfully applied to several genomic prediction studies (Zhang et al., 2016; Lourenco et al., 2017; Guarini et al., 2019). However, to our best knowledge, there are no reports evaluating the prediction ability of WssGBLUP in crossbred animals, especially in F1 populations. Therefore, we aimed to compare the predictive performance of ssGBLUP and WssGBLUP using different training populations (based on purebred and/or crossbred animals) and alternative statistical models (single- or multiple-trait). One alternative for evaluating the predictive performance of genomic models is comparing GEBVs and True Breeding Values (TBVs). However, in practice, the TBVs are usually unknown and therefore simulated datasets can be advantageous when comparing models and genomic prediction approaches. In this context, we evaluated five simulated scenarios mimicking beef cattle populations (two purebred lines and four F1 populations), in which the trait under evaluation differed in terms of the number of Quantitative Trait Loci (QTLs) and the trait heritability (h²) explained by them ( $h_{Q T L}^{2}$ ). Furthermore, the impact of the genetic distance between training and validation populations used in the crossbreeding scheme was also investigated.

Materials and Methods

Only (computationally) simulated datasets were used in this study. Therefore, the approval of an Institutional Animal Care and Use Committee was not required.

Simulated Population

Datasets of purebred and crossbred animals were simulated based on a beef cattle production system. The purebred populations were simulated to mimic Bos taurus indicus (Line1; Zebu cattle) or Bos taurus taurus (Line2; Taurine cattle) animals. Crossbred animals (F1) were originated from the crossing between females from Line1 and males from Line2. Phenotypes and TBVs were simulated for a trait with a h² equal to 0.33 and phenotypic variance equal to 0.13. This was done to mimic the trait residual feed intake (RFI; an indicator of feed efficiency), which is a very important trait in beef cattle breeding programs (Branco et al., 2014) and has a similar genetic architecture compared to many other economically important (quantitative) traits in livestock.

The historical population consisted of 1,020 generations (Figure 1). During the first 1,000 generations (i.e., from generation −1,020 to generation −20), 2,000 individuals (1,000 males and 1,000 females) were randomly mated (Brito et al., 2011; Lourenco et al., 2016). From generation −19 to generation zero, a first “bottleneck” (i.e., population reduction) was created by reducing the total number of individuals from 2,000 to 1,500 (750 males and 750 females), which were also randomly mated. Thereafter, a second “bottleneck” was created by randomly sampling 100 males and 100 females from generation zero (1,500 individuals) of the historical population. These 200 individuals were used to create the expansion population (POP) containing 64,000 individuals. The population reductions (“bottlenecks”) were simulated to create an initial level of linkage disequilibrium (LD), which will be further explained.

FIGURE 1

Figure 1. Simulated population scheme representing bottleneck in historical population, breed differentiation, and origin of F1 for all simulated scenarios. The Bos taurus indicus population is represented by Line1, Bos taurus taurus is represented by Line2.

Animals in POP were subjected to random selection, mating, and culling for eight generations. To increase the number of animals in POP, we assumed that each female had five offspring, with the same proportion of males and females. At the end of the eighth generation, 64,000 animals were available in POP, which was then used to create Line1 and Line2. Line1 was developed based on 32,000 females and 640 males, and Line2 was developed based on 3,200 females and 64 males; all of them were randomly selected from the eighth generation of POP. In subsequent generations of Line1 and Line2, each female had one offspring (with the same probability of being male or female), and the replacement ratio for sires and dams was 0.60 and 0.20, respectively. Selection and culling in both Line1 and Line2 were performed based on the lowest and highest Estimated Breeding Values (EBVs), respectively. EBVs were estimated based on the Best Linear Unbiased Prediction method (Henderson, 1975), through an Animal Model and considering the True Additive Genetic Variance. After 10 generations in Line1 (Bos taurus indicus), and 30 in Line2 (Bos taurus taurus), the average LD values (between adjacent SNPs) were similar to those reported for Bos taurus indicus (r² = 0.20) and Bos taurus taurus (r² = 0.33) (Villa-Angulo et al., 2009). Both LD values were assessed in the last generation using the distance between SNPs up to 0.05 cM.

The F1 population originated from the random mating of 3,000 females from Line1 with 60 males from Line2. A total of four F1 populations were created and they differed with regards to the parental generation used in the crossbreeding scheme. Parental animals of the F1 populations were from: (i) F1-1: generations seven and 27; (ii) F1-2: eight and 28; (iii) F1-3: nine, and 29; (iv) F1-4: ten and 30; in Line1 and Line2, respectively. The differences in the generation of Line1 and Line2 (e.g., seven for Line1 and 27 for Line2) are due to the simulation scheme designed to mimic the current pattern of LD and genetic distance between Nellore and Angus, represented by Line1 and Line2, respectively.

Simulated Genotypes

The genomic prediction was performed using simulated genotypes for animals from generations six to eight (for Line1), generations 26 to 28 (for Line2), and all F1 individuals. Animals from the last two generations of the purebred lines (i.e., generations nine and ten for Line1, and 29 and 30 for Line2) were not included in the analyses in order to maintain a genetic distance between training and validation populations (described below). The simulated genotypes consisted of 52,886 bi-allelic SNPs distributed across 29 chromosomes (autosomes), mimicking the bovine genome. The size of the whole genome was ~2,696.54 cM. The number of SNPs and the size of each chromosome was defined based on information retrieved from the Illumina Bovine 50 K Beadchip (https://support.illumina.com/downloads/bovinesnp50v2.html), as suggested by Matukumalli et al. (2009). The SNPs were evenly spaced within each chromosome and the initial allele frequency for SNPs and QTLs were equal to 0.50 in the first generation of the historical population.

Different $h_{Q T L}^{2}$ and numbers of QTLs were used in this study: (i) $h_{Q T L}^{2}$ equal to zero, to represent a completely polygenic trait (SIM1); (ii) $h_{Q T L}^{2}$ equal to 1/3 of the trait h² (i.e., $h_{Q T L}^{2}$ equal to 0.11), and 198 QTLs (SIM2); (iii) $h_{Q T L}^{2}$ equal to 1/3 of the trait h² and 4,500 QTLs (SIM3); (iv) $h_{Q T L}^{2}$ equal to the trait h² (i.e., 0.33), and 198 QTLs (SIM4); (v) $h_{Q T L}^{2}$ equal to the trait h² and 4,500 QTLs (SIM5). The heritability only due to the QTL effects, $h_{Q T L}^{2}$ , represents the proportion of the total genetic variation of a trait that is due to a limited number of QTLs (i.e., 198 or 4,500) out of all the markers simulated. In other words, it does not indicate the complete inheritance mode of the trait, but the proportion of the total genetic variance explained by the simulated QTLs. The number of QTLs (198) was defined based on a systematic review performed for RFI in beef cattle (Duarte et al., 2019). In addition, simulations considering 4,500 QTLs were also performed, assuming that not all QTLs for RFI are currently known.

The effect of each QTL was sampled from a Gamma distribution with a shape parameter of 0.40. The mutation rate for both SNPs and QTLs was considered as 10⁻⁵ per generation and locus. The QTL effect captured by the SNP marker can potentially change across populations and generations due to the population-specific allele frequency and LD levels between SNP markers and QTLs. In order to minimize the effects of the simulation (starting values) in the results, ten independent replicates were carried out for each scenario. Simulations were performed using the QMSim software (Sargolzaei and Schenkel, 2009).

Genotypic Quality Control

Genotypic quality control was performed independently for each population (Line1, Line2, and F1 populations) and replicated. The genotype quality control kept SNPs with minor allele frequency (MAF) higher 0.05, and departure from the Hardy–Weinberg Equilibrium (estimated as the difference between expected and observed frequency of heterozygous) lower than 0.15. Only common SNPs across populations were kept for further analyses. A summary of the descriptive statistics for Line1, Line2, and F1 in each scenario is shown in Table 1. Detailed descriptive statistics for each replicate are shown in the Supplementary Material (Tables S1A–S1E). The PREGSF90 software (Aguilar et al., 2014) was used to perform the genotypic quality control.

TABLE 1

Table 1. Mean and standard deviation (inside parentheses) of phenotypes ( $\bar{X}$ ), inbreeding coefficients (F), average allele A frequency (ρ_A), average linkage disequilibrium (LD), and number of markers before (SNP_beforeQC), and after (SNP_afterQC) genotypic quality control for Line1, Line2, and F1 populations, in the different scenarios (SIM).

Genetic Connectedness Between Populations

Principal Component Analysis (PCA)

In order to better assess the population composition of the animals and to graphically display the results, we performed a PCA by decomposition of the genomic relationship matrix (G). Principal components were assessed using the flag “–pca” of PLINK 2.0 (Chang et al., 2015).

Consistency of Gametic Phase

The consistency of gametic phase was defined by the Pearson correlation of signed LD (measured by r) values between two populations [Line1 vs. Line2; Line1 vs. F1 (F1-1, F1-2, F1-3, and F1-4); Line2 vs. F1 (F1-1, F1-2, F1-3, and F1-4)]. The LD level between two SNP markers was measured by r², in which r² = $\frac{D^{2}}{f (A) f (a) f (B) f (b)}$ ; where D = f(AB)−f(A)f(B), and f(AB), f(A), f(a), f(B), and f(b) are observed frequencies of haplotype AB and alleles A, a, B, and b, respectively (Hill and Robertson, 1968). The LD levels were obtained by the flag “–r2 dprime” using the PLINK 2.0 software (Chang et al., 2015). The signed r value was obtained by taking the square root of the r² value and assigning the appropriate sign based on the D value. Data was sorted into bins based on pair-wise SNP marker distance to determine the breakdown in the consistency of gametic phase across distances, and to assess the consistency of gametic phase at the smallest distances in the current panel, given the number of genotyped SNPs. For each distance bin, the signed r values were correlated between all pairs of populations using the cor basic function of the R statistical software (R Core Team, 2019).

Allele A Frequency Correlation

Assessment of the allele A frequency correlation across populations was based on the Pearson correlation. The allele frequency was calculated for each population individually using the option “–freq” from PLINK 2.0 (Chang et al., 2015).

Genomic Prediction of Breeding Values

Methodological Scenarios

Comparisons between the ssGBLUP and WssGBLUP methods were based on the predictive ability of the GEBVs of the F1 animals. In other words, we aimed to identify the best scenario where the selection of purebred animals would result in the greatest crossbred performance (indicated by the GEBVs of crossbred animals). A total of five alternative scenarios (SC) were investigated: (SC1) ssGBLUP based on a single-trait model considering both purebred and crossbred animals in the training population; (SC2) ssGBLUP based on a multiple-trait model considering phenotypes recorded on purebred and crossbred training animals as different traits; (SC3) WssGBLUP based on a single-trait model including both purebred and crossbred animal datasets in the training population (and information from the three populations to estimate the SNP weights—further described); (SC4) WssGBLUP based on a single-trait model considering only purebred animals in the training population (and only the information from crossbred animals to estimate the SNP weights); (SC5) WssGBLUP based on a single-trait model considering only purebred animals in the training population (and their information to estimate the SNP weights). The main goal of SC4 was to account for the crossbred allele frequencies during the G calculation, and SC5 was performed to evaluate the use of only purebred information to predict crossbred performance.

The animals included in the training populations were purebred animals from generations six, seven, and eight (Line1), and generations 26, 27, and 28 (Line2). When crossbred animals were included in the training population, animals from F1-1 and F1-2 populations were used. The scenarios used to create the different training populations are summarized in Table 2. F1-3 and F1-4 were used as two different validation populations in all scenarios, in order to assess the impact of the genetic distance between training and validation populations in the genomic predictions.

TABLE 2

Table 2. Structure of scenarios (SC) using the single-step Genomic Best Linear Unbiased Prediction (ssGBLUP) or weighted ssGBLUP (WssGBLUP) approaches, in terms of training population and single nucleotide polymorphism (SNP) weights.

ssGBLUP and WssGBLUP

The ssGBLUP and WssGBLUP methods were used to combine phenotypic, pedigree, and genotypic information. Therefore, the inverse of the H matrix (Misztal et al., 2009; Aguilar et al., 2010; Christensen and Lund, 2010) used in this study was created as:

\begin{array}{l} H^{- 1} = A^{- 1} + [\begin{matrix} 0 & 0 \\ 0 & τ {(0.95 G + 0.05 A_{22})}^{- 1} - ω A_{22}^{- 1} \end{matrix}] & (1) \end{array}

Where A is the pedigree-based relationship matrix, which included up to five generations of animals with phenotypes or genotypes, A₂₂ is the subset of the A matrix related to genotyped animals, the τ and ω values will be described further, and G is the genomic relationship matrix, which was created as follows (VanRaden, 2008):

\begin{array}{l} G = \frac{Z D Z^{'}}{k}, with Z = (M - P) & (2) \end{array}

Where D is a diagonal matrix with weights, k is a scale parameter defined as $2 \sum_{j = 1}^{n} p_{j} (1 - p_{j})$ , M is a matrix of n SNPs for each animal, and P is a matrix containing two times the allele frequency of the second allele p at locus j (p_j). In the ssGBLUP analyses, the D matrix was assumed as an identity matrix. In the WssGBLUP analyses, D was a diagonal matrix with values given by weights derived from the SNP solutions, as described by Wang et al. (2012). The SNP weights were obtained by back solving the GEBVs using the software BLUPF90 (Strandén and Garrick, 2009; Wang et al., 2012). First of all, the ssGBLUP was performed by using D matrix as an identity matrix (I). Then, the SNP weights were derived based on Strandén and Garrick (2009) and Wang et al. (2012):

\begin{array}{l} \hat{u} = λ D M^{'} G^{- 1} (G E B V s) & (3) \end{array}

Where û is a vector of estimated SNP effects, λ is the ratio of SNP variance to genetic variance, and GEBVs are the genomic estimated breeding values. The SNP weights to be considered in the next iteration (second iteration) were derived from the SNP effects as SNP variances:

\begin{array}{l} d_{j} = û_{j}^{2} 2 p_{j} (1 - p_{j}) & (4) \end{array}

Where d_j is the j SNP weight (equivalent to j SNP variance); û is a vector of estimated j SNP effect; and p is the allele frequency of j SNP.

Consequently, a total of two iterations (i.e., using the identity matrix plus one iteration using the D matrix derived from SNP solutions) were used in the WssGBLUP because the second iteration provided higher GEBV accuracies in the preliminary analysis (Table S2). The SNP solutions were estimated using the POSTGSF90 software (Aguilar et al., 2014).

As genomic datasets were simulated, all individuals included in the pedigree also had genotypes. In order to make G⁻¹ and A₂₂⁻¹ matrices compatible (Misztal et al., 2017; Oliveira et al., 2019), different values for the τ (from 0.9 to 2.5; defined at every 0.1) and ω (from 0.5 to 1.2; defined at every 0.1) parameters were tested. These ranges were chosen based on the literature (Misztal et al., 2017; Oliveira et al., 2019). As G⁻¹ and A₂₂⁻¹ matrices were basically the same in all scenarios (i.e., the A₂₂⁻¹ matrix was the same in all scenarios, and G⁻¹ matrix was the same in SC1, SC2, and SC3; and training crossbred animals were excluded from SC4 and SC5, but the validation crossbred animals remained on all SCs), τ and ω parameters were only tested using SC1. Thereafter, the tuning parameters that increased the accuracy and reduced the prediction bias of GEBVs were used in all analyses. Details about the methods used to calculate the accuracy and bias (based on regression coefficient) of GEBVs are described in section accuracy and regression coefficient. The inbreeding coefficient was estimated using the BLUPF90 family software (Misztal et al., 2002).

Statistical Models

The ssGBLUP and WssGBLUP analyses were performed using the BLUPF90 software (Misztal et al., 2002), based on single- and multiple-trait models. The single-trait models used in SC1, SC3, SC4, and SC5 are described as:

\begin{array}{l} y = Xb + Zu + e & (5) \end{array}

Where y, b, u and e are the vectors of observations; fixed effects (mean, sex, and population); additive genetic random effects, u ~ N(0, $σ_{u}^{2} H$ ); and random residuals, e ~ N(0, $σ_{e}^{2} I$ ), respectively. X and Z are the incidence matrices for b and u, respectively. $σ_{u}^{2}$ and $σ_{e}^{2}$ are the additive genetic and residual variances, respectively. Variance components were independently estimated for each scenario using the AIREMLF90 software (Misztal et al., 2002) and the A matrix, since it has been currently recommended in several ssGBLUP and WssGBLUP studies (Ali et al., 2019; Oliveira et al., 2019; Pocrnic et al., 2019). The multiple-trait model used in SC2 can be described as:

\begin{array}{l} y_{3} = X_{3} b_{3} + Z_{3} u_{3} + e_{3} & (6) \end{array}

Where y₃ is a vector of observations considering records from Line1, Line2, and F1 as three different traits; b₃, u₃, and e₃ are the vectors of fixed effects (mean and sex), additive genetic random effects, u₃ ~ N(0,G₀ ⊗ H), and, random residuals, e₃ ~ N(0,R ⊗ I), respectively. X₃ and Z₃ are the incidence matrices for the fixed and additive genetic effects, respectively. G₀ and R are the additive genetic and residual variance-covariance matrices, respectively, described as:

\begin{array}{l} G_{0} = [\begin{matrix} σ_{u_{Line 1}}^{2} & σ_{u_{Line 1}, u_{Line 2}} & σ_{u_{Line 1}, u_{F 1}} \\ σ_{u_{Line 2}, u_{Line 1}} & σ_{u_{Line 2}}^{2} & σ_{u_{Line 2}, u_{F 1}} \\ σ_{u_{F 1}, u_{Line 1}} & σ_{u_{F 1}, u_{Line 2}} & σ_{u_{F 1}}^{2} \end{matrix}] & (7) \end{array}

\begin{array}{l} R = [\begin{matrix} σ_{e_{Line 1}}^{2} & 0 & 0 \\ 0 & σ_{e_{Line 2}}^{2} & 0 \\ 0 & 0 & σ_{e_{F 1}}^{2} \end{matrix}] & (8) \end{array}

Where $σ_{u_{L i n e 1}}^{2}$ , $σ_{u_{L i n e 2}}^{2}$ , and $σ_{u_{F 1}}^{2}$ are the additive genetic variances for Line1, Line2, and F1, respectively; σ_u is the additive genetic (co)variance between pairs of populations; $σ_{e_{L i n e 1}}^{2}$ , $σ_{e_{L i n e 2}}^{2}$ , and $σ_{e_{F 1}}^{2}$ are the residual variances for Line1, Line2, and F1, respectively.

Accuracy and Regression Coefficient

The predictive ability of tested scenarios was evaluated based on a comparison of GEBVs and True Breeding Values (TBVs) of F1 populations. The main goal of the current study was to evaluate the predictive performance of genomic models when purebred parents are selected to produce crossbred progeny with higher genetic breeding value and improved performance, both indicated by higher GEBVs. Therefore, accuracies of genomic predictions were estimated as the Pearson correlation coefficients calculated between GEBVs and TBVs, for the validation populations (F1-3 and F1-4). In addition, the regression coefficient (an indicator of inflation or deflation of the TBVs on GEBVs) was assessed using a linear regression model of TBVs on GEBVs, for the validation animals. Paired Student's t test (Rosner, 1982) was applied to verify significant differences (P < 0.05) between accuracies and the regression coefficient from different scheme pairs by using the t-test function available in the R software (R Core Team, 2019).

Results

Variance and Covariance Components

Genetic parameters and (co)variance components estimated in the different simulated scenarios using the A matrix are shown in Table 3. In general, variance components estimated from SIM1, SIM2, SIM3, and SIM5 ranged from 0.03 to 0.05 for the additive genetic variance, and from 0.08 to 0.09 for the residual variance. Heritability estimated in SIM1, SIM2, SIM3, and SIM5 ranged from 0.26 to 0.40, which were consistent with the initial value used in the simulation process (h² equal to 0.33). For the Line2 and F1 populations in the SIM4, additive genetic variance and h² were underestimated (additive genetic variance equal to 0.01, and h² ranged from 0.11 to 0.13) in comparison to the other scenarios. Genetic correlations across populations in the different scenarios ranged from moderate to high (from 0.71 to 0.99).

TABLE 3

Table 3. Mean and standard deviation (in parentheses) of variance and covariance components and genetic parameters estimated for Line1, Line2, and F1 populations.

Genetic Connectedness Between Populations

Principal Component Analysis

Both purebred and F1 populations clustered separately, and the F1 animals clustered between both purebreds (as expected). This is shown by the first and second principal components (PC) of the genomic relationship matrix, in which the first principal component explained from 79 to 82% of the total variation (Figure 2). There was no projection overlapping in all five simulated scenarios, indicating that the populations were genetically divergent based on the relationship calculated from segregating SNPs.

FIGURE 2

Figure 2. Principal component decomposition of the genomic relationship matrix of repetition 1 colored by breed-group. Letters represent the simulated scenarios: (A) Simulated scenario with heritability explained by the quantitative trait loci ( $h_{Q T L}^{2}$ ) equal to zero (SIM1); (B) $h_{Q T L}^{2}$ equal to 1/3 of trait heritability (h²) (i.e., $h_{Q T L}^{2}$ equal to 0.11), and the number of QTLs equal to 198 (SIM2); (C) $h_{Q T L}^{2}$ equal to 0.11 and the number of QTLs equal to 4,500 (SIM3); (D) $h_{Q T L}^{2}$ equal to trait h² (0.33), and the number of QTLs equal to 198 (SIM4); and (E) $h_{Q T L}^{2}$ equal to 0.33 and the number of QTLs equal to 4,500 (SIM5).