
94% of researchers rate our articles as excellent or good
Learn more about the work of our research integrity team to safeguard the quality of each article we publish.
Find out more
METHODS article
Front. Genet. , 30 April 2014
Sec. Statistical Genetics and Methodology
Volume 5 - 2014 | https://doi.org/10.3389/fgene.2014.00107
Exploring heritability of complex traits is a central focus of statistical genetics. Among various previously proposed methods to estimate heritability, variance component methods are advantageous when estimating heritability using markers. Due to the high-dimensional nature of data obtained from genome-wide association studies (GWAS) in which genetic architecture is often unknown, the most appropriate heritability estimator model is often unclear. The Haseman–Elston (HE) regression is a variance component method that was initially only proposed for linkage studies. However, this study presents a theoretical basis for a modified HE that models linkage disequilibrium for a quantitative trait, and consequently can be used for GWAS. After replacing identical by descent (IBD) scores with identity by state (IBS) scores, we applied the IBS-based HE regression to single-marker association studies (scenario I) and estimated the variance component using multiple markers (scenario II). In scenario II, we discuss the circumstances in which the HE regression and the mixed linear model are equivalent; the disparity between these two methods is observed when a covariance component exists for the additive variance. When we extended the IBS-based HE regression to case-control studies in a subsequent simulation study, we found that it provided a nearly unbiased estimate of heritability, more precise than that estimated via the mixed linear model. Thus, for the case-control scenario, the HE regression is preferable. GEnetic Analysis Repository (GEAR; http://sourceforge.net/p/gbchen/wiki/GEAR/) software implemented the HE regression method and is freely available.
So-called “missing heritability” can occur due to various reasons, such as small sample size, underrepresented variant spectrum, experimental design, and improper methodological assumptions (Manolio et al., 2009). Because of the high-dimensional nature of genome-wide association study (GWAS) data, in which the number of markers (M) is far greater than the number of individuals (N), estimating heritability is difficult. For instance, if the statistical power is insufficient, variants associated with a small effect may not be captured under a stringent p-value threshold (~ 10−8). This obstacle can be partially bypassed by implementing the mixed linear model, which uses the genetic relationship between individuals estimated from single nucleotide polymorphism (SNP) markers in lieu of fitting hundreds of thousands of markers together (Yang et al., 2010). Nevertheless, it has been recently disputed how an estimator should be adjusted under genetic architecture. Speed et al. (2012) suggested using a weighted genetic relationship matrix under different genetic architecture, which is often unknown. As demonstrated in large-scale empirical data studies (Lee et al., 2013), Speed's ad-hoc weighing method depends on the genetic architecture and does not often outperform plain weight methods upon comparison. As the genetic architecture, such as the relationship between variant frequency and variant effect, is often unknown, criteria should be established to justify the model used to estimate heritability.
For GWAS, as many samples are collected to study diseases, many studies eventually adopt a case-control design. Due to ascertainment in case-control studies, scale transformation is necessary. Without scale transformation, the heritability on the observed scale can be greater than 1, rendering the estimated heritability meaningless, as it is not representative of its heritability on the liability scale, which is more interpretable (Falconer, 1966) for disease data. An equation (Lee et al., 2011) that transforms heritability from the observed scale to the liability scale has been proposed (as the equation was indexed as the 23rd equation in Sang Hong Lee's paper, it is henceforth denoted as Hong23) and was investigated under the infinitesimal model, for which the number of casual loci is infinite. However, in practice, disease loci are reasonably limited for many diseases (Yang et al., 2011), which raises the question of whether or not Hong23 works well for mixed linear model estimates if the infinitesimal model does not hold.
All of the above concerns are related to the heritability estimated via variance component methods implemented thus far in mixed linear models. The Haseman–Elston (HE) regression is a prestigious method for estimating variance components (Haseman and Elston, 1972). The HE regression, a well-known tool for linkage studies that uses identity by descent (IBD) (Lynch and Walsh, 1998; Hill and Weir, 2011) scores, however, seems a rusty weapon in the genomics analysis armory of the GWAS era. This is because the HE regression relies on relatedness measured on IBD but not identity by state (IBS). Although IBS has been employed for linkage analysis, such as under affected-pedigree-member design (Lange, 1986b; Weeks and Lange, 1988; Bishop and Williamson, 1990), its performance is largely dependent on marker polymorphisms and may cause high false positives when ad-hoc weighting functions or incorrect frequencies are adopted. As an underrepresented concept of the linkage era, IBS is neither well-adapted to linkage studies nor employed in the original HE regression framework.
Taken together, the following questions remain.
(1) Can the HE regression be applied to the IBS content such as for GWAS? If the answer is affirmative, what is the theoretical basis and the genetic interpretation in this new context?
(2) An equilibrium has been established between the HE regression and the variance component method (Sham et al., 2002) for linkage studies. Does this equilibrium stand for high-dimensional data such as GWAS data and what are its assumptions?
(3) If the IBS-based HE regression is applied to case-control studies, can it estimate heritability better than the mixed linear models?
(4) Given the recent dispute regarding heritability estimation of complex traits, can HE regression provide further justification?
Recently, a new theory using like-standardized IBS has paved another route to assess genetic relatedness (Ritland, 1996; Powell et al., 2010; Yang et al., 2010) between unrelated individuals (conventional sense). The IBS score resembles the conventional IBD score (Powell et al., 2010), which raises the question of whether this IBS score can be used in the HE regression for unrelated individuals. In this study, by replacing the IBD scores with standardized IBS scores, we used the HE regression to conduct association studies for GWAS data. Assuming random mating, biallelic loci, and additive genetic effects only on the genetic architecture of quantitative trait loci (QTLs) underlying a complex trait, this report establishes the theoretical basis for using the HE regression for GWAS. Two generic scenarios were investigated, and their regression coefficients were derived and have genetically meaningful interpretations. In scenario I, the IBS score was assessed via a marker that was in linkage disequilibrium (LD) with a QTL. This enabled the HE regression to be a tool for single-marker GWAS. In scenario II, IBS score was assessed on multiple markers, each of which could be in LD with multiple QTLs. This allowed the HE regression to be used to estimate the variance component tagged by markers.
The second scenario has implications for estimation of heritability for complex trait using whole genome-wide markers together, similar to the mixed linear model (Yang et al., 2010; Lee et al., 2011). Using an analytical method that establishes the equivalence between the IBS-based HE regression and the mixed linear model, a simple criterion is proposed to justify the estimates in this study. A similar equivalence between the HE regression and the variance component analysis with the mixed linear model was determined in the context of linkage analysis (Sham and Purcell, 2001). In this study, their equivalence is established under the context of GWAS, and the conditions for equivalence are explored analytically as well as in silico. After extending the established HE regression into case-control scenarios, we demonstrated that Hong23 fits the estimate from the HE regression better than that from the mixed linear model.
Furthermore, as the IBS-based HE regression uses least squares, it is advantageous in its computational efficiency and is N (N is the sample size) times faster than the mixed linear model. In order to facilitate the application, the HE regression algorithm for GWAS data was implemented in Java software, GEnetic Analysis Repository (GEAR), which is freely available online.
As the first half of this report is focused on establishing the mathematical basis of the IBS-based HE regression, many mathematical symbols are introduced (Table 1). In the text below, the HE regression is the IBS-based HE regression unless explicitly noted otherwise.
For an individual, the phenotype is denoted as yi, which follows the normal distribution of N(μy, σ2y), and the genotype is xi = [xi1, xi2, …, xiM], in which M is the number of markers. For the ith individual, the genotype at the kth locus is xik, which counts the reference alleles at the kth locus. The reference allele is denoted as Ak and the alternative is ak. The frequency of Ak is pk and the frequency of ak is qk. gk is the set of possible genotypes, say {akak, Akak, AkAk}, at the kth locus. Consequently, akak, Akak, and AkAk are coded as 0, 1, and 2, respectively. After standardization, xi is expressed as . For xik, given a genotype of akak, Akak, and AkAk, their standardized scores are , and , respectively. The additive effect of the lth QTL is denoted as βl. Throughout the study, we assume a polygenic model with L QTLs.
Haseman and Elston (1972) proposed a linear model, Yij = μ + bπij + eij, for detecting linkage between a marker and a QTL in a full-sib design. Yij represents the squared difference between a pair of full sibs, and πij is the proportion of IBD at an observed marker locus; μ is the intercept of the regression, b is the regression coefficient, and eij is the residual. The mathematical expectation of the regression coefficient is b = −2(1 − 2c)2 σ2A, in which c is the recombination fraction between the marker locus and the QTL, and σ2A is the additive genetic variance of the QTL.
Now consider a sample consisting of N unrelated individuals. If the phenotype for the ith individual is yi, we can modify the HE original regression as below
in which Yij = (yi − yj)2 represents the squared difference, Ωij is the measure of the genetic relatedness of a pair of individuals, and eij is the residual. Given N unrelated individuals, there are such individual pairs. Ωij is the similarity score between a pair of individuals based on the IBS, as recently proposed (Powell et al., 2010; Yang et al., 2010).
For the linear model in Equation (1), the expectation of the regression coefficient is . var(Ωij) is the variance of the genetic relatedness. cov(Yij, Ωij) = E(ΩijYij) − E(Ωij)E(Yij) = E(ΩijYij) because E(Ωij) = 0 [see the definition for Ωij in section The Derivation of var(Ωij) and Effective Number of Markers (Me)], and is the mathematical expectation of the joint distribution for Ωij and Yij. In order to derive var(Ωij) and cov(Yij, Ωij), we need to introduce the haplotype distribution of a biallelic loci pair (section Haplotypes of a Biallelic Loci Pair). When the haplotype is constructed on a pair of markers, it leads to the derivation of var(Ωij) [section The Derivation of var(Ωij) and Effective Number of Markers (Me)]; when the haplotype is constructed for a marker and a QTL, it leads to E(yi|xik), the conditional expectation of the phenotype based on a marker [section The Derivation of E(yi|xik)].
For a pair of biallelic loci, there are four haplotype phases, and their conditional probabilities are as summarized in Table S1. rkl = p(al|ak) and Rkl = p(Al|Ak) are defined as the conditional probabilities of the haplotypes in the coupling phases, such as akal and AkAl, respectively; 1 − rkl and 1 − Rkl represent the conditional probabilities of the alleles in their repulsion phases, such as akAl and Akal, respectively. Dkl = fAkAl − pkpl, in which fAkAl is the frequency of the haplotype AkAl; Dkl is the covariance between the loci, quantifying the LD between them.
The correlation of a pair of biallelic loci can be expressed as a 2 × 2 correlation
ρ2kl is often used to parameterize the LD of a loci pair (Hill and Robertson, 1968). For more LD parameterization, please refer to Devlin and Risch (1995) and Wray (2005).
Once the conditional probabilities of the haplotypes are defined, it is straightforward to obtain the joint probabilities of the genotypes for a pair of loci. For example, under random mating, the probability of the genotype AkAkAlAl is p(AkAkAlAl) = p(AkAk|AlAl)p(AkAk) = p(Al|Ak)p(Ak) p(Al|Ak)p(Ak) = p2kR2kl. Analogously, this leads to the joint probabilities of the other eight two-locus genotypes (See Table 2).
For a sample consisting of unrelated individuals, their pairwise genetic relationships, say additive genetic relationships, can be estimated with genetic markers, such as SNP markers (Powell et al., 2010; Yang et al., 2010). The genetic relatedness Ωij between the ith individual and the jth individual is measured by the dot product of their standardized genotypes and then divided by the number of markers.
The possible relatedness scores of a pair of individuals are summarized in Table 3A, totaling nine products. After combining the same score values, there are seven unique terms as in Table 3B. It is easy to derive that E(Ωij) = 0 and , in which cov(Ωijk, Ωijl) = E(ΩijkΩijl) − E(Ωijk)E(Ωijk) = E(ΩijkΩijl) because E(Ωij.) = 0.
Table 3B. A reorganization of Table 3A to illustrate the relatedness joint distribution of a pair of individuals.
Ωij is informative in revealing hidden relatedness. For example, for the duplicated individual in the sample, E(Ωij) = 1; for first-degree relatives, E(Ωij) = 0.5; for second-degree relatives, E(Ωij) = 0.25. Consequently, it can control the entry of samples that are under the expected cutoff for relatedness.
After some additional algebra (see Supplementary Note I), we arrived at the following equation.
When the kth locus and the lth locus are in linkage equilibrium, cov(Ωijk, Ωijl) = 0; when the kth locus and the lth locus are at the same locus, cov(Ωijk, Ωijl) = 1.
The distribution of ρ2kl varies with pk and pl (Wray, 2005). We can also interpret var(Ωij) as the mean of the squared Pearson's correlation between the markers along the genome, denoted as ρ2M.
For simplicity of the following derivation, the concept of an effective number of markers, Me, is introduced here. Intuitively, as markers are often in linkage disequilibrium, the real number of “independent” markers is smaller than the total number of the markers genotyped. This concept was previously introduced under the context of risk prediction (Purcell et al., 2009), and Me was evaluated using Monte Carlo simulation. As indicated in Supplementary Note II, 1/var(Ωij) is the mathematical expectation of the effective number of markers evaluated under the simulation method (Purcell et al., 2009). For example, for 100 equifrequent biallelic loci, if the correlation for each pair of consecutive markers is 0, 0.25, 0.5, and 0.75, the effective number of markers is approximately 100, 90, 61, and 29, respectively. Real GWAS data are often at a magnitude of 104 (Vinkhuyzen et al., 2013).
The expected phenotype of yi given genotype xik depends on the QTL genotype, say the lth locus, in LD with xik. Assuming a biallelic QTL in LD with the marker, the conditional expectation of the marker is E(yi|xik = AkAk) = Σxilglsilp(xil|xik = AkAk), in which gl = {l
l,
lql, qlql}, and E(yi|xik = AkAk) = βl × Rkl2 + 0 × 2Rkl (1 − Rkl) − βl × (1 − Rkl)2 = (2Rkl − 1)βl. Analogously, we can derive the expected values of E(yi|xik = Akak) = (Rkl − rkl)βl and E(yi|xik = akak) = (1 − 2rkl)βl (See Table 4). Once E(yi|xik) is defined, the distribution of E(Yij|xik, xjk) can be tabulated as in Table 5.
In this section, we investigated two scenarios to derive the expected value of the regression coefficient for Equation (1). In scenario I, genetic similarity is estimated at a single marker, which is in LD with one or more QTLs. In scenario II, genetic similarity is estimated based on M markers, each of which can be in LD with L QTLs.
Under the scenario of one marker, say the kth marker, and one QTL, say the lth QTL, since var(Ωijk) = 1, E(b) = E(ΩijYij), which is E(ΩijYij) = ΣxikΣxjksiksjk[E(yi|xik) − E(yj|xjk)]2p(xik)p(xjk). Consequently, we can derive the regression coefficient as
and
in which τkl = 1 − rkl − Rkl.
When the QTL overlaps with the marker, or the correlation between the QTL and the marker is 1, E(b) = −4pkqk β2l because rkl = Rkl = 1. When the QTL is in linkage equilibrium with the marker, rkl = ql and Rkl = pl, 1 − rkl − Rkl = 0, and consequently E(b) = 0.
According to Table S1, . Consequently, the expression of Equation (6) can be rearranged as . The correlation of a pair of biallelic loci [Equation (2)], and consequently E(b) = − 2ρ2klσ2l, in which . Alternatively, we can write
In the GWAS context, squared LD (Pearson's correlation) is in lieu of the recombination fraction for linkage. The mathematical expectation of the regression coefficient resembles the one in the original HE regression. However, it should be noted that here the interpretation of the regression coefficient is based on linkage disequilibrium and association, whereas the original interpretation is based on linkage between the marker and the QTL.
When multiple QTLs are in LD with the marker, the conditional expectation for yi given xik is , and , respectively. The joint distribution of Ωij and Yij is as summarized in Table S2, which resembles Table 5. Still using E(ΩijYij) = ΣxikΣxjksiksjk[E(yi|xik) − E(yj|xjk)]2p(xik)p(xjk), the regression coefficient can be derived as below.
Equation (8) can be rearranged as
It is easy to see that when L = 1, Equation (9) can be simplified to Equation (7).
When the genetic relatedness matrix is constructed with M markers, each of which may be in LD with L QTLs, the HE regression becomes Yij = a + bΩij, in which . For convenience, Ωijk denotes the relatedness fraction constructed with the kth marker between the ith and the jth individuals. According to the definition of the regression coefficient, .
, in which cov(Ωijk, Xijl) = ρkl2, as expressed in Equation (4).
After rearrangement
in which and , summarizing the between-locus variance.
is the average squared LD between a marker and a QTL across the genome, and is the averaged LD between every pair of markers, including the LD between each marker to itself. The interpretation of Equation (11) will be clear in Simulation III and Simulation IV.
If the phenotype is standardized, heritability equals the additive variance component. It is straightforward to obtain an estimate of the heritability for a single QTL, as in scenario I, or all QTLs, as in scenario II (See Supplementary Note IV)
The sum of square error (SSE) is
var(Yij) = 8σ4y (Supplementary Note III), and
, in which
and d is the number of the regression coefficient (here d = 1).
For scenario I, as only one marker is used, var(Ωij) = 1 and ρ2M = 1.
Given the current GWAS data, which incorporates thousands of individuals and often up to one million markers, it is reasonable to assume and 8Me » 4σ4A Λ2.
For real GWAS data with about one million markers, ranges from 30,000 to 50,000 markers due to the strong LD pattern (Vinkhuyzen et al., 2013).
When the phenotype is standardized, the sampling variance of the regression coefficient is half of the additive variance component.
The expectation of the intercept is E(Yij) = E[(yi − yj)2] = E(yi2) + E(yj2) − 2E(yiyj). E(yi2) = var(yi) − E(yi)2, E(yj2) = var(yj) − E(yj)2, E(yiyj) = cov(yi, yj) + E(yi)E(yj). As the individuals are not related to each other, assuming no common environment, cov(yi, yj) = 0. So, E(Yij) = var(yi) + var(yj) = 2σ2A + 2σ2e, twice the phenotypic variance. The negative ratio between the regression coefficient and the intercept provides an estimate of the heritability if the phenotype is not standardized.
The derived regression coefficients and their sampling variance at the completion of the derivation are summarized in Table 6.
The additive variance of a trait is defined as . However, for a complex trait with polygenic genetic architecture, if the QTLs are randomly allocated along the genome, (Supplementary Note V), a phenomenon that the between-locus covariances tradeoff. This is often true for a trait without ascertainment or selection. When each QTL is tagged perfectly and randomly allocated along the genome, Λ = 1. Equation (11) zeros out the Δ term and directly gives the unbiased estimate of twice negative of the additive variance. Removing the scale makes the heritability estimate unbiased. In practice, due to imperfect LD, the heritability is reduced to h2Λ.
In fact, the HE regression and the mixed model are equivalent and can agree on the heritability estimate (see Simulation III). However, this equivalence can be disturbed when QTL effects are not randomly distributed (Simulation IV).
Like the debut application of the original HE regression for schizophrenia (Elston et al., 1973), the IBS HE regression is also extended to case-control GWAS data in this study. However, due to scale issues and ascertainment (Dempster and Lerner, 1950; Falconer, 1966), the estimated heritability needs to be transformed to the liability scale, which is genetically meaningful for ascertained samples. One transformation was proposed by Lee et al. (2011), denoted here as Hong23. It is expressed as , in which h2l is the heritability on the liability scale, h2o is the heritability on the observed scale directly estimated based on the case-control data, K is the prevalence of the disease, P is the proportion of the cases in the data, and z is the height of the standard normal distribution in which the prevalence K is located.
Once the heritability is estimated by the HE regression on the observed scale with Hong23, it can be easily transformed from the observed scale to the liability scale. Simulation studies will be conducted to investigate whether the HE regression better estimates heritability than does the mixed linear model (Simulation IV).
In addition, Y in the HE regression can also be expressed as a cross-product, and then E(b) = −2pkqkτ2klβ2l, which is half that of Equation (7) (See Supplementary Note VI).
In the Monte Carlo simulation, we will investigate the precision of the derived equations.
This simulation investigated the accuracy of Equation (7) for a single-marker application. One thousand unrelated individuals were simulated. One marker and one QTL were simulated, both of which were equifrequent and biallelic. The heritability of the QTL was 0.5. The LD between the marker and the QTL was set at three levels: ρ = 0.25, ρ = 0.5, and ρ = 0.75. The single marker was used to construct the genetic relatedness, Ω. Then a single-marker-based HE regression was conducted. After standardizing the phenotype, the negative half of the regression coefficient returned the unbiased heritability estimate.
As indicated by Equation (7), given ρ = 0.25, ρ = 0.5, and ρ = 0.75, the regression coefficient expectation was −0.062, 0.125, and 0.57, respectively. After 100 rounds of simulation, the derived expectation of the regression coefficient, as well as the sampling variance (Table 6), were in good agreement with the simulation results listed in Table 7. This simulation indicates that the single-marker HE regression is a competitive tool for QTL mapping.
For the single-marker HE regression, as the expectation and the sampling variance of the regression coefficient were already derived, a t-test could be constructed as , in which the linkage disequilibrium between the kth marker and the lth QTL is ρkl. When the sample size is sufficiently large, the t-test approaches the z-score distribution, and the non-centrality parameter of χ21 is consequently , a χ2-test with one degree of freedom. In Table 8, the required sample size to detect association with a SNP for a GWAS (type-I error rate of 10−8) and the required sample size to detect a QTL are indicated.
Table 8. The sample size required for the single-marker HE regression to detect a QTL associated with the target marker.
In contrast, for a conventional one-marker association linear regression, yi = μ + bksik + ei, if the phenotype and the genotypes are both standardized, E(bk) = ρklβl, and its standard error is , a t-test can be constructed as . Taking the square of the t statistic, the non-centrality parameter of χ21 is .
These two χ2 tests differ by the factor . Once , the single-marker HE regression is more powerful than the conventional liner regression; otherwise, the conventional linear regression method is more powerful. As listed in Table 9, given that the heritability of a QTL is 0.01, if the LD between the target marker is low (ρkl = 0.25), medium (ρkl = 0.5), or high (ρkl = 0.75), the sample size required to allow HE to outperform the linear regression is 6400, 1600, and 712, respectively. If the heritability is even smaller, say h2 = 0.001, the required sample size is 12,800, 3200, and 1423 to make the HE regression more powerful under the low, medium, and high LD, respectively.
Table 9. The required sample size that makes the HE regression more powerful than the conventional single-marker linear regression.
Depending on the sample size, heritability, and LD patterns between the QTL and the target marker, the power of the HE regression may or may not be greater than the conventional linear regression. However, when the sample size is large, or the heritability of the QTL is large, HE regression is a more powerful tool for association studies. These results are based on the assumption that the real sampling variance agrees with the derived theoretical result.
In this simulation, 100 equifrequent and biallelic QTLs were simulated, and the additive effect of each QTL was sampled from N(0, 1). Four LD levels (ρl1, l2 = 0, 0.25, 0.5, 0.75) were adopted for each of two consecutive QTLs, and the effective number of markers decreased correspondingly (Me ≈ 100, 90, 61, 29). One thousand unrelated individuals were simulated, and the genetic relatedness of each pair of individuals was estimated on these 100 QTLs. The heritability of the simulated polygenic model was 0.5, which is calculated as . And .
Both the HE regression and the mixed linear model were employed to estimate the additive variance component. The mixed linear model (Yang et al., 2010) can be expressed as yi = μ + xijaj + ei, where yi is the phenotype of the ith individual, μ is the mean, xij is the indicator variable with values of 0, 1, or 2 depending on the reference allele counts, and ej is the residual. Restricted maximum likelihood (REML) was employed to estimate the variance components of the mixed linear model (Yang et al., 2010).
As shown in Table 10, the estimated heritability from either the HE regression or the mixed linear model was equal and not biased, demonstrating the equivalence between the HE regression and the mixed linear model when the QTLs are randomly distributed regardless of their pairwise LD.
Table 10. Simulation evaluations of Equation (11) and comparison between the HE regression and the mixed linear model method (Δ = 0).
E(b) = −2σ2A Λ (ignoring Δ) sheds light on the inference of the general LD pattern between the tagged markers and the causal variance. , and ρ2M can be estimated from markers. If the heritability of the trait is known (not likely though), it is possible to estimate ρ2Q. For example, the heritability of height is estimated at around 0.8 (Visscher et al., 2006; Perola et al., 2007) in linkage, but is 0.4 as estimated in an association study (Yang et al., 2010). If the estimate from linkage was considered to be the true heritability, = 0.5. Assuming the effective number of markers is Me = 10,000, ρ2M = 0.0001, ρ2Q = ρ2M = 0.00005. The average absolute value of the LD between a QTL and a marker is 0.007.
The general setting for this simulation was similar to the last one, but the QTL effects were sorted such that the additive effects were increased along the simulated chromosomal segment. The covariance between any two QTLs can be predicted by cov. The heritability is defined as , in which . Different from the last simulation, .
With this set-up, which is not likely to be true in practice but illustrates an extreme case, the HE regression and the mixed model gave very different estimations. With increased correlation between markers, the HE gave inflated estimates and the mixed model gave deflated estimates. Although both methods gave biased estimates, Equation (11) still could predict the results of the HE regression correctly (See Table 11).
Table 11. Simulation evaluations of Equation (11) when the covariance summation is not zero (Δ ≠ 0).
The HE regression was applied to case-control data. A polygenic model of L equifrequent diallelic QTLs was simulated, and each locus was in Hardy–Weinberg equilibrium and any pair of QTLs was in linkage equilibrium. The heritability on the liability scale was h2l, the heritability on the liability scale. The effect of each QTL was sampled from N(0, σ2b), and σ2b = hl2 /[2 × p × (1 − p) × L], in which p = 0.5. The phenotype of each individual under the liability scale was scaled to unit. The ascertainment of cases on the liability scale was K. Individuals were sampled from the described reference population until 1000 cases and 1000 controls were recruited.
The heritability on the liability scale was 0.5. In order to cover a broad range of scenarios, three levels of QTL number, L = 100, 1000, and 10,000, and three levels of disease prevalence at the population level, K = 0.1, 0.01, and 0.001, were adopted. Nine scenarios were simulated in total, and 30 independent simulation replications were implemented for each scenario.
The genetic relationship matrix was constructed using all individuals and the allele frequencies were estimated from the sample. The genetic additive variance components were estimated with the HE regression and the mixed model method. As the directly estimated variance component was on the observed scale and could be greater than 1, we employed both the REML and non-constrained REML for mixed model methods, which allowed the heritability to be greater than 1.
As illustrated in Figure 1, the estimated h2l was compared across all three methods. In general the HE regression resulted in a more precise estimate than that of the REML and non-constrained REML. For the mixed model methods, either with or without constraints, REML often underestimated the variance components. The bias was caused by two factors: the number of QTLs (in each row panel) and the prevalence of the disease (in each column panel). With fewer QTLs, a lower prevalence could exacerbate underestimation by the mixed model.
Figure 1. Estimation of heritability on the liability scale using the HE regression and mixed linear model methods. In each row, from left to right, each panel represents the case-control sample simulated under the same heritability on the liability scale (h2l) but with different prevalence. In each panel, the vertical axis indicates the estimated heritability on the liability scale (h2l), whereas the horizontal axis indicates which of the three methods (REML, non-constrained REML, and HE regression [least square estimate]) was used. The standard error of the mean (SEM) is indicated at the top of each bar.
The analytical results summarized in Table 6 were evaluated using Monte Carlo simulation, and were highly precise in general. The single-marker HE regression is a competitive tool for QTL mapping, particularly with a large sample size (Simulations I, II). The HE regression and the mixed model method were equivalent, with both providing a precise heritability estimate for a typical polygenic trait (Simulation III). However, if QTL effects are correlated, neither the HE regression nor the mixed model method gave an unbiased estimate (Simulation IV). For case-control studies, the HE regression should be preferred in general (Simulation V).
In order to facilitate application of the HE regression method to estimate complex trait heritability, GEAR software was developed. GEAR was developed on Java and can run across many operating systems, such as Windows, Mac, and Linus/Unix, as long as a Java virtual machine is available. GEAR has been demonstrated to function in the following situations.
(1) It can generate genetic relatedness of unrelated individuals, as formulated in Equation (3), based on whole-genome markers.
(2) It can estimate the effective number of markers based on a genetic-relatedness matrix.
(3) It can estimate heritability with the HE regression. GEAR can read genotype data saved in PLINK binary format (Purcell et al., 2007).
GEAR can be downloaded from the website: https://sourceforge.net/projects/gbchen/files/GEAR/
The online GEAR manual can be found at https://sourceforge.net/p/gbchen/wiki/GEAR/
Historically, linkage was the major tool for QTL mapping of complex traits since the 1970s, which was gradually replaced by association analysis when GWAS became popular (The Wellcome Trust Consortium, 2007). The transmission/disequilibrium test (TDT; Ott, 1989; Spielman et al., 1993) triggered the transition from linkage to association for family-based studies. In the year 2000, generalized TDT was proposed (Laird et al., 2000), which is robust for population stratification. Shortly after that, population-based design emerged as the major flow in genetic data, and GWAS became the leading method for estimating heritability up until now. Extension of the original HE regression to association studies can be seen as an effort to increase the diversity of GWAS analysis tools.
In this study, we established a theory for a modified HE regression, in which IBS scores replace IBD scores. Although IBS is used to detect IBD in linkage studies (Lange, 1986a,b; Bishop and Williamson, 1990), it is considered to be a way of inferring IBD for relatives, such as sib pairs, when founder genotypes are unavailable. In this study, IBS served as the key concept to detect association of unrelated samples rather than relatives. Linkage and association have both been proposed to estimate heritability of complex traits. For example, for height, the heritability estimated from linkage studies is around 0.8 (Visscher et al., 2006; Perola et al., 2007), but around 0.4 from association studies (Yang et al., 2010). Thus, far there is no clear conclusion regarding the fundamental difference between the heritability estimated from these two kinds of methods. Despite their mathematical similarity, application, and interpretation differences should be appreciated.
Under various scenarios, the mathematical expectations of the regression coefficients, as well as the sampling variances, were derived. There is substantial mathematical similarity between the IBD HE regression and the IBS HE regression. For example, for both models under the single-marker scenario, their regression coefficients can be expressed in a unified form, E(b) = − 2ρ2 σ2A. As these two models are based on different genetic mechanisms, the interpretations of their respective regression coefficients are reasonably different. In the IBD-based HE regression, E(b) = −2(1 − 2c)2 σAl2, 1 − 2c ranges from 0 to 1; whereas in the IBS-based HE regression, E(b) = −2τ2kl σAl2, in which τkl, ranges from −1 to 1. As the values of r and R rely upon the allele frequencies of the biallelic marker and the biallelic QTL, they reach either −1 or 1 only given that the marker has the same allele frequency as that of the QTL. However, after taking the square, both (1 − 2c)2 and τ2 lie between 0 and 1, inclusive.
Equation (11), E(b) = −2σ2A Λ (ignoring Δ), provides a possible way to estimate the LD pattern between causal loci and markers. If the true heritability is not readily known, it is possible to estimate ρ2Q, the average LD between QTLs and markers. As demonstrated in simulation III, it may be possible to estimate ρ2. However, the causal loci can be in any possible form, such as SNPs, chromatin markers, or methylation markers, and different methods capture genetic variation in different forms. In practice, the obstacle in estimating ρ2
lies in how heritability estimated from different methods, such as linkage and association, or genotyping platforms, such as SNP markers and methylation markers, can be connected to each other. Equation (11) sheds light on the investigation for how QTLs are distributed along the genome.
Application of the HE regression to heritability estimation of complex traits revealed that the HE regression seems to be equivalent to the mixed model approaches in general (Δ = 0). A similar equivalence was previously established for linkage analysis (Sham and Purcell, 2001). However, for GWAS data, it should be noted that the equivalence is conditional on the genetic architecture of a trait. As indicated in the simulation, the equivalence stands only for typical polygenic genetic architecture, which may be true for many traits without ascertainment or selection, such as height (Yang et al., 2010). However, when substantial covariance exists between causal loci, the equivalence does not stand and neither the HE regression nor the mixed model method gave unbiased estimates. The equivalence may break down under other circumstances that have not been investigated. In real studies, this kind of covariance may be a result of selection in active regions, such as HLA loci, which harbors many signals; then, the HE regression and the mixed model estimates may differ. The equivalence may break down under other circumstances that have not been investigated in this study.
In GWAS, many samples are collected for complex diseases, which are often in a case-control design. Complex disease prevalence is often low; consequently, the cases are under strong ascertainment, which disrupts the assumptions underlying the mixed linear model. As observed in the simulation studies, the HE regression is more precise in estimating heritability than the mixed linear model for case-control studies across a broad range of scenarios. Use of HE regression is advantageous when the disease prevalence is low and the number of causal loci is few. In their original work, Lee et al. (2011) assumed an infinitesimal model of complex diseases. However, when this assumption was disrupted during simulation (likely in practice as well), the mixed linear model method gave biased estimates of heritability. Thus, whenever possible, the HE regression method is preferable to estimate heritability of complex traits.
As derived in this work, the HE regression and the mixed model method are equivalent under polygenic genetic architecture. In other words, when the estimates generated by these two methods significantly differ for the same data, caveats should be presented. As investigated in the simulation, the real heritability may lie between the estimates of these two methods. Speed et al. (2012) previously investigated the assumptions underlying the mixed model method and proposed alternative weighting methods to adjust the heritability estimation. However, as their weighting method depends on genetic architecture, which is often unknown, it is difficult to justify which weighting method is appropriate to adopt for certain data (Gusev et al., 2013). Thus, simply comparing the estimates from the HE regression and the mixed model method may offer an alternative way of justification.
It should be noted that the HE regression method is on the basis of the least square framework rather than the maximal likelihood framework as many mixed model based on (Yang et al., 2010; Speed et al., 2012; Lee et al., 2013). As a numerical method, maximal likelihood methods give estimates optimizing the likelihood under the assumptions, which may break down in practice. Given recent interests in comparing estimates with or without imputation for the genome (Gusev et al., 2013), controversial results have been observed. It is not sure what the increased or decreased estimation of heritability indicates after imputation. A reasonable guess will be that the local covariance structure, as indicated in Equation (11), changes and eventually bring out different estimates. The proposed IBS HE regression, which depends on fewer assumptions compared with maximal likelihood methods, may help melt the controversy.
In practice, undocumented relatedness may creep into samples, and eventually bring about suspiciously high relatedness. As discussed previously (Powell et al., 2010), Equation (3) gives a score of 0 for a pair of unrelated individuals, 0.5 for first-degree relatives, and 1 for duplicated individuals or monozygotic twins. It seems easy to eliminate related individuals if a cutoff, say a relatedness of less than 0.05, is applied to the sample. For association studies, population stratification may increase false positive rates. To reduce the threat of population stratification, phenotypes can be adjusted by principal components (Price et al., 2006) and then fit into the HE regression. If a sample is admixed, the power of the HE regression may be reduced if, in the ancestral populations, the allele frequency spectrums are different from each other or genetic heterogeneity exists in the genetic architecture of the underlying trait in question. More investigation will be required to overcome this challenge.
The variance components have often been estimated via REML (Yang et al., 2010; Lee et al., 2011). Given its various merits, REML is computationally expensive, particularly for large sample sizes. The computational complex is on the scale of O(tN3), which indicates that it is cubic to the sample size and t rounds of iterations. The time complex of the HE regression is far lower, asymptotically O(2N2), given two parameters, the intercept and the regression coefficient, included in the model. Given the large sample sizes often employed in GWAS, the computational burden can be dramatically reduced. Although the HE regression method is derived on a simple-regression scenario, its extension to a multiple-regression scenario is straightforward. For instance, the genetic relatedness between each pair of individuals can be constructed on each chromosome and then all chromosome-based relatedness scores can be fit into the regression framework. In addition, the difference between a pair of phenotypes can also be expressed as a cross product and squared sum (Sham and Purcell, 2001).
Guo-Bo Chen conceived the study, derived the equations, conducted the simulation studies, and calculated the analytical results. Guo-Bo Chen developed and maintained the GEnetic Analysis Repository (GEAR) software. Guo-Bo Chen wrote the manuscript.
The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
This article benefited greatly from detailed feedback by reviewers. Thanks to Professors Peter M. Visscher and Naomi R. Wray for helpful discussion. Thanks to Mr. Zhi-Xiang Zhu for contributing to GEAR. The publication fee was in part covered by Frontiers Publishing Grants and by Australia National Health and Medical Research Council Grant 1011506 to Peter M. Visscher and Naomi R. Wray.
The Supplementary Material for this article can be found online at: http://www.frontiersin.org/journal/10.3389/fgene.2014.00107/abstract
Bishop, D. T., and Williamson, J. A. (1990). The power of identity-by-state methods for linkage analysis. Am. J. Hum. Genet. 46, 254–265.
Dempster, E. R., and Lerner, I. M. (1950). Heritability of threshold characters. Genetics 35, 212–236.
Devlin, B., and Risch, N. (1995). A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 29, 311–322. doi: 10.1006/geno.1995.9003
Elston, R. C., Kringlen, E., and Namboodiri, K. K. (1973). Possible linkage relationships between certain blood groups and schizophrenia or other psychoses. Behav. Genet. 3, 101–106. doi: 10.1007/BF01067650
Falconer, D. S. (1966). The inheritance of liability t o certain diseases, estimated from the incidence among relatives. Ann. Hum. Genet. 29, 51–76.
Gusev, A., Bhatia, G., Zaitlen, N., Vilhjalmsson, B. J., Diogo, D., Stahl, E. A., et al. (2013). Quantifying missing heritability at known GWAS loci. PLoS Genet. 9:e1003993. doi: 10.1371/journal.pgen.1003993
Haseman, J. K., and Elston, R. C. (1972). The investigation of linkage between a quantitative trait and a marker locus. Behav. Genet. 2, 3–19. doi: 10.1007/BF01066731
Hill, W. G., and Robertson, A. (1968). Linkage disequilibrium in finite populations. Theor. Appl. Genet. 38, 226–231. doi: 10.1007/BF01245622
Hill, W. G., and Weir, B. S. (2011). Variation in actual relationship as a consequence of Mendelian sampling and linkage. Genet. Res. (Camb.). 93, 47–64. doi: 10.1017/S0016672310000480
Laird, N. M., Horvath, S., and Xu, X. (2000). Implementing a Unified Approach Tests of Association to family-based tests of association. Genet. Epidemiol. 19, S36–S42. doi: 10.1002/1098-2272(2000)19:1+<::AID-GEPI6>3.0.CO;2-M
Lange, K. (1986a). A test statistic for the affected-sib-set method. Ann. Hum. Genet. 50, 283–90. doi: 10.1111/j.1469-1809.1986.tb01049.x
Lange, K. (1986b). The affected sib-pair method using identity by state relations. Am. J. Hum. Genet. 39, 148–150.
Lee, S. H., Wray, N. R., Goddard, M. E., and Visscher, P. M. (2011). Estimating missing heritability for disease from genome-wide association studies. Am. J. Hum. Genet. 88, 294–305. doi: 10.1016/j.ajhg.2011.02.002
Lee, S. H., Yang, J., Chen, G.-B., Ripke, S., Stahl, E. A., Hultman, C. M., et al. (2013). Estimation of SNP heritability from dense genotype data. Am. J. Hum. Genet. 93, 1151–1155. doi: 10.1016/j.ajhg.2013.10.015
Lynch, M., and Walsh, B. (1998). Genetics and Analysis of Quantitative Traits. Sunderland, MA: Sinauer Associates, Inc.
Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., Hunter, D. J., et al. (2009). Finding the missing heritability of complex diseases. Nature 461, 747–753. doi: 10.1038/nature08494
Ott, J. (1989). Statistical properties of the haplotype relative risk. Genet. Epidemiol. 6, 127–130. doi: 10.1002/gepi.1370060124
Perola, M., Sammalisto, S., Hiekkalinna, T., Martin, N. G., Visscher, P. M., Montgomery, G. W., et al. (2007). Combined genome scans for body stature in 6,602 European twins: evidence for common Caucasian loci. PLoS Genet. 3:e97. doi: 10.1371/journal.pgen.0030097
Powell, J. E., Visscher, P. M., and Goddard, M. E. (2010). Reconciling the analysis of IBD and IBS in complex trait studies. Nat. Rev. Genet. 11, 800–805. doi: 10.1038/nrg2865
Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909. doi: 10.1038/ng1847
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A. R., Bender, D., et al. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575. doi: 10.1086/519795
Purcell, S. M., Wray, N. R., Stone, J. L., Visscher, P. M., O'Donovan, M. C., Sullivan, P. F., et al. (2009). Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752. doi: 10.1038/nature08185
Ritland, K. (1996). Estimators for pairwise relatedness and individual inbreeding coefficients. Genet. Res. 67, 175–185. doi: 10.1017/S0016672300033620
Sham, P. C., and Purcell, S. (2001). Equivalence between Haseman-Elston and variance-components linkage analyses for sib pairs. Am. J. Hum. Genet. 68, 1527–1532. doi: 10.1086/320593
Sham, P. C., Purcell, S., Cherny, S. S., and Abecasis, G. R. (2002). Powerful regression-based quantitative-trait linkage analysis of general pedigrees. Am. J. Hum. Genet. 71, 238–253. doi: 10.1086/341560
Speed, D., Hemani, G., Johnson, M. R., and Balding, D. J. (2012). Improved heritability estimation from genome-wide SNPs. Am. J. Hum. Genet. 91, 1011–1021. doi: 10.1016/j.ajhg.2012.10.010
Spielman, R. S., McGinnis, R. E., and Ewens, W. J. (1993). Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am. J. Hum. Genet. 52, 506–516.
The Wellcome Trust Consortium. (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678. doi: 10.1038/nature05911
Vinkhuyzen, A. A. E., Wray, N. R., Yang, J., Goddard, M. E., and Visscher, P. M. (2013). Estimation and partition of heritability in human populations using whole-genome analysis methods. Annu. Rev. Genet. 47, 75–95. doi: 10.1146/annurev-genet-111212-133258
Visscher, P. M., Medland, S. E., Ferreira, M. A. R., Morley, K. I., Zhu, G., Cornes, B. K., et al. (2006). Assumption-free estimation of heritability from genome-wide identity-by-descent sharing between full siblings. PLoS Genet. 2:e41. doi: 10.1371/journal.pgen.0020041
Weeks, D. E., and Lange, K. (1988). The affected-pedigree-member method of linkage analysis. Am. J. Hum. Genet. 42, 315–326.
Wray, N. R. (2005). Allele frequencies and the r2 measure of linkage disequilibrium: impact on design and interpretation of association studies. Twin Res. Hum. Genet. 8, 87–94. doi: 10.1375/1832427053738827
Yang, J., Benyamin, B., McEvoy, B. P., Gordon, S., Henders, A. K., Nyholt, D. R., et al. (2010). Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42, 565–569. doi: 10.1038/ng.608
Keywords: Haseman–Elston regression, GWAS, identity by state, variance component, missing heritability, case-control, mixed linear model, REML
Citation: Chen G-B (2014) Estimating heritability of complex traits from genome-wide association studies using IBS-based Haseman–Elston regression. Front. Genet. 5:107. doi: 10.3389/fgene.2014.00107
Received: 23 October 2013; Accepted: 10 April 2014;
Published online: 30 April 2014.
Edited by:
Mariza De Andrade, Mayo Clinic, USACopyright © 2014 Chen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Guo-Bo Chen, Queensland Brain Institute, The University of Queensland, Building 79, St. Lucia, QLD 4072, Australia e-mail:Y2hlbmd1b2JvQGdtYWlsLmNvbQ==
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.
Research integrity at Frontiers
Learn more about the work of our research integrity team to safeguard the quality of each article we publish.