- 1INRA, Univ. Paris-Sud, CNRS, AgroParishTech, Université Paris-Saclay, Gif-sur-Yvette, France
- 2Department of Mathematics, An-Najah National University, Nablus, Palestine
Recombinant Inbred Lines (RILs) are obtained through successive generations of inbreeding. In 1931 Haldane and Waddington published a landmark paper where they provided the probabilities of achieving any combination of alleles in 2-way RILs for 2 and 3 loci. In the case of sibling RILs where sisters and brothers are crossed at each generation, there has been no progress in treating 4 or more loci, a limitation we overcome here without much increase in complexity. In the general situation of L loci, the task is to determine 2L probabilities, but we find that it is necessary to first calculate the 4L “identical by descent” (IBD) probabilities that a RIL inherits at each locus its DNA from one of the four originating chromosomes. We show that these 4L probabilities satisfy a system of linear equations that follow from self-consistency. In the absence of genetic interference—crossovers arising independently—the associated matrix can be written explicitly in terms of the recombination rates between the different loci. We provide the matrices for L up to 4 and also include a computer program to automatically generate the matrices for higher values of L. Furthermore, our framework can be generalized to recombination rates that are different in female and male meiosis which allows us to show that the Haldane and Waddington 2-locus formula is valid in that more subtle case if the meiotic recombination rate is taken as the average rate across female and male. Once the 4L IBD probabilities are determined, the 2L probabilities of RIL genotypes are obtained via summations of these quantities. In fine, our computer program allows to determine the probabilities of all the multilocus genotypes produced in such sibling-based RILs for L<=10, a huge leap beyond the L = 3 restriction of Haldane and Waddington.
Introduction
There are numerous inference problems in population and quantitative genetics that require comparing experimental frequencies of genotypes to those expected “theoretically.” Examples include genetic mapping of genomic markers, localizing causal factors of diseases and quantitative traits, performing marker assisted selection etc. (Lander and Schork, 1994; Weir, 1996; Walsh and Lynch, 2018). The expected frequencies of genotypes, hereafter referred to as probabilities, of interest in such studies often involve multiple loci (Buckler et al., 2009) and are strongly dependent on population structure. In population genetics studies, the structure of natural populations is rarely perfectly known. That partly explains why, in both animal and plant genetics, controlled crosses are widely produced to ensure a specific population structure. Arranging the crosses to lead to homozygous lines is greatly advantageous as such lines can be reproduced “identically and indefinitely.” The simplest situation satisfying these criteria is that of recombinant inbred lines (RILs) (Crow, 2007) founded from two parents as displayed in Figure 1. Given two (generally homozygous) parents that are the founders of the RIL construction (F0), one first produces the associated hybrids (F1). Second, starting with these F1 individuals, one produces a sequence of generations F2, F3, etc by iterative inbreeding, crossing male and female siblings until formally at F∞ one reaches full homozygozity (fixation of the alleles at all loci). As seen in Figure 1, the genomes of the homozygous lines produced by this process are mosaics of the parental genomes.
Figure 1 The production of recombinant inbred lines by sibling mating (SIB RILs). The homologous chromosomes (numbered 0, 1, 2, and 3) inherit genomic segments from their parents, the boundaries of which identify crossover positions. The allelic content becomes fixed for “enough” generations, and so we introduce the limit of generations with the notation F∞.
Consider the allelic content at some set of L genomic markers or loci. There are then 2L possible RIL genotypes, each having a probability that depends on how meioses generate recombinations between these different loci. In the case of plants that allow for selfing, the same individual is both the mother and the father of its offspring; the RILs are then produced via single seed descent (SSD) as opposed to via sibling (hereafter denoted SIB) mating, this second case being the focus of the present work.
There are numerous generalizations of the RIL construction just given. Instead of using two parents to initiate the inbreeding, the use of 2k parents leads to 2k-way RILs (Broman, 2005). 2k-way RILs start with 2k parents to form 2k-1 offspring that are themselves crossed iteratively following a funnel (specifically a binary tree) pattern. Once the root of this tree is reached, the usual RIL inbreeding process is applied. For instance, the so called “Collaborative Cross” which has been a key community tool for mouse genetics, corresponds to k = 3; the choice there of using 8 founding parents at the top of the funnel allows for significantly greater allelic diversity than when using just 2-way RILs. Another generalization is the so called Advanced Intercross RIL (AI-RIL, sometimes referred to as Intermated RIL or IRIL) in which several generations of panmixia are inserted before applying the inbreeding to produce the RILs (Darvasi and Soller, 1995; Winkler et al., 2003; Rockman and Kruglyak, 2008). Other generalizations include Multi-parent Advanced Generation Inter-Cross (MAGIC) (El-din El-Assal et al., 2001), nested association mapping (NAM) populations (Buckler et al., 2009) etc. All of these population constructions involve some initial generations of allelic shuffling followed by the RIL (inbreeding) construction per se. Those early generations produce in effect initial conditions on the genotypes that are at the origin of the RILs and these initial conditions can be computed by direct recurrence from one generation to the next. In contrast, the RIL phase requires crossings that continue until all loci are homozygous and thus—at least mathematically—this phase involves an infinite number of generations. As a result, the computation of the probabilities of multilocus genotypes in RILs does not follow from a simple recursion over a fixed number of generations: either an extrapolation has to be made to deal with the infinite number of generations or some mathematical trick has to be devised to bypass the infinite nature of the process. This fact is at the heart of the difficulty of obtaining exact probabilities of multilocus genotypes in RILs.
The mathematical derivation of such RIL probabilities for two and three loci was provided by Haldane and Waddington (Haldane and Waddington, 1931) for bi-parental RILs in 1931. For two loci, by considering successive generations, they produced recursion equations for the probabilities of the corresponding (fixed or not) SIB genotypes which they then extrapolated to an infinite number of generations. This was quite a feat as they had to solve 22 simultaneous equations, leading in fine to their celebrated relation:
In this formula, R is the probability for a RIL two-locus genotype to be recombinant (have the allele of one F0 parent at one locus and the allele of the other at the other locus) while r is the recombination rate per meiosis between the two loci, assumed identical across male and female meiosis. We will rederive this formula using our framework in the section Case of Two Loci: Recovering the Haldane–Waddington Result and Allowing for Sex-Dependent Recombination Rates because to our knowledge, the generalization of the Haldane–Waddington formula to situations where male and female recombination rates differ has not been published and our framework allows to deal with this extension.
Given R, it is easy to derive the probabilities of the four different RIL genotypes (each of the two loci can be fixed for either of the two parental alleles). Indeed, the two recombinant genotypes have the same probability and the sum of these two probabilities is precisely R. The probability of each of the two recombinant (respectively non-recombinant) RIL genotypes is then R/2 (respectively (1-R)/2).
Haldane and Waddington further showed that this two-locus result also determined the three-locus probabilities. A way to see this is to notice that for three loci (L = 3) there are 2L = 8 different RIL genotypes (at each locus the homozygous allelic state comes from one of the two parents). These 8 genotypes can be grouped into 4 pairs such that within each pair one genotype is obtained from the other by exchanging the alleles of the parents; for instance if the alleles of the parents are denoted by (A, B, C) and (a, b, c) at the three successive loci, the 4 pairs are {(A, B, C), (a, b, c)}, {(A, B, c), (a, b, C)}, {(A, b, C),(a, B, c)} and {(a, B, C), (A, b, c)}. In each pair, the two complementary genotypes have the same probability so in effect it is enough to find the probabilities of each of the 4 pairs. These probabilities add up to one, providing a first equation. Then, labeling the loci as 1, 2, and 3, if the three meiotic recombination rates r1,2, r2,3 and r1,3 are known, the three RIL recombination rates R1,2, R2,3 and R1,3 are also. These quantities provide three further equations relating the four pair probabilities. These four equations uniquely determine the four pair probabilities and thus the probabilities of the 8 RIL genotypes.
Since that 1931 Haldane–Waddington landmark paper, some works have provided generalizations of Eq. 1, for instance in the case of 2k-way RILs (Broman, 2005; Teuscher and Broman, 2007) and in the case of IRILs (Winkler et al., 2003; Teuscher and Broman, 2007). However, the problem of dealing with more than three loci seems substantially more difficult. Following the Haldane–Waddington algebraic approach, if there are L loci, there are 16L possible allelic combinations at each generation and so it is necessary to diagonalize a 16L×16L matrix; that task takes on the order of 163L operations and thus cannot be done on a standard computer even for L = 4. To our knowledge, the only work providing closed-form expressions for 4 or more loci is that of (Samal and Martin, 2015), but their framework for determining exact probabilities of RIL multilocus genotypes applies only to single seed descent RILs, not to SIB RILs. The contribution of the present work is to show that the case of SIB RILs is also to a large extent tractable. In particular, (i) we give the analytic expressions for treating four loci in the absence of crossover interference, and (ii) we show that our framework allows to tackle more loci, though at a computational cost (CPU time and also computer memory) that increases roughly as 16L. Specifically, our computer scripts, written in R (Ihaka and Gentleman, 1996), can treat L = 8 loci in approximately 5 min when run on a desktop computer while a high-end server allows us to go up to L = 10 loci. Lastly, to illustrate an application of our theoretical framework to a practical situation, we construct a maximum likelihood algorithm to impute missing data in RIL populations. In contrast to the standard approach which infers probabilities using machine learning, our method exploits the exact multilocus RIL genotype probabilities. By comparing the two approaches we show that the use of the exact probabilities significantly increases the reliability of the missing data imputation.
Overview of the Method
In the less complex case of single seed descent RILs, it was possible to determine the probabilities of the 2L RIL multilocus genotypes by writing self-consistent equations directly associated with these unknowns (Samal and Martin, 2015). However, in the case of SIB RILs, the situation is more subtle because the allele carried by a RIL genotype may come from either of the two siblings at the F1 generation and thus “identical by descent” (IBD) does not reduce to identity by state (having the same allelic content) as can be seen in Figure 1. As a result, it is necessary to first work with the 4L probabilities that a RIL inherits IBD at the L loci from any of the four F1 homologous chromosomes. After introducing in the Section Probabilities of Multilocus IBD Inheritances in RILs and the Set of Non-Equivalent Q’s the 4L RIL multilocus IBD probabilities, we show in the section Self-Consistent Equations for the IBD Probabilities that each of these unknowns satisfies a self-consistent equation relating it to the others. These equations allow to overcome the technical obstacle of there being an unlimited number of generations in the process of generating RILs all the way to complete fixation. Although these 4L self-consistent equations constrain the 4L unknowns, we show in the section Adding One Linear Inhomogeneous Equation to Uniquely Specify All IBD Probabilities that one additional equation is necessary to specify the solution. For that last constraint we use the fact that the sum of all probabilities is 1. In the section Reducing the System of Equations to Treat Only the Non-Equivalent Q’s, we show how the complexity of the problem can be reduced by working with a subset only of the unknowns. Finally, upon solving the system of equations to determine the IBD quantities, each of the 2L RIL multilocus genotype probabilities follows by summing the probabilities of all compatible IBDs as will be shown in the section Extracting the Probabilities of RIL Genotypes.
Probabilities of Multilocus IBD Inheritances in Rils and the Set of Non-Equivalent Q’s
For a given RIL L-locus genotype (specified formally at generation F∞), the genomic content at any locus ℓ (ℓ ∈{1,…, L}) will be IBD with exactly one of the four F1 homologous chromosomes. (One may note that the allelic fixation can happen before the IBD fixation, but no matter what, after an infinite number of generations both the IBD and the allelic states are fixed, that is they are identical across the four chromosomes of the SIB pair.) We number those four chromosomes 0, 1, 2 and 3 as indicated in Figure 2 and use the same labeling for the later generations too. The IBD case illustrated is such that the RIL inherits from the F1 chromosome 2 at the first locus and from the F1 chromosome 1 at the second locus. (By convention we order the loci from left to right.) More generally, let us introduce the probability Q(i1, i2…,iL) that a RIL inherits IBD from F1 chromosome iℓ for locus ℓ, ℓ = 1,…, L where iℓ = 0, 1, 2, 3. Naturally the sum of these 4L probabilities (there are four possible values of iℓ at each locus ℓ) is equal to 1.
Figure 2 Inheritance during SIB mating and illustration of the construction of a self-consistent equation for any IBD probability. At each generation the homologous chromosomes are labeled 0, 1 (for the female) and 2, 3 (for the male). Note that the chromosomes labeled 0 and 2 are the outcomes of female meiosis while the chromosomes labeled 1 and 3 are the outcomes of male meiosis. The drawing illustrates the transition probability T[(2,1) → (1,2)] Q(1,2) entering the self-consistent equation (cf. Eq. 3 when the left-hand side is Q(2, 1).
For L = 1, there are four IBD probabilities: Q(0), Q(1), Q(2) and Q(3). We shall assume Mendelian segregation with no bias in favor of any particular allele and so in particular the two homologues within each sex are equivalent. Then Q(i) = 1/4 for all ∈ {0, 1, 2, 3}. Moving on to L = 2 for which there are 16 Q’s, the equivalence of homologues leads to the equalities Q(0,0) = Q(1,1), Q(0,1) = Q(1,0), Q(2,2) = Q(3,3), and Q(2,3) = Q(3,2) but also to equalities between mixed terms, Q(0,2) = Q(0,3), Q(1,2) = Q(1,3) etc. Furthermore, if female and male meiosis behave in the same way (so that in particular they have the same recombination rates), we can also conclude that Q(0,0) = Q (2,2) etc so that finally there are just three probabilities to determine, Q(0,0), Q(0,1) and Q(0,2) instead of the initial 16. More generally, if there are L loci, how many non-equivalentQ’s are there? We shall assume there is no segregation bias and that female and male meioses have statistically identical behavior. Then it is possible to show (see Supplementary Material for details) that the number of non-equivalent Q’s is exactly
For example L = 1 leads to NQ (L) = 1 while L = 2 leads to NQ (L) = 3. The number of these non-equivalent Q’s grows roughly as (1/8) × 4L to be compared with the total number ignoring equivalence of 4L. The factor (1/8) clearly makes it worthwhile to use such a reduction in the number of unknowns to simplify the task of writing and solving the equations. The proof of Eq. 2 in the Supplementary Material provides a way to enumerate the Q’s to be kept and schematically goes as follows. First, because all four chromosomes play equivalent roles, we can force i1 to be 0. Second, i2 can be constrained not to take the value 3 since that value can be replaced by 2, this time by equivalence of chromosomes 2 and 3. If i2 takes the value 0 or 1, we can again constrain i3 to be different from 3 by the same reasoning. If instead i2 = 2, then i3 must be allowed to take all values 0, 1, 2 and 3. We can proceed in this way to define the rules to be applied to the successive iℓ. As long as the current list consist of 0’s and 1’s, the next i can be constrained to not take the value 3 by equivalence between chromosomes 2 and 3, but for all entries after the first occurrence of a 2, all values must be allowed (see the Supplementary Material for the final steps required to prove Eq. 2). As an illustration, the reader can check that for L = 3 loci, this construction leads to 10 non-equivalent Q’s, namely Q(0,0,0), Q(0,0,1), Q(0,0,2), Q(0,1,0), Q(0,1,1), Q(0,1,2), Q(0,2,0), Q(0,2,1), Q(0,2,2), and Q(0,2,3).
Self-Consistent Equations for the 4L IBD Probabilities
The IBD inheritance needs an infinite number of generations to become fixed with certainty, at least in principle. Our strategy consist in mapping such an infinite process into a finite one by relying on self-consistency. The probability for F∞ siblings to inherit IBD the sequence of “indices” (i1,i2,…,iL) from the F1 chromosomes can be decomposed into trajectories where the inheritance indices at the F2 level are also made explicit. If we denote these by (iʹ1,iʹ2,…,iʹL), we can reinterpret Q(i1,i2,…,iL) as a sum of contributions:
where T[(·)→(·)] is the transition probability of having the IBD propagate from the first list of indices to the second list of indices when going from the F1 to the F2 generation. T[(·)→(·)] is illustrated graphically in Figure 2 by considering the case of two loci and having i1 = 2, i2 = 1, iʹ1 = 1 and iʹ2 = 2 .
Clearly T [(·) → (·)] depends on the meiotic process, and thus in particular on the recombination rates between loci. To simplify the notation, let us set u = (i1,i2,…iL) and v = (iʹ1,iʹ2,…,iʹL). These transition probabilities T [(·) → (·)] satisfy three properties. First, if ik = 0 or 1, T [u → v] = 0 unless iʹk = 0 or 2. Similarly, if ik = 2 or 3, T [u → v] = 0 unless iʹk = 1 or 3. We summarize this via the rules
where i and iʹ ∈ {0,1,2,3}. Second, it turns out that the matrix T is “doubly stochastic” meaning that the sum of its entries in any row or in any column is exactly 1. The result that the sum over elements in a row is 1 follows from the fact that this sum gives the probability of having any of the possible outcomes of inheritances for a given starting point. Analogously, the result that the sum over all elements in a column is 1 corresponds to the fact that a given v is reached by some u and that summing over all possibilities for u again leads to 1. Third, each element of T decomposes into four factors,
where the subscript of each P labels the chromosome of interest (and therefore the meiosis) at the F2 generation, thus Pj is a probability associated with the meiosis that produces chromosome j when going from F1 to F2. Consider for specificity the term P3. For the computation of this probability, only the entries in v equal to 3 matter. The corresponding indices specify which loci are thereafter IBD from chromosome 3 when considering the F∞ inheritance from the F2 generation. If those loci numbers are say 2, 5, and L – 1, then P3[u → v] is the probability for the loci 2, 5 and L – 1 to inherit IBD from i2, i5 and iL-1 during the meiosis producing chromosome 3 when going from the F1 generation to the F2 generation. Note that all the other loci and chromosomes are irrelevant for this factor. The probability of that event is 0.5 (for the probability that the locus 2 will inherit IBD from chromosome i2) times the probability that the successive intervals 2-5 and 5-(L–1) will be as required – recombinant or not – by the values of i5 and iL–1. Let us suppose that meioses arise without genetic interference, that is, according to the so-called Haldane model (Haldane et al., 1919). (Note that the values of these P's are the only part of our framework where crossover interference affects our computations; if these single-meiosis probabilities are known, then our framework provides the probabilities of all RIL multilocus genotypes just as in the case of no interference.) For specificity, if there is no interference and both intervals 2–5 and 5– (L – 1) are recombinant, the associated (meiotic) probability P is simply 0.5 × r2,5 × r5,L – 1. Such a reasoning is easily extended to any situation, leading to the formula
where the locus indices l and lʹ are such that vl = vlʹ = j , j being the index appearing in the probability Pj. In addition, the el,lʹ are defined as
For Eq. 6, an interval is called “recombinant” if and only if il and ilʹ differ. Lastly, we need to specify the actual pairs of loci l and lʹ that are to be used in that equation. To do so, we first construct the list of ordered indices that satisfy the constraint vl = vlʹ = j. The product in Eq. 6 is then over the successive pairs of this list. If the list is empty, Pj = 1 while if there is only one element in the list, Pj = 0.5. The interpretation of Eq. 6 is then as follows: there is a factor rl,lʹ if the u list imposes that the interval be recombinant and a factor 1 – rl,lʹ otherwise. Putting together Eqs.3 and 5 specifies the 4L linear homogeneous equations for the Q’s. In our computer software, we determine the matrix elements of T as formal mathematical functions of the rl,lʹ. In these general expressions it is possible to substitute the numerical values of the rl,lʹ when necessary.
Adding One Linear Inhomogeneous Equation to Uniquely Specify All 4L IBD Probabilities
Eq. 3 can be rewritten as
for all choices of u, corresponding to a set of 4L linear homogeneous equations. Given one has as many equations as unknowns, one might hope that this system would determine the Q’s but that is not the case because these 4L equations are not independent. Indeed, consider the sum of all the equations in the system:
By interchanging the order of the sums this becomes
which is automatically satisfied because T is doubly stochastic so that . To overcome the problem coming from this dependence amongst the homogeneous self-consistent equations, we need to include further information. We choose to do that by adding the constraint that the sum of all 4L IBD probabilities equals 1:
The inclusion of this (inhomogeneous) linear equation then uniquely specifies the values of all Q’s.
Reducing the System of Equations to Treat Only the NQ(L) Non-Equivalent Q’s
As mentioned previously, it is advantageous to work with a subset of non-equivalent Q’s because this substantially reduces the complexity of the operations to be performed. Specifically, we modify the above approach by considering self-consistent equations only for the reduced list of unknowns—the NQ(L) non-equivalent Q’s chosen in the section Probabilities of Multilocus IBD Inheritances in RILs and the Set of Non-Equivalent Q’s—so instead of having 4L homogeneous equations of the type Eq. 8 we have only NQ(L) of them. In these NQ(L) equations, we replace each Q(v) by an equivalent Q(vʹ) where Q(vʹ) belongs to our list of NQ(L) unknowns. This recipe leads to NQ(L) linear homogeneous equations for our unknowns. Furthermore, we also apply these substitutions to the inhomogeneous equation Eq. 11, with the previously mentioned rule. As a result, by counting the number of Q’s arising in each equivalence class defined in the section Probabilities of Multilocus IBD Inheritances in RILs and the Set of Non-Equivalent Q’s, Q(u) occurs with weight 4 if the entries of u are all different from 2 and with weight 8 otherwise.
In practice, to solve this set of equations, it is convenient to have as many equations as unknowns so we remove exactly one of the homogeneous equations. In our computer algorithm we remove the last of these homogeneous equations but any other choice is just as valid. Having obtained as many independent equations as there are unknowns, the direct solution of this linear system (a linear algebra problem) provides the (unique) values of our NQ(L) non-equivalent Q’s.
Extracting the 2L Probabilities of RIL Genotypes
Once the Q’s are determined, the probabilities of RIL multilocus genotypes can be computed by summing all IBD probabilities that are compatible with the RIL allelic content. Let us refer to the allelic content of parent 1 as a series of A alleles and that of parent 2 as a series of a alleles. Consider then a RIL multilocus for all k \in {1,..,L} genotype, written as a list .G = (α1,α2,…,αL) of L alleles, αk being A or a. The probability of a genotype G is obtained by summing over all Q(u) for which the u is compatible with the allelic content of G. The compatibility rule can be summarized as follows: if αk = A, then uk must be 0 or 2, while if αk = a, then uk must be 1 or 3. This is formalized mathematically by the following equation
where the sum is restricted to the u’s satisfying the compatibility rule. Note that the Q’s on the right-hand side of Eq. 12 in general will not belong to our list of non-equivalent Q’s. As before, just omit all the terms associated with Q’s that are not in this list and multiply the other terms by either 8 or 4 depending on whether the associated u has one of its indices uk equal to 2 or not, again because of the size of the equivalence classes.
Results
We illustrate the power of our framework by considering increasing number of loci. The case of two loci is presented both for pedagogical reasons and to give the novel (as far as we know) values of the IBD probabilities when allowing for sex-dependent recombination rates. For three loci we detail the derivation of the coefficients of the self-consistent equations by giving associated graphical representations in the Supplementary Material. For four loci the analytical expression of the 40 × 40 matrix is also given explicitly. For more loci, the mathematical steps become too cumbersome to be dealt with by hand, but our computer code (in the form of R functions) can be used to first generate the analytic expressions for the linear system of equations, then to solve that system for the Q’s, and finally to produce the probabilities of all the RIL multilocus genotypes. The complexity of the computations provided by our framework can be summarized via the dimensionality of the linear system of equations used to compute the Q’s. This dimension increases roughly by a factor 4 for each additional locus for the simple reason that the number of unknowns increases in that way (cf. Eq. 2). Lastly, in the section Application to Imputing Missing Data we will apply our method to the problem of imputing missing values in RIL genotyping data, demonstrating the benefit of using exact multilocus genotypes.
Case of Two Loci: Recovering the Haldane–Waddington Result and Allowing for Sex-Dependent Recombination Rates
Haldane and Waddington (1931) derived the formula for the probabilities of 2-locus RIL genotypes and (Teuscher and Broman, 2007) gave an alternative more compact approach. We will derive the Haldane–Waddington result here using our self-consistency approach. Then we show how to extend our framework to the case where female and male recombination rates differ.
Let rl,lʹ = r1,2 denote the recombination fraction between the two loci (this recombination rate is for the moment taken to be the same in female and male as assumed by Haldane and Waddington). Furthermore, let al denote the allele at locus l, l ∈ {1,…, L}, on any of the homologous chromosomes in the RIL. By Eq. 2, for L = 2 there are 3 unknown Q’s. The indices u for each of these Q’s are such that they are not related by the symmetry between chromosomes. Our choice is to use Q(0,0), Q(0,1) and Q(0,2). To build the 3 × 3 system of equations, begin with the inhomogeneous linear equation
where the respective factors 8 and 4 follow from whether or not the u list of indices contains a 2. The next step is to write the self-consistent equation for each of the NQ(L) – 1 non-equivalent Q’s. For instance for u = (0,0), by Eq. 3 applied to this case and using the rules for the vanishing of the elements of the matrix T, one has
The matrix elements T [u → v] are determined by Eqs. 5 and 6. Direct calculation gives (1 – r1,2)/2, 1/4, 1/4, and (1 – r1,2)/2 respectively. To obtain a self-consistent equation involving only our three non-equivalent Q’s, we rewrite Eq. 14 by replacing Q(2,0) by Q(0,2) and Q(2,2) by Q(0,0), leading to
The self-consistent equation for Q(0,1) is obtained by the same method. Eq. 13 together with Eq. 15 and its analogue for Q(0,1) then lead to the system
(Compared to Eq. 8, we have changed the signs of each homogeneous equation to obtain a more readable matrix.) This system can be solved by hand, leading to
Given these three values, we can compute the RIL recombination rate R by summing all the probabilities of IBD events that produce recombinant RILs:
Using the equivalences (Q(3,0) = Q(0,2) etc), this gives R = 4Q(0,1) + 4Q(0,2); substituting the values from Eq. 17 leads directly to the Haldane-Waddington formula, Eq. 1.
How do these results extend to the case where female and male have different recombination rates, rf and rm? The main complication comes from the fact that the symmetries of the system are reduced: one can no longer exchange the roles of female and male SIBs. As a result, there are 6 non-equivalent IBD probabilities. Without loss of generality, we take these to be Q(0,0), Q(0,1), Q(0,2), Q(2,0), Q(2,2), and Q(2,3). The determination of these six unknowns follows the same logic as when rf = rm. First, use the inhomogeneous equation specifying that the Q’s are probabilities that add up to 1:
Second, determine the homogeneous equations associated with the self-consistency for the first NQ(L) – 1 non-equivalent Q’s. This then leads to the following system of equations:
For the matrix elements in this system of equations, we have used the notation to designate the complementary value of the recombination rate, such a notation allowing for more compact expressions. The linear system of Eq. 20 can be solved by hand, leading to
Note that except for Q(0,2) and Q(2,0), all the Q’s are asymmetric functions of rf and rm. Furthermore, the equality Q(0, 2) = Q(2, 0) follows from the special symmetry of replacing the left-right convention that orients chromosomes by one using the right-left orientation.
Given the non-trivial result of Eq. 21, we can ask what is the consequence for R, the RIL recombination rate. The calculation is straightforward:
Interestingly, this result depends only on the mean of the female and male recombination rates, in spite of the fact that such a property does not hold at the level of the individual Q’s. Furthermore, it shows that the Haldane-Waddington relation (Eq. 1) can be used when recombination rates are sex-dependent if in that formula the (sex-independent) recombination rate is replaced by the sex-averaged recombination rate.
Although this example was very simple (it involved only two loci), it should be clear that our framework is generally applicable, for any number of loci, whether the female and male recombination rates are identical or not.
Case of Three Loci
Haldane and Waddington showed that the probabilities of two-locus RIL genotypes may be used to derive the probabilities of the three-locus RIL genotypes. Teuscher and Broman also provided this result when they introduced their approach (Broman, 2005; Teuscher and Broman, 2007). In the introduction we explained why such a relation holds and so one might expect a similar conclusion to hold for the Q’s, but this is not so. Indeed, for this L = 3 case, as mentioned in the section Probabilities of Multilocus IBD Inheritances in RILs and the Set of Non-Equivalent Q’s, there are NQ(L) = 10 unknown Q’s to determine, corresponding to 9 degrees of freedom, but the information from the L = 2 level only provides 6 constraints, two for each pair of loci (6 = 2 × 3).
To determine the values of all the IBD probabilities, we simply apply our framework when using L = 3. We begin by specifying the set of non-equivalent Q’s that are our unknowns, following the logic of the general case as exposed in the section Probabilities of Multilocus IBD Inheritances in RILs and the Set of Non-Equivalent Q’s. We thus choose Q(0, 0, 0), Q(0, 0, 1), Q(0, 0, 2), Q(0, 1, 0), Q(0, 1, 1), Q(0, 1, 2), Q(0, 2, 0), Q(0, 2, 1), Q(0, 2, 2), and Q(0, 2, 3). Second, we write the single inhomogeneous equation that sums all Q’s (before applying equivalences). Third, we construct the self-consistent equations for the first 9 of our non-equivalent Q’s, assuming no genetic interference. The Supplementary Material provides a graphical representation of the T [u → v] entries to be explicit, our R code constructs this matrix automatically. These successive steps lead to the following linear system for our 10 unknowns:
where as before denotes the complementary value of the recombination rate and rij = ri,j. The solution of Eq. 23 can be obtained either numerically or analytically—that is as an explicit function of the three recombination rates—using e.g., Maple or Mathematica since a treatment by hand would be very tedious.
Four and More Loci
The previous methodology can be extended to more loci but quickly becomes too cumbersome to manage manually. For illustration, in the case L = 4, there are 40 Q’s to determine (cf. Eq. 2). The system of 40 linear inhomogeneous equations determining these unknowns is given in Eq. 24.
In that display including a 40 × 40 matrix, we have used the same compact notation as for L = 3. Our software produces this system of equations and then can solve for the Q’s for any particular values of the rij. Computing the corresponding probabilities of RIL genotypes is then straightforward and in practice the computer does this very quickly.
It is of course possible to go to larger values of L but then it becomes unweildly to show the corresponding matrix. As expected, the computation time required by our R code grows fast with L, by about a factor 16 for each unit increase of L. The required computer memory also grows in the same way. At L = 8 the code takes about 5 min to solve the problem, and for still larger values of L it is best to use a server with large memory capacity (we have gone up to L = 10).
Application to Imputing Missing Data
Genotyping arrays can provide calls for thousands and even millions of single nucleotide polymorphisms. When dealing with such large numbers, the raw data of some markers will inevitably be unambiguous and so generally these cases are called as “missing data.” On the other hand, some technologies such as genotyping by sequencing of low coverage in fact lead predominantly to missing data calls. To deal with either of these cases, one typically imputes a posteriori to transform the missing calls into the most plausible values, exploiting the values of the calls at neighboring loci. Such imputation is a general problem and is typically treated by machine learning approaches that attempt to infer probabilities from the data. For the current context where we are focused on SIB RIL populations, one may expect that having the exact probabilities for RIL multilocus genotypes will allow for more reliable imputation than when using algorithms which resort to statistical inference.
To test this idea, we have developed an algorithm that exploits our exact probabilities and compared it to a standard imputation algorithm. The comparison is based on applying these two algorithms to simulated RIL genotyping data, and doing so for many replicates. Specifically, for each replicate, we started with two homozygous parents having two homologous chromosomes of total length 150 cM on which we randomly positioned 100 markers. After producing from these a SIB RIL population of 100 individuals, we took the genotypes of each individual and transformed the calls by introducing missing data, selecting at random 10, 20, 30, 40, 50 or 70% of the markers for this change from a parental allele to “missing data.”
For a standard imputation algorithm, we used the R package “missForest” (Stekhoven and Buelhmann, 2012; Stekhoven, 2013) that uses machine learning to estimate the most probable values underlying the missing data. In effect, it applies a hidden Markov model that adjusts its parameters to the dataset. It outputs the imputed genotypes from which one can determine an associated error rate. This error rate depends on the realization of the RIL population, which is why we perform replicates. These values are displayed on the X axis of Figure 3 for each of the studied values of the percentage of missing data. One clearly sees that the error rate increases with that percentage, a feature that is of course expected.
Figure 3 Scatterplot comparison of imputation error rates. For each of the replicates simulated, the X axis gives the error rate for the missForest R package while the Y axis gives the error rate when using our code that exploits the exact multilocus genotype probabilities. For each fraction of missing data studied we display the points using a different color.
The second method (ours) is based on exploiting the fact that one can now access (cf. previous sections) the exact multilocus genotype probabilities in SIB RILs. For each genotyped RIL, we construct the blocks of adjacent markers that are called as missing data. They thus have one or two flanking markers. If there is just one such marker (the block is at the end or beginning of the chromosome), we impute the values to be that of this flanking marker. If there are instead two flanking markers (one on each side), again we just impute all the missing data to be that of these two markers if they are both of the same parental type (non-recombinant). The remaining case is where the considered individual is recombinant for those two flanking markers. To impute here, we calculate the multilocus genotype probabilities for all 2L allelic combinations when considering these two flanking markers and the L – 2 markers in the missing data block. Then we select the L-locus genotype of maximum probability that is compatible with the calls at the two flanking markers, and this selected genotype specifies our imputation. The corresponding imputation errors are displayed on the Y axis of Figure 3. Clearly, our imputation method systematically out performs missForest, as expected since using the exact probabilities should be more reliable than using approximate ones. This is further quantified in the Supplementary Material. The imputation code is available online at https://github.com/olivier-c-martin/PMG_SIB_RILs.git and is internal documentation explains in greater detail the different algorithmic steps.
Discussion
The construction of RILs involves successive generations of inbreeding. In realistic situations, SIB based inbreedings are performed for 10 to 20 generations and that leads to some low level of residual heterozygosity. One way to deal with such residual heterozygosity is to follow the probabilities of all possible combinations of allelic values for the siblings from one generation to the next. As shown by Haldane and Waddington (Haldane and Waddington, 1931), that means applying at each generation a 16L × 16L matrix to the vector of those probabilities, where L is the number of loci considered (see also Hospital et al., 1996). Because this is very tedious and just not possible for 5 or more loci, a shortcut is used whereby one considers that the statistics are given by the limiting case in which fixation should be complete and whenever a locus is in the heterozygous state one replaces it by missing data. Softwares that construct genetic maps or that perform QTL mapping then either just ignore such missing data or first perform imputation on those missing values.
That brings us to the challenge of determining RIL probabilities when fixation is indeed complete; note that there are far fewer combinations of allelic values in this situation than when one allows for residual heterozygosity, so one might hope for a simple way to obtain the corresponding multilocus genotype probabilities. But in this mathematical idealization where fixation is complete, the difficulty is that fixation formally requires an infinite number of generations. Thus, either the recursions must be taken “sufficiently far” to obtain numerical convergence or a mathematical trick has to be found. For L = 2, Haldane and Waddington succeeded in the second path thanks to much mathematical ingenuity, and interestingly, that L = 2 solution automatically determines the probabilities in the L = 3 case. However, since that founding work—going back to 1931—no solution had been proposed to tackle the problem of determining probabilities in SIB RILs with four or more loci.
Using a novel method, we have successfully overcome that long-standing challenge here. Our approach provides an algebraic solution, albeit at a computational cost that grows roughly as 16L for L loci. That exponential growth rate is far less drastic than that of the iterative method mentioned above using 16L × 16L matrices and even more dramatically less than applying diagonalization methods as in of the original proposition of Haldane and Waddington of 1931. As a result, not only did we break the L = 4 barrier but in fact we were able to rather easily treat L’s up to 8. We also pointed out that our framework can deal with different female and male recombination rates, a situation that seems to have never been considered before in the context of SIB RILs, even for L = 2.
The ability to compute probabilities of RIL multilocus genotypes opens up to a number of applications. For instance, when building genetic maps, the ordering of markers is determined by comparing likelihoods of different orderings. That calculation can now be done using exact rather than approximate multilocus genotype frequencies, putting those mapping algorithms on a more solid footing. Similarly, when RIL genotypes must be imputed because of missing data, determining the most likely value of an allele marked as missing data requires comparing multilocus genotype probabilities. In the absence of these probabilities, imputation algorithms use approximations. We showed that it was in fact possible to avoid doing so, leading to systematically more reliable imputation results. Finally, beyond specific uses in the case of RILs, our mathematical framework that exploits self-consistency might be useful in certain population genetics problems involving an infinite number of generations.
Software Availability
R code implementing the methodology described in this paper as well as the study of the imputation application is available online at https://github.com/olivier-c-martin/PMG_SIB_RILs.git
Data Availability
All datasets generated for this study are included in the manuscript/Supplementary Files.
Author Contributions
OM proposed the project and with MP conceived and implemented a first approach. KJ introduced the analytic formulation and this led to major enhancements to the algorithmic. KJ and MP developed the R scripts and all authors wrote, edited, and approved the manuscript.
Funding
This work has benefited from a French State grant (LabEx Saclay Plant Sciences-SPS, ANR-10-LABX-0040-SPS), managed by the French National Research Agency under an “Investments for the Future” program (ANR-11-IDEX-0003-02) which funded the salary of KJ. Also, the public Ph.D. grant from the French National Research Agency (ANR) as part of the Investissement d’Avenir program, through the Initiative Doctoral Interdisciplinaire (IDI) 2015 project funded by the Initiative d’Excellence (IDEX) Paris-Saclay, ANR-11-IDEX-0003-02 funded the salary of MP.
Conflict of Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Acknowledgments
The authors are grateful to Prof. D. de Vienne and C. Dillmann for insightful comments.
Supplementary Material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00833/full#supplementary-material
References
Broman, K. W. (2005). The genomes of recombinant inbred lines. Genetics 169 (2), 1133–1146. doi: 10.1534/genetics.104.035212
Buckler, Edward, S., Holland, J. B., Bradbury, P. J., Acharya, C. B., Brown, P. J., Browne, C., et al. (2009). The genetic architecture of maize flowering time. Science 325 (5941), 714–718. doi: 10.1126/science.1174276
Crow, J. F. (2007). Haldane, Bailey, Taylor and recombinant-inbred lines. Genetics 176 (2), 729–732.
Darvasi, A., Soller, M. (1995). Advanced intercross lines, an experimental population for fine genetic mapping. Genetics 141 (3), 1199–1207.
El-Din El-Assal, S., Alonso-Blanco, C., Peeters, A. J., Raz, V., Koornneef, M. (2001). A QTL for flowering time in Arabidopsis reveals a novel allele of CRY2. Nat. Genet. 29 (4), 435–440. doi: 10.1038/ng767
Haldane, J. S., Meakins, J. C., Priestley, J. G. (1919). The effects of shallow breathing. J. Physiol. 52 (6), 433–453. doi: 10.1113/jphysiol.1919.sp001842
Hospital, F., Dillmann, C., Melchinger, A. E. (1996). A general algorithm to compute multilocus genotype frequencies under various mating systems. Comput. Appl. Biosci.: CABIOS 12 (6), 455–462. doi: 10.1093/bioinformatics/12.6.455
Ihaka, R., Gentleman, R. (1996). R: A Language for data analysis and graphics. J. Comput. Graph. Stat. 5 (3), 299–314. doi: 10.1080/10618600.1996.10474713
Lander, E. S., Schork, N. J. (1994). Genetic dissection of complex traits. Sci. (New York, N.Y.) 265 (5181), 2037–2048. doi: 10.1126/science.8091226
Rockman, M. V., Kruglyak, L. (2008). Breeding designs for recombinant inbred advanced intercross lines. Genetics 179 (2), 1069–1078. doi: 10.1534/genetics.107.083873
Samal, A., Martin, O. C. (2015). Statistical physics methods provide the exact solution to a long-standing problem of genetics. Phys. Rev. Lett. 114 (23), 238101. doi: 10.1103/PhysRevLett.114.238101
Stekhoven, D. J. (2013). missForest: Nonparametric missing value imputation using random forest. R package version 1.4.
Stekhoven, D. J., Buehlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28 (1), 112–118. doi: 10.1093/bioinformatics/btr597
Teuscher, F., Broman, K. W. (2007). Haplotype probabilities for multiple-strain recombinant inbred lines. Genetics 175 (3), 1267–1274. doi: 10.1534/genetics.106.064063
Walsh, B., Lynch, M. (2018). Evolution and Selection of Quantitative Traits. Sunderland, Massachusetts, USA: Oxford University Press. doi: 10.1093/oso/9780198830870.001.0001
Weir, B. S. (1996). Genetic Data Analysis II: Methods for Discrete Population Genetic Data. 2 sub edition edn. Sunderland, Mass: Sinauer Associates is an imprint of Oxford UniversityPress.
Keywords: Recombinant Inbred Line population, sibling mating, crossover, self-consistency, structure population
Citation: Jebreen K, Petrizzelli M and Martin OC (2019) Probabilities of Multilocus Genotypes in SIB Recombinant Inbred Lines. Front. Genet. 10:833. doi: 10.3389/fgene.2019.00833
Received: 21 June 2019; Accepted: 13 August 2019;
Published: 01 October 2019.
Edited by:
Chaeyoung Lee, Soongsil University, South KoreaReviewed by:
Christophe Lambing, University of Cambridge, United KingdomXuehui Huang, Shanghai Normal University, China
Copyright © 2019 Jebreen, Petrizzelli and Martin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Olivier C. Martin, olivier.c.martin@inra.fr
†These authors have contributed equally to this work