A Larger Chocolate Chip—Development of a 15K Theobroma cacao L. SNP Array to Create High-Density Linkage Maps

Livingstone, Donald; Stack, Conrad; Mustiga, Guiliana M.; Rodezno, Dayana C.; Suarez, Carmen; Amores, Freddy; Feltus, Frank A.; Mockaitis, Keithanne; Cornejo, Omar E.; Motamayor, Juan C.

doi:10.3389/fpls.2017.02008

ORIGINAL RESEARCH article

Front. Plant Sci. , 05 December 2017

Sec. Plant Breeding

Volume 8 - 2017 | https://doi.org/10.3389/fpls.2017.02008

A Larger Chocolate Chip—Development of a 15K Theobroma cacao L. SNP Array to Create High-Density Linkage Maps

Updated

A correction has been applied to this article in:

Corrigendum: A Larger Chocolate Chip—Development of a 15K Theobroma cacao L. SNP Array to Create High-Density Linkage Maps
1. Read correction

$\r\nDonald Livingstone III&#x;$ Donald Livingstone III¹^†

Conrad Stack¹^†

Guiliana M. Mustiga¹

Dayana C. Rodezno¹

Carmen Suarez^2,3

Freddy Amores^2,4

¹Mars, Inc., Miami, FL, United States
²Estación Experimental Tropical Pichilingue, National Institute of Agricultural Research, Quevedo, Ecuador
³Agricultural Faculty, Technical University of Quevedo, Quevedo, Ecuador
⁴Universidad Técnica Estatal Quevedo (UTEQ), Quevedo, Ecuador
⁵Department of Genetics and Biochemistry, Clemson University, Clemson, SC, United States
⁶Department of Biology and Pervasive Technology Institute, Indiana University Bloomington, Bloomington, IN, United States
⁷School of Biological Sciences, Washington State University, Pullman, WA, United States

Cacao (Theobroma cacao L.) is an important cash crop in tropical regions around the world and has a rich agronomic history in South America. As a key component in the cosmetic and confectionary industries, millions of people worldwide use products made from cacao, ranging from shampoo to chocolate. An Illumina Infinity II array was created using 13,530 SNPs identified within a small diversity panel of cacao. Of these SNPs, 12,643 derive from variation within annotated cacao genes. The genotypes of 3,072 trees were obtained, including two mapping populations from Ecuador. High-density linkage maps for these two populations were generated and compared to the cacao genome assembly. Phenotypic data from these populations were combined with the linkage maps to identify the QTLs for yield and disease resistance.

Introduction

With a preferred growing range between 10°N and 10°S of the equator, Theobroma cacao L. (cacao) is a diploid understory tree crop with 10 chromosomes (2n = 2x = 20). Cacao has a long history of cultural significance for both the Aztec and Mayan civilizations, with some of the earliest archeological data suggesting that cacao was used as early as 1000 BC (Henderson et al., 2007). While primarily an outbreeding crop, self-fertilization has been known to occur in certain genetic backgrounds of cacao particularly Criollo and Amelonado. The mechanism controlling self-incompatibility is only beginning to be understood (Royaert et al., 2011; Lanaud et al., 2017), however initial domestication from wild ancestors likely occurred over centuries of cultivation in Central America and Mexico contributing to the few self-compatible trees observed now (Bartley, 2005). Today, cacao remains an important cash crop in the tropics of Central and South America, Asia, and Africa (Cuatrecasas, 1964; Hunter, 1990; Motamayor et al., 2002). Prized for its seeds, commonly referred to as cocoa beans, T. cacao L. is the signature ingredient of chocolate and is produced in over 50 countries (Knight, 2000). The unique fat composition of cocoa beans, which are also used extensively in the cosmetics industry in the form of cocoa butter, provides a smooth texture to chocolate, while the cocoa solids provide its distinctive chocolate flavor. Due to increases in the demand for cocoa products, which have been exacerbated by increased disease pressure in its growing regions and the fact that it is mainly grown on smallholder farms, the importance of cacao germplasm improvement is more essential now than ever (Lattre-Gasquet et al., 1998; ICCO Production Values, 2012). Efforts to make cacao production more sustainable are of great importance, and building on the assembled cacao genomes to assist traditional breeding methods by developing molecular tools holds the promise of accelerating germplasm improvement toward this goal.

Colonization of the Americas led to some of the earliest attempts to cultivate cacao by the English, Dutch, and Spanish in the seventeenth century. More modern cacao breeding efforts occurred in the early 1900s (Fredholm, 1911), when cacao was classified into three major morphological/geographic groups that were still recognized up until a decade ago, when the use of genetic markers proposed a new classification of the cacao germplasm into at least 10 major groups: Marañon, Curaray, Criollo, Iquitos, Nanay, Contamana, Amelonado, Purús, Nacional, and Guiana (Motamayor et al., 2008). These groups differ greatly in both their agronomic and commercial traits, including flavor, yield, and susceptibility to diseases. The heterosis that occurs between crosses of trees from different genetic groups is exploited by the breeders to improve cacao production (Warren, 1992).

Ecuador is the leading cacao producer in the Americas, with exported cacao beans its third largest commodity (Gill, 2014). Breeding programs in Ecuador have succeeded in furthering the generation of improved cacao varieties. The foundation of our current knowledge of frosty pod and witches' broom, two devastating cacao diseases, was built upon work done in Ecuador, where Moniliophthora roreri was first identified as the causal agent for frosty pod and where the first reports of witches' broom disease (Moniliophthora perniciosa) originated (Evans et al., 1977; Gutiérrez et al., 2016). The Tropical Experimental Station in Pichilingue was established, in part, to breed for resistance to these and other cacao diseases (Evans et al., 1977). The most widely adopted cacao variety in South America, CCN-51, was developed in Ecuador by Homero Castro in the 1960s and is known for its high yield and disease resistance (Boza et al., 2013). While improved yield and resistance remain persistent breeding targets, Ecuadorian chocolate is prized by the makers of fine chocolates for its “Arriba” flavor profile. Recent efforts have focused on incorporating other desired traits, along with improved flavor profiles, while incorporating the yield and resistance alleles.

Representing distinct alleles within the genome, molecular markers represent a particular genetic location that can be associated with phenotypic traits. An important aspect of an integrated cacao genetics program is to develop and apply molecular markers for use in breeding improved varieties and curating the germplasm resources of cacao (Risterucci et al., 2000; Kuhn et al., 2003, 2008; Borrone et al., 2004; Schnell et al., 2007; Lima et al., 2009; Irish et al., 2010; Motilal et al., 2010, 2016). As sequencing technology continues to improve, identification of Single Nucleotide Polymorphisms (SNPs), which can serve as molecular markers, has substantially increased the amount of resources available to develop new cacao varieties. T. cacao is fortunate to have two genetically diverse genome sequence references, the Belizean Criollo genotype B97-61/B2 (Argout et al., 2011) and an Amelonado genotype, Matina 1-6 (Motamayor et al., 2013). These sequenced genomes provide a solid reference platform for SNP discovery. In cacao, SNPs have been developed and used for comparative genomic studies, genetic maps, marker assisted breeding, and determining off-types in clonal collections (Kuhn et al., 2010, 2012; Livingstone et al., 2011; Allegre et al., 2012; Cosme et al., 2016; Motilal et al., 2016).

Crop improvement efforts receive the greatest benefit when genetic markers are associated with specific phenotypic traits. The first linkage map in cacao, representing all 10 chromosomes, was produced by Lanaud et al. (1995) and it has served as the reference for naming the linkage groups of cacao. Since then, additional cacao linkage maps, utilizing a variety of marker types, have been created and used for QTL analysis to identify marker-trait associations (Pugh et al., 2004; Brown et al., 2005, 2007; Cervantes-Martinez et al., 2006; Lima et al., 2009; Royaert et al., 2011, 2016). Resolution of these QTLs is partially limited by the number of available genetic markers, as the maps were composed mostly of microsatellite markers, which have low technical precision when assayed across different equipment. SNP markers can be assayed at larger scales with lower cost compared to most other available markers. SNPs are also easily ported to different platforms, making it straightforward to compare SNP genotypes over time. In other crops, such as soybean, cotton, potato, and rice, SNP arrays have been developed and successfully used to generate linkage maps, identify QTL, track introgression, and verify pedigree (Akond et al., 2013; Cai et al., 2017; Schumann et al., 2017; Thomson et al., 2017). SNP-based linkage maps are already serving the cacao breeding community. The marker-trait associations identified in several recent cacao studies are now being used to help guide the planning of new crosses and the selection of improved varieties (Allegre et al., 2012; Motamayor et al., 2013; Livingstone et al., 2015; Motilal et al., 2016; Royaert et al., 2016).

Here we describe a new 15,000 bead-type SNP array (Illumina Infinium) that combines the sequencing variants identified from two previous projects (Motamayor et al., 2013; Livingstone et al., 2015). Selection of markers was prioritized to target gene models within the QTL regions. (Brown et al., 2005; Lanaud et al., 2009; Feltus et al., 2011; Royaert et al., 2011) To demonstrate the usefulness of the genotypes generated from this array, linkage maps and QTL analysis were performed on two large F1 mapping populations from Ecuador. Due to the dense genotype data provided by this SNP array and the relatively large population size of one of the mapping populations, our analyses identified new QTLs for agronomically important traits, and further allowed us to insert orphan sequence contigs into the cacao genome assembly.

Materials and Methods

Diversity Panel Description for SNP Identification

A diversity panel of 11 cacao accessions was used for SNP discovery (Table 1). The accessions were chosen because they are key clones in the breeding programs (Matina 1-6, Criollo 13, MVP30, MVT85) or parents of previously studied mapping populations (Pound 7,UF273 Type 1, UF273 Type 2, KA2-101, K82, TSH1188, CCN51). Plant material for the diversity panel was obtained from the USDA-ARS SHRS germplasm collection in Miami; the Centro Agronomico Tropical de Investigacion y Ensenanza (CATIE) in Turrialba, Costa Rica; the Cacao and Coconut Institute in Papua New Guinea (PNG); the Mars Center for Cacao Science (MCCS) in Brazil; or the Cocoa Research Institute of Ghana (CRIG) as described in Table 1. The majority of the accessions on the diversity panel come from mixed genetic backgrounds with the two predominant genetic groups presented (Table 1). In all, half of the total diversity in cacao is represented within the predominant background of these accessions.

TABLE 1

Table 1. Summary of diversity panel accessions used for SNP discovery.

Transcript-Based SNP Identification (Dataset 1)

Transcript-based SNP calls had previously been generated for four of the diversity panel members (Table 1) as described in Livingstone et al. (2015) and it contains 48,408 filtered SNPs. In short, leaf RNA samples isolated from each accession were sequenced on the Illumina GAII platform and aligned to the T. cacao 454 sequencing derived from the Matina 1-6 leaf transcriptome (Livingstone et al., 2015). The SNPs were identified and filtered as follows: variant bases had a Phred-type quality score of Q20 or higher, an observed frequency of 20% or higher at the position calling the variant allele, and two or more sequence reads calling the variant. Non-bivariate SNPs and In/Dels were also removed. A total of 48,408 filtered SNPs were identified in this manner, designated as dataset 1, used for SNP selection on the Cacao15kSNP array (Figure 1).

FIGURE 1

Figure 1. SNP discovery and filtering to select the 13,530 SNPs for inclusion on the Cacao15kSNP array. The transcript (dataset 1) and genomic resequencing (dataset 2) variants were identified by alignment to the early Matina reference genome. Initial filtering was applied to provide confidence in the existence of variants. The two variant sets were combined into the merged dataset, and filtering was applied to reduce the number of SNPs by enriching for markers within the QTL regions, identifying those SNPs most likely to perform well on the chip (ADT score), and selecting for SNPs that represented the most genes possible. SNPs were added from the genomic resequencing data (dataset 2) to reach the targeted number of markers in the final filtered SNPs for inclusion on the Cacao15kSNP array.

Genome-Based SNP Identification (Dataset 2)

Genomic DNA from each of the 11 diversity panel members was resequenced, aligned to the Matina 1-6 reference genome, and the sequence variants were identified (Motamayor et al., 2013). The same initial filtering for dataset 1 was also applied here: variant bases had a Phred-type quality score of Q20 or higher, a frequency of 20% or higher at the position calling the variant allele, and two or more sequence reads calling the variant. In/Dels were removed, as were any SNPs within 60 bp of another variant to ensure proper genotyping on the Cacao15kSNP array (Figure 1). The 194,803 filtered SNPs identified in this manner are designated as dataset 2.

QTL-Enriched Gene Models

The gene model annotation from the Matina 1-6 genome (Motamayor et al., 2013) was used to create a list of enriched gene models that are present within the QTL regions. The genomic position of the published QTL regions based on microsatellites (Borrone et al., 2004; Brown et al., 2005, 2007; Lanaud et al., 2009; Feltus et al., 2011; Royaert et al., 2011) was used in conjunction with the Matina 1-6 gene models to identify all the genes that fall within those QTL regions. By looking for the enrichment of discrete functions [e.g., Gene Ontology (GO), Protein Domains (INTERPRO), Enzyme (EC) & Pathway (KEGG) annotations] in the summed QTL space for a trait relative to the genome background of functional term counts, it was possible to scan for significant, QTL-enriched functional annotations and the genes that code for that function. The expectation is that each of these candidate genes would have a higher probability of driving the QTL effect and should therefore be higher priority targets for polymorphism discovery.

Final SNP Selection

Filtered SNP datasets 1 and 2 were merged, keeping only the SNPs common to both, and then an additional round of filtering was performed on the combined dataset. One thousand SNPs from the combined dataset that were segregating in at least one of the three mapping populations genotyped with the Cacao6kSNP array (Livingstone et al., 2015) were chosen to serve as anchor markers to link the data from the previous Cacao6kSNP array and Cacao15kSNP array described here. Preference was given to the markers in an enriched gene model or the markers that were observed to be polymorphic within all three previously genotyped mapping populations. To make the number of anchor markers from each chromosome proportional to the number of predicted gene models in that chromosome, markers were then removed at random (where they were overrepresented) or were added at random from the combined data set (where they were underrepresented; Figure 2). SNP selection continued by reducing the merged dataset (Figure 1). First, the SNP sequences were sent to Illumina for Assay Design Tool (ADT) scoring, which is a measure of the likelihood of success for a potential probe sequence based on the GC content and sequence adjacent to the SNP site. A value from zero to one, with 1 being the most likely to succeed, is assigned and only those SNPs with an ADT score greater than 0.9 were kept. Next, the SNPs within the QTL-enriched gene model list were selected. Previously, two SNPs per transcriptome contig had been the target to ensure transcript representation, but this requirement proved to be redundant as the marker success was high (Livingstone et al., 2015). For the Cacao15kSNP array, we targeted one SNP per QTL-enriched gene model. If more than one SNP existed with a gene model, we favored, in decreasing order, those SNPs with unique flanking regions (non-duplicated flanking regions), SNPs that were predicted to produce a missense or non-synonymous amino acid change (according to alignments to the reference genome; Motamayor et al., 2013), SNPs using a single bead-type (A/C, A/G, T/C, T/G), and lastly, SNPs with high ADT scores. This filtering approach resulted in slightly less than the required 15,000 bead-types (Figure 1). To increase the number of bead-types to 15,000, additional SNPs from dataset 2 that met all of the above listed filters were selected by filling in the largest gaps present between any two selected SNPs. The final list of 15,000 bead-type variants, representing 13,530 SNP sequences (12,643 gene models), was submitted to Illumina for Infinium SNP Array production and subsequent genotyping (Supplementary Table 1). A summary of the SNPs selected for inclusion on the array is presented in Table 2.

FIGURE 2

Figure 2. Comparison of SNP locations in the Matina 1-6 genome. The chromosomes (TcmChr) and unanchored scaffolds (TcmScaf) of the Matina 1-6 v1.1 genome assembly are presented as concentric circles composed of colored arcs with their physical size in Mbp indicated. The outer circle comprises the 10 chromosomes, while the unanchored scaffolds are the inner circle. The location of the enriched QTL regions used in SNP selection are shaded in gray. Colored tick marks represent the SNPs selected for inclusion on the Cacao15kSNP array (green tick marks), SNPs present on the Cacao6kSNP array (blue tick marks), and SNPs in common between the two arrays (purple tick marks).

TABLE 2

Table 2. Summary of the SNP content present on the array.

Plant Material for Genotyping and Linkage Mapping

Leaf samples were collected from a total of 3,072 cacao trees at INIAP, Ecuador. The DNA from these samples was extracted using the Zymo Research plant DNA extraction kit, following the manufacturer's protocol (Zymo Research, Irvine, CA), and were submitted to Illumina for genotyping on the custom Infinium II BeadArray. The trees that were sampled came from the following populations / breeding trials: EET 95 × Silecia 1 (Stalin A) mapping population (733 individuals, including both parents), Scavina 12 × Unknown (Stalin B) mapping population (251 individuals, including the single, known parent), representatives of the clones from the Tecas (655 individuals), Ganaderia (409 individuals), and Malvinas field (401 individuals) clonal breeding trials, along with a sampling of their progenitors (66) from INIAP, Ecuador. Additional samples were collected from the Trinidad germplasm collection (136 individuals), the L12 progeny trial, with parents at CATIE (151 individuals), and selected clones from the SHRS, MCCS, and MRCC germplasm collections (172 individuals). The remaining 98 samples were internal control samples from the Matina 1-6, Criollo 13, or Pound 7 trees located in Miami.

SNP Array Genotype Processing

Data generated from all 3,072 samples were analyzed with GenomeStudio (V2011.1, Illumina) using the Genotyping module (1.9.4, Illumina) and SNP clustering was performed, as part of the Illumina Genotyping module, with the custom GenTrain clustering algorithm that incorporates several biological heuristics. The genotype data were filtered to remove the SNP markers that failed to generate data for any sample. Additionally, data points with low GenTrain score (<0.4) and GenCall score (<0.2) scores were removed. GenTrain score is a statistical score that evaluates clustering based on the shape of the clusters, their relative distance to each other, and mimics evaluations made by human experts. GenCall score is a quality metric ranging from 0 to 1. The lower the value the farther away from the center of the cluster a data point is; values below 0.15 are considered no calls. The data were exported as a whole, and subsets of the Stalin A and Stalin B mapping populations were generated.

F1 Mapping Populations

The Stalin A mapping population consisted of 731 F1 full-sib progeny from a cross between two parents, EET 95 and Silecia 1. The mother, EET-95, is a heterozygous (90.1%), high-yielding clone that is an F1 from an open-pollinated Nacional-type mother. It is often known for providing the fine flavor characteristics of Ecuadorian chocolate. The father, Silecia 1, is a homozygous (79.5%) cultivar from the upper region of the Amazon and has been shown to be tolerant to witches' broom (Eskes, 2010). This cross was made in 1968 as part of a breeding effort to improve yield and disease resistance (Hardy, 1960; Hunter, 1961; Rudgard et al., 1993; Eskes, 2010). The Stalin B mapping population comprises 250 F1 full-sib progeny from a cross between two parents, Scavina 12 (mother, 50% heterozygous) and a single, but unknown, father (58.8% heterozygous). The father tree was either not characterized or this information has been lost. The tree had reportedly died before the start of this study, but it was likely a high-yielding selection from a local farm. Scavina 12 is known to carry resistance to witches' broom (Eskes, 2010). From Stalin A, both parents and 705 out of the 731 progeny samples were successfully genotyped. From Stalin B, the mother and 249 out of the 250 progeny samples were successfully genotyped. Two additional progeny were removed from Stalin B because their genotypes were identical, and they clearly represented a mistake during leaf collection or sample processing. A genotype for the unknown father of Stalin B was inferred from allelic segregation in the progeny and the genotype of the mother. This was done for each locus by calculating the chi-squared goodness of fit of the genotype segregation against the hypothetical genotype segregation patterns for a population of F1 progeny, assuming Mendelian inheritance and given the maternal genotype. The model with the smallest chi-squared test statistic was assumed to represent the true segregation pattern at a locus, and the father genotype was assigned accordingly. The scripts for this inference can be found in the Supplemental Material (Supplementary File 1).

Plink version 1.9 (Chang et al., 2015) software was used to calculate the frequency of Mendelian errors in the progeny genotypes of both Stalin A and B, which, in the latter case, used the inferred father (Supplementary File 1). Progeny where 3% or more of the genotyped loci violated Mendelian inheritance were considered to be off-types. A total of 129 progenies from the Stalin A cross and 9 from the Stalin B cross were identified as off-types and were excluded from further analysis.

Development of Linkage Maps

Genetic linkage maps were created for the two F1 populations using JoinMap (version 4.1; Kyazma BV, Wageningen, the Netherlands; Stam, 1993; Van Ooijen, 2011). All members of the mapping populations described above were genotyped for the 13,530 SNPs, but not all of the SNPs segregated in the mapping populations. The resulting SNP genotype data were manually transformed to include only those markers segregating in the progeny and were formatted for use with the JoinMap software (Stam, 1993). The maps for both Stalin A and B were created in JoinMap using the outbreeding, full-sib family (“CP”) population type. The following rules were maintained to provide robust genotype data: any SNP loci that had missing values for more than 5% of the trees, along with SNP markers showing high Chi (Royaert et al., 2011) values (>15), were removed from the analysis, and finally, individual trees that were missing data for more than 15% of the SNP markers were removed. Additionally, the similarity of the loci was examined and identical loci were excluded, keeping only the first listed marker. Identical markers will map to the identical map position, and removing them results in faster calculations (JoinMap Manual). Likewise, trees with identical genotypes across every marker were removed from all analyses (two individuals from Stalin B). The maps were made using the maximum likelihood (ML) mapping algorithm with the following changes to the default settings: set recombination frequency start to 0.5 and set recombination frequency end to 0.1. After an initial mapping was performed for each linkage group, the Fit and Stress values for the markers were examined. Any marker with a Fit and Stress value, a measure of uncertainty in placement, above 2 was removed, and the map was remade until no marker showed high stress values (Kyazma/Johan W. Van Ooijen personal communication). For each of the linkage groups, the ordering of the maps was calculated via JoinMap's Metropolis-Hasting algorithm as part of a Gibbs sampler run with random starts. Since the ordering can depend on the proposed random start value, the maps were calculated several times until the most consistent version of the map was obtained as a way to check convergence, provided that no high fit and stress values were present. For each population, the maps were first constructed by linkage group and by parental segregation types (lmxll, nnxnp, hkxhk). For each linkage group, four separate maps were created, which consisted of markers that segregate only in the mother (lmxll), markers segregating only in the father (nnxnp), markers that segregate in both parents (hkxhk), and all the markers for a particular LG. The positions of the markers in each map was compared to the overall map and checked for collinear marker order. The final maps selected were those with best collinear agreements.

Phenotypic Data Collection

The F1 progenies from both populations were planted at 4 × 4 m in 1969 in Lot 7A at INIAP. The parents of the trials are planted as clones at a different lot as part of a germplasm collection, except for the unknown father from Stalin B. The experimental site is at 75 m above sea level. Annual temperature and rainfall averages for the last 25 years are 24.9°C and 2166 mm. There is a seasonal pattern of rainfall with wet season in the first half of the year and a dry season in the second half of the year. Although data collection was discontinued for long periods, in 2002, trees were pruned to obtain uniformity, making it possible to start data collection on a monthly basis during 2003–2008, which is the dataset used in these analyses.

Phenotypic data were collected on both mapping populations by field workers at INIAP over the course of ~5.5 years (January 2003–July 2008). For each tree, the number of healthy pods, number of pods infected with Frosty Pod (FP) rot (M. roreri), number of sick pods (not including those with FP), number of wilted cherelles (dried, aborted fruits still attached to the tree), and wet bean weight (grams) were observed at ~30-day intervals. Counts of the number of trunk cankers (multiple etiological agents) and the incidence of witches' broom (WB) disease (M. perniciosa) were made annually, typically in July or August. Incidences of WB were counted for three classes of plant tissue: vegetative growth (Vegetativas), flower cushions (Coginete), and pods (Chirimoya). For the traits, broad-sense line-mean heritabilities were calculated with Genstat software (VSN International, 2015) with the method described in Cullis et al. (2006). In addition, for each trait, the genotype, year, and genotype by year effect was evaluated with two generalized linear models; one model that assumes binomial distribution for traits involving ratios (e.g., percentage of pods infected monilia) and the second model assumes Gaussian distributions otherwise.

QTL Mapping

QTL mapping analysis was performed on the final linkage maps, and the phenotype statistics were summarized using Genstat (VSN International, 2015). Monthly phenotypic measurements were summed over the data collection period, and these aggregates and means were used to create the response variables for Genstat. Each trait was mapped separately, first by using the Interval Mapping (IM) procedure, followed by Composite Interval Mapping (CIM) with cofactor selection. CIM was used with the aim to narrow down the QTL region and break apart the correlations, if any, between the markers of largest effects (peak) with surrounding markers. Hence, the markers of largest effect from the IM are used as cofactors in every iteration. To test the consistency of the QTL candidates, CIM was run several times on every trait; however, no additional changes in QTL candidates were seen after a maximum of two iterations of CIM. The threshold for significance was set to the default, corrected-Bonferroni option in Genstat, as described in Li and Ji (2005), with a 0.05 significance level. The traits of interest in both the Stalin A and Stalin B populations consisted of disease resistance and yield. More specifically, the traits included the aggregate counts per year over the course of the trial; the means of witches' broom disease that affected the flowers (cushion brooms), developing pods (chirimoya), and branches (vegetative brooms); and the counts of total pods, pods infected with frosty pod, and sick pods. Sick pods are pods that are affected by a pathogen. The percent of sick pods and the percent of frosty pods are ratios that are relative to the total pods. The cherelle wilt ratio is the number of cherelles relative to the potential yield, which is defined by the sum of the cherelle counts and total pods. The probability of the monilia infected pods was calculated as the number of observations in which at least one pod was infected with monilia, divided by the total number of times pods that were present.

Results

Filtering and SNP Statistics

SNPs were identified across a panel of 11 genotypes representing parents of the mapping populations and select clones and were based on a combination of two separate datasets. Figure 1 shows the workflow used for final selection of the SNPs ultimately leading to the inclusion of 13,530 SNPs in the Cacao15kSNP array (Figure 1). The 10 chromosomes of the Matina 1-6 genome are represented in Figure 2 (outermost ring), with the enriched QTL regions shaded in gray. SNPs selected for inclusion on the Cacao15kSNP array are indicated (green tick marks) in relationship to those present on the Cacao6kSNP array (blue tick markers) and SNPs in common between the two arrays (purple tick marks). The inner rings (TcmScaf11-13, 26) represent some of the largest unanchored genome assembly scaffolds of the Matina 1-6 assembly (Figure 2, inner), with the SNPs included on the Cacao15kSNP array also presented as green tick marks. Assembly scaffolds 1-10 represent the 10 chromosomes of cacao respectively. The inclusion of SNPs from unanchored genome assembly scaffolds aimed to place these scaffolds onto the 10 cacao chromosomes.

SNP Array Evaluation

The number of successfully synthesized markers on the array amounted to 11,930 out of the 13,530 originally submitted SNPs, which represents an 88% successful conversion rate. SNPs that failed to produce genotype data for any sample (951), along with those with low GenTrain (<0.4) and GenCall (<0.2) scores, were removed prior to analysis. Ultimately, data were obtained for 10,688 SNPs. The number of DNA samples submitted for genotyping on the Cacao15kSNP array totaled 3,072, including replicated controls. Only 34 DNA samples failed to provide any genotype information. In total, 32,470,144 data points were obtained (10,688 SNP × 3,038 DNA samples) out of a possible 36,648,960 data points (11,903 SNPs × 3,072 samples), representing 11% missing data.

Comparison of Allele Calls from Chip Array and Sequencing

An examination of the allele calls generated in the unfiltered sequencing datasets that were used to select SNPs and the genotype data generated by the Cacao15kSNP array were performed. Six of the clones used in the diversity panel were placed on the Cacao15kSNP array as controls, and the genotype calls from the array are compared to the genotype calls obtained from sequencing during SNP discovery in Table 3. The right-hand side of Table 3 shows a contingency table for each of the six varieties, comparing the count of genotype calls made by the Cacao15kSNP Array (rows) and the genotype calls made from the short-read sequencing that we used to identify the SNPs. The comparisons were broken down in to three classes based on whether the calls were “Homozygous Reference,” indicating that the alleles at a locus were homozygous for the Matina 1-6 reference allele; “Homozygous Variant,” indicating that the alleles at a locus were homozygous for the non-Matina 1-6 reference allele; or “Heterozygous,” indicating that one of the two alleles at a locus matched the allele from the reference genome, while the other allele did not match the reference allele. These six clones represent the two sequenced genomes (Matina 1-6 and Criollo 13), as well as the parents from two of the mapping populations run on the Cacao6kSNP array (TSH 1188, CCN 51, Pound 7, and UF273 Type 1). The comparison consisted of 63,543 total data points, of which 95% did not vary between the diversity panel sequencing and the array genotype calls (Table 3). This value is slightly better than the concordance found in the Cacao6kSNP array (Brown et al., 2007). The identical genotype calls within a particular clone range from 89% (TSH 1188) to 99% (Matina 1-6 and Criollo 13). The genotype calls from the Cacao15kSNP array can be observed vertically in Table 3, while the genotype determined by sequencing is shown horizontally for each clone. Concordant calls can be found along the diagonal from upper left to lower right.

TABLE 3

Table 3. Comparison of the Cacao15kSNP array and sequencing calls across common accessions.

Linkage Mapping

Subsets of the data from the Cacao15kSNP array that represent markers segregating in either the Stalin A (EET 95 × Silecia 1) or Stalin B (Scavina 12 × unknown) mapping populations were generated. Only SNP markers that are heterozygous for at least one parent within these F1 populations were retained for mapping, while uninformative (identical or opposing homozygous) markers were removed. SNP markers that segregate in at least one of the mapping populations genotyped on the Cacao6kSNP array were chosen for inclusion on the Cacao15kSNP array to serve as anchor markers (Livingstone et al., 2015). The markers that are common across maps may be used as anchor points to combine the linkage maps from different populations. Stalin A contains 6,202 segregating SNPs, while Stalin B, with its unknown parent, contains 3,636 segregating SNPs. The two F1 mapping populations share 2,603 segregating SNP markers. Additionally, 909 anchor markers segregate in the two populations (393 in Stalin A, 110 in Stalin B, and 406 in both). Although it is outside the scope of this manuscript, these 909 markers could be used to a generate a composite linkage map that combines markers from all the mapping populations used in both Cacao SNP arrays, thereby providing higher overall precision than the use of markers from any one individual map (Brown et al., 2007; Allegre et al., 2012). Included within Supplementary Table 1 is a list of which marker subset each SNP belongs to (i.e., Stalin A, Stalin B, or anchor).

Genetic linkage maps were created for both of the Stalin mapping populations using the JoinMap software (Stam, 1993; Van Ooijen, 2011), as described. Ultimately, the final linkage map of Stalin A (EET 95 × Silecia 1) was generated with 576 individuals and 3,636 SNP markers, generating a map that was 834 cM in length (Table 4). For Stalin B (Scavina 12 × unknown), a linkage map totaling 1,269 cM was created using the genotypes from 238 individuals and 1,862 SNP markers (Table 4). For both maps, the 10 identified linkage groups (LG) correspond to the 10 chromosomes of cacao, and the linkage group designations were assigned by a comparison of the mapped markers to the Matina 1-6 genome assembly, using the convention defined by Lanaud et al. (1995) and Motamayor et al. (2013). The marker order and cM position for the Stalin A and Stalin B maps can be found in Supplementary Tables 2, 3, respectively. Marker position, as determined by linkage mapping, is compared to the Matina 1–6 assembly in Figure 3.

TABLE 4

Table 4. Summary of the generated Stalin A (top) and Stalin B (bottom) linkage maps.

FIGURE 3

Figure 3. A comparison of the Matina 1-6 genome assembly and the two Stalin linkage maps. SNP marker position is compared across the Matina 1-6 genome assembly and the (A) Stalin A and (B) Stalin B genetic maps. The Stalin A linkage groups are designated by StA and the corresponding linkage group number. The Stalin B linkage groups are designated by StB and the corresponding linkage group number. Chromosomes from the Matina assembly are denoted with TcmChr. Unanchored scaffolds in the Matina assembly are identified as TcmScaf 11, TcmScaf 12, TcmScaf 13, and TcmScaf 26. Each linkage group is identified by a unique color. SNPs positioned on the Matina assembly are connected by a line to their corresponding locations within the linkage map.