Development of a targeted genotyping platform for reproducible results within tetraploid and hexaploid blueberry

Clare, Shaun J.; Driskill, Mandie; Millar, Timothy R.; Chagné, David; Montanari, Sara; Thomson, Susan; Espley, Richard V.; Muñoz, Patricio; Benevenuto, Juliana; Zhao, Dongyan; Sheehan, Moira J.; Mengist, Molla F.; Rowland, Lisa J.; Ashrafi, Hamid; Melmaiee, Kalpalatha; Kulkarni, Krishnanand P.; Babiker, Ebrahiem; Main, Dorrie; Olmstead, James W.; Gilbert, Jessica L.; Havlak, Paul; Hung, Hsiaoyi; Kniskern, Joel; Percival, David; Edger, Patrick; Iorizzo, Massimo; Bassil, Nahla V.

doi:10.3389/fhort.2023.1339310

ORIGINAL RESEARCH article

Front. Hortic., 15 January 2024

Sec. Breeding and Genetics

Volume 2 - 2023 | https://doi.org/10.3389/fhort.2023.1339310

Development of a targeted genotyping platform for reproducible results within tetraploid and hexaploid blueberry

Shaun J. Clare^1,2†

Mandie Driskill^1,3

Timothy R. Millar^4,5†

David Chagné^6†

Sara Montanari^7†

Susan Thomson⁴

Richard V. Espley⁸

Patricio Muñoz^9†

Juliana Benevenuto^9†

Dongyan Zhao^10†

Moira J. Sheehan^10†

Molla F. Mengist¹¹

Lisa J. Rowland¹²

Hamid Ashrafi¹³

Kalpalatha Melmaiee¹⁴

Krishnanand P. Kulkarni¹⁴

Ebrahiem Babiker¹⁵

Dorrie Main¹⁶

James W. Olmstead¹⁷

Jessica L. Gilbert¹⁷

Paul Havlak¹⁷

Hsiaoyi Hung¹⁷

Joel Kniskern¹⁷

David Percival¹⁸

Patrick Edger¹⁹

Massimo Iorizzo^13,20

Nahla V. Bassil^1*†

¹USDA-ARS National Clonal Germplasm Repository, United States Department of Agriculture (USDA) Agricultural Research Service (ARS), Corvallis, OR, United States
²Department of Crop and Soil Science, Washington State University, Pullman, WA, United States
³Fall Creek Farm & Nursery, Inc., Lowell, OR, United States
⁴The New Zealand Institute for Plant & Food Research Ltd, Lincoln, New Zealand
⁵Department of Biochemistry, School of Biomedical Sciences, University of Otago, Dunedin, New Zealand
⁶The New Zealand Institute for Plant & Food Research Ltd, Palmerston North, New Zealand
⁷The New Zealand Institute for Plant & Food Research Ltd, Motueka, New Zealand
⁸The New Zealand Institute for Plant & Food Research Ltd, Auckland, New Zealand
⁹Horticultural Sciences Department, University of Florida, Gainesville, FL, United States
¹⁰Breeding Insight, Cornell Institute of Biotechnology, Cornell University, Ithaca, NY, United States
¹¹Agriculture Research Station, College of Agriculture, Virginia State University, Petersburg, VA, United States
¹²United States Department of Agriculture (USDA) Agricultural Research Service (ARS), Genetic Improvement of Fruits and Vegetables Lab, Beltsville, MD, United States
¹³Department of Horticultural Science, North Carolina State University, Raleigh, NC, United States
¹⁴Department of Agriculture and Natural Resources, Delaware State University, Dover, DE, United States
¹⁵United States Department of Agriculture (USDA) Agricultural Research Service (ARS) Southern Horticultural Laboratory, Poplarville, MS, United States
¹⁶Department of Horticulture, Washington State University, Pullman, WA, United States
¹⁷Driscoll’s, Inc., Watsonville, CA, United States
¹⁸Department of Plant, Food, and Environmental Sciences, Faculty of Agriculture, Dalhousie University, Halifax, NS, Canada
¹⁹Department of Horticulture, Michigan State University, East Lansing, MI, United States
²⁰Plants for Human Health Institute, North Carolina State University, Kannapolis, NC, United States

Blueberry (Vaccinium spp.) is one of the most economically important berry crops worldwide. Validation of genetic mapping studies is often hindered by asynchronous marker technology. The development of a standardized genotyping platform that targets a specific set of polymorphic loci can be a practical solution to unify the scientific and breeding community toward blueberry improvement. The objective of this study was to develop and evaluate a targeted genotyping platform for cultivated blueberries that is affordable, reproducible, and sufficiently high density to warrant large-scale adoption for genomic studies. The Flex-Seq platform was developed in a two-step procedure that resulted in 22,000 loci that yielded 194,365 single nucleotide polymorphisms when assessed in a diversity set of 192 samples including cultivated and other related wild Vaccinium species. Locus recovery averaged 89.4% in the cultivated polyploid blueberry (northern highbush [NHB], southern highbush [SHB], and rabbiteye [RE]) and on average 88.8% were polymorphic. While recovery of these loci was lower in the other Vaccinium species assayed, recovery remained high and ranged between 60.8% and 70.4% depending on the taxonomic distance to the cultivated blueberry targeted in this platform. NHB had the highest mean number of variants per locus at 9.7, followed by RE with 9.1, SHB with 8.5, and a range between 7.7 and 8.5 in other species. As expected, the total number of unique-in-state haplotypes exceeded the total number of variants in the domesticated blueberries. Phylogenetic analysis using a subset of the SNPs and haplotypes mostly conformed to known relationships. The platform also offers flexibility about the number of loci, depth of sequencing for accurate dosage calling, loci and haplotype reconstruction from increased fragment length. This genotyping platform will accelerate the development and improvement of blueberry cultivars through genomic-assisted breeding tools.

Introduction

Vaccinium is a genus of the Ericaceae family of shrub/small tree species that includes cultivated fruit crops such as blueberry (V. corymbosum and V. virgatum), cranberry (V. macrocarpon), and lingonberry (V. vitis-idea), as well as many edible wild stands of berry crops such as lowbush blueberry (V. angustifolium and V. myrtilloides), billberry (V. cespitosum, V. deliciosum, V. myrtillus, V. uliginosum), sparkleberry (V. arboreum), deerberry (V. stamineum) and some species of huckleberry (V. deliciosum, V. membranaceum, V. parvifolium, V. ovatum) (Ballington, 2001; Manzanero et al., 2023). This genus is divided into 30 sub-generic sections according to Stevens (Stevens, 1969; Luby et al., 1991). Sections Cyanococcus, Oxycoccus, Vitis-idaea, Myrtillus, and Vaccinium include species that are either cultivated or extensively collected from native stands for their edible fruit, of which blueberries are in the Cyanococcus section (Galletta and Ballington, 1996).

The wild progenitors of the cultivated blueberry are native to North America along with cranberry, some strawberry, and caneberry crops (Carter et al., 2019; Colle et al., 2019). Cultivation of blueberry is recent in comparison to other established crops with the lowbush blueberry (V. angustifolium) beginning in the mid-19^th century (Ballington, 2001), whereas domestication and cultivation of the more widespread tetraploid highbush blueberry (V. corymbosum L.) began in the early 20^th century (Colle et al., 2019). Production continued to increase in subsequent decades due to their status of “superfood” like other berry species (Burton-Freeman et al., 2016; Davidson et al., 2018). Cultivated highbush blueberry is differentiated into northern highbush blueberry (NHB) and southern highbush blueberry (SHB) based on the chilling requirement sourced by introgression of low or no chill from V. darrowii into SHB which allowed expansion to warmer climates (Manzanero et al., 2023). In addition, the hexaploid rabbiteye (RE) blueberry (V. virgatum syn. ashei) is also cultivated but in smaller quantities (Ballington, 2001). Lowbush blueberries (V. angustifolium and V. myrtilloides), also referred to as the wild blueberry, are grown in limited native stands in Northern US (Maine) and Canada (Maritime provinces and Quebec) (Strik and Yarborough, 2005). A key contributing factor to the appeal of blueberries are the presence of phenolic compounds, sugars, acids and volatile organic compounds that contribute to flavor perception (Gilbert et al., 2015; Klee and Tieman, 2018). In 2021, approximately 320,000 thousand tons of fresh weight cultivated blueberry valued at $1.02 billion was produced in the United States (USDA NASS, 2022). The top five producing states include Washington (180 million tons, 26.9%), Oregon (151 million tons, 22.6%), Georgia (86.5 million tons, 12.9%), California (74.5 million tons, 11.1%), and Michigan (72.9 million tons, 10.9%). The blueberry market is divided approximately evenly between fresh and processed blueberries. Fresh blueberries are produced in multiple states whereas processed blueberries are primarily from Washington and Oregon.

To meet the demands from industry and consumers, modern breeding programs have been developing and employing genomics tools to accelerate cultivar development. However, the utilization of asynchronous marker technologies across studies hampers utilization of their discoveries. A crucial tool missing from the blueberry community is an affordable genotyping platform that can facilitate genome-wide association mapping and yield reproducible results and consolidation of the vast array of genetic studies being conducted. Considerable technological advancements have been made in the field of genotyping from the initial, gel-based genotyping methods (Jones et al., 1997). These include the widespread development of rapid fluorescent SNP maker technologies such as Kompetitive Allele Specific PCR (KASP™, LGC Biosearch Technologies), PCR Allelic Competitive Extension (PACE^®, 3CR Bioscience) and RNase H-dependent PCR (RhAmp™, IDT Technologies) (Dobosy et al., 2011; Semagn et al., 2014); the scale and repeatability of array-based sequencing (Schenk et al., 2000); complexity reduction for repetitive genomes of Restriction site-Associated Digest-Genotyping By Sequencing (RAD-GBS, Elshire et al., 2011); and cost-effectiveness of targeted amplicon sequencing (Lundberg et al., 2013). Other recent developments aim to increase the cost-effectiveness of array-based technology by developing multi-species arrays that can multiplex multiple organisms together with the same barcode (Montanari et al., 2022). There is a wealth of platforms and novel ideas aimed to encourage more repeatable and more affordable genotyping for different end goals such as mapping or predicting target traits, genome-wide association mapping, pedigree analysis, DNA fingerprinting, or diversity analysis. Amongst these, Capture-Seq (LGC Biosearch Technologies, do Amaral et al., 2015) and DArTag (Diversity Array Technologies, Jaccoud et al., 2001; Wenzl et al., 2004) are similar to Allegro (Tecan Genomics) and SeqSNP (discontinued, LGC Biosearch Technologies) using Single Primer Extension Technology (Scolnick et al., 2015), in that a single oligonucleotide probe is used for enrichment of a target variant, either through hybridization or amplification. These platforms have advantages over array-based or RAD-GBS approaches that were either repeatable but expensive to develop or cheap but have low repeatability.

The application of marker-assisted selection (MAS) using genetic markers linked to traits of economic importance has not yet been widely implemented in blueberry despite progress in identifying a small number of loci controlling fruit characteristics and high-quality genome assemblies becoming available (reviewed by Edger et al., 2022). To fill the existing gaps toward MAS in blueberry, the U.S. Vaccinium breeders, allied scientists, and extension specialists, with strong international participants, are collaborating in the VacCAP project to address major bottlenecks for growth of the U.S. Vaccinium industry (Iorizzo et al., 2023). This project is an international coordinated transdisciplinary research approach that was funded in 2019 by the United States Department of Agriculture to develop and implement MAS for fruit quality traits in Vaccinium breeding programs. Its objectives are reviewed by Iorizzo et al. (2023) and described at https://www.vacciniumcap.org/. The first objective was to develop a cost-effective high throughput genotyping platform that works across the cultivated NHB and SHB germplasm to advance genetic studies (in particular genome-wide association mapping) and enable downstream application of MAS in blueberry. In this study, we describe the development of a Flex-Seq platform panel for blueberry that uses two probes for increased specificity and haplotype reconstruction. A total of 22,000 targeted loci were designed and assessed in a diversity panel of 192 accessions made up of 72 NHB, 72 SHB, 21 RE, and 27 wild relatives of interest to the blueberry research community.

Materials & methods

Plant materials

A total of 192 diverse Vaccinium accessions (Supplementary Table 1) obtained from blueberry researchers worldwide from public and private institutions were submitted for DNA extraction, library preparation, and genotyping to RAPiD Genomics (LGC Group, Gainesville, FL). The 192 accessions consisted of 72 NHB, 72 SHB, 21 RE, and 27 accessions considered wild for the purpose of the study. Wild accessions included one to two accessions of each Vaccinium species that were split further into Cyanococcus species: lowbush blueberry (V. angustifolium), common/Canadian blueberry (V. myrtilloides), evergreen blueberry (V. darrowii), Elliotts’s blueberry (V. elliottii), small black blueberry (V. tenellum); and Non-Cyanococcus species: Madeira blueberry (V. padifolium), northern billberry (V. uliginosum), deerberry (V. stamineum), evergreen huckleberry (V. ovatum), lingonberry (V. vitis-idaea) and sparkleberry (V. arboreum). In addition, hybrids between NHB with common blueberry, evergreen blueberry, and northern billberry, as well as evergreen blueberry with Azores blueberry (V. cylindraceum) were grouped with wild Cyanococcus.

Catalogue data collection and quality control

Genomic and transcriptomic sequencing data files were collected from NCBI and collaborators to obtain a de novo variant catalogue. Files collected from NCBI were downloaded as fastq files using Fastq-dump v2.10.9 (Sayers et al., 2021). The data consisted of 50 cultivars and seven projects containing transcriptomic data of 16 additional cultivars from NCBI (Supplementary Table 2). Data types were comprised of paired-ended whole genome sequences, 15 paired-ended transcriptomic sequences, and one single-ended transcriptomic sequence. Collaborator data files were provided as files with flanking sequences approximately 150 to 250 bp in length with the variant positioned in the middle of the sequence, fastq files, or a coordinate file that consisted of a chromosome and position on an associated reference genome. Data were obtained from nine collaborators that encompassed 13 different studies that included eight diversity panels and four mapping populations (Supplementary Table 3).

Fastq files were evaluated with FastQC v0.11.9 using default setting to identify sequence contamination such as sequence adapters, poly-tail SNP repeats, and over-represented fungal, bacterial, and plasmid sequences. Contamination was removed with BBDuk v03.28.2018 (Bushnell, 2022) using the following parameters: ktrim right, mink 11, hdist 2, tpe, tbo, maq 25, minlen 25. Files were re-evaluated with FastQC to determine quality of curated data. Curated genomic fastq files were indexed and aligned to the W85 Phase 0 (P0) reference genome (Mengist et al., 2023) with bwa-mem v0.7.12 using default settings (Li et al., 2009) and bam files sorted by genomic coordinates with samtools v1.18 sort (Li et al., 2009).

Transcriptomic fastq files were indexed and aligned to the W85 P0 reference genome (Mengist et al., 2023) with STAR’s v2.7.10a 2 Pass Mode (Dobin et al., 2013). For more accurate mapping, intron length statistics were calculated with a publicly available AWK v10.29.2014 script (Weeks, 2014). The minimum and maximum intron lengths were supplied to STAR during the first round of mapping using the alignIntronMin and alignIntronMax options, respectively. After mapping, STAR produced a tab-delimited file with the SJ.out.tab prefix for each alignment. These files contain coordinates for high confidence collapsed splice junctions with associated strand orientation. A second tab-delimited file with the SJ.in.tab prefix was created from the SJ.out.tab file using a publicly available AWK v05.14.2014 script (Dobin, 2014). This second file contained four columns, the chromosome, the first base of the intron, the last base of the intron, and strand orientation (e.g., + or -). During the second round of indexing and mapping, all program settings were kept the same, except that the SJ.in.tab file was supplied to STAR during the genomeGenerate run mode using the sjdbFileChrStartEnd option. The resulting bam files were coordinate sorted with samtools v1.18 sort (Li et al., 2009). Genomic and transcriptomic bam files were processed with GATK v4.2.0 MarkDuplicate to remove PCR duplicates arising from library construction and single amplification clusters (McKenna et al., 2010; Van der Auwera et al., 2013). The bam files were processed to add read groups for traceability with GATK v4.2.0 AddorReplaceReadGroups (McKenna et al., 2010; Van der Auwera et al., 2013). To determine the depth and coverage across the genome, each bam file was processed with samtools v1.18 coverage (Li et al., 2009).

De novo variant calling

Files with the highest depth and uniform coverage across the genome were used in building the de novo variant catalogue to ensure the most accurate variant calling. All bam files were merged into a single bam file using samtools v1.18 merge and sorted by genomic coordinates with samtools v1.18 sort (Li et al., 2009). The bam file was split into smaller regions, approximately 300 equally sized regions per chromosome and once to three times for larger scaffolds. Freebayes v1.3.2-38-g71a3e1c-dirty (Garrison and Marth, 2012) fasta_generate_regions.py and a custom Snakemake v6.3.0 (Mölder et al., 2021) script were used to generate bed files containing the start and stop positions of each split region for each chromosome. All of the chromosome region bed files were concatenated together into a single bed file that was supplied to samtools v1.18 view to perform partitioning (Li et al., 2009).

Variant calling was performed on each split bam file using Freebayes and Snakemake with the options cnv-map to enter the correct ploidy per sample and use-best-n-allele to use the best 3 alleles (Garrison and Marth, 2012). The resulting VCF files were concatenated per chromosome using bcftools v1.9 concat (Danecek et al., 2021). The program vcflib v09.28.2015 vcfuniq (Garrison et al., 2022) was used to ensure there were no duplicate calls. The concatenated files were sorted by coordinates with bcftools v1.9 sort and calls were filtered for the following requirements using bcftools v1.9 view option i and bcftools filter option e: 1) call quality >=20, 2) minor allele frequency >= 10%, 3) max depth +2 standard deviations of the mean depth, 4) alternative and reference supported by at least 5 reads, and 5) missing genotypes >= 20% (Danecek et al., 2021).

The flanking sequences 250 base pairs in length from the filtered calls were extracted from the W85 P0 reference genome and converted to fastq format using a custom python script. The flanking sequences with the variant were re-aligned to the W85 P0 reference genome with bwa-mem and sorted with samtools sort. A custom python script was created to identify single mapping variants using the bitwise FLAG, XA, and SA alignment fields. The filtered bwa single mapping variant sequences were aligned a second time with BLASTn v2.14.0 (Sayers et al., 2022). A second python script was created to identify single mapping variants based on alignments that only had one hit.

A priori variant calling

Transcriptomic files not used for de novo variant calling were used in a second catalogue called the a priori variant catalogue, comprised mostly of collaborator data that had a priori variant calling. Freebayes was used to joint call all the transcript files using the same options in the de novo variant calling. Additionally, the transcriptomic VCF files went through the same filtering and single mapping process as the de novo pipeline.

Collaborator data provided as genomic coordinates had 250 base pair flanking sequences extracted from the associated reference genome as a fasta file using a custom python script. Collaborator data provided as a file with flanking sequences were converted to fasta format. All fasta files were aligned to the W85 P0 reference genome with bwa-mem. Each alignment file, the associated fasta file with the variant in square brackets (e.g., [A]), and the W85 P0 reference genome, were utilized in a custom python script that created liftover coordinates for the sequences between the two assemblies while considering and adjusting for the CIGAR string alignment information. This custom script generated two data files, a visual alignment file between the reference and query sequence and an alignment data file that contained the following information: 1) primary and secondary alignment, 2) strand orientation (forward or reverse), and 3) chromosome and SV position(s). Each of the alignment data files for the 13 collaborator studies and the single mapping transcriptomic VCF files were parsed to extract the chromosome and position for each SV into a text file.

Final variant catalogue development

To create both the de novo and a priori variant catalogue, the text file that contained the existing chromosomes and position information, the W85 P0 hard and soft core pangenome (Yocca et al., 2023), the W85 GFF3, and the de novo single mapping variants per chromosome were put into a custom Snakemake and python script. The function of these scripts were to compile existing variants and associated information and to identify variants shared between all the collaborator studies, the transcriptomic data, and the de novo data for each chromosome. The same information was entered for the de novo variant catalogue, except for the text file that contained the existing chromosomes and position information which was omitted and the a priori variant catalogue per chromosome was provided. The function of this script was to compile for each chromosome the de novo variants and associated information that were not present in the a priori variant catalogue. The a priori and de novo variant catalogue is a tab-separated text file that contains the following columns: 1) the variant name, 2) the chromosome, 3) the position(s), 4) the origin of the data (existing collaborators, transcriptomic, or de novo as E, TR, or D, respectively), 5) the type (DP for diversity panel, MP for mapping population, or TR for transcriptome) as counts, respectively, 6) the collaborator, 7) variant location in the hard core, soft core, or intergenic region of the genome, and 8) gene information if available.

Additionally, a trait text data file was created that contained the collaborator variants associated with a study and corresponding trait data. The trait file contained four columns that had the variant name, the importance or “Priority” coded as 1 for important and 0 as not important, the trait, and the reference publication. This file along with the a priori and de novo variant catalogues for each chromosome were the input for a custom Snakemake and python script to generate a per chromosome master catalogue. The catalogue had the same columns as the a priori and de novo catalogues with the addition of three columns that contained the priority, trait, and citation information.

Flex-seq platform design and genotyping

The de novo SNPs and the existing SNPs were provided to RAPiD Genomics/LGC Group (Gainsville, FL, USA) for filtering down to 50,000 variants. SNPs located within core or accessory genes, and SNPs previously associated with a trait were prioritized over intergenic SNPs. Flex-Seq probes were designed by RAPiD Genomics using a proprietary in-house pipeline based on target SNPs identified through previous genotyping efforts. A 1 bp sliding window was applied to the 300-500 bp of the flanking sequence for each target, generating a list of all possible probe sequences. Candidate probes were filtered to remove probes with extreme GC content, homopolymer runs, and ambiguous bases. The remaining probes were evaluated for uniqueness within the reference genome using BLASTn (Camacho et al., 2009). Probes were assigned a weighted score based on their properties maximizing uniqueness within the genome (i.e., specificity), uniformity of GC content and Tm (melt temperature) across all loci, and dimer and hairpin potential. The final probe pairs for each locus were selected based on the optimization of the previous parameters and ensuring that the target SNP would be fully sequenced by 2x150bp Illumina sequencing and maintained a minimum fragment length of at least 300bp. The design and selection of the final panel utilized a two-step approach, with step one consisting of the design and synthesis of a larger ~50,000 panel which was then tested by genotyping a set of representative test samples that included an equal number of NHB and SHB accessions (Supplementary Table 1). In the second step, the final probe panel was selected based on the analysis of step one to empirically identify the loci that performed the best in the Flex-Seq reaction. Loci were selected to optimize recovery across all samples in the initial test set, minimizing variation within a locus among samples, and optimizing data uniformity across loci. This resulted in a final panel consisting of 22,000 loci (Flex-Seq Panel Code: FS_1903). Samples were paired-end sequenced using 2x150 Illumina NovaSeq Platform. Individual fastq and haplotype files are provided for each sample from RAPiD Genomics as well as one raw and one curated VCF file for all samples combined. The raw and filtered VCF file were generated using a tetraploid calling format using FreeBayes for all accessions.

Platform assessment

Marker overlap between platforms

Loci overlapping between Flex-Seq platform and previously developed Capture-Seq and DArTag genotyping platforms were evaluated. This was performed by first obtaining genomic positions of Capture-Seq and DArTag probes using blastn 2.13.0+ (Camacho et al., 2009) against the W85 P0 genome assembly. The Flex-Seq platform consists of 22,000 loci, whereas the Capture-Seq and DArTag platforms consist of 10,000 and 3,000 loci, respectively. Only the top hit with query coverage and percent identity of 90% was obtained. The subsequent outputs were converted to bed files and overlap within all three bed files were compared using bedtools 2.30.0 intersect (Quinlan and Hall, 2010) and -wa and -wb options, whereas unique loci were identified using the -v option.

Variant filtering and locus recovery

The unfiltered VCF file was subsampled into respective classes (NHB, SHB, RE, Cyanococcus, and Non-Cyanococcus, Supplementary Table 1) using bcftools 1.9 view subcommand. Data was filtered using vcflib vcffilter (Garrison et al., 2022) for biallelic SNPs, quality scores of more than 30, forward and reverse mapping scores of more than 30, and sequencing depth exceeding ten reads per sample to ensure realistic odds that each of the four homologous chromosomes within tetraploid NHB and SHB were sampled. Raw and filtered sequencing depth across variants was assessed using vcftools 0.1.16. A locus was determined to be recovered if a single variant was retained within the locus post-filtering and investigated by chromosome and SNP class to ascertain any performance bias. Loci were further filtered to remove monomorphic sites within a specific subset. The number of polymorphisms per locus and mean number of variants per locus were calculated based on post-filtering.

Haplotype reconstruction

More stringent filtering on the raw VCF file was performed considering variant, forward and reverse mapping quality scores higher than 30, biallelic SNPs, alternative alleles present in at least ten individuals, population level sequencing depth of 9,600 (50x/individual mean), allelic balance between 0.25 and 0.75 resulting in 53,557 SNPs. Haplotypes of each locus were constructed using only recovered high-quality biallelic SNPs identified from the previous filtering steps using MCHap 0.8.1 assemble (https://github.com/PlantandFoodResearch/MCHap). Firstly, raw fastq files were trimmed using fastp 0.22.0 (Chen et al., 2018), after being assessed with fastqc 0.12.1 https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ and multiqc 1.14 (Ewels et al., 2016). Trimmed reads were aligned to W85 P0 using bwa-mem2 2.2.1 (Vasimuddin et al., 2019) and converted to bam files using samtools 1.6 view, sort, and index commands (Li et al., 2009). MCHap 0.8.1 assemble was supplied with a list of all bam files, a bed file containing all target loci, a compressed and indexed VCF file containing all high-quality variants, and an indexed reference genome of W85 P0. The options for ploidy were set to 4 and the prior for the inbreeding coefficient to 0.01. The number of called haplotypes per locus were extracted from the subsequent haplotype VCF file.

Phylogenetic tree

The 53,557 high-quality SNP set used for haplotype reconstruction was filtered for any missing data leaving a total of 10,388 SNPs. For constructed haplotypes using the high-quality SNP set, monomorphic haplotypes were filtered for a total of 10,683 haplotype blocks. The 10,388 SNP set and 10,683 haplotype blocks were considered sufficiently similar to construct and compare phylogenetic trees. Phylogenetic trees were constructed using 10,388 SNPs and 10,683 haplotype blocks using unweighted pair group method with arithmetic mean (UPGMA) hierarchical clustering using 100 bootstraps and the aboot command and neinan genetic distance within poppr 2.9.3 (Kamvar et al., 2014). The phylogenetic tree was plotted using the ggtree 3.4.2 (Yu, 2020) extension of ggplot2 3.3.6 (Wickham, 2016) within R 4.2.1. (R Core Team, 2021).

Results

SNP catalogue composition

The SNP catalogue contained 7,571,026 variants that included 444,908 submitted by collaborators based on previous studies in mapping populations or diversity panels (Supplementary Table 3). Of these, 4,675 variants were indicated to be associated with traits evaluated in blueberry including fruit firmness, weight, diameter, size, volatiles, color, flavor, titratable acidity, soluble solids, full bloom, chilling requirement, and cold hardiness. A total of 7,527,230 (or 99.4%) of these variants were distributed across the 12 chromosomes while the remaining (30,514 and 13,282) mapped to 191 contigs and 12 scaffolds, respectively of the W85 P0 assembly (Mengist et al., 2023).

Data curation and Flex-Seq platform composition

This SNP catalogue was used to develop a probe set targeting ~50,000 loci (Flex-Seq Panel Code: FS_1902, RAPiD Genomics) within blueberry. These loci contained 194,365 variants, the majority of which were located in core genes of the genome (124,629, 64.1% of total), with the remaining in accessory genes (12,928, 6.7%), intergenic regions (56,494, 29%), or mixed regions (core and accessory, 314, 0.2%). Up to 1,345 variants in this design were associated with blueberry traits as provided by collaborators. This design was assessed in an equal number of SHB and NHB samples and reduced to a final probe set targeting 22,000 loci (Flex-Seq Panel Code: FS_1903, RAPiD Genomics) distributed evenly throughout the genome (Figure 1). These loci were selected by RAPID Genomics to optimize recovery across the genotyped samples, minimizing variation within a locus among samples, and optimizing data uniformity across loci. Of these 20,000 loci: 15,992 (72.7%) were in core genes, 1,872 (8.5%) were in accessory genes, 4,037 (18.4%) were in intergenic regions, and 34 (0.2%) were mixed (core and accessory). The remaining 65 loci (0.3%) were not assigned. Up to 205 loci were in regions reported to control blueberry traits. A total of 99% of the loci in the final design were distributed across the 12 chromosomes (between 1,620 loci on Chromosome 07 and 2,156 loci on Chromosome 02) and the remaining 1% on contigs and scaffolds (215 loci).

Figure 1

Figure 1 Loci density plot of the Flex-Seq platform across the W85 Phase 0 reference genome assembly. Loci density is displayed with low (green) to high (red) density across the twelve chromosomes within the W85 Phase 0 reference genome assembly in a window size of 100 kb. Contigs and scaffolds are removed to aid visualization. Figure produced using CMplot.

The 192 blueberry samples (Supplementary Table 1) were genotyped with the Flex-Seq 22K (FS_1903). A total of 3.96 million variants were identified including single nucleotide polymorphisms (SNPs), multi-nucleotide variants, insertions and deletions, and other complex variants with an average read depth of 66x. The combined dataset was subsampled to each blueberry subclass and filtered for high-quality biallelic SNPs resulting in approximately ~430,000 variants for NHB (81x average read depth), SHB (75x average read depth), and RE (84x average read depth), ~350,000 variants in Cyanococcus (55x average read depth) and ~275,000 variants in Non-Cyanococcus (52x average read depth). A total of four samples were observed to skew data retention within Cyanococcus and Non-Cyanococccus and were therefore removed from further analysis. These included one Cyanococcus (V. tenellum) and three Non-Cyanococcus (one V. stamineum and both V. arboreum). On closer inspection, these samples with low recovery appeared directly correlated to DNA concentration provided by RAPiD Genomics.