Genomic Evidence That Governmentally Produced Cannabis sativa Poorly Represents Genetic Variation Available in State Markets

Vergara, Daniela; Huscher, Ezra L.; Keepers, Kyle G.; Pisupati, Rahul; Schwabe, Anna L.; McGlaughlin, Mitchell E.; Kane, Nolan C.

doi:10.3389/fpls.2021.668315

ORIGINAL RESEARCH article

Front. Plant Sci., 14 September 2021

Sec. Plant Metabolism and Chemodiversity

Volume 12 - 2021 | https://doi.org/10.3389/fpls.2021.668315

This article is part of the Research TopicBehind the Smoke and Mirrors: Reflections on Improving Cannabis Production and Investigating Medical PotentialView all 16 articles

Genomic Evidence That Governmentally Produced Cannabis sativa Poorly Represents Genetic Variation Available in State Markets

Daniela Vergara^1*

Ezra L. Huscher^1†

Kyle G. Keepers^1†

Rahul Pisupati²

Anna L. Schwabe³

Mitchell E. McGlaughlin³

Nolan C. Kane^1*

¹Kane Laboratory, Department of Ecology and Evolutionary Biology, University of Colorado Boulder, Boulder, CO, United States
²Austrian Academy of Sciences, Vienna Biocenter, Gregor Mendel Institute, Vienna, Austria
³School of Biological Sciences, University of Northern Colorado, Greeley, CO, United States

The National Institute on Drug Abuse (NIDA) is the sole producer of Cannabis for research purposes in the United States, including medical investigation. Previous research established that cannabinoid profiles in the NIDA varieties lacked diversity and potency relative to the Cannabis produced commercially. Additionally, microsatellite marker analyses have established that the NIDA varieties are genetically divergent form varieties produced in the private legal market. Here, we analyzed the genomes of multiple Cannabis varieties from diverse lineages including two produced by NIDA, and we provide further support that NIDA’s varieties differ from widely available medical, recreational, or industrial Cannabis. Furthermore, our results suggest that NIDA’s varieties lack diversity in the single-copy portion of the genome, the maternally inherited genomes, the cannabinoid genes, and in the repetitive content of the genome. Therefore, results based on NIDA’s varieties are not generalizable regarding the effects of Cannabis after consumption. For medical research to be relevant, material that is more widely used would have to be studied. Clearly, having research to date dominated by a single, non-representative source of Cannabis has hindered scientific investigation.

Introduction

Public perception of recreational and medicinal Cannabis sativa L. (marijuana, hemp) use has shifted, with Cannabis derived products quickly becoming a multibillion-dollar legal industry. However, the National Institute on Drug Abuse (NIDA), a United States governmental agency, continues to be the sole producer of Cannabis for research. Additionally, high-tetrahydrocannabinol (THC) producing Cannabis continues to be classified as a Schedule I drug, along with heroin, LSD, and ecstasy, according to the DEA (DEA, 2020). This Schedule I classification restricts the acquisition of Cannabis from the private markets, making NIDA the only federally legal source for research. In addition to this limitation, research on Cannabis requires a multitude of permits and supervision (Nutt et al., 2013; Hutchison et al., 2019). However, the medical and recreational Cannabis industry in North America are predicted to grow to 7.7 and 14.9 billion dollars, respectively, by late 2021 (Hutchison et al., 2019).

Cannabis sativa (marijuana, hemp) is an angiosperm member of the family Cannabaceae (Bell et al., 2010). It appears to be one of the oldest domesticated plants, utilized by numerous ancient cultures, including Egyptians, Chinese, Greeks, and Romans (Li, 1973, 1974; Russo, 2007). This versatile plant has many known uses, including fiber for paper, rope and clothing, oil for cooking and consumption, and numerous medicinal applications. The plant produces secondary metabolites known as cannabinoids that interact with the human body in physiological (Russo, 2011; Swift et al., 2013; Volkow et al., 2014) and psychoactive (Russo and John, 2003; ElSohly and Desmond, 2005) ways. The cannabinoids compounds are manufactured in the trichomes, which are abundant on the female flowers (Sirikantaramas et al., 2005). The remarkable properties of cannabinoids are partly responsible for driving the growth of the thriving Cannabis industry. Two of the main cannabinoids— Δ-9-tetrahydrocannabinolic acid (THCA) and cannabidiolic acid (CBDA)—when heated are converted to the neutral forms Δ-9 THC and cannabidiol (CBD), respectively (Russo, 2011). Two well-characterized enzymes, Δ-9-tetrahydrocannabinolic acid synthase (THCAS) and cannabidiolic acid synthase (CBDAS), are responsible for the production of these cannabinoids in the plant.

Despite the regulatory hurdles and the limited scope of contributions from the United States government, Cannabis research is growing at a rapid pace (Vergara et al., 2016; Kovalchuk et al., 2020) and United States scientists have made significant advances in Cannabis research from multiple disciplines. Researchers in the United States have produced one of the most complete publicly available Cannabis genome assemblies to date, along with the locations of the cannabinoid family of genes in the genome (Grassa et al., 2018). However, oversight is needed to assure the quality and consistency of Cannabis testing across laboratories (Jikomes and Zoorob, 2018). Regulation and supervision will allow for a deeper understanding of all the compounds produced by the plant, particularly minor cannabinoids which are not always measured (Vergara et al., 2020) and are produced using multiple genes with complex interactions (Vergara et al., 2019). This is particularly important because medical Cannabis use has outpaced its research (Hutchison et al., 2019). Collaborative research between American academic institutions and private companies has shown that the cannabinoid content and genetic profile of Cannabis provided by NIDA is not reflective of what consumers have access to from the private markets (Vergara et al., 2017; Schwabe et al., 2019). Therefore, research with these varieties may not reflect the physiological effects of Cannabis consumed by the general public.

In 2017, we compared the cannabinoid chemotypes from the Cannabis produced in the private market to the chemotypes from the governmentally produced Cannabis for NIDA by the University of Mississippi (Vergara et al., 2017). We found that NIDA’s Cannabis lacked potency and chemotypic variation and had an excess of cannabinol (CBN), which is a degradation product of THC. The cannabinoid diversity from the governmentally produced Cannabis was a fraction (only 27% of the THC) of that from the private markets. A study using microsatellite markers also showed that NIDA’s Cannabis was genetically different from commercially available recreational and medical varieties. This study concluded that results from research using flower material supplied by NIDA may not be comparable to consumer experiences with Cannabis from legal private markets (Schwabe et al., 2019).

Here, we present results of analysis to further examine the genetic diversity in governmentally produced Cannabis. We acquired DNA from two NIDA-produced samples which had been previously analyzed using ten variable microsatellite regions (Schwabe et al., 2019). After sequencing, we compared their overall genomic diversity to that of previously sequenced varieties including hemp- and marijuana-types (Lynch et al., 2016; Vergara et al., 2019). We report here the genomic characteristics of the two NIDA samples, including overall genetic variation, as well as genetic variation within the cannabinoid family of genes, the maternally inherited organellar genomes (mitochondrial and chloroplast), and the repetitive genomic content. We compare this diversity to the publicly available genomes from other Cannabis lineages within the species, to characterize the relationships with other well-studied lineages.

Materials and Methods

NIDA’s Samples

Bulk Cannabis supplied for research purposes is referred to as “research grade marijuana” by NIDA and is characterized by the level of THC and CBD (NIDA, 2016). They offer 12 different categories of Cannabis for research that vary in the levels of THC (low < 1%, medium 1–5%, high 5–10%, and very high > 10%) and CBD (low < 1%, medium 1–5%, high 5–10%, and very high > 10%)”. The high THC NIDA sample (Supplementary Table 1) has an RTI log number 13494–22, reference number SAF 027355 and the high THC/CBD has an RTI log number 13784-1114-18-6, reference number SAF 027355. DNA from both samples was extracted by Schwabe et al. (2019) and provided to the University of Colorado Boulder. These two samples were sequenced using standard Illumina multiplexed library preparation protocols as described in Lynch et al. (2016) which yielded to an approximate coverage of 17–20x (Supplementary Table 1).

Genome Assembly, Whole Genome Libraries, and Nuclear Genome Exploration

We aligned sequences from 73 different Cannabis plants to the previously developed CBDRx assembly Cs10 (Grassa et al., 2018). These genomes were sequenced using the Illumina platform by different groups (Supplementary Table 1) and are, or will be, publicly available on GenBank. For detailed information on sequencing and the library preparation of the 57 genomes sequenced by our group at the University of Colorado Boulder please refer to Lynch et al., 2016. The remaining 16 genomes were sequenced and provided by different groups (Supplementary Table 1), however, most of these genomes have been previously used in other studies (Lynch et al., 2016; Vergara et al., 2019).

We aligned the 73 libraries to the CBDRx assembly using Burrows-Wheeler alignment (ver. 0.7.10-r789; Li and Durbin, 2009), then calculated the depth of coverage using SAMtools (ver. 1.3.1-36-g613501f; Li et al., 2009) as described in Vergara et al. (2019). We used GATK (ver. 3.0) to call single nucleotide polymorphisms (SNPs). We filtered for SNPs lying in the single-copy portion of the genome (Lynch et al., 2016) which resulted in 7,738,766 high-quality SNPs. The single-copy portion of the genome does not include repetitive sequences such as transposable elements or microsatellites. Subsequently, we were then able to estimate the expected coverage at single-copy sites as in Vergara et al. (2019). We performed a STRUCTURE analysis (ver. 2.3.4; Pritchard et al., 2000) with K = 3 in accordance with previous research (Sawler et al., 2015; Lynch et al., 2016). With these STRUCTURE results, we then classified the different varieties into four different groupings: Broad-leaf marijuana-type (BLMT), Narrow-leaf marijuana-type (NLMT), Hemp, and Hybrid (Supplementary Table 2). Hybrid individuals had less than 60% population assignment probability to a particular group. We found 12 individuals in the BLMT group, 16 in the Hemp group, 14 in the Hybrid group, and 31 in the NLMT group. We then used SplitsTree (ver. SplitsTree4; Huson, 1998) to visualize relationships among the 73 individuals, VCFtools (ver. 4.0; Danecek et al., 2011) to calculate genome-wide heterozygosity as measures of overall variation, and PLINK (ver. 1.07; Purcell et al., 2007) for a principal component analysis (PCA).

Cannabinoid Gene Pathway Exploration

Using BLAST, we found 12 hits for putative CBDA/THCA synthase genes in the CBDRx assembly (Supplementary Table 3) with more than 80% identity and an alignment length of greater than 1,000 bp. For this BLAST analysis, we used the CBCA synthase (Page and Stout, 2017), the THCA synthase with accession number KP970852.1, and the CBDA synthase with accession number AB292682.1.

We estimated the gene copy-number (CN) for the cannabinoid genes (Vergara et al., 2019) and calculated summary statistics of the CN for each of the 12 genes by variety (Supplementary Table 1). Differences in the estimated gene CN between the cultivars for each of the 12 cannabinoid synthases gene family were determined using one-way ANOVAs on the CN of each gene as a function of the lineages (BLMT, Hemp, Hybrid, and NLMT), with a later post hoc analysis to establish one-to-one group differences using the R statistical platform (R Core Team, 2013).

We used BLAST to search for the two enzymes upstream in the cannabinoid pathway using the methodology from Vergara et al. (2019). We found 1 hit to olivetolate geranyltransferase enzyme, and two hits to olivetolic acid synthase (Supplementary Table 1).

Maternally Inherited Genomes

We used the publicly available chloroplast (Vergara et al., 2015) and mitochondrial (White et al., 2016) genome assemblies to construct haplotype networks using PopART (ver. 1.7; Leigh and Bryant, 2015) using only variants with a high quality score in the variant call file. The chloroplast and mitochondrial haplotype networks comprised 508 and 1,929 SNPs, respectively.

Repetitive Genomic Content

We used RepeatExplorer (ver.2; Novák et al., 2010) to determine the repetitive content in 71 of the 73 genomes (Pisupati et al., 2018). We excluded ‘‘Jamaican Lion’’ (NLMT) and ‘‘Feral Nebraska’’ (hemp) genomes due to low-quality reads that led to dubious results. We estimated the repetitive content of the genome and annotating repeat families using custom python scripts¹.

Results

Nuclear Genome Exploration

Our analysis of the nuclear genome used 7,738,766 high-quality SNPs from the inferred single-copy portion of the genome. STRUCTURE analysis (Figure 1A) shows the population assignment probabilities for all 73 different varieties including both of NIDA’s varieties. This analysis established that NIDA’s samples cluster with both the hemp and NLMT groupings, with less than 60% in either group, and therefore we categorized them as Hybrid (Supplementary Table 2). The individuals that are part of the Hemp (orange, n = 16), NLMT (blue, n = 31), or BLMT (purple, n = 12) groups had a population assignment probability of more than 60% to that particular group. However, those individuals with a probability of less than 60% to a particular population were assigned to the Hybrid group (gray, n = 14), which includes both of NIDAs samples.

FIGURE 1

Figure 1. STRUCTURE and Principal Component Analyses. Proportion of each color in the bar indicates the probability of assignment to Hemp (orange), NLMT (blue), or BLMT (purple), groups. Both of NIDA’s strains outlined with black margins are assigned to both NLMT and Hemp groups with less than 60% probability, and therefore we assigned them to the Hybrid group (A). The two NIDA samples in green cluster with each other and away from other varieties (B).

In addition to clustering probability results (Figure 1B) from STRUCTURE, we colored the varieties in the PCA (Figure 1B) and SplitsTree (Figure 2) according to their color scheme from the STRUCTURE analysis. The first two principal components in the PCA explain 28.71% of the variation (Figure 1 bottom panel), and the two NIDA varieties cluster together, also seen in the SplitsTree analysis (Figure 2). Both the PCA and SplitsTree indicate high genetic similarity between the NIDA samples and neither of them cluster with any other strains.

FIGURE 2

Figure 2. SplitsTree graph. Genetically similar individuals cluster together, such as the NIDA cluster, “Afghan Kush” cluster, and “Carmagnola” cluster. NIDA samples are highlighted in green. Hemp, NLMT, and BLMT shown in orange, blue, and purple, respectively.

The Hybrid group which contains NIDA’s samples show the widest range of heterozygosity (μ = 0.131, s.d = 0.0545) in the single-copy portion of the genome. However, it is not significantly different from any other group (Figure 3). This wide range of heterozygosity in the hybrid group is expected given that we are grouping individuals that do not belong to one particular genetic group but rather have some assignment probability to two or three genetic groups. Therefore, varieties which are not related to each other, or that belong to more than one group are found in the hybrid category. This may explain why the Hybrid group has the highest mean heterozygosity in this study (Hemp: μ = 0.0817, s.d = 0.0352; BLMT μ = 0.0959, s.d = 0.0405; and NLMT μ = 0.112, s.d = 0.0411).

FIGURE 3

Figure 3. Genome wide heterozygosity. The Hemp lineage differs significantly from the Hybrid grouping with a P < 0.03. The two NIDA samples are presented within the Hybrid grouping by two green triangles.