In-depth investigation of the point mutation pattern of HIV-1

Sun, Nan; Yau, Stephen S.-T.

doi:10.3389/fcimb.2022.1033481

ORIGINAL RESEARCH article

Front. Cell. Infect. Microbiol., 15 November 2022

Sec. Extra-intestinal Microbiome

Volume 12 - 2022 | https://doi.org/10.3389/fcimb.2022.1033481

This article is part of the Research TopicOmics-based Novel Computational Methods Revealing Microbe-Disease AssociationsView all 7 articles

In-depth investigation of the point mutation pattern of HIV-1

Nan Sun¹

Stephen S.-T. Yau^1,2*

¹Department of Mathematical Sciences, Tsinghua University, Beijing, China
²Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing, China

Mutations may produce highly transmissible and damaging HIV variants, which increase the genetic diversity, and pose a challenge to develop vaccines. Therefore, it is of great significance to understand how mutations drive the virulence of HIV. Based on the 11897 reliable genomes of HIV-1 retrieved from HIV sequence Database, we analyze the 12 types of point mutation (A>C, A>G, A>T, C>A, C>G, C>T, G>A, G>C, G>T, T>A, T>C, T>G) from multiple statistical perspectives for the first time. The global/geographical location/subtype/k-mer analysis results report that A>G, G>A, C>T and T>C account for nearly 64% among all SNPs, which suggest that APOBEC-editing and ADAR-editing may play an important role in HIV-1 infectivity. Time analysis shows that most genomes with abnormal mutation numbers comes from African countries. Finally, we use natural vector method to check the k-mer distribution changing patterns in the genome, and find that there is an important substitution pattern between nucleotides A and G, and 2-mer CG may have a significant impact on viral infectivity. This paper provides an insight into the single mutation of HIV-1 by using the latest data in the HIV sequence Database.

1 Introduction

Human immunodeficiency virus (HIV) is a virus that attacks the host’s immune system. It makes people more vulnerable to other diseases (Weiss, 1993). Without treatment, the two types of HIV (HIV-1 and HIV-2) infection will cause acquired immunodeficiency syndrome (AIDS). HIV-1 is more infective than HIV-2 and is the main cause of global HIV infection (Gilbert et al., 2003). According to the HIV statistics data of UNAIDS (Joint United Nations Programme on HIV/AIDS), about 84.2 million people have become infected with HIV and 40.1 million people have died from AIDS-related illnesses since the start of the epidemic. Approximately 1.5 million new HIV infections occurred last year. The COVID-19 pandemic led to disruptions to key HIV treatment and prevention services, millions of people are in danger, the emergency situation makes the effective HIV treatment and tools to prevent, detect and treat infections more significant (UNAIDS, 2022). HIV-1 is a single-stranded, positive-sense RNA retrovirus, its RefSeq (Accession number: NC_001802) length is 9181 bp. HIV has nine major genes (three structural genes: Gag, Pol, Env, two regulatory genes: Tat, Rev, and four accessory genes: Nef, Vif, Vpr, Vpu), encoding nineteen proteins. The mutation of nucleotide sequence encoding protein may affect the ability or efficiency of virus infectivity. Therefore, the identification of mutation type is of great significance for understanding the mechanism of virus infection.

The virus constantly mutates to adapt to the surroundings. Sequence mutation will cause the translocation, insertion, deletion and substitution of bases. The variation of nucleotides may result in codon changes, which may lead to changes in the coding sequences for proteins and their functions (Nishikura, 2010). Many mutations are minor and will not affect the transmission speed or infection degree, even make the virus less infectious. However, some genetic mutations are dangerous. They may cause the surface of the virus to look different from the original virus, leading to immune escape and stronger infectivity (Smyth et al., 2012). On the other hand, mutation is the driving force of species evolution. Evolution is a function of both mutations and selective pressures, eliminating unfavorable mutations and selecting for favorable mutations (Sun et al., 2022a). HIV-1 has high replication rate and is vulnerable to natural selection (Coffin, 1995). The high mutation rate creates genetic diversity and new virus subtypes, which poses a greater challenge to the development of vaccines.

There are many molecular mechanisms that contribute to HIV-1 replication rate such as the DNA repair process associated with the uracil DNA glycosylase (UNG) (Mansky et al., 2000) or the nuclear form of uracil DNA glycosylase (UNG2) (Chen et al., 2004), RNA editing process by deamination. There are two kinds of RNA editing mechanisms by deamination in human cells: APOBEC (apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like) (Zheng et al., 2004) and ADAR (adenosine deaminases acting on double-stranded regions of RNA) (Nishikura, 2016). APOBEC-induced editing involves cytidine deaminase that deaminates a cytidine base into a uridine base (C-to-U mutation). ADAR enzyme deaminates adenine to inosine (A-to-I editing) and causes A-to-G mutation. HIV-1 RNAs contain several double-stranded regions such as the Rev responsive element, trans-activation responsive element and dimerization domain (Phuphuakrat et al., 2008; Doria et al., 2009). Many studies provide evidence that expression levels of the ADAR protein are responsible for A to G mutations in the HIV-1 genome and HIV-1 infection potential (Sharmeen et al., 1991; Clerzius et al., 2009; Weiden et al., 2014). Particularly, APOBEC3G, as a member of APOBEC protein family, induces G-to-A hypermutation in plus-stranded cDNA, which can be readily detected in infected persons; but the virus encodes the Vif protein to counteract APOBEC3G (Malim, 2009). So maybe the interaction of APOBEC3G and ADAR has the influence on the HIV-1 infectivity. In this paper, we hypothesize that gene editing via APOBEC and ADAR is a driving force for RNA viral evolution.

We analyze the 12 single-nucleotide polymorphisms (SNPs) types of 11897 HIV-1 complete genomes. SNP is a variation at a single position in a nucleotide sequence among individuals. Looking for SNPs can evaluate the viral infectivity mechanism (Single-nucleotide polymorphism / SNP). We observe that mutations A>G, G>A, C>T and T>C own nearly 64% ratio in non-unique case, which suggests that APOBEC and ADAR editing may drive the virus evolution. In addition, we investigate the distribution of 12 SNP types in different time, countries and subtypes to understand whether mutations have time, geographical location and subtype preferences. We also report the mutation preferences in local regions of the genome, i.e. k-mer. Finally, we use our previously proposed method, k-mer natural vector to identify and validate the mutation preferences in the genome.

2 Materials and methods

2.1 HIV-1 datasets analyzed

The complete genomes of HIV-1 analyzed in this study are downloaded from HIV sequence Database (https://www.hiv.lanl.gov), including 16844 sequences up to April 8, 2022. The sequences with degenerate bases (i.e. R, Y, M, K, S, W, H, B, V, D, N) and without time or country labels are deleted, and 11897 reliable genomes are retained. The accession number, sampling year, sampling country, subtype of each sequence can be found in the Supplementary Data.

2.2 Method overview

The genome sequences are aligned with reference sequence of HIV-1 (GenBank accession number is NC_001802, it sampled in France in 1983 (Barré-Sinoussi et al., 1983; Gallo et al., 1983), and sequence length is 9181 base pairs) using Clustal Omega with default parameters. Then the SNP types (A>C, A>G, A>T, C>A, C>G, C>T, G>A, G>C, G>T, T>A, T>C, T>G) can be counted in the corresponding positions from the aligned genomes. For example, there are 4 sequences totally: RefSeq: ACGTGTGAC, Seq1: CCTTGTGAC, Seq2: TCGTTGTC, Seq3: CCGGGATAAC. The pairwise sequences are aligned as follows:

● The first pairwise sequence alignment:

www.frontiersin.org

● The second pairwise sequence alignment:

www.frontiersin.org

● The third pairwise sequence alignment:

www.frontiersin.org

We only consider those mutation positions without gaps. The non-unique mutation indicates that the same mutation in different genomes is counted repeatedly, and the unique mutation indicates that the same mutation in different genomes is only counted once. For above three alignments, the non-unique mutation numbers of A1C, A1T, A8T, G3T, T4G, and G7A are 2, 1, 1, 1, 1, 1, respectively, then the non-unique mutation numbers of A>C, A>T, G>T, T>G, and G>A are 2, 2, 1, 1, 1, respectively, and the other 7 SNP types’ number is 0. The unique mutation numbers of A1C, A1T, A8T, G3T, T4G, and G7A are 1, 1, 1, 1, 1, 1, respectively, then the non-unique mutation numbers of A>C, A>T, G>T, T>G, and G>A are 1, 2, 1, 1, 1, respectively, and the other 7 SNP types’ number is 0.

3 Results

3.1 Single mutation analysis

3.1.1 Global analysis

We first analyze 11897 HIV-1 aligned genomes globally. All genomes are aligned with the reference sequence of HIV-1, then the mutations are counted. The proportion of 12 SNP types is shown in Figure 1. The distribution of 12 non-unique mutations is displayed in Figure 1A. There are 10,972,545 single mutations for 11897 HIV-1 isolates, and each HIV-1 genome has about 922 mutations, which shows that the variation of HIV-1 is diverse (Fischer et al., 2021). The frequency of transition (A>G, G>A, C>T, T>C) is higher than that of transversion (A>C, C>A, A>T, T>A, G>C, C>G, G>T, T>G) (Collins and Jukes, 1994). If the mutation is random, the probability of each mutation is 8.33%, but the four mutations, A-to-G, G-to-A, C-to-T and T-to-C, accounts for 63.87% (=6898807/10972545). One potential explanation is that A>G, G>A, C>T may be driven by the APOBEC and ADAR editing, and these patterns have a significant impact on HIV-1 infectivity. The ratio of G>A is a little higher than that of A>G. It is noticed that ADAR induces A>G mutation and APOBEC3G induce G>A mutation in cDNA, but virus will fight back against the APOBEC3G editing mechanism using encoding protein (Malim, 2009). On the basis of these notifications, we predict that there may be a balance between these mechanisms or processes. The counts of pairwise mutation (A>G and G>A, C>T and T>C, A>C and C>A, A>T and T>A, G>C and C>G, G>T and T>G) are similar. This is because for a specific biological sequence, the proportion of each base in the whole sequence will not vary too much even if mutations happen. We also analyze the unique mutations in Figure 1B. There are 23803 unique mutations. A>G owns the highest ratio, C>G owns the lowest ratio (5.43%=1293/23803), and SARS-CoV-2 also has the lowest ratio of C>G (1.25%) (Wang et al., 2020).

FIGURE 1

Figure 1 (A) The distribution of 12 non-unique mutations counted from the 11897 HIV-1 aligned genomes. The non-unique mutation indicates that the same mutation in different genomes is counted repeatedly. (B) The distribution of 12 unique mutations counted from the 11897 HIV-1 aligned genomes. The unique mutation indicates that the same mutation in different genomes is only counted once. The number above the vertical bar is the count of corresponding mutation.

3.1.2 Time analysis

Whether the SNP of each HIV-1 genome changes over time is counted. The 11897 genomes are aligned with the reference sequence of HIV-1 first, and the non-unique single mutations of each genome is counted, as illustrated in Figure 2. The x-axis represents the year, y-axis represents total SNP count, and each dot represents a sequence. As time went on, the total number of mutations did not change significantly. The mutations in most genomes ranged from 500 to 1500, but 21 genomes have abnormal mutation numbers, which are distributed in 20 years (from 1991 to 2011, marked with No. 1 to No. 13). Of the 21 sequences, 9 are from Cameroon, 7 from France, 2 from Gabon, 1 from Senegal, 1 from Spain, and 1 from the United States. About half of these highly mutated sequences are located in African countries, but they are all distributed in Atlantic coastal countries (Supplementary Figure 1).

FIGURE 2

Figure 2 The distribution of SNP counts for each HIV-1 genome. Each genome with year label is aligned with the reference sequence of HIV-1, then the total SNP mutation of each genome is counted. Each dot represents a sequence. The total number of mutations in most sequences ranged from 500 to 1500 over time, but there are 21 genomes with abnormal mutation number, distributed in 13 years (from 1991 to 2011, marked with No. 1 to 13). No.- (Year, SNP Count): Accession Number| Patient Code| Country.

3.1.3 Geographic analysis

The distribution of 12 SNP types in different countries is analyze. The 11897 sequences are distributed in 78 countries, as listed in Supplementary Table S1. To better exhibit the quantitative distribution of HIV-1 genome, we plot them in Figure 3. There are many HIV-1 cases in the Atlantic coastal countries, and the transmission may be related to traffic and living environment. The top 20 countries with larger number of genome include South Africa, which has the largest population of people with HIV of any country in the world (Ritchie and Max, 2018); South and South East Asia (For example, Thailand), which is the second most affected (UNAIDS,); United States, where over 675 thousand people died of AIDS since the beginning of the HIV epidemic (Prevention, C.f.D.C.a, 2016); Cameroon, where HIV-1 appears to have originated (Sharp and Hahn, 2011), etc. Among the 20 countries, 9 countries or regions belong to Africa (South Africa, Rwanda, Zambia, Botswana, Cameroon, Malawi, Kenya, Uganda, Tanzania); 3 countries belong to America (United States, Brazil, Canada); 5 countries belong to Europe (Germany, United Kingdom, Belgium, Sweden, Cyprus); 3 countries belong to Asia (China, Thailand, South Korea).

FIGURE 3

Figure 3 The number of HIV-1 complete genomes in each country. The 11897 sequences in our dataset are distributed in 78 countries.

The distribution of 12 non-unique mutations in these top 20 countries is shown in Figure 4 and Supplementary Table S2. The ratios of A>G, G>A, C>T and T>C are the top four highest among the 12 SNP types in each country, which is further convinced that these kinds of mutations are driven by RNA-APOBEC and RNA-ADAR editing. The APOBEC/ADAR protein will lead to C>T/A>G mutation and inhibit replication of HIV (Weiden et al., 2014), reducing the infectivity virus. The pairwise mutations (A>G and G>A, C>T and T>C, A>C and C>A, A>T and T>A, G>C and C>G, G>T and T>G) among these 20 countries also present similar counts. The distribution of 12 unique mutations is described in Supplementary Figure S2. The mutation A>G is relatively higher than G>A.

FIGURE 4

Figure 4 The distribution of 12 non-unique mutations in the top 20 countries with the larger number of genomes.

3.1.4 Subtype analysis

The high mutations create many new HIV variants. During viral infection, a mutant genotype that is different from the infected genotype may be produced at any time. The proportion of 12 SNP types in different subtypes of HIV-1 is worth studied. HIV-1 is divided into a major group M (Taylor et al., 2008) and three minor groups O (D'Arc et al., 2015), N (Mourez et al., 2013) and possibly a group P (Plantier et al., 2009). The M group is subdivided further into subtypes: A, B, C, D, E, F, G, H, I, J, K, L (Hemelaar et al., 2006). There are also CRFs (circulating recombinant forms) from the recombination between different subtypes (Smith et al., 2005). For example, CRF01_AE is the recombination between subtypes A and E. Unique recombinant forms (URFs) are the viruses that have not been widely spread geographically (Fang et al., 2004). The study of the subtype diversity and high variability of HIV is of great significance for the diagnosis and treatment of AIDS. Supplementary Figure S3 to S4 and Supplementary Table S3 elucidate the distribution of 12 unique and non-unique mutations in the top 20 subtypes with the larger number of genomes. The mutation A>G, G>A, C>T and T>C are more likely to occur than other mutations.

3.2 Mutation preferences on sequence context

There may be some interaction between nucleotides, so understanding point mutations in local regions of the genome is of great biological significance. The single nucleotide indicates A, C, G, T (also known as 1-mer). Dinucleotide sequences indicate AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT (also known as 2-mer). Trinucleotide sequences consist of 4³ sequence segments (also known as 3-mer). The k-mer (k=1, 2, 3, 4) counts are shown in Supplementary Figure S5 to S8. There are only 4 letters in the genetic code (A, C, G, T), so the probability of any 1-mer (or 2-mer or 3-mer or 4-mer) is $\frac{1}{4}$ ( $\frac{1}{16}$ or $\frac{1}{64}$ or $\frac{1}{256}$ , respectively), but there seems to be more k-mer containing A. Therefore, studying the local spelling of the virus genetic code may be able to recode its genomes, thereby reducing its infectivity, which is helpful to develop safe vaccines (Takata et al., 2017).

Single mutation counts of 2-mers are shown in Figure 5, and we observe the following mutation patterns:

FIGURE 5

Figure 5 Single mutation counts of 2-mers. There are 16 dinucleotide sequences: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT. (A) Single non-unique mutation counts at the first position of 2-mers. (B) Single non-unique mutation counts at the second position of 2-mers. (C) Single unique mutation counts at the first position of 2-mers. (D) Single unique mutation counts at the second position of 2-mers.

● For non-unique mutation, there are 10,899,995 SNPs at the first position (Figure 5A), and 10,900,019 SNPs at the second position (Figure 5B).

● For unique mutation, there are 368,673 SNPs totally at the first position (Figure 5C), and 367,302 SNPs at the second position (Figure 5D).

● 2-mer AN or NA (N is one of A or C or G or T) for A>G mutation is the predominant context.

● 2-mer GN or NG for G>A mutation is the predominant context.

● 2-mer CH (H is one of A or C or T, the other degenerate base symbols can be found in Supplementary Table 4) or NC for C>T mutation is the predominant context.

● 2-mer TN or NT for T>C mutation is the predominant context.

● At the first position of CG, the count of C>A is slightly greater than that of C>T (Figures 5A, C). CG may have special significance. From Supplementary Figure S6, HIV-1 lacks CG sequence, and CG is less than other 2-mer. Due to evolutionary coincidence, the CG sequence will modify the letter C through chemical reaction, resulting in it being replaced by another letter. Therefore, the host may have some kind of cell monitoring system to recognize and destroy CG sequences, so as to prevent virus infection (Takata et al., 2017).

Single mutation counts of 3-mers are shown in Supplementary Figure S9 and Supplementary Figure S10, and we observed the following mutation patterns:

● 3-mer ANN or NAN or NNA (N is one of A or C or G or T) for A>G mutation is the predominant context.

● 3-mer GNN or NGN or NNG for G>A mutation is the predominant context.

● 3-mer CMN, CTD (M is either A or C, D is one of A or G or T) or ACH (H is one of A or C or T) or DNC for C>T mutation is the predominant context.

● 3-mer THN, TGY (Y is either C or T) or WNT, CWT, GHT (W is either A or T) for T>C mutation is the predominant context.

Single mutation counts of 4-mers are shown in Supplementary Figure S11 and Supplementary Figure S12, and we could observe the stable mutation patterns: A>G, G>A, C>T and T>C.

3.3 Identification and validation of mutation preferences using k-mer natural vector

The sequence comparison of the mutants is necessary, and the method of comparing molecular sequences in above analysis is based on alignment. The other approach is based on alignment-free, which mainly include methods based on k-mer frequency (Sims et al., 2009), the length of common substrings (Leimeister and Morgenstern, 2014), graphical representation (Jeffrey, 1990), micro-alignments (Yi and Jin, 2013), and the number of word matches (Bromberg et al., 2016). Our team proposed k-mer natural vector to compare genomic sequences (Deng et al., 2011), which has been successfully applied to many classification and phylogenetic tasks (Sun et al., 2021; Sun et al., 2022a; Sun et al., 2022b). K-mer natural vector characterizes the statistical distribution of k-mers. Mutations can cause the changes in k-mers distributions. We take 1-mer as an example and give the definition of 1-mer natural vector. For the genomic sequence S=s₁ s₂ s₃…s_n with length n, L={A,C,G,T or U} , the indicator functions are $w_{k} (s_{i}) = {\begin{matrix} 1, if s_{i} = k \\ 0, otherwise \end{matrix},$ and $w_{k l} (s_{i}) = w_{lk} (s_{i}) = {\begin{matrix} 1, if s_{i} = k or l \\ 0, otherwise \end{matrix}$ , where s_i,k,l∈L,i=1,2,3,…,n. The 1-mer natural vector with covariance component (1-mer NVC) is an 18-diemensional vector: $(n_{A}, n_{C}, n_{G}, n_{T}, μ_{A}, μ_{C}, μ_{G}, μ_{T}, D_{2}^{A}, D_{2}^{C}, D_{2}^{G}, D_{2}^{T}, C o v (A, C), C o v (A, G), C o v (A, T), Cov (C, G), C o v (C, T), C o v (G, T))$ , where n_k is the count of nucleotide k within sequence S: $n_{k} = \sum_{i = 1}^{n} w_{k} (s_{i})$ , μ_k is the average location of nucleotide k within sequence S: $μ_{k} = \sum_{i = 1}^{n} i \frac{w_{k} (s_{i})}{n_{k}}$ , $D_{2}^{k}$ is the second central moment of positions of nucleotide k within sequence S: $D_{2}^{k} = \sum_{i = 1}^{n} \frac{{(i - μ_{k})}^{2} w_{k} (s_{i})}{n_{k} n}$ , Cov (k, l) is the covariance between nucleotide k and nucleotide l within sequence S: $C o v (k, l) = \sum_{i = 1}^{n} \frac{[i - μ_{k}] [i - μ_{l}] w_{k l} (s_{i})}{n \sqrt{n_{k}} \sqrt{n_{l}}}$ . Similarly, 2-mer natural vector with covariance component (2-mer NVC) characterizes the statistical features of 2-mers, which is a $(4^{k} \cdot 3 + C_{4^{k}}^{2})$ -dimensional vector:

\begin{array}{l} (n_{AA}, n_{AC}, n_{A G}, n_{A T}, n_{C A}, n_{C C}, n_{C G}, n_{C T}, n_{GA}, n_{GC}, n_{GG}, n_{GT}, n_{TA}, n_{TC}, n_{TG}, n_{TT}, \\ μ_{AA}, μ_{AC}, μ_{A G}, μ_{A T}, μ_{C A}, μ_{C C}, μ_{C G}, μ_{C T}, μ_{GA}, μ_{GC}, μ_{GG}, μ_{GT}, μ_{TA}, μ_{TC}, μ_{TG}, μ_{TT}, \\ D_{2}^{AA}, D_{2}^{AC}, D_{2}^{AG}, D_{2}^{AT}, D_{2}^{CA}, D_{2}^{CC}, D_{2}^{CG}, D_{2}^{CT}, D_{2}^{GA}, D_{2}^{GC}, D_{2}^{GG}, D_{2}^{GT}, D_{2}^{TA}, D_{2}^{TC}, D_{2}^{TG}, D_{2}^{TT}, \\ Cov (AA, AC), Cov (AA, AG), Cov (AA, AT), Cov (AA, CA), Cov (AA, CC), \dots, Cov (TG, TT)) \end{array}

We use the alignment-free method, k-mer natural vector, to identify and validate the mutation preference conclusion obtained from the previous analysis. To achieve the goal, we first calculate the k-mer natural vector of HIV-1 RefSeq (NC_001802), and use the k-mer natural vectors of the rest sequences to subtract it. Mathematically, suppose sequence S₁ and sequence S₂ correspond to k-mer natural vectors NV₁ and NV₂ respectively, the nucleotide distribution difference (NDD) of the two sequences is described as NDD = NV_{2 –} NV₁. K-mer natural vector can measure the changes of sequence context distribution and then identify the mutations in the genomes (Sun et al., 2022a). The changes results of k-mer (k=1, 2, 3) distribution are shown in Figure 6; Supplementary Figure S13 and Supplementary Figure S14. We observe the following k-mer distribution changing patterns:

● The change of count of 1-mer G is the largest (mean=-71, std=70), and C is second-ranked. The change of count of 1-mer A is minimum (mean=-21, std=63). The change of average location of 1-mer A is the largest (mean=-115, std=127), and C is second-ranked. The change of second central moment of 1-mer A is the minimum (mean=0, std=127). The change of covariance between 1-mer C and G is the largest (mean=-58, std=53).

● The change of count of 2-mer AG is the largest (mean=-35, std=27), and the change of count of 2-mer GA is the second largest (mean=-27, std=23). The average location of 2-mer CG grows the most (mean=156, std=361). The second central moment of 2-mer AA grows the most (mean=15), but the change is small (std=13). The change of second central moment of 1-mer CG is the largest (std=130). The change of covariance between 2-mer AA and CG is the largest (std=145).

● The change of count of 3-mer AGA is the largest (mean=-19, std=14), and the change of count of 3-mer GAG is the second largest (mean=-18, std=11). The count of 3-mer AAA grows the most. The changes of average location of 3-mer GCG is the largest (mean=600, std=639).

FIGURE 6

Figure 6 The difference between the 1-mer NVC of each sequence in our dataset and the 1-mer NVC of the reference sequence (NC_001802). The 1-mer NVC of the RefSeq is calculated first, and the 1-mer NVCs of the rest sequence subtracted it.

The results of Figure 6, Supplementary Figure S13 and Supplementary Figure S14 give us reason to believe that the substitutions between nucleotide A and G, C and T are of great significance, and 2-mer CG may play an important role in viral infectivity. This analysis also shows the potential of natural vector in identifying the mutations in the genomes.

4 Discussion

4.1 Gene analysis

Furthermore, the ratios of 12 SNP types per gene are studied. HIV has nine major genes: Gag, Pol, Env, Tat, Rev, Nef, Vpr, Vpu, Vif. We extract each gene according to the gene location records of each genome in our dataset, then align the gene sequences with the gene sequence of reference sequence, and count the unique and non-unique mutations, as shown in Table 1. We can see that for each gene, mutation A>G has the highest ratio in unique case. In non-unique case, either A>G or G>A has the highest ratio among the 12 SNP types, which indicates the substitution of these two bases plays a vital role in viral infection.

TABLE 1A

Table 1A The ratios of 12 unique mutations per gene.

TABLE 1B

Table 1B The ratios of 12 non-unique mutations per gene.

4.2 Comparison of SIV mutations

HIV-1 is believed to originate in southern Cameroon (A country of West-central Africa) (Keele et al., 2006), and shows the high similarity with simian immunodeficiency virus (SIV, which infects wild chimpanzees or non-human primates) (Gao et al., 1999; Chahroudi et al., 2012). SIV has two types: SIVsmm in sooty mangabeys and SIVcpz in chimpanzees. A recent study on wild chimpanzee SIVcpz shows that infected chimpanzees would experience AIDS-like diseases. The late stage of SIV infection develops into SAIDS (Simian Acquired Immunodeficiency Syndrome), much like how HIV infection develops into AIDS. This inspires us to explore whether SIV and HIV-1 share the similar RNA editing behavior.

We download all 38 SIV complete genomes from https://www.hiv.lanl.gov/components/sequence/HIV/search/search.html, and regrad SIV ViralProj15501 as the reference (GenBank accession number: NC_001549, sequence length is 9623 base pairs). The structure comparison of reference sequences of HIV-1 and SIV can be found in Supplementary Figure S15. Their genome structures are similar, both viruses have Vif, Nef, Tat, Env, Gag-Pol genes. We present the distribution of 12 unique and non-unique mutations in Supplementary Figure S16. Mutation A>G is the highest ranked unique mutation, the reversed mutations G>A is the highest ranked non-unique mutation, which indicates the ADAR-editing may also play a crucial role in SIV infection.

4.3 Further discussion

In this research, we hypothesize that gene editing via APOBEC (C>T) and ADAR (A>G) is a driving force for RNA viral evolution (Wang et al., 2020), and focus on the mechanisms of HIV-1 from statistical perspective. Knowing how it works is very complicated, it depends more on the biological or clinical experiments. Our analysis gives a statistical basis for the effect study of ADAR gene on HIV replication. Many details are worth discussing. First, we download all data from the public HIV sequence Database, they are reliable. But there may exist sampling biases. For example, some relatively backward regions with AIDS cases do not have the perfect technologies for sequencing and sampling, which will result in a lack of data. The conclusion will be more persuasive if more data can be added. Moreover, if more labels are recorded (for example, virus source: blood or cell), the credibility of the results will be further improved. Second, we compare all sequences against HIV-1 RefSeq, it is a thoughtful choice. Literatures record that HIV-1 is a lentivirus discovered in 1983 (Barré-Sinoussi et al., 1983; Gallo et al., 1983). The HIV-1 RefSeq (Accession number is NC_001802) was sampled in France in 1983, which is a very early time. It is reasonable that the other genome sequences are aligned with HIV-1 RefSeq in the field of bioinformatics. Third, we only study the point mutation in local regions of the genome, but it will be of great significance to discuss the specific regions or the direction of the k-mer change count, which can be considered in future study.

5 Conclusions

We count and analyze 12 types of point mutations (A>C, A>G, A>T, C>A, C>G, C>T, G>A, G>C, G>T, T>A, T>C, T>G) from multiple perspectives. The global/geographical location/subtype/k-mer analysis results disclose that mutation A>G, G>A, C>T and T>C occupy the main superiority, which reveals the effectiveness of host APOBEC-editing and ADAR-editing. For non-unique mutations in population, the number of paired mutations (A>G and G>A, C>T and T>C, A>C and C>A, A>T and T>A, G>C and C>G, G>T and T>G) is similar. Time analysis displays that most genomes with abnormal mutation numbers are from African countries. In addition, our previously proposed method, k-mer natural vector, is applied to identify the k-mer distribution changing patterns in the genome. It is found that there is an important substitution pattern between nucleotides A and G, and 2-mer CG may be of very great significance for viral virulence.

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.

Author contributions

SS-TY conceived the project and designed the study. NS collected data and carried out the data analysis including figures drawing and wrote the preliminary version of the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by National Natural Science Foundation of China (NSFC) grant (12171275) and Tsinghua University Education Foundation fund (042202008).

Acknowledgments

SS-TY is grateful to the National Center for Theoretical Sciences (NCTS) for providing an excellent research environment while part of this research was done.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fcimb.2022.1033481/full#supplementary-material

References

Single-nucleotide polymorphism / SNP. Available at: https://web.archive.org/web/20151110112814/http://www.nature.com/scitable/definition/single-nucleotide-polymorphism-snp-295.

Google Scholar

Barré-Sinoussi, F., Chermann, J. C., Rey, F., Nugeyre, M. T., Chamaret, S., Gruest, J., et al. (1983). Isolation of a T-lymphotropic retrovirus from a patient at risk for acquired immune deficiency syndrome (AIDS). Science 220 (4599), 868–871. doi: 10.1126/science.6189183

PubMed Abstract | CrossRef Full Text | Google Scholar

Bromberg, R., Grishin, N. V., Otwinowski, Z. (2016). Phylogeny reconstruction with alignment-free method that corrects for horizontal gene transfer. PloS Comput. Biol. 12 (6), e1004985. doi: 10.1371/journal.pcbi.1004985

PubMed Abstract | CrossRef Full Text | Google Scholar

Chahroudi, A., Bosinger, S. E., Vanderford, T. H., Paiardini, M., Silvestri, G. (2012). Natural SIV hosts: showing AIDS the door. Science 335 (6073), 1188–1193. doi: 10.1126/science.1217550

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, R., Le Rouzic, E., Kearney, J. A., Mansky, L. M., Benichou, S.. (2004). Vpr-mediated incorporation of UNG2 into HIV-1 particles is required to modulate the virus mutation rate and for replication in macrophages. J. Biol. Chem. 279 (27), 28419–28425. doi: 10.1074/jbc.M403875200

PubMed Abstract | CrossRef Full Text | Google Scholar

Clerzius, G., Gelinas, J. F., Daher, A., Bonnet, M., Meurs, E. F., Gatignol, A. (2009). ADAR1 interacts with PKR during human immunodeficiency virus infection of lymphocytes and contributes to viral replication. J. Virol. 83 (19), 10119–10128.

PubMed Abstract | Google Scholar

Coffin, J. M. (1995). HIV Population dynamics in vivo: implications for genetic variation, pathogenesis, and therapy. Science 267 (5197), 483–489. doi: 10.1126/science.7824947

PubMed Abstract | CrossRef Full Text | Google Scholar

Collins, D. W., Jukes, T. H. (1994). Rates of transition and transversion in coding sequences since the human-rodent divergence. Genomics 20 (3), 386–396. doi: 10.1006/geno.1994.1192

PubMed Abstract | CrossRef Full Text | Google Scholar

D'Arc, M., Ayouba, A., Esteban, A., Learn, G. H., Boue, V., Liegeois, F., et al. (2015). Origin of the HIV-1 group O epidemic in western lowland gorillas. Proc. Natl. Acad. Sci. U.S.A. 112 (11), E1343–E1352. doi: 10.1073/pnas.1502022112

PubMed Abstract | CrossRef Full Text | Google Scholar

Deng, M., Yu, C., Liang, Q., He, R. L., Yau, S. S. T. (2011). A novel method of characterizing genetic sequences: Genome space with biological distance and applications. PloS One 6 (3), e17293. doi: 10.1371/journal.pone.0017293

PubMed Abstract | CrossRef Full Text | Google Scholar

Doria, M., Neri, F., Gallo, A., Farace, M. G., Michienzi, A. (2009). Editing of HIV-1 RNA by the double-stranded RNA deaminase ADAR1 stimulates viral infection. Nucleic Acids Res. 37 (17), 5848–5858. doi: 10.1093/nar/gkp604

PubMed Abstract | CrossRef Full Text | Google Scholar

Fang, G., Weiser, B., Kuiken, C., Philpott, S. M., Rowland-Jones, S., Plummer, F., et al. (2004). Recombination following superinfection by HIV-1. AIDS 18 (2), 153–159. doi: 10.1097/00002030-200401230-00003

PubMed Abstract | CrossRef Full Text | Google Scholar

Fischer, W., Giorgi, E. E., Chakraborty, S., Nguyen, K., Bhattacharya, T., Theiler, J., et al. (2021). HIV-1 and SARS-CoV-2: Patterns in the evolution of two pandemic pathogens. Cell Host Microbe 29 (7), 1093–1110. doi: 10.1016/j.chom.2021.05.012

PubMed Abstract | CrossRef Full Text | Google Scholar

Gallo, R. C., Sarin, P. S., Gelmann, E. P., Robert-Guroff, M., Richardson, E., Kalyanaraman, V. S, et al. (1983). Isolation of human T-cell leukemia virus in acquired immune deficiency syndrome (AIDS). Science 220 (4599), 865–867. doi: 10.1126/science.6601823

PubMed Abstract | CrossRef Full Text | Google Scholar

Gao, F., Bailes, E., Robertson, D. L., Chen, Y., Rodenburg, C. M., Michael, S. F., et al. (1999). Origin of HIV-1 in the chimpanzee pan troglodytes. Nature 397 (6718), 436–441. doi: 10.1038/17130

PubMed Abstract | CrossRef Full Text | Google Scholar

Gilbert, P. B., McKeague, I. W., Eisen, G., Mullins, C., Guéye-Ndiaye, A., Mboup, S, et al. (2003). Comparison of HIV-1 and HIV-2 infectivity from a prospective cohort study in Senegal. Stat Med. 22 (4), 573–593. doi: 10.1002/sim.1342

PubMed Abstract | CrossRef Full Text | Google Scholar

Hemelaar, J., Gouws, E., Ghys, P. D., Osmanov, S. (2006). Global and regional distribution of HIV-1 genetic subtypes and recombinants in 2004. AIDS 20, W13–W23. doi: 10.1097/01.aids.0000247564.73009.bc

PubMed Abstract | CrossRef Full Text | Google Scholar

Jeffrey, H. J. (1990). Chaos game representation of gene structure. Nucleic Acids Res. 18 (8), 2163–2170. doi: 10.1093/nar/18.8.2163

PubMed Abstract | CrossRef Full Text | Google Scholar

Keele, B. F., Heuverswyn, F. V., Li, Y., Bailes, E., Takehisa, J., Santiago, M. L., et al. (2006). Chimpanzee reservoirs of pandemic and nonpandemic HIV-1. Science 313 (5786), 523–526. doi: 10.1126/science.1126531

PubMed Abstract | CrossRef Full Text | Google Scholar

Leimeister, C.-A., Morgenstern, B. (2014). Kmacs: the k -mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics 30 (14), 2000–2008. doi: 10.1093/bioinformatics/btu331

PubMed Abstract | CrossRef Full Text | Google Scholar

Malim, M. H. (2009). APOBEC proteins and intrinsic resistance to HIV-1 infection. Philos. Trans. R Soc. Lond B Biol. Sci. 364 (1517), 675–687. doi: 10.1098/rstb.2008.0185

PubMed Abstract | CrossRef Full Text | Google Scholar

Mansky, L. M., Preveral, S., Selig, L., Benarous, R., Benichou, S. (2000). The interaction of vpr with uracil DNA glycosylase modulates the human immunodeficiency virus type 1 In vivo mutation rate. J. Virol. 74 (15), 7039–7047. doi: 10.1128/JVI.74.15.7039-7047.2000

PubMed Abstract | CrossRef Full Text | Google Scholar

Mourez, T., Simon, F., Plantier, J. C. (2013). Non-m variants of human immunodeficiency virus type 1. Clin. Microbiol. Rev. 26 (3), 448–461. doi: 10.1128/CMR.00012-13

PubMed Abstract | CrossRef Full Text | Google Scholar

Nishikura, K. (2010). Functions and regulation of RNA editing by ADAR deaminases. Annu. Rev. Biochem. 79, 321–349. doi: 10.1146/annurev-biochem-060208-105251

PubMed Abstract | CrossRef Full Text | Google Scholar

Nishikura, K. (2016). A-to-I editing of coding and non-coding RNAs by ADARs. Nat. Rev. Mol. Cell Biol. 17 (2), 83–96. doi: 10.1038/nrm.2015.4

PubMed Abstract | CrossRef Full Text | Google Scholar

Phuphuakrat, A., Kraiwong, R., Boonarkart, C., Lauhakirti, D., Lee, T. H., Auewarakul, P., et al. (2008). Double-stranded RNA adenosine deaminases enhance expression of human immunodeficiency virus type 1 proteins. J. Virol. 82 (21), 10864–10872. doi: 10.1128/JVI.00238-08

PubMed Abstract | CrossRef Full Text | Google Scholar

Plantier, J. C., Leoz, M., Dickerson, J. E., De Oliveira, F., Cordonnier, F., Lemee, V., et al. (2009). A new human immunodeficiency virus derived from gorillas. Nat. Med. 15 (8), 871–872. doi: 10.1038/nm.2016

PubMed Abstract | CrossRef Full Text | Google Scholar

Prevention, C.f.D.C.a (2016). Today's HIV/AIDS epidemic factsheet. Available at: https://www.cdc.gov/nchhstp/newsroom/docs/factsheets/todaysepidemic-508.pdf.

Google Scholar

Ritchie, H., Max, R. (2018). HIV / AIDS. Our World Data. Available at: https://ourworldindata.org/hiv-aids.

Google Scholar

Sharmeen, L., Bass, B., Sonenberg, N., Weintraub, H., Groudine, M. (1991). Tat-dependent adenosine-to-inosine modification of wild-type transactivation response RNA. Proc. Natl. Acad. Sci. 88 (18), 8096–8100. doi: 10.1073/pnas.88.18.8096

CrossRef Full Text | Google Scholar

Sharp, P. M., Hahn, B. H. (2011). Origins of HIV and the AIDS pandemic. Cold Spring Harb. Perspect. Med. 1 (1), a006841. doi: 10.1101/cshperspect.a006841

PubMed Abstract | CrossRef Full Text | Google Scholar

Sims, G. E., Jun, S. R., Wu, G. A., Kim, S. H. (2009). Whole-genome phylogeny of mammals: Evolutionary information in genie and nongenic regions. Proc. Natl. Acad. Sci. United States America 106 (40), 17077–17082. doi: 10.1073/pnas.0909377106

CrossRef Full Text | Google Scholar

Smith, D. M., Richman, D. D., Little, S. J. (2005). HIV Superinfection. J. Infect. Dis. 192 (3), 438–444. doi: 10.1086/431682

PubMed Abstract | CrossRef Full Text | Google Scholar

Smyth, R. P., Davenport, M. P., Mak, J. (2012). The origin of genetic diversity in HIV-1. Virus Res. 169 (2), 415–429. doi: 10.1016/j.virusres.2012.06.015

PubMed Abstract | CrossRef Full Text | Google Scholar

Sun, N., Pei, S., He, L., Yin, C., He, R. L., Yau, S. S. T. (2021). Geometric construction of viral genome space and its applications. Comput. Struct. Biotechnol. J. 19, 4226–4234. doi: 10.1016/j.csbj.2021.07.028

PubMed Abstract | CrossRef Full Text | Google Scholar

Sun, N., Yang, J., Yau, S. S. T (2022a). Identification of HIV rapid mutations using differences in nucleotide distribution over time. Genes (Basel) 13 (2), 170. doi: 10.3390/genes13020170

PubMed Abstract | CrossRef Full Text | Google Scholar

Sun, N., Zhao, X., Yau, S. S. T. (2022b). An efficient numerical representation of genome sequence: natural vector with covariance component. PeerJ 10, e13544. doi: 10.7717/peerj.13544

PubMed Abstract | CrossRef Full Text | Google Scholar

Takata, M. A., Goncalves-Carneiro, D., Zang, T. M., Soll, S. J., York, A., Blanco-Melo, D., et al. (2017). CG dinucleotide suppression enables antiviral defence targeting non-self RNA. Nature 550 (7674), 124–127. doi: 10.1038/nature24039

PubMed Abstract | CrossRef Full Text | Google Scholar

Taylor, B. S., Sobieszczyk, M. E., McCutchan, F. E., Hammer, S. M. (2008). The challenge of HIV-1 subtype diversity. N Engl. J. Med. 358 (15), 1590–1602. doi: 10.1056/NEJMra0706737

PubMed Abstract | CrossRef Full Text | Google Scholar

UNAIDS Fact sheet – latest statistics on the status of the AIDS epidemic. Available at: www.unaids.org.

Google Scholar

UNAIDS (2022) Global HIV & AIDS statistics — fact sheet. Available at: https://www.unaids.org/en/resources/fact-sheet.

Google Scholar

Wang, R., Hozumi, Y., Zheng, Y. H., Yin, C., Wei, G. W. (2020). Host immune response driving SARS-CoV-2 evolution. Viruses 12 (10). doi: 10.3390/v12101095

CrossRef Full Text | Google Scholar

Weiden, M. D., Hoshino, S., Levy, D. N., Li, Y., Kumar, R., Burke, S. A., et al. (2014). Adenosine deaminase acting on RNA-1 (ADAR1) inhibits HIV-1 replication in human alveolar macrophages. PloS One 9 (10), e108476. doi: 10.1371/journal.pone.0108476

PubMed Abstract | CrossRef Full Text | Google Scholar

Weiss, R. A. (1993). How does HIV cause AIDS? Science 260 (5112), 1273–1279. doi: 10.1126/science.8493571

PubMed Abstract | CrossRef Full Text | Google Scholar

Yi, H., Jin, L. (2013). Co-Phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acids Res. 41 (7), e75–e75. doi: 10.1093/nar/gkt003

PubMed Abstract | CrossRef Full Text | Google Scholar

Zheng, Y. H., Irwin, D., Kurosu, T., Tokunaga, K., Sata, T., Peterlin, B. M. (2004). Human APOBEC3F is another host factor that blocks human immunodeficiency virus type 1 replication. J. Virol. 78 (11), 6073–6076. doi: 10.1128/JVI.78.11.6073-6076.2004

PubMed Abstract | CrossRef Full Text | Google Scholar

Keywords: mutation, HIV, SNP, natural vector, nucleotide distribution

Citation: Sun N and Yau SS-T (2022) In-depth investigation of the point mutation pattern of HIV-1. Front. Cell. Infect. Microbiol. 12:1033481. doi: 10.3389/fcimb.2022.1033481

Received: 31 August 2022; Accepted: 25 October 2022;
Published: 15 November 2022.

Edited by:

ShiXing Tang, Southern Medical University, China

Reviewed by:

Redmond Patrick Smyth, Helmholtz Institute for RNA-based Infection Research (HIRI), Germany
Alessandra Borsetti, National Institute of Health (ISS), Italy

Copyright © 2022 Sun and Yau. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Stephen S.-T. Yau, eWF1QHVpYy5lZHU=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.