- ImmunityBio, Inc., Culver City, CA, United States
To provide a unique global view of the relative potential for evasion of CD8+ and CD4+ T cells by SARS-CoV-2 lineages as they evolve over time, we performed a comprehensive analysis of predicted HLA-I and HLA-II binding peptides in Spike (S) and Nucleocapsid (N) protein sequences of all available SARS-CoV-2 genomes as provided by NIH NCBI at a bi-monthly interval between March and December of 2021. A data supplement of all B.1.1.529 (Omicron) genomes from GISAID in early December was also used to capture the rapidly spreading variant. A key finding is that throughout continued viral evolution and increasing rates of mutations occurring at T-cell epitope hotspots, protein instances with worst-case binding loss did not become the most frequent for any Variant of Concern (VOC) or Variant of Interest (VOI) lineage; suggesting T-cell evasion is not likely to be a dominant evolutionary pressure on SARS-CoV-2. We also determined that throughout the course of the pandemic in 2021, there remained a relatively steady ratio of viral variants that exhibit conservation of epitopes in the N protein, despite significant potential for epitope loss in S relative to other lineages. We further localized conserved regions in N with high epitope yield potential, and illustrated heterogeneity in HLA-I binding across the S protein consistent with empirical observations. Although Omicron’s high volume of mutations caused it to exhibit more epitope loss potential than most frequently observed versions of proteins in almost all other VOCs, epitope candidates across its most frequent N proteins were still largely conserved. This analysis adds to the body of evidence suggesting that N may have merit as an additional antigen to elicit immune responses to vaccination with increased potential to provide sustained protection against COVID-19 disease in the face of emerging variants.
Introduction
The ability to understand and predict the potential of new SARS-CoV-2 variants to escape immune responses elicited by previous lineages is vital to meeting and overcoming the challenge of the COVID-19 pandemic. Amidst the multitude of parallel efforts to characterize the adaptive immune response to the virus (1–3) and the potential for it to be sustained against variants, many have highlighted T-cell responses as playing an important role in disease control, resolution, and immunity (4–10). Underscoring the importance of T cells, earlier reports revealed that CD4+ and CD8+ T-cell responses against the Spike (S), Membrane (M), and Nucleocapsid (N) proteins of original SARS-CoV persisted for up to 11 years post infection (11) and N-specific T-cell reactivity to 17 years post infection (12).
Persistence of T-cell reactivity against viral proteins does not, however, indicate that such pre-existing cellular immunity will be sufficiently protective against highly mutated viral variants. This risk has come to the forefront in the present COVID-19 pandemic because SARS-CoV-2 variants more transmissible than the original strain have emerged at a rapid rate and spread throughout the globe.
There is some encouraging evidence that, due to the broad CD4+ and CD8+ T-cell responses against multiple SARS-CoV-2 structural proteins (7, 10, 13–15), it is unlikely that variants will escape the majority of T-cell immunity conferred by prior infection or vaccination. In a study of peripheral blood mononuclear cells (PBMCs) from 57 recovered or vaccinated subjects, Tarke et al. (16) reported that on average 93% of CD4+ and 97% of CD8+ T-cell epitopes verified from the reference SARS-CoV-2 genome (14) were conserved across genomes selected to represent 4 Variants of Concern (VOC). Across the 4 variants, these conserved epitopes were found to account for an average of 91.5% and 98.1% of CD4+ and CD8+ T-cell recognition, respectively. The authors went on to demonstrate that COVID-19 patients showed no significant difference in T-cell reactivity to peptide pools spanning 4 VOC lineages compared to peptides sourced from the SARS-CoV-2 reference/ancestral genome (NCBI Reference Sequence: NC_045512). This was true not only for peptide pools specific to S, but also those spanning proteins across the whole viral genome. Individuals who had received mRNA vaccines showed some diminished T-cell response to S B.1.351 and B.1.427/B.1.429 variant peptide pools, but overall, responses were considered robust. Conversely, Lucas et al. (17) found decreased CD8+ (but not CD4+) T-cell responses against S peptides from the P.1 lineage compared to peptides from the reference/ancestral lineage in individuals who had received mRNA vaccines. Further, in recovered COVID-19 patients, Agerer et al. (18) confirmed that single mutations in CD8+ T-cell epitopes led to diminished HLA-I binding, a weaker T-cell response, and ineffective cytotoxicity in HLA-matched COVID-19 patients. Thus, mutations in SARS-CoV-2 may evade aspects of the adaptive immune response at the narrow scope of specific epitopes and HLAs.
Additionally, individuals or populations with significantly different HLA phenotypes than those present in the referenced study cohorts may exhibit different levels of evasion risk. This may be one possible explanation for conflicting findings across studies (16, 17). As another example, even within a geographically limited cohort, CD8+ T-cell epitopes were found to vary significantly based on patient HLA-I repertoires (14).
Even if the risk for T-cell evasion by variants has not, thus far, appeared to be great, given the emergence of highly-mutated new variants such as B.1.1.529 (Omicron) (19), continued monitoring of the potential impact on T-cell reactivity is warranted. Our approach to understanding the implications of viral evolution and provide a comprehensive view of CD4+ and CD8+ evasion potential is the assessment of the dynamics of epitope – and specifically predicted HLA binding - changes with time. HLA presentation is an essential precursor to T-cell recognition, and recent results have demonstrated that immunodominant CD4+ T-cell epitopes specific to SARS-CoV-2 correlate with HLA binding promiscuity (14). We analyzed both the S and N peptidomes, motivated by next generation vaccines currently in development that include both S and N antigens (20–22) and findings discussed above indicating broad T-cell response across structural proteins.
We identified viral protein regions predicted to be the highest frequency sources of HLA-I and HLA-II binding peptides, and tracked the evolution of binding loss (and thus latent potential for T-cell evasion) across all available viral genomes at bi-monthly time points throughout 2021. Narrowing our focus to specific protein locations with peak frequencies of potential epitopes, that is, potential epitope hotspots, improved robustness to noise when ranking viral variants according to risk of epitope loss, and simplified identification of critical regions or HLAs most impacted by epitope loss.
Our predictive analysis differs from that of others by focusing on the evolution of SARS-CoV-2 viral lineages, integrating results across a broad range of peptide lengths, considering impacts of observed mutation co-occurrence by analyzing full protein sequences versus independently considering key mutations, demonstrating efficacy of CD4+ T-cell epitope predictions, and by offering a distinct approach to capturing a comprehensive set of HLA-I and HLA-II alleles. In an effort to address fairness in studying the potential impact of SARS-CoV-2 variants on world populations, we sampled a representative set of HLA-I and HLA-II alleles based not only on documented frequency of appearance, but also leveraged learned embeddings to capture functionally similar HLA clusters. These functional groupings were then used in the hope of ensuring that a uniquely behaving set of HLAs was not excluded due to absence in majority populations.
Access to functional embeddings was enabled by in-house neural networks trained to predict peptide binding to HLA-I and HLA-II molecules. In our tests, the models compared favorably to the state-of-the-art (NetMHCpan-4.1), and opened the door to additional capabilities including abstaining from classification in cases of ambiguity, and the ability to visualize and leverage intermediate representations of HLAs or peptides.
Over the time period analyzed, there remained a relatively steady ratio of viral variants that exhibited conservation of N epitopes among lineages exhibiting most potential for epitope loss in S. VOC lineages generally had greater overall loss in binding for both HLA-I and HLA-II than VOI lineages across both S and N proteins. This correlates with the expectation of increased potential for immune evasive viral variants to emerge over time as the pandemic progresses, highlights that lineages exhibiting more rapid spread also have more opportunity to develop evasion, and underscores the critical role of genomic surveillance to uncovering how the immune landscape of viral variants may be shifting. Interestingly, we also found that within evolutionary trajectories of VOC and VOI lineages, protein instances exhibiting worst-case binding loss have not become the most frequent versions of S or N for any lineage; which suggests T-cell evasion may not be a dominant evolutionary pressure on SARS-CoV-2. This observation held true throughout the initial surge of the Omicron variant, which despite its high rate of mutations in S compared to other VOCs, also demonstrated epitope conservation in the N protein. Finally, we used our analyses to identify conserved regions in N most likely to yield immunodominant CD4+ and CD8+ T-cell epitopes, and illustrate heterogeneity of CD8+ T-cell epitope loss potential in the S protein across our comprehensive set of HLAs.
We provide our snapshots of the evolution of the SARS-CoV-2 peptidome as interactive visualizations at: https://research.immunitybio.com/scov2_epitopes/. The interactive visualization link also includes in-depth examples demonstrating localization of potential epitope loss for most frequent versions of S and N proteins corresponding to VOC lineages.
Results
Localization of Potential Epitope Hotspots on Viral Proteins
We observe that when HLA binding predictions are evaluated at all possible peptides in a sliding window along a source protein (see Methods), certain regions exhibit an increased frequency of predictions with higher binding probability relative to elsewhere on the same protein. When overlapping peptides agree on higher probabilities of predicted binding, the chance that one or more HLA binders do in fact exist within their region of overlap is increased. This observation was leveraged to narrow the search space of our analysis only to protein regions most likely to impact CD8+ or CD4+ T-cell responses. Potential epitope hotspot selection serves to improve robustness to noise (from protein locations less likely to be impactful to immune response) when globally comparing viral variants, and enables quick localization of where potential epitope loss is most significant when more resolution is desired.
For each protein considered, hotspot regions were identified for a representative set of HLA-I only, HLA-II only, and an aggregate that integrated all HLA-I and HLA-II binding prediction information to create a map of pan-HLA potential epitope hotspots. Representative sets of HLA-I and HLA-II alleles used to inform hotspot selection were chosen by combining both global allele frequency information as well as functional clustering based on embeddings from our trained HLA binding prediction models, to ensure rare but functionally unique HLAs were represented. Potential epitope hotspot regions are shown in Figure 1 for both S and N proteins of the SARS-CoV-2 reference genome, NCBI Reference Sequence NC_045512 (23). Hotspot regions identified for four additional reference proteins highlighted by Saini et al. (15) as containing the top T-cell epitope counts are also shown in Supplementary Figure S1: Membrane (M), Envelope (E), Open Reading Frame 1ab (ORF1ab) and ORF3a. Despite not being a major epitope contributor, E was included in the latter list so that all structural proteins were included in validation.
Figure 1 Epitope hotspots in SARS-CoV-2 proteins. Protein regions with peak frequency of predicted binding peptides (potential epitope hotspots) across HLAs are indicated in red for our key proteins of interest from the SARS-CoV-2 reference genome (NCBI Reference Sequence: NC_045512): Spike (A, C, E) and Nucleocapsid (B, D, F). Red lines (pooled max score) show the value of the nearest maxima of the aggregate signal (avg bind score in blue) within a set sliding window size (9 amino acids for HLA-I, 15 for HLA-II, 12 for pan-HLA). For each protein we show hotspots on aggregate signals across all HLA-I molecules only (A, B), HLA-II only (C, D), as well as the combined pan-HLA signal (E, F). The legend in panel (B) applies to all panels. Aggregate signals (avg bind score in the legend) are obtained by a filtered averaging of predicted binding values across our representative HLA set. See Methods for details.
S was found to have 18 HLA-I binding hotspots, 8 HLA-II binding hotspots, and 12 pan-HLA binding hotspots (Figures 1A, C, and E); whereas the much shorter N protein featured 4 HLA-I hotspots, 3 HLA-II hotspots, and 4 pan-HLA hotspots (Figures 1B, D, and F). Note that pan-HLA hotspots are not a simple union of HLA-I and HLA-II hotspots, so regions that satisfied HLA-I or HLA-II hotspot criteria independently may not satisfy pan-HLA hotspot thresholds upon fusion of predictions (see Methods for further details).
Several notable locations on the S protein fall within our hotspot locations. For example, the receptor binding domain (RBD) of S that interfaces with host cell angiotensin-converting enzyme 2 (ACE2) (24) overlaps pan-HLA hotspots 4, 5; HLA-I hotspots 6, 7, 8; and HLA-II hotspot 3. From among all the Centers for Disease Control (CDC) listed substitutions of therapeutic concern (implicated in contributing to increased transmissibility, severity, or reduced susceptibility to therapies) based on data up to July 31, 2021 (25); only E484K does not fall within any hotspots. For all others: L452R falls within HLA-I hotspot 7; K417N/K417T is in pan-HLA hotspot 8, HLA-I hotspot 6, HLA-II hotspot 3; and N501Y falls in pan-HLA hotspot 5 and HLA-I hotspot 8.
In alignment with our analysis, response frequencies from empirically confirmed epitopes aggregated across 25 studies spanning 1197 human subjects (9) aligned well with our predictions (Figure 2 and Table 1). Further, epitopes verified at the scope of individual studies such as by Saini et al. (15) were shown to consistently be in zones that align with our hotspot predictions, and were consistently assigned high HLA binding confidence values (Supplementary Tables S1 and S2).
Figure 2 Epitope response frequency correlates with aggregated binding predictions in SARS-CoV-2 proteins. For CD8+ T-cell epitopes collected from across 25 studies by Grifoni et al. [9] we found position-specific response frequency (RF) (dash red) and RF lower bound (95% confidence interval) averaged with a 10 amino acid sliding window (solid red) correlated with our aggregated HLA-I (turquoise) and pan-HLA (purple) binding prediction scores in the S (A) and N (B) proteins. CD4+ T-cell epitope RF (dash blue) and RF lower bound (solid blue) also correlated with HLA-II (green) and pan-HLA aggregated predictions for S (C) and N (D). Legend in (A) applies to (B) legend in (C) applies to (D).
Table 1 Correlation and intersection at peaks of epitope response frequency with aggregated HLA binding predictions.
Verified T-Cell Epitope Response Frequencies Correlate With Aggregated Predictions
Grifoni et al. (9) collected confirmed CD8+ and CD4+ epitopes across 25 studies and used the Immunome Browser (26, 27) hosted by IEDB [https://www.iedb.org/] (28) to compute position-specific T-cell response frequency (RF) along with confidence interval (CI) for key viral proteins. The RF values captured the number of individuals and assays reporting positive responses to a peptide including that particular residue. Using their provided data, we replicated the RF calculation to validate alignment with our aggregate HLA binding predictions and epitope hotspot locations for both the S and N proteins (Figures 2A–D).
In N, the CD8+ RF lower bound (CI at 95%) correlated with our aggregated HLA-I predictions with Spearman rs = 0.522 (p = 1.46e-30); and with our pan-HLA aggregate with rs = 0.507 (p = 1.14e-28). The correlation of the CD4+ RF lower bound in N was at rs = 0.560 (p = 6.27e-36) and rs = 0.634 (p = 2.51e-48) with our HLA-II and pan-HLA predictions, respectively. For the S protein, the CD8+ RF lower bound correlates with our HLA-I and pan-HLA aggregated predictions at rs = 0.412 (p = 2.51e-53) and rs = 0.236 (p = 1.46e-17); and the CD4+ RF lower bound correlates with our HLA-II and pan-HLA predictions at rs = 0.201 (p = 4.21e-13) and rs = 0.302 (p = 2.65e-28).
Note that the lower bound CI for RF is a function of number of subjects for which a certain protein position was interrogated. Thus we do not expect perfect correlation due to non-uniform coverage across studies, as well as the understanding that HLA binding is necessary but not sufficient for T-cell response. Further, the HLA composition of aggregated studies (9) is limited by the regions in which each study was performed, and thus not necessarily reflective of the same global distribution used to inform our HLA set. However, these correlations indicate that when binding predictions are aggregated across a diverse set of HLAs and peptide lengths, the resulting representation of binding promiscuity is useful for predicting immunodominant protein regions yielding peak T-cell epitope frequencies across large samples of subjects.
Although correlation of aggregated HLA predictions with RF lower bound was lower in S than N, significant peaks of the RF were effectively captured in both proteins even when relative magnitudes differed. Employing the same methodology used to select our potential epitope hotspots on the unsmoothed RF lower bound, we found that the majority of regions identified based on the RF signal were indeed covered by our predicted epitope hotspot regions (Table 1).
To layer insight from empirical epitope evidence across studies on top of our predictive analysis, we ranked our predicted epitope hotspots according to the maximum RF lower bound within each hotspot range (Figure 3). This helped interpret our subsequent results with an observation driven approach to prioritizing the impact of HLA binding changes that lead to potential for epitope loss.
Figure 3 Observation-driven prioritization of epitope hotspots. Shown are our potential epitope hotspots based on binding predictions aggregated across HLAs and peptide lengths, sorted according to the maximum position-specific response frequency (RF) lower bound [9] within their ranges. We ranked HLA-I hotspots according to CD8+ RF (A, B), HLA-II hotspots according to CD4+ RF (C, D), and pan-HLA hotspots according to CD8+ RF (E, F), for S and N, respectively, for each. Each plot illustrates the maximum RF within hotspot ranges, as well as the max of both the 95% confidence interval upper and lower bounds.
Verified CD8+ T-Cell Epitopes Coincide With Predicted Epitope Hotspots
We also validated predictions against an individual study in which DNA-barcoded peptide-MHC complex (pMHC) multimers were used to identify 122 unique epitopes recognized by SARS-CoV-2-specific CD8+ T cells across 10 HLA-I molecules in 18 COVID-19 patients (15). Of the total unique epitopes from across the viral genome, 118 unique epitopes (119 unique pMHC complexes) were found to originate from the set of 6 proteins considered in Figure 1 and Supplementary Figure S1. The most frequent sources of epitopes reported were the ORF1, S, and ORF3 proteins. Further, 4 immunodominant epitopes (recognized by T cells in > 50% of analyzed patients) were found to come from ORF1 and one from N.
For all proteins illustrated in Figures 1 and S1 except ORF3a, the majority of verified epitopes occurred at our pan-HLA potential epitope hotspots (Supplementary Table S1). As previously shown with NetMHCpan-4.1 rank score results (15), we also verified that immunogenic pMHC complexes were dominated by high-confidence binding predictions from our system (Supplementary Table S2). Following the prior example of Saini et al. (15), we reported Mann-Whitney test p-values when comparing predictions for immunogenic pMHCs versus those that did not trigger immune response. Due to the equivalence of the Mann-Whitney test with ROC AUC, the latter was also included for clarity. Note that both our classifiers and NetMHCpan-4.1 were trained to predict MHC binding, whereas the ROC AUC scores in Supplementary Table S2 reflect how much predicted binding contributes to separability of immunogenic vs. non-immunogenic peptides within the 18 patient samples in the study.
If hotspot selection had no benefit for the identification of immunogenic peptides, then restricting the set of non-immunogenic examples to only those outside of pan-HLA hotspots when computing ROC AUC (or the Mann-Whitney test) would not impact the score. However, we found that in these scenarios, ROC AUC consistently increased (Supplementary Table S2); indicating that hotspot selection is in fact a helpful heuristic for rejecting peptides likely to be non-immunogenic.
Notably, when considering the S and N proteins on which we focused our surveillance, the impact of pan-HLA hotspot selection enhanced separability from an ROC AUC of 0.5914 (when all RNN predictions were considered), to 0.6366 after negatives were restricted to samples outside of hotspot locations. As a baseline comparison on these two proteins: NetMHCpan-4.1 rank values (with no hotspot selection) led to an ROC AUC of 0.5406.
Since MHC binding does not guarantee immunogenicity, it is not surprising that at individual peptide resolution these scores were low compared to what one might expect when a classifier is trained and tested on the same task. It is precisely because individual peptide-MHC binding predictions are not directly predictive of immunogenicity - but are a prerequisite - that we focus on a holistic perspective of binding loss at critical locations as an approach to track and highlight viral variants with most potential for T-cell epitope loss.
Mutations in S and N Protein Across SARS-CoV-2 Genomes Frequently Overlap Potential Epitope Hotspots
At bi-monthly time points throughout 2021, a comprehensive analysis was performed of HLA-I and HLA-II binding predictions for S and N proteins across all available SARS-CoV-2 genomes provided by NIH NCBI. In order to focus on impactful variations and minimize erroneous samples, only proteins whose amino acid sequences were observed at least 3 times were considered.
Potential epitope hotspots were mapped from the reference protein sequences to each unique version of S and N to enable direct comparison of predicted binder counts at hotspots. Figure 4A and B summarizes the fraction of unique versions of the S and N proteins, respectively, that are impacted by hotspot mutations from the reference strain, NCBI Reference Sequence NC_045512 (23). This alignment led to the observation that a majority of unique versions of both S and N proteins included at least one mutation at a T-cell epitope hotspot region. For example, in data from July 27, there were 3752 unique versions of the S protein, of which a fraction of 0.868 had mutations at HLA-I hotspots, 0.686 had mutations at HLA-II hotspots, and 0.779 had mutations at pan-HLA hotspots. At the earlier time point of March 19th, there were only 1081 unique versions of S and a fraction of 0.691 had HLA-I hotspot mutations, 0.398 had HLA-II hotspot mutations, and 0.563 had pan-HLA hotspot mutations. In the first half of 2021, as the pandemic progressed, the fraction of unique proteins that included mutations impacting hotspots continued to increase rapidly. It finally plateaued in the latter half of 2021, at a state where an overwhelming majority of unique proteins included mutations at epitope hotspots.
Figure 4 Unique protein count and fraction of unique proteins with hotspot mutations over time for Spike (S) and Nucleocapsid (N) proteins. For (A) S and (B) N proteins, the unique protein count (right y-axis) and fraction of unique proteins with mutations at HLA-I, HLA-II, and pan-HLA hotspots (left y-axis) are shown. Unique protein counts included all versions of a viral protein whose amino acid sequence appeared at least 3 times in the viral genomes available through NIH NCBI at bi-monthly time points throughout 2021. Every unique protein that included at least one mutation relative to the SARS-CoV-2 reference genome occurring within an HLA binding hotspot was counted in the fraction of unique proteins with hotspot mutations. For (C–F), total sample count (right y-axis) and the number of unique versions (left y-axis) of S (C, E) and N (D, F) proteins classified as VOC (C, D) or VOI (E, F) are shown at each data sample time. December 6 included November 26 data plus supplementary genomes from GISAID to capture the emergence of B.1.1.529 (WHO: Omicron). Average bind loss per HLA-I (G-J) and HLA-II (K-N) were tracked over time where lineages were represented by their worst-case bind loss (solid lines) or most frequently occurring versions of S and N (dashed lines). VOC lineages: B.1.1.7 (Alpha), B.1.351 (Beta), B.1.617.2 (Delta), P.1 (Gamma), and B.1.1.529 (Omicron); VOI lineages: B.1.427 and B.1.429 (Epsilon), B.1.525 (Eta), B.1.526 (Iota), and B.1.617.1 (Kappa).
A noteworthy counter example to the above trend are versions of the N protein with mutations at HLA-II hotspots; the fraction of which has stayed steady throughout 2021. The key distinction between HLA-II hotspots identified in N is that regions in the middle and end of the protein that were prominently predicted as promiscuous HLA-I binders (and included in HLA-I and pan-HLA hotspots) were predicted to have minor success binding with HLA-II, and thus did not pass hotspot selection thresholds for class II (Figures 1B, D, F). The minor contribution of T-cell epitopes from regions omitted in our HLA-II hotspots was confirmed by empirical CD4+ epitope response frequency data (Figures 3D and F). Overall, we did not observe the emergence of mutations in the N protein over the course of 2021 that significantly affected predicted N-directed CD4+ T-cell epitope repertoires.
We also compared mutation frequency within versus outside of hotspot regions. In November 26 data, the ratio of mutation frequency in the S protein at pan-HLA hotspots versus outside of pan-HLA hotspots was 0.563, and 0.490 when normalized for length. For HLA-I and HLA-II these ratios were 1.094 (0.933 normalized) and 0.341 (0.412 normalized) respectively.
In the N protein the mutation frequency ratios were 1.319 (1.094 normalized), 1.181 (1.614 normalized), and 0.259 (0.306 normalized), for pan-HLA, HLA-I, and HLA-II hotspots respectively.
These differences in distribution of mutation frequency further highlight our identification of conserved regions of N with high epitope potential, which were captured in better isolation from often mutated regions by HLA-II hotspots. Also, the nearly equal mutation distribution at HLA-I hotspots versus the rest of the S protein connects to a previously made observation that CD8+ T-cell response frequencies aggregated over many studies led to a more homogeneous positional distribution than for CD4+ T-cells or other proteins (9). A more uniform distribution of mutation frequencies, as well as considerable heterogeneity due to patient HLA-I repertoire (14), are likely both factors leading to the greater uniformity in aggregate. We illustrate this HLA-I binding heterogeneity later in VOCs (Figure 8).
Sorting Protein Variants by Fraction of Binders Lost Across Representative HLAs
To track the evolution of predicted binding loss - thus potential for epitope loss - across viral variants, we counted the number of predicted binders at each pan-HLA hotspot for every unique version of S and N. This enabled us to measure the fractional change in binding between any two versions of a protein at any specific hotspot location, or in aggregate over all hotspots.
Interactive visualizations illustrating per-HLA aggregate binder count fraction relative to the reference SARS-CoV-2 genome (NCBI Reference Sequence NC_045512) for all S and N epitopes have been made available at https://research.immunitybio.com/scov2_epitopes/, and a static example is illustrated in Figure 5.
Figure 5 Illustration of per-HLA aggregate binder count fraction at pan-HLA hotspots illustrated for all HLAs across all unique versions of S and N. All plots were sorted according to sum fraction of binders lost, such that proteins with the most loss appear on the left. Each plot has 5 vertical black lines indicating average bind loss per HLA thresholds at 0.01, 0.005, 0.002, 0.001, and 0. For compactness, binder count fractions in data only from March 13, 2021 are shown here: HLA-I binder count fractions for all unique versions of (A) S and (B) N; and HLA-II binder count fractions for (C) S and (D) N variants. Interactive plots from all data sample times are available online.
Independently for our comprehensive HLA-I and HLA-II sets, all versions of the S and N proteins were ranked according to sum fraction of binders lost (see Methods and Figure 5) such that protein versions with the most significant overall loss in HLA binders relative to the reference genome were ranked above proteins with less epitope loss potential. A fractional value (bind loss rank fraction) was assigned to represent ranking position; where values near 0 correspond to the top of the ranking order (most loss of HLA binders), and values near 1 indicate a rank at the end (least loss of HLA binders). This was done at each bi-monthly data time point. Since the number of unique proteins increased through time and therefore impacted relative rankings, to enable comparison of lineages between sample times sum fraction of binders lost was divided by the number of HLAs considered to obtain an average bind loss per HLA.
To prioritize browsing protein versions with most potential for epitope loss, all interactive plots were sorted as described above. Visualizations were also supplemented with average bind loss per HLA thresholds drawn at 0.01, 0.005, 0.002, 0.001, and 0 to clearly delineate zones where binding loss is likely to have little or no impact.
High Fraction of Lineages With the Most Binder Loss in S Show Conserved HLA Binding in N
Motivated by vaccine approaches currently in trials that include both the S and N proteins (20–22), we investigated the relationship between HLA binder loss in S and N across all viral lineages in the data.
Note that one viral lineage is often defined to span multiple alternative versions of each protein. Conversely, one specific protein sequence may be produced by several viral lineages. To summarize results, each lineage was represented by the sample with worst-case potential epitope losses in S and N proteins (highest sum fraction of binders lost), and alternatively by the most frequent instance of each protein occurring across genomes assigned to a single lineage classification. Worst-case scenarios were considered independently so that bind loss rank fraction values assigned to a lineage were allowed to come from different genomes for S and N. This assumed that versions of the lineage may already exist outside the data, or come to exist through continued evolution, where proteins with most significant loss co-occur in the same viral genome. Most frequent versions of S and N representing each lineage were also selected independently.
Figure 6 shows the relationship between lineages with representative versions of the S protein ranked at the top according to most significant loss in HLA-I or HLA-II binders (bind loss rank fraction) and different thresholds of HLA binder loss in N (average bind loss per HLA). In data analyzed from November 26, it was found that lineages whose worst-case S protein ranked within the top 1% most potential for epitope loss, 17.5% had an average bind loss per HLA less than 0.005 in N for HLA-I (Figure 6A). Further, as the scope was increased to consider lineages with S ranked in the top 10% most epitope loss potential, 33.3% of those lineages demonstrated the same level of conservation in N (Figure 6A). The picture was similarly optimistic when it came to conservation of potential CD4+ epitopes. The same level of HLA-II binder conservation (average bind loss per HLA < 0.005) in N was exhibited by 26.5% of lineages that fell in the top 10% according to binder loss in S (Figure 6C). When a stricter requirement was imposed that an equal threshold of binder conservation be observed across both HLA-I and HLA-II in N, there was little change in outcome: the strict N conservation criteria was satisfied by 19.1% of lineages in the top 10% of those with most loss in S (Figure 6E).
Figure 6 N epitope conservation in SARS-CoV-2 lineages with representative S protein ranking in the top percent potential epitope loss. All viral lineages within the data were represented either by the S and N protein exhibiting the worst-case epitope loss potential (A, C, E), or by the most frequently occurring versions of S and N (B, D, F). For lineages with the S protein ranking in the top percent for the most significant loss in HLA-I or HLA-II binders (x-axis), shown are the fractions of those top loss lineages that exhibited (A, B) HLA-I binder conservation in the N protein, (C, D) HLA-II binder conservation in the N protein, or (E, F) both HLA-I and HLA-II binder conservation in the N protein; all at various conservation thresholds of average (avg.) bind loss per HLA, as represented by the colors shown in the legend. The strictest N conservation criteria was used to illustrate the dynamics of the relationship throughout the span of the 2021 (E, F).
In addition to being robust to a strict requirement of binder conservation across both HLA-I and HLA-II, this conclusion also held robustly throughout the course of the pandemic in 2021. Looking at the lineages with worst-case S demonstrating binder loss within the top 10% at data sample times in March, May, July, September, and November; the percentage of lineages satisfying the same N conservation criteria as above was 23.5%, 19.3%, 18.1%, 18.8%, 19.1%, respectively (Figure 6E).
Even more compelling is the observation that the most commonly occurring versions of S and N within each lineage (Figures 6B, D, and F) were not typically the versions exhibiting the worst-case HLA binder loss (Figures 6A, C, and E). Lineages with their most frequent S ranking within the top 10% for binder loss, demonstrated strict HLA-I and HLA-II binder conservation in N (at average bind loss per HLA < 0.005) at percentages of 40.0%, 37.5%, 50.0%, 40.5%, 37.5% across time (March, May, July, September, and November respectively) (Figure 6F).
Variants of Concern (VOC) and Variants of Interest (VOI) Over Time
Our global analysis was leveraged to highlight the evolution of the peptidome in VOC as defined by the Centers for Disease Control and Prevention (CDC): B.1.17 (World Health Organization (WHO): Alpha), B.1.351 (WHO: Beta), B.1.617.2 (WHO: Delta), P.1 (WHO: Gamma), and B.1.1.529 (WHO: Omicron). The evolution of VOI was also tracked: B.1.427 and B.1.429 (WHO: Epsilon), B.1.525 (WHO: Eta), B.1.526 (WHO: Iota), and B.1.617.1 (WHO: Kappa). Note that some WHO labels such as Beta or Delta can correspond to multiple PANGO lineages, thus the complete history of Delta may not be fully captured only by its originating PANGO lineage B.1.617.2.
To supplement NIH NCBI data from November 26, 2021 to cover the emergence and rapid spread of the B.1.1.529 (Omicron) lineage in late 2021, we obtained all Omicron genomes from GISAID (29–31) up to December 6, 2021. All unique versions of S and N proteins that included less than 1 percent unspecified amino acids and were classified by GISAID as Omicron were appended to the November 26 data. This augmented dataset is labeled as December 6 throughout figures. We also later verified that according to all GISAID data available up to two additional time points (December 20, 2021 and December 29, 2021), the versions of Omicron S and N proteins identified as the top two most frequent remained unchanged in their rankings.
With each of the VOC and VOI lineages, we observed that as both overall sample count and variety of unique proteins increased over time for S and N (Figures 4C and D; and Figures 4E and F, respectively), there was a distinct correlation with increased worst-case HLA-I and HLA-II binder loss (Figures 4G, H, K, L and Figures 4I, J, M, N, respectively). As a result of this evolution, lineages increased their chances of successful CD8+ and CD4+ T-cell evasion. However, we also observed that this did not occur at the same rate in all lineages. For example, worst-case HLA-I binder loss in S changed extremely rapidly between July 27 and September 25 for both VOC B.1.617.2 (Delta) (Figure 4G) and VOI B.1.429 (Epsilon) (Figure 4I) compared to any other variants. We also saw from the emergence of B.1.1.529 (Omicron) that novel strains can feature high average bind loss per HLA in the context of existing lineages. The most frequent versions of B.1.1.529 proteins exhibited the highest potential for epitope loss compared to the most frequent proteins corresponding to almost all other VOCs and VOIs (Figures 4G, H, K and Figures 4I, J, M, N). The one exception was a close match in average bind loss per HLA-II with B.1.1.7 in the N protein (Figure 4L). Emergence of novel variants and differences in rate of evolution occasionally impact the relative rankings of linages according to most epitope loss potential, and thus underscore the importance of continued genomic surveillance.
VOCs Demonstrate Increased Potential for Loss of Epitopes When Compared to VOI
Across both the S and N proteins, we observed that VOC lineages demonstrate a consistently higher average bind loss per HLA than lineages classified as VOI; especially when focusing on the most frequent proteins within each lineage. This held true for VOC versus VOI across both HLA-I (Figures 4G and H versus Figures 4I and J, respectively) as well as HLA-II (Figures 4K and L versus Figures 4M and N, respectively). In summary, lineages which have spread most successfully, and have been identified as the most concerning, consistently demonstrate a higher rate of epitope loss potential relative to VOI and other lineages.
Proteins Instances With Worst-Case Binding Loss Did Not Become the Most Frequent for Any VOC or VOI Lineage
Perhaps the most notable observation from our tracking of VOC and VOI lineages is that in no instance did the S or N proteins exhibiting the worst-case HLA-I or HLA-II binding loss become the most frequent protein version for a lineage (Figures 4G–N). Even though we found an increasing occurrence of mutations at T-cell epitope hotspots over time (Figures 4A and B), genomes with the most potential for T-cell evasion within the trajectory of each lineage were not the most successful at spreading across the population. This appears consistent with hypotheses expressed earlier (7, 10, 13–15), suggesting that T-cell evasion is unlikely to be a primary evolutionary pressure on SARS-CoV-2. Instead of a gradual evolution, we observed step changes in T-cell evasion potential when tracking most frequent versions of S and N within VOCs, as demonstrated by the emergence of B.1.617.2 (Delta) and B.1.1.529 (Omicron) during our sampling period (Figures 4G, H, K, and L).
High Levels of Epitope Conservation Hold Across Most Frequent Versions of N Protein in VOC and VOI Lineages
For both VOC and VOI lineages, regardless of epitope loss potential in versions of their S protein, a high degree of epitope conservation was observed across nearly all most frequent versions of N. Even lineages whose most common N protein exceeded our earlier defined conservation threshold (average bind loss per HLA < 0.005, Figure 6) did so only by a narrow margin: B.1.617.2 and B.1.1.529 (Figure 4H), B.1.617.1 and B.1.525 (Figure 4J) with respective average bind loss per HLA-I values: 0.0072, 0.0098, 0.0072, 0.0088; B.1.1.7 and B.1.1.529 (Figure 4L) with average bind loss per HLA-II values: 0.0072, 0.0070.
Localization of Significant Epitope Change Across VOC and VOI Lineages Highlights Regions of N Conservation
On the S protein the hotspot locations that exhibited some of the most significant potential drops in epitope count across sequenced genomes of VOCs and VOIs also corresponded to regions confirmed to have the highest frequency of T-cell epitopes across aggregated empirical studies (9). Specifically pan-HLA hotspots 2 and 4, which ranked first and third in terms of confirmed epitope response frequency (Figure 3E), were regions where both B.1.351 (Beta) and B.1.1.529 (Omicron) demonstrated notably lower HLA-I binder count fraction values (0.953 and 0.961 at hotspot 2, and 0.922 and 0.938 at hotspot 4; respectively for the two lineages) relative to other VOCs and VOIs (Figure 7A). B.1.617.2 (Delta) also stood out at pan-HLA hotspot 5 with a low HLA-I binder count fraction of 0.958 relative to others. However, hotspot 5 ranked in the middle according to empirical response frequencies.
Figure 7 Distribution of binder count fraction relative to reference SARS-CoV-2 at pan-HLA hotspots for VOC and VOI lineages. The box plots illustrate the distribution of HLA-I (A, B) and HLA-II (C, D) binder count fraction averaged across HLAs (y-axis) at S (A, C) and N (B, D) pan-HLA hotspots (x-axis) for all unique proteins classified as VOC and VOI lineages. Binder count fractions specific to the most frequent protein versions of each lineage are annotated with an ‘X’ on top of the box plots. VOC and VOI lineage labels are indicated in the legend at right.
In terms of HLA-II binder count fraction across S protein hotspots, only Omicron stood out at hotspot 3 with a distinctly lower value than other lineages (Figure 7C). Considering hotspot 3 is toward the end of rankings according to response frequency, CD4+ epitopes may, in general, be less subject to T-cell evasion across VOCs and VOIs than CD8+ epitopes.
For N protein, pan-HLA hotspots 1 and 3 were verified as the highest frequency contributors of both CD4+ and CD8+ epitopes according to empirical evidence (Figure 3F). As previously discussed, we were able to infer that these hotspots were conserved, that is, less subject to mutation in the course of viral evolution (Figure 4B); and - as expected - we observe that they are the most stable across all variants in terms of HLA-I and HLA-II binder count fraction at or near 1 (Figures 7B and D).
VOCs Demonstrate Heterogeneity of Epitope Loss Potential at pan-HLA Hotspots Across Our Comprehensive HLA Set
Finally, to illustrate that epitope loss is not uniform across HLAs, we visualized the HLA-I binder count fraction relative to reference proteins at pan-HLA hotspots for the most frequent S (Figures 8A–E) and N (Figures 8G–K) proteins of all VOCs. We also included the second most frequent instances of Omicron S (Figure 8F) and N (Figure 8L). Again the HLA binding stability of N hotspots 1 and 3 was apparent across HLAs, even when not averaged as in Figure 7B. In contrast, multiple hotspots on the S protein demonstrated significant diversity in binder count fraction across our HLA-I set (Figures 8A-F). This was consistent with findings that S regions yielding CD8+ epitopes vary with patient repertoire of HLA-I alleles (14), thus leading to a more homogeneous distribution of position specific response frequency in S across aggregated studies (9) than observed for CD4+ epitopes or the N protein. For closer inspection, interactive visualizations of Figure 8 are available online along with other supplementary content.
Figure 8 Change in HLA-I binder count from SARS-CoV-2 reference at pan-HLA epitope hotspots for most frequent S and N proteins of VOC lineages. Change in potential epitope count as a fraction of reference count is shown for all HLA-I in our analysis set (y-axis) for the most frequent versions of S (A–F) and N (G-L) in each of the VOC lineages: (A, G) B.1.1.7 (Alpha), (B, H) B.1.351 (Beta), (C, I) B.1.617.2 (Delta), (D, J) P.1 (Gamma), (E, K) B.1.1.529 (Omicron), as well as the second most frequent versions of Omicron proteins (F, L). For both S and N proteins pan-HLA hotspots (x-axis) were used. Interactive versions of all plots above are available online.
Discussion
With our unique approach to integrating HLA binding predictions across a broad range of peptide lengths and a representative HLA set, we were able to demonstrate that our aggregate signals representing candidate epitope frequency and HLA promiscuity correlated well with empirical evidence of T-cell response frequencies, and correctly identified protein regions yielding immunodominant epitopes across multiple different cohorts (9, 14, 15) (Figures 1, 2 and Table 1 and Supplementary Tables S1, S2). It is particularly noteworthy that multiple prior studies had omitted predictive analysis when investigating CD4+ T-cell reactivity, attributing this omission to HLA-II binding prediction algorithms not effectively predicting epitope recognition (14). Our results have demonstrated that our binding prediction algorithms coupled with our approach to integrating predictions across lengths and HLAs were similarly effective at identifying regions of both immunodominant CD4+ epitopes and CD8+ epitopes (Figure 2 and Table 1).
Our approach to selecting representative HLA sets also benefitted from our binding prediction models. We leveraged learned HLA representations to construct a global HLA set based not only on population frequency, but also on uniqueness of function. The goal was to ensure we did not miss consideration of HLAs that may not be frequently studied or well characterized in datasets, but were predicted to be distinct from other HLA groups in how they interact with peptides.
In contrast to in-silico analyses of singular (9, 13) or select (13, 32) sequences derived from representative viral genomes or impacts of individual mutations, our focus on full protein sequences obtained from all available viral genomes at each sample time enabled us to capture deeper co-occurrence relationships between mutations and their impact on epitope loss potential within and across lineages. This yielded a unique view of viral evolution from the perspective of potential for T-cell evasion.
Notably, we observed that throughout 2021 there were no cases among VOC or VOI lineages where the S or N protein versions with the most epitope loss potential became the most frequently observed within a lineage. This occurred despite trends of increasing rates of mutations at potential epitope hotspots overall (Figure 4), and worst-case protein versions exhibiting more epitope loss within lineages as time progressed (Figures 7 and 8). This suggests that T-cell evasion is not a dominant evolutionary pressure on SARS-CoV-2 evolution, in agreement with prior hypotheses (7, 10, 13–15).
When looking across all SARS-CoV-2 lineages, we observed that among lineages with their S protein demonstrating worst-case binder loss within the top 10%, 19.1% had a high level of epitope conservation in their worst-case version of N. This relationship proved even more distinct when representing lineages by their most frequent versions of S and N; boosting the fraction of lineages with N conservation to 37.5% among lineages with their S in the top 10% of binder loss. We observed that these relationships remained stable over time throughout the course of the pandemic, even as new variants emerged and existing lineages continued to mutate (Figure 6).
Delving deeper, we found that predicted hotspots aligning with the highest frequencies of empirically verified epitopes were among the most impacted by HLA binding loss on the S protein. Whereas on the N protein, the top two epitope contributing hotspots happened to be conserved across time, showing almost no loss in HLA binding (Figure 7). We further found that candidate epitope regions most impacted by N protein mutations were specific to CD8+ T cells and that the virus did not show an accelerated rate of mutation that significantly impacted the predicted CD4+ T-cell epitope repertoire of the N protein over the course of 2021.
A key limitation of this study is that our analysis is built upon predictions of peptide binding to HLA-I and HLA-II molecules, which is known to be necessary but not sufficient, for T-cell recognition. Not all predicted binders are likely to be T-cell epitopes, and thus not all changes in HLA binding prediction are guaranteed to lead to epitope loss. We aimed to minimize this disparity with our approach to integrating predictions across peptides, lengths, and HLAs; and by focusing on relative change at hotspots of potential epitopes. But ultimately there are unmodeled factors between HLA presentation and T-cell response, and an additional layer of predictive tools trained directly for the task may be necessary to improve precision in the future.
A secondary limitation to the analyses presented herein is that our conclusions are based around a representative set of HLAs selected to cover the most general picture of the world population (and be inclusive of any minority but distinctly functioning HLA groups). Thus conclusions may vary if the analysis were to be repeated with a different HLA set, for example one specific to a population or region. We also encountered very sparse coverage in population statistics of HLA-II alpha-beta haplotypes. To overcome this, we selected our HLA-II set based on all individual alpha and beta chain frequencies and considered all permutations, and may have incurred a risk of overrepresenting the impact of some alleles.
Finally, it is important to note that our conclusions were based on viral genome data curated by the NIH NCBI and GISAID. Therefore, there may be multitudes of unaccounted factors such as regional variability in available testing, processing times, sequencing methodologies, etc.; which might affect how representative the data was of the true state of the virus at any one point in time.
Overall, our work yielded a novel perspective into the evolving landscape of T-cell evasion potential over the time period studied, and added to the mounting body of evidence suggesting that the inclusion of proteins such as N, along with S, in vaccines is likely to be an effective strategy for improving robustness and duration of immune memory protective against SARS-CoV-2 infection and severe COVID-19 in the face of evolving viral variants.
Methods
HLA-I and -II Binding Prediction With RNN and CNN
A convolutional neural network (CNN) as well as an attention augmented recurrent neural network (RNN) were developed for HLA-I and HLA-II binding prediction using data from IEDB [https://www.iedb.org/] (28) downloaded in January 2020. Training data included 436,805 unambiguously labeled peptide-HLA examples across 159 unique HLA-I A, B, and C alleles; and 96,818 examples spanning 17 unique HLA-II alpha chains and 73 unique HLA-II beta chains.
Both neural net architectures utilize the full HLA protein sequence as well as the full peptide amino acid sequence as inputs. HLA-I and HLA-II binding prediction systems were trained independently, but weights learned from the best HLA-I systems were used to initialize corresponding architecture components for HLA-II training. Rather than estimating affinity, both systems predict binding as classification. This design choice alleviates issues due to high variance in affinity measurements observed in IEDB data (Supplementary Figure S2), which required prior systems to add ranking to normalize predictions across HLAs (33–35).
As represented in Supplementary Figure S2A, IEDB affinity measurements were found to exhibit high variance across curated qualitative assessments of binding. Although some biases may have been due to measurement method, even when the same assay was used, significant variance in affinity values across binding categories remained (Supplementary Figure S2B). The commonly selected cutoff of 500 nM for confidently binding peptides is thus likely to mislabel or ignore many valid binders. Transforming the problem to classification of qualitative labels entirely removed the need for such a threshold, enabling models to learn to assign a probability of being an HLA binder independent of any affinity biases between HLA alleles, assay methods, or other potential noise factors. We therefore expected that classification of peptide binding and presentation was likely to be a better posed problem than attempting to predict extremely noisy binding affinities.
Validating our selection of classification over regression, our RNN classifier’s raw predictions were consistently confident of binding for verified eluted ligands across 27 HLA-I molecules in the Pearson et al. dataset (36) (Supplementary Figure S3). Large variance of predicted affinity values across HLAs in this dataset was the motivation for re-scoring raw NetMHCpan-4.0 predictions (12) by instead reporting HLA specific ranking values. By achieving comparable outcomes to affinity prediction followed by re-ranking, but without needing the additional independent ranking step to re-calibrate predictions, we confirmed that classification based on IEDB’s curated qualitative labels was better posed when considering a multitude of HLAs.
During each RNN and CNN training epoch, data for every distinct HLA was augmented with an additional 20 percent background negative peptides (or a minimum of 150 for each HLA-I and 25 for each HLA-II) randomly sampled from the Swiss-Prot human proteome (https://www.uniprot.org/proteomes/UP000005640) and required to not already exist in training.
To help address data imbalance between HLAs, as well as positive/negative sample imbalance within each HLA, we adapted the sample weighting strategy proposed by Cui et al. (37) independently across both scenarios (weight smoothing parameter beta = 0.99 was used within each HLA to balance positive/negative examples, and beta = 0.999 was used to balance contributions from each HLA). To ensure average batch loss magnitudes were not impacted by the re-weighting scheme, the combined weights were re-normalized to sum to twice the number of distinct HLAs in training.
Both trained RNN and CNN neural networks were evaluated to compare favorably on the test set of eluted ligands across 36 HLA-I molecules published with the release of NetMHCpan-4.1 (test data and results from other evaluated systems (33, 34, 38, 39) are available at: https://services.healthtech.dtu.dk/service.php?NetMHCpan-4.1).
To compare directly to the state-of-the-art (34), we used the same methodology to estimate PPV for each HLA: based on the fraction of true positive peptides in the top N predictions, where N was defined by the total number of true positives for the HLA of interest multiplied by a factor of 0.95. The mean of this PPV estimate across HLAs is reported in Table 2 along with mean ROC AUC.
Table 2 Performance evaluation on eluted ligand test NetMHCpan-4.1, [34] dataset including 77,053 peptides (lengths 8-14) across 36 distinct HLA-I molecules.
Our RNN and CNN systems both incorporate an ensemble of 5 models with different random weight initializations and different random training/validation splits, but otherwise identical model architecture and hyperparameters. The variance of predictions from such an ensemble has been shown to be an effective proxy for model uncertainty (40), particularly in cases such as unseen data. Using this additional metric, our models are able to abstain from forcing a classification decision when uncertainty is high. Specifically, ensemble predictions for which the classification decision boundary falls within a standard deviation of the mean are classified as ambiguous and can be disregarded from consideration as either binders or non-binders. The mean and standard deviation of predicted values from the ensemble are provided for all peptide HLA binding predictions, and where not otherwise specified, the mean value is used in subsequent analyses. Additional columns in Table 2 indicate that this ability to abstain from classification improves accuracy for both the RNN and CNN. In addition to the mean accuracy metrics across HLAs provided in Table 2, individual metrics for each HLA (Supplementary Data Sheet 2) and binding predictions for all peptides in the test set (Supplementary Data Sheet 3, 4) are provided as Supplementary Data. On average, across all HLAs, only 2.2% of samples were excluded as ambiguous by both RNN and CNN systems, with HLAs showing the lowest levels of ambiguous sample counts being at 0.7%, and highest ambiguous sample counts at 6%.
Our HLA binding prediction software is currently proprietary, but will be made commercially available to researchers. Please contact us if interested.
HLA-I Grouping and Selection
HLA nomenclature was originated by the World Health Organization (WHO) Nomenclature Committee for Factors of the HLA System, which issued its first report in 1968 based on serology analysis (41) and has continued to refine the resolution of its methods by incorporating molecular and genetic information (42).
In this study, we sought to ensure that examples from uniquely functioning HLA groups were represented in our list of alleles to analyze, even if they were not present in majority populations. To achieve this, we needed to characterize functional similarity relationships between all HLAs in a way that would estimate the magnitude of difference between distinct HLAs and enable clustering at resolutions that vary from the curated nomenclature. Prior efforts have constructed HLA-I clusters and hierarchies based on distances defined by binding peptide set overlap or motif similarity identified from peptide binding data (2, 43). However, by the nature of their methods, such studies were limited to HLAs for which abundant data was available at the time and thus did not cover the full set we wished to consider.
To leverage all data used to train our binding prediction systems, we utilized a fixed length embedding of the full HLA amino acid sequence generated at an intermediate step in our trained neural networks as a basis for calculating distances between HLAs. For robustness, the embeddings from each model in the ensemble were stacked into one vector, and all clustering operations were performed on these concatenated ensemble embeddings. This approach to HLA grouping enabled consideration of all HLAs whose full sequence was known to our system (all data from the IPD-IMGT/HLA Database up to July 2019 (44, 45), including those for which limited empirical binding data is currently available.
To capture both local and global structure in high dimensional data, PHATE (46) was used to approximate relationships between our HLA embeddings in two dimensions (knn = 20 was used to focus more on global relations). For additional robustness to noise in cluster assignment, agglomerative clustering was used in the simplified PHATE coordinate space to group HLAs. Using a sweeping threshold across agglomerative clustering runs, a stable point was identified at 38 HLA-I clusters that maximized the silhouette score (Supplementary Figure S4A).
Cluster assignments were then paired with data from the Allele Frequency Net Database (AFND) (47, 48) to select the set of HLA-I molecules for consideration in analysis. To first ensure all functional clusters were represented, the set was initially populated by the top 2 highest frequency HLAs from each of the 38 clusters. Subsequently, all HLAs with fHLA ≥ min (µ,M)- 0.6 * σ were added to the set (where fHLA is the frequency of appearance of an HLA, and µ,M, σ are the mean, median and standard deviation of the initial selection of 76 HLA-Is). The 0.6 multiplier was used so that the threshold ensured all top 100 most frequent HLA-I molecules were included. The final selection of 114 HLA-I molecules thus spanned all functional groups at the resolution of our clusters as well as the highest frequency alleles.
HLA-II Grouping and Selection
The same studies that had previously approached HLA-I functional grouping (2, 43) did not attempt clustering of HLA-II. Using our trained neural nets for predicting HLA-II binding, we followed a similar pipeline as with HLA-I for creating functional clusters and selecting a comprehensive set for analysis.
In the case of HLA-II, our neural network architectures first process the full amino acid sequence of the alpha and beta chains independently, before allowing the information to be combined for binding prediction. These independent intermediate representations constitute the embeddings used to represent HLA-II alpha and beta chains for clustering. As with HLA-I, each representation was a stack of embeddings from all 5 models in an ensemble.
HLA-II clusters were assigned at 3 resolutions. For alpha chains alone, we found 18 clusters via agglomerative clustering and threshold sweep; and 16 clusters for all beta chains considered. Note that only alpha and beta chains with full amino acid sequences available (IPD-IMGT/HLA Database up to July 2019 (44, 45) were used. Finally, we considered the space of all compatible permutations of alpha + beta pairs. Each compatible pair (for example: DQA1*05:01-DQB1*06:02) was represented by the concatenation of the independent alpha and beta embedding, and all such pairs were represented in two dimensions with PHATE (at knn = 20). A total of 93 clusters were assigned to explain all alpha + beta permutations with our agglomerative approach (Supplementary Figure S4B). The full lists of HLA-I and HLA-II cluster membership assignments are provided as Supplementary CSV files and interactive plots.
Selecting a comprehensive HLA-II set for analysis utilized all 3 clustering assignments to ensure a fair sampling. The top 2 highest frequency alpha chains were selected from each of the 18 alpha clusters (4 singular clusters only contributed 1 alpha chain), and all other alpha chains that satisfied fα ≥ min (µ,M)- 0.5 * σ were added to the set (where fα is the frequency of appearance of an HLA-II alpha chain, and µ,M, σ are the mean, median and standard deviation of the initial set of 32 alpha chains).
Independently, the top 3 highest frequency beta chains from each of the 16 beta clusters were selected, and supplemented with all other beta chains that met fβ ≥ min (µ,M)- 0.5 * σ (where fβ is the frequency of appearance of an HLA-II beta chain, and µ,M, σ are the mean, median and standard deviation of the initial set of 47 beta chains).
The above sets were then combined to create 645 compatible permutations of alpha + beta chain pairs. The top 2 HLA-II alpha + beta pairs from any of the 93 clusters that were not covered by the selected set were added; bringing the augmented total to 661 HLA-II alpha + beta pairs. Since haplotype information in the AFND (47, 48) was less comprehensive than allele frequencies for independent alpha and beta chains, pairs within clusters were sorted by max ( fα ,fβ ); first at 4 digit allele nomenclature resolution and secondarily at 2 digit allele nomenclature resolution.
Finally, a pruning stage removed some HLA-II alpha + beta pairs from each of the 93 clusters in the analysis set only if: they were not in the top 2 samples representing a unique combination of alpha cluster (18 labels) and beta cluster (16 labels), and fβ ≤ 0.05 at the 4-digit HLA-II nomenclature resolution. This ensured that we kept all functionally unique HLA-II alpha + beta combinations and spanned the entire set of 93 clusters in our analysis set, but did not heavily oversample functionally similar pairs unless they appeared with very high frequency in the world population.
The final HLA-II set for analysis included 373 alpha + beta pairs, with 37 unique alpha chains and 66 unique beta chains represented. See supplementary files in online content for details.
HLA-I Potential Epitope Hotspot Localization
Protein regions that featured an increased frequency of peptides with high predicted HLA binding confidence relative to elsewhere on the same protein were identified as potential epitope hotspots. These hotspots helped narrow our search space, added robustness to noise for global comparisons, and enhanced the interpretability of global analysis by enabling localization of regions at most risk for potential epitope loss.
To detect hotspots, first an aggregate signal was independently obtained for each HLA-I in our analysis set. All peptides of lengths from 8 to 12 were considered in a sliding window fashion, and the value at each protein position was the average binding prediction of all overlapping peptides. Figure 9A illustrates these aggregate signals for the reference S protein of SARS-CoV-2 across all 114 HLA-I (rows).
Figure 9 Method for detection of binder hotspots. (A) Binding prediction scores of all overlapping peptides at each position along the S protein (x-axis), averaged independently for each HLA-I molecule (y-axis). (B) Top binding score locations based on thresholding max pooled averaged prediction scores independently for each HLA (per-HLA hotspots). Unselected regions are masked out and appear in gray. Values before pooling are shown to keep relation to the prior plot clear. (C) Averaged scores of all peptides classified as binders that overlap per-HLA hotspots. Color bars in (A–C) indicate scale of averaged binding prediction values for each sub-plot. See Methods for details. (D) The final HLA-I hotspot locations (red) selected by max pooling and thresholding the masked binder signal averaged across HLAs (blue).
Max pooling with a window size based on the dominant binding peptide length (9 amino acids) was then applied to the aggregate signal for each HLA, and all values in the top 10% of maxima were selected as hotspots (Figure 9B). The max thresholding operation was applied independently per HLA to help discount remaining uncorrected biases potentially causing systematic shift in prediction scores for some HLAs (ensuring peak binder locations of less well characterized HLAs were not discarded in later analysis).
Once potential epitope hotspots were selected independently for each HLA-I, they were used to narrow the scope of the aggregate signal to average only peptides classified as binders that overlap their respective HLA-I hotspots (Figure 9C). These masked binder signals were averaged across HLAs to create a global HLA-I average binder score, which was then max pooled (size = 9), and the top 25% of maxima were selected as general HLA-I binding hotspots for the protein of interest (Figure 9D).
HLA-II Potential Epitope Hotspot Localization
The same procedure was followed for identification of HLA-II potential epitope hotspots as for HLA-I, except all peptide sizes considered ranged from 11 to 21, and all max pooling steps utilized the most frequent HLA-II ligand length of 15 (instead of 9 used in HLA-I). The set of selected HLA-II for analysis was also considerably larger: 373 alpha + beta pairs.
Pan-HLA Candidate Epitope Hotspot Localization
To identify locations along a protein with the highest concentration of peptides predicted to bind both HLA-I and HLA-II with high confidence, we averaged the global HLA-I and HLA-II signals before their respective final max pooling steps were applied. This pan-HLA average binder signal was then max pooled (with size = 12) and hotspot locations were selected based on the top 25% of the max pooled signal. Figure 1 and Supplementary Figure S1 illustrate HLA-I, HLA-II, and pan-HLA hotspot locations for 6 key proteins from the SARS-CoV-2 reference genome.
SARS-CoV-2 Genome Data
SARS-CoV-2 genome and protein data were downloaded from the NIH NCBI portals (https://www.ncbi.nlm.nih.gov/datasets/coronavirus/genomes/, https://www.ncbi.nlm.nih.gov/datasets/coronavirus/proteins/ respectively) on the dates of March 19, May 11, July 27, September 25, and November 26, 2021. Upon download all genomes were classified with the latest version of the Pangolin tool available at each time point (49), to enable analysis with respect to the PANGO lineage classifications (50).
To base our viral surveillance on the most confident and critical data, our analyses of the S and N proteins considered only samples for which the S or N protein sequences were observed at least 3 times (Figure 4).
Potential Epitope Hotspot Alignment Across Variants
Potential epitope hotspots for all considered proteins were localized based on the SARS-CoV-2 reference genome (NCBI Reference Sequence: NC_045512 (23). To map hotspot locations onto each new mutated version of a protein as it was processed, dynamic time warping (DTW) (51) was used. This enabled direct comparison of relative predicted HLA binder counts at hotspot locations between viral variants.
Ranking Protein Variants by Sum Fraction of Binders Lost
To summarize how mutations in a protein variant ,p, impact its predicted peptide binding, we defined the binder count fraction for each HLA molecule, m as:
Bh,p,m is the number of peptides at the hotspot, h, in protein variant, p, that were classified as binders for a specific HLA, m, Bh,WT,m is the corresponding binder count in the wildtype WT version of the protein being considered; and H is the set of all corresponding hotspot regions in WT and p. All wildtype proteins sequences were obtained from the SARS-CoV-2 reference genome (NCBI Reference Sequence: NC_045512).
The overall potential for epitope loss of a protein variant when compared to wild type was captured by the sum fraction of binders lost, defined as:
where M is our set of all HLA molecules selected for analysis. Cases where epitope count for a specific HLA increased compared to the reference were intentionally omitted from the summation to focus only on the risk of losing epitopes that may be critical to immune response.
To compare overall binder loss magnitudes across time and across class I and class II, sum fraction of binders lost was divided by the number of HLAs considered to obtain average bind loss per HLA.
Independently for HLA-I and HLA-II, all versions of the S and N protein at each data time point were ranked according to sum fraction of binders lost, such that protein versions with the most significant overall loss in HLA binders appeared at the top. Prior to ranking and all later stages of analysis, any protein version that featured unspecified amino acids with a quantity greater than 1% of protein length were removed from consideration. A fractional value (bind loss rank fraction) was assigned to represent ranking position; where values near 0 correspond to the top of the ranking order (most loss of HLA binders), and values near 1 indicate a rank at the end (least loss of HLA binders). Bind loss rank fraction was used in the analysis illustrated in Figure 6 to select lineages whose representative S ranked within a percentage of unique proteins exhibiting most binder loss, as captured by the plot’s x-axis.
Capturing the Emergence of B.1.1.529 With Supplementary Data From GISAID
To supplement NIH NCBI data from November 26, 2021 to cover the emergence and rapid spread of the B.1.1.529 (Omicron) lineage, we obtained all Omicron genomes from GISAID (29–31) up to December 6, 2021. Genomes were required to be complete, and those with low coverage were excluded from the query. Each downloaded genome was aligned to reference using MAFFT (52). All unique versions of S and N proteins that included less than 1% unspecified amino acids and were classified by GISAID as Omicron were combined with the November 26 data. This augmented December 6 dataset was then used to capture a more timely picture of relative potential for epitope loss across VOC lineages. To verify that our early Omicron data remained representative of the rapidly spreading lineage, we used two additional time points (December 20, 2021 and December 29, 2021) to confirm that the top most frequent versions of S and N proteins representing the lineage had in fact not changed. Thus, our analysis remained representative throughout the remainder of 2021.
Characterizing Binding Loss Variation at pan-HLA Hotspots Across Lineages
The B.1.1.529 augmented December 6, 2021 data was used to analyze the full range of epitope loss potential in VOC and VOI lineages at all pan-HLA hotspot regions (Figure 7). To summarize changes, binder count fraction relative to reference (WT) was first computed independently per hotspot, h, then averaged across the set of all HLAs, M, for each protein,
For each VOC and VOI lineage, box plots in Figure 7 were created by collecting bh,pvalues of all unique protein sequence instances that included at least one corresponding source genome classified as the lineage being considered. Finally, bh,p values (averaged across HLA-I) for all most frequent instances of S and N proteins representing all VOC lineages are illustrated in Figure 8.
Data Availability Statement
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.
Author Contributions
KW conceived the project, curated the data, developed the RNN model and methods for HLA selection, prediction aggregation, hotspot identification, visualization, and wrote the manuscript. JS conceived and developed the CNN model, collaborated closely on validation of data and HLA binding prediction models, assisted with data curation from GISAID, and edited the manuscript. PS assisted with creation of figures and edited the manuscript. PS-S provided feedback from a vaccine development perspective, suggested in-depth analysis of VOC and VOI variants, and suggested edits to the manuscript. All authors contributed to the article and approved the submitted version.
Conflict of Interest
All authors are affiliated with Immunitybio, Inc. a company that is developing SARS-CoV-2 vaccines. The study received funding from ImmunityBio, Inc. The funder designed study, performed all analyses and wrote the submitted manuscript.
Proprietary software used for HLA binding predictions will be made commercially available to interested researchers and is covered by the following intellectual property: International Application Number PCT/US2019/046582 (International Publication Number WO 2020/046587 Α2); USPTO Application 17670385.
Publisher’s Note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Acknowledgments
We would like to thank Adrian Rice, Peter Sieling, Courtney Fleenor, Andy Nguyen, Kayvan Niazi, Shahrooz Rabizadeh, Hermes Garbán, and others with whom we have shared countless insightful discussions around immunology over many years.
Supplementary Material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fimmu.2022.918928/full#supplementary-material
Additionally, interactive visualizations are also available online at: https://research.immunitybio.com/scov2_epitopes/
References
1. Prévost J, Finzi A. The Great Escape? SARS-CoV-2 Variants Evading Neutralizing Responses. Cell Host Microbe (2021) 293:322–4. doi: 10.1016/j.chom.2021.02.010
2. Sidney J, Peters B, Frahm N, Brander C, Sette A. HLA Class I Supertypes: A Revised and Updated Classification. BMC Immunol (2008) 9:1–15. doi: 10.1186/1471-2172-9-1
3. Stoddard CI, Galloway J, Chu HY, Shipley MM, Sung K, Itell HL, et al. Epitope Profiling Reveals Binding Signatures of SARS-CoV-2 Immune Response in Natural Infection and Cross-Reactivity With Endemic Human CoVs. Cell Rep (2021) 358:109164. doi: 10.1016/j.celrep.2021.109164
4. Moderbacher CR, Ramirez SI, Dan JM, Grifoni A, Hastie KM, Weiskopf D, et al. Antigen-Specific Adaptive Immunity to SARS-CoV-2 in Acute COVID-19 and Associations With Age and Disease Severity. Cell (2020) 183:996–1012. doi: 10.1016/j.cell.2020.09.038
5. Sekine T, Perez-Potti A, Rivera-Ballesteros O, Strålin K, Gorin J-B, Olsson A, et al. Robust T Cell Immunity in Convalescent Individuals With Asymptomatic or Mild COVID-19. Cell (2020) 1831:158–68. doi: 10.1016/j.cell.2020.08.017
6. Tan AT, Linster M, Tan CW, Le Bert N, Chia WN, Kunasegaran K, et al. Early Induction of Functional SARS-CoV-2-Specific T Cells Associates With Rapid Viral Clearance and Mild Disease in COVID-19 Patients. Cell Rep (2021) 346:108728. doi: 10.1016/j.celrep.2021.108728
7. Sette A, Crotty S. Adaptive Immunity to SARS-CoV-2 and COVID-19. Cell (2021) 1844:861–80. doi: 10.1016/j.cell.2021.01.007
8. Bertoletti A, Le Bert N, Qui M, Tan AT. SARS-CoV-2-Specific T Cells in Infection and Vaccination. Cell Mol Immunol (2021) 1810:2307–12. doi: 10.1038/s41423-021-00743-3
9. Grifoni A, Sidney J, Vita R, Peters B, Crotty S, Weiskopf D, et al. SARS-CoV-2 Human T Cell Epitopes: Adaptive Immune Response Against COVID-19. Cell Host Microbe (2021) 297:1076–92. doi: 10.1016/j.chom.2021.05.010
10. Heide J, Schulte S, Kohsar M, Brehm TT, Herrmann M, Karsten H, et al. Broadly Directed SARS-CoV-2-Specific CD4+ T Cell Response Includes Frequently Detected Peptide Specificities Within the Membrane and Nucleoprotein in Patients With Acute and Resolved COVID-19. PloS Pathog (2021) 179:e1009842. doi: 10.1371/journal.ppat.1009842
11. Ng OW, Chia A, Tan AT, Jadi RS, Leong HN, Bertoletti A, et al. Memory T Cell Responses Targeting the SARS Coronavirus Persist Up to 11 Years Post-Infection. Vaccine (2016) 3417:2008–14. doi: 10.1016/j.vaccine.2016.02.063
12. Le Bert N, Tan AT, Kunasegaran K, Tham CYL, Hafezi M, Chia A, et al. SARS-CoV-2-Specific T Cell Immunity in Cases of COVID-19 and SARS, and Uninfected Controls. Nature (2020) 584:457–62. doi: 10.1038/s41586-020-2550-z
13. Grifoni A, Weiskopf D, Ramirez SI, Mateus J, Dan JM, Moderbacher CR, et al. Targets of T Cell Responses to SARS-CoV-2 Coronavirus in Humans With COVID-19 Disease and Unexposed Individuals. Cell (2020) 1817:1489–501. doi: 10.1016/j.cell.2020.05.015
14. Tarke A, Sidney J, Kidd CK, Dan JM, Ramirez SI, Yu ED, et al. Comprehensive Analysis of T Cell Immunodominance and Immunoprevalence of SARS-CoV-2 Epitopes in COVID-19 Cases. Cell Rep Med (2021) 22:100204. doi: 10.1016/j.xcrm.2021.100204
15. Saini SK, Hersby DS, Tamhane T, Povlsen HR, Amaya Hernandez SP, Nielsen M, et al. SARS-CoV-2 Genome-Wide T Cell Epitope Mapping Reveals Immunodominance and Substantial CD8(+) T Cell Activation in COVID-19 Patients. Sci Immunol (2021) 658:eabf7550. doi: 10.1101/2020.10.19.344911
16. Tarke A, Sidney J, Methot N, Yu ED, Zhang Y, Dan JM, et al. Impact of SARS-CoV-2 Variants on the Total CD4+ and CD8+T-Cell Reactivity in Infected or Vaccinated Individuals. Cell Rep Med (2021) 27:100355. doi: 10.1016/j.xcrm.2021.100355
17. Lucas C, Vogels CBF, Yildirim I, Rothman JE, Lu P, Monteiro V, et al. Impact of Circulating SARS-CoV-2 Variants on mRNA Vaccine-Induced Immunity. Nature (2021) 600:523–9. doi: 10.1038/s41586-021-04085-y
18. Agerer B, Koblischke M, Gudipati V, Montaño-Gutierrez LF, Smyth M, Popa A, et al. SARS-CoV-2 Mutations in MHC-I-Restricted Epitopes Evade CD8(+) T Cell Responses. Sci Immunol (2021) 657:eabg6461. doi: 10.1126/sciimmunol.abg6461
19. Khandia R, Singhal S, Alqahtani T, Kamal MA, El-Shall NA, Nainu F, et al. Emergence of SARS-CoV-2 Omicron (B.1.1.529) Variant, Salient Features, High Global Health Concerns and Strategies to Counter It Amid Ongoing COVID-19 Pandemic. Environ Res (2022) 209:112816. doi: 10.1016/j.envres.2022.112816
20. Gabitzsch E, Safrit JT, Verma M, Rice A, Sieling P, Zakin L, et al. Dual-Antigen COVID-19 Vaccine Subcutaneous Prime Delivery With Oral Boosts Protects NHP Against SARS-CoV-2 Challenge. Front Immunol (2021) 12:729837. doi: 10.3389/fimmu.2021.729837
21. Krammer F. SARS-CoV-2 Vaccines in Development. Nature (2020) 5867830:516–27. doi: 10.1038/s41586-020-2798-3
22. Sieling P, King T, Wong R, Nguyen A, Wnuk K, Gabitzsch ER, et al. Prime Had5 Spike Plus Nucleocapsid Vaccination Induces Ten-Fold Increases in Mean T-Cell Responses in Phase 1 Subjects That are Sustained Against Spike Variants. medRxiv (2021), 2021.04.05.21254940. doi: 10.1101/2021.1104.1105.21254940
23. GenBank. NIH National Library of Medicine. NCBI Reference Sequence Nc_045512.2. Bethesda, MD, USA: Genebank, National Library of Medicine (2020). Available at: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.
24. Lan J, Ge J, Yu J, Shan S, Zhou H, Fan S, et al. Structure of the SARS-CoV-2 Spike Receptor-Binding Domain Bound to the ACE2 Receptor. Nature (2020) 5817807:215–20. doi: 10.1038/s41586-020-2180-5
25. CDC. SARS-CoV-2 Variant Classifications and Definitions. Atlanta, GA, USA: Centers for Disease Control and Prevention (CDC) Website (2021). Available at: https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-info.html#unweighted-proportions-substitutions-of-therepeutic-concern.
26. Dhanda SK, Vaughan K, Schulten V, Grifoni A, Weiskopf D, Sidney J, et al. Development of a Novel Clustering Tool for Linear Peptide Sequences. Immunology (2018) 1553:331–45. doi: 10.1111/imm.12984
27. Dhanda SK, Vita R, Ha B, Grifoni A, Peters B, Sette A. ImmunomeBrowser: A Tool to Aggregate and Visualize Complex and Heterogeneous Epitopes in Reference Proteins. Bioinformatics (2018) 3422:3931–3. doi: 10.1093/bioinformatics/bty463
28. Vita R, Mahajan S, Overton JA, Dhanda SK, Martini S, Cantrell JR, et al. The Immune Epitope Database (IEDB): 2018 Update. Nucleic Acids Res (2019) 47(D1):D339–43. doi: 10.1093/nar/gky1006
29. Khare S, Gurry C, Freitas L, Schultz MB, Bach G, Diallo A, et al. GISAID's Role in Pandemic Response. China CDC Wkly (2021) 349:1049–51. doi: 10.46234/ccdcw2021.255
30. Elbe S, Buckland-Merrett G. Data, Disease and Diplomacy: GISAID's Innovative Contribution to Global Health. Global Challenges (Hoboken NJ) (2017) 11:33–46. doi: 10.1002/gch2.1018
31. Shu Y, McCauley J. GISAID: Global Initiative on Sharing All Influenza Data - From Vision to Reality. Euro Surveill (2017) 2213:30494. doi: 10.2807/1560-7917.ES.2017.22.13.30494
32. Nguyen A, David JK, Maden SK, Wood MA, Weeder BR, Nellore A, et al. Human Leukocyte Antigen Susceptibility Map for Severe Acute Respiratory Syndrome Coronavirus 2. J Virol (2020) 94:e00510–00520. doi: 10.1128/JVI.00510-20
33. O'Donnell TJ, Rubinsteyn A, Bonsack M, Riemer AB, Laserson U, Hammerbacher J. MHCflurry: Open-Source Class I MHC Binding Affinity Prediction. Cell Syst (2018) 71:129–32.e124. doi: 10.1016/j.cels.2018.05.014
34. Reynisson B, Alvarez B, Paul S, Peters B, Nielsen M. NetMHCpan-4.1 and NetMHCIIpan-4.0: Improved Predictions of MHC Antigen Presentation by Concurrent Motif Deconvolution and Integration of MS MHC Eluted Ligand Data. Nucleic Acids Res (2020) 48(W1):W449–w454. doi: 10.1093/nar/gkaa379
35. Jurtz V, Paul S, Andreatta M, Marcatili P, Peters B, Nielsen M. NetMHCpan-4.0: Improved Peptide-MHC Class I Interaction Predictions Integrating Eluted Ligand and Peptide Binding Affinity Data. J Immunol (Baltimore Md 1950 (2017) 1999:3360–8. doi: 10.4049/jimmunol.1700893
36. Pearson H, Daouda T, Granados DP, Durette C, Bonneil E, Courcelles M, et al. MHC Class I-Associated Peptides Derive From Selective Regions of the Human Genome. J Clin Invest (2016) 12612:4690–701. doi: 10.1172/JCI88590
37. Cui Y, Jia M, Lin T-Y, Song Y, Belongie S. Class-Balanced Loss Based on Effective Number of Samples. CVPR (2019) 2019:9268–77. doi: 10.1109/CVPR.2019.00949
38. Gfeller D, Guillaume P, Michaux J, Pak HS, Daniel RT, Racle J, et al. The Length Distribution and Multiple Specificity of Naturally Presented HLA-I Ligands. J Immunol (Baltimore Md 1950 (2018) 20112:3705–16. doi: 10.4049/jimmunol.1800914
39. Bassani-Sternberg M, Chong C, Guillaume P, Solleder M, Pak H, Gannon PO, et al. Deciphering HLA-I Motifs Across HLA Peptidomes Improves Neo-Antigen Predictions and Identifies Allostery Regulating HLA Specificity. PLoS Comput Biol (2017) 138:e1005725. doi: 10.1101/098780
40. Lakshminarayanan B, Pritzel A, Blundell C. Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles. Proceedings Neurips CC. In: 31st Conference on Neural Information Processing Systems (NIPS) Red Hook, NY, USA: Curran Associates, Inc. (2017).
42. Marsh SGE, Albert ED, Bodmer WF, Bontrop RE, Dupont B, Erlich HA, et al. Nomenclature for Factors of the HLA System, 2010. Tissue Antigens (2010) 754:291–455. doi: 10.1111/j.1399-0039.2010.01466.x
43. Gfeller D, Bassani-Sternberg M. Predicting Antigen Presentation-What Could We Learn From a Million Peptides? Front Immunol (2018) 9:1716. doi: 10.3389/fimmu.2018.01716
45. Robinson J, Barker DJ, Georgiou X, Cooper MA, Flicek P, Marsh SGE. IPD-IMGT/HLA Database. Nucleic Acids Res (2020) 48(D1):D948–d955. doi: 10.1093/nar/gkz950
46. Moon KR, van Dijk D, Wang Z, Gigante S, Burkhardt DB, Chen WS, et al. Visualizing Structure and Transitions in High-Dimensional Biological Data. Nat Biotechnol (2019) 3712:1482–92. doi: 10.1038/s41587-019-0336-3
47. Gonzalez-Galarza FF, McCabe A, Santos E, Jones J, Takeshita L, Ortega-Rivera ND, et al. Allele Frequency Net Database (AFND) 2020 Update: Gold-Standard Data Classification, Open Access Genotype Data and New Query Tools. Nucleic Acids Res (2020) 48(D1):D783–d788. doi: 10.1093/nar/gkz1029
48. AFND Database. Allele Frequency Net Database (AFND). Liverpool, UK. At: allelefrequencies.net (2021).
49. O’Toole Á, Scher E, Underwood A, Jackson B, Hill V, McCrone JT, et al. Assignment of Epidemiological Lineages in an Emerging Pandemic Using the Pangolin Tool. Virus Evol (2021) 72:veab064. doi: 10.1093/ve/veab064
50. Rambaut A, Holmes EC, O’Toole Á, Hill V, McCrone JT, Ruis C, et al. A Dynamic Nomenclature Proposal for SARS-CoV-2 Lineages to Assist Genomic Epidemiology. Nat Microbiol (2020) 511:1403–7. doi: 10.1038/s41564-020-0770-5
51. Giorgino T. Computing and Visualing Dynamic Time Warping Alignments in R: The Dtw Package. J Stat Soft (2009) 31:i07. doi: 10.18637/jss.v031.i07
Keywords: SARS-CoV-2, HLA, epitope, binding, conservation, variants, spike, nucleocapsid
Citation: Wnuk K, Sudol J, Spilman P and Soon-Shiong P (2022) Peptidome Surveillance Across Evolving SARS-CoV-2 Lineages Reveals HLA Binding Conservation in Nucleocapsid Among Variants With Most Potential for T-Cell Epitope Loss in Spike. Front. Immunol. 13:918928. doi: 10.3389/fimmu.2022.918928
Received: 12 April 2022; Accepted: 23 May 2022;
Published: 23 June 2022.
Edited by:
Kristen W. Cohen, Ozette Technologies, United StatesReviewed by:
Manon Nayrac, University of Montreal Hospital Centre (CRCHUM), CanadaAlba Grifoni, La Jolla Institute for Immunology (LJI), United States
Koshlan Mayer-Blackwell, Fred Hutchinson Cancer Research Center, United States
Copyright © 2022 Wnuk, Sudol, Spilman and Soon-Shiong. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Kamil Wnuk, a2FtaWwud251a0BpbW11bml0eWJpby5jb20=