- 1Public Health Sciences Division, Fred Hutchinson Cancer Center, Seattle, WA, United States
- 2Department of Mathematics and Statistics, Wright State University, Dayton, OH, United States
T-cell receptor (TCR) plays critical roles in recognizing antigen peptides and mediating adaptive immune response against disease. High-throughput technologies have enabled the sequencing of TCR repertoire at the single nucleotide level, allowing researchers to characterize TCR sequences with high resolutions. The TCR sequences provide important information about patients’ adaptive immune system, and have the potential to improve clinical outcome prediction. However, it is challenging to incorporate the TCR repertoire data for prediction, because the data is unstructured, highly complex, and TCR sequences vary widely in their compositions and abundances across different individuals. We introduce TCRpred, an analytic tool for incorporating TCR repertoire for clinical outcome prediction. The TCRpred is able to utilize features that can be extracted from the TCR amino acid sequences, as well as features that are hidden in the TCR amino acid sequences and are hard to extract. Simulation studies show that the proposed approach has a good performance in predicting clinical outcome and tends to be more powerful than potential alternative approaches. We apply the TCRpred to real cancer datasets and demonstrate its practical utility in clinical outcome prediction.
1 Introduction
T cell is one of the most important components of the adaptive immune system and plays fundamental roles in fighting diseases (Kumar et al., 2018). The functions of T cells critically depend on the T-cell receptor (TCR), a protein complex that is expressed on the surface of T cells and can recognize an astronomical number of antigens from pathogens or tumor cells (Chopp et al., 2022). Recent research has shown that TCR repertoire can be indicative of the functional activity of tumor infiltrating T cells and predict disease course in cancer progression (Valpione et al., 2021; Shafer et al., 2022). Indeed, sequencing the TCR repertoire and utilizing the sequence information for clinical outcome prediction have become a vital task in cancer research.
The TCR complex is a transmembrane heterodimer linked by disulfide bonds. In humans, about 95% of T cells are comprised of the alpha and beta chains, and the remaining 5% of T cells consist of the gamma and delta chains (Shah et al., 2021). For the TCR alpha chain, its diversity is mainly generated by the random rearrangement of the variable (V) and the joining (J) gene segments, while for the beta chain, the random rearrangement involves the V and J segments plus the diversity (D) gene segment (Tanno et al., 2020). Due to the extra D segment, the beta chain is more diverse than the alpha chain, and thus we focused on the beta chain. Within the beta chain, a region of particular interest is the complementarity-determining region 3 (CDR3), which is generally considered to be the principal binding site for antigens (Rock et al., 1994). In this article, we focus on the CDR3 of the TCR beta chain for risk prediction.
It is challenging to incorporate TCR repertoire for clinical outcome prediction because TCR repertoires are highly diverse and little overlapped among individuals. In fact, the TCR diversity involves not only the recombination of the V(D)J gene segments, but also the random addition or deletion of nucleotides at the junctions between gene segments. It is estimated that the degree of TCR diversity can reach up to the order of 1 × 1015 (Clambey et al., 2014). Meanwhile, TCR repertoires from different individuals generally have distinct profiles, i.e., their TCR compositions and abundances differ substantially. Animal model studies showed that the overlap between TCR repertoires of two genetically identical organisms is around only 20% (Nikolich-Žugich et al., 2004). Thus, TCR repertoire data carry few common features that can be used for clinical outcome prediction. For this reason, a common practice in TCR analysis is to calculate the Shannon entropy for each individual and then use this quantity for risk prediction (Li et al., 2020). However, the Shannon entropy, by its definition, accounts for only the proportions of TCR sequences, while the rich information embedded in the TCR amino-acid sequences is largely neglected. Feature extraction is needed for a more efficient use of the TCR sequences. In addition, some features hidden in the TCR sequences may be difficult to extract due to the complexity of the structure and functions of the TCR. Examples include structural motifs, 3D conformations, and amino acid interactions (Parras-Moltó et al., 2013; Stiffler et al., 2020). It is desirable to incorporate these hidden features in risk modeling to potentially improve the prediction accuracy.
Recently, a number of tools have been developed for studying the TCR repertoire, such as the powerTCR (Desponds et al., 2016), Immunarch (ImmunoMind Team, 2019), ImSpectR (Cordes et al., 2020), and VisTCR (Ni et al., 2020). These tools provide a variety of functionalities for TCR analysis, ranging from comparing clonal distribution and tracking clonotype to quantifying repertoire diversity and data visualization. Few methods have been developed for predicting clinical outcome of interest using the TCR repertoire information, which is believed to play important roles in immune responses to tumor progression. The TCR-L method (Liu et al., 2022) is for conducting genetic association analysis, not for risk prediction. The DeepTCR (Sidhom et al., 2021) was proposed to utilize TCR repertoire for prediction, but does not accommodate adjusting covariates such as demographic and clinical variables. Here, we propose a powerful approach, TCRpred, for predicting continuous or binary outcome by incorporating TCR repertoire with existing demographic and clinical factors. In TCRpred, the effect of the TCR repertoire is characterized by two components: 1) the effect from features that can be extracted from the TCR sequences, such as amino acid k-mers or V(D)J gene usage, and 2) the effect from hidden features that are modeled through kernel machine techniques. Then, we relate the two types of effects (along with other risk factors’ effects) to the clinical outcome through a generalized linear model. An effective algorithm is proposed to optimize the objective function to estimate the regression coefficients, which are then used to predict clinical outcomes for new observations.
Our article is organized as follows. We describe the TCRpred method in detail in the Model and method section. In the Simulation section, we conduct simulation studies under various scenarios to evaluate the performance of the proposed approach and compare it to potential alternative methods. In the Real data analysis section, we apply TCRpred to lung cancer datasets from the Cancer Genome Atlas (TCGA) and show that the TCRpred method performs well in practical data analysis.
2 Model and method
2.1 Notation and model
Assume that there are n individuals in a study. For the ith individual, let Yi be a binary or continuous response, and
Given two individuals (i and i′), Ri often differs substantially from Ri’ in their compositions and abundances, and hence there are few common features that can be directly used for clinical outcome prediction. We propose extracting features from the TCR repertoire based on TCR’s sequence information. Given that each TCR-CDR3 sequence is a string of amino acid letters (such as CASSHGRAEAFF), we consider the strategy of extracting k-mers from each sequence and then aggregate the k-mers across all the sequences in a TCR repertoire. An example is shown in Supplementary Figure S1. This strategy shares spirit with the natural language processing where k-grams, contiguous sequences of k items from a given document, are extracted for text classification (Zhang and Rao, 2020). Because the number of amino acids is 20, the number of possible k-mers is 20k, which increases rapidly with k. For example, the number of possible 4-mers is 204 = 160, 000. The extremely high dimensionality poses tremendous challenges to data analysis and can potentially harm the prediction accuracy. Hence, in practical analysis of amino acid sequences, k is often chosen to be between 2 and 5 (ValizadehAslani et al., 2020). Besides the k-mers extraction, other ways to extract features from the TCR sequence, such as counting the V(D)J gene usage, can be adopted as well. Let Z (Ri) denote the vector of all the features extracted from Ri. For ease of notation, we use Zi to represent Z (Ri) in the remainder of this article.
While some features can be explicitly extracted from the TCR sequences, other features that involve tertiary structure or long-range amino acid interactions are often difficult to extract. To accommodate such hidden features, we consider the following semi-parametric model where the effect of the hidden features is modeled through kernel machines. Let π(⋅) denote a link function. For continuous traits, π(⋅) is the identity function, and for binary traits, π(x) = exp(x)/(1 + exp(x)). Then the mean of Yi can be represented by
where β0 is an intercept, β and γ are regression coefficients, and h (Ri) represents the effect of the hidden features. Under the kernel machine framework, we assume that h (⋅) belongs to a reproducing kernel Hilbert space
where Mi = {1, …, mi} and Mi’ = {1, …, mi′} for i, i′ = 1, …, n. Let K be an n × n matrix defined based on k (Ri, Ri’). The k (Ri, Ri’) accounts for both the amino acid information and the abundances of the TCR sequences, and fully characterizes the functional space of the hidden effect h (Ri). A workflow of the TCRpred is shown in Figure 1.
2.2 Continuous outcome
To build the prediction model, we need to estimate parameters in Eq 1 and obtain an explicit form for h (⋅). First, considering that the extracted features can be high dimensional, we impose a penalty to the regression coefficients γ to reduce dimensions (i.e., remove noise features). Then, under the kernel machine framework, the estimation proceeds by minimizing the following penalized loss function
where λ0 and λ1 are regularization parameters, |⋅|1 is the L1 norm, and
We propose the following procedure to solve the above optimization problem. Notice that when β0, β, γ are fixed, by the Representer’s Theorem, a general solution for h (Ri) can be expressed as
Let
For initialization, let Z(0) = Z and the starting value
Step 1. Fix
By minimizing Eq 2, we obtain estimates
Step 2. Fix
Then, we estimate α as follows,
where
Step 3. Compute the error term
and then calculate the
Define MSE(0) to be +∞. Iterate Steps 1 - 3 until a convergence criterion is met, i.e., |MSE(t) − MSE(t−1)| ≤ ϵ for a small value of ϵ or the maximum number of iterations is reached.
Once the model has been trained, we can use the trained model to conduct prediction tasks. Suppose that a new sample consists of Xi’ and Ri’. We first extract features from Ri’. Based on the features in the final Z(t), we extract the corresponding features from Ri’ to form a feature vector Zi’. Then, we calculate the TCR homology between the new individual and the previous n training individuals, yielding a n × 1 vector which is denoted by Ki’. Then, we plug in Xi’, Zi’, and Ki’ into the trained model,
to obtain the predicted value
2.3 Binary outcome
For the binary outcome, we have the objective function as
where λ2 and λ3 are regularization parameters.
As for the linear outcome, we propose an iterated procedure to solve the optimization problem. With a similar argument and by the Representer’s Theorem, we aim to solve the following objective function
To minimize this objective function, we propose to transform the binary outcome into a linearized form (Park and Hastie, 2008) and then conduct the optimization accordingly. For initialization, we fit a regularized logistic regression for Y with respect to X and Z, i.e.,
to obtain estimates
Step 1. Compute the working response
where
Step 2. Let
to obtain estimates
Step 3. Following a similar argument to the linear outcome, we estimate α by
where
Step 4. Compute the cross entropy
Iterate Steps 1 - 4 above until a convergence criterion is met, i.e.,
In practice, because the number of extracted features is ultra high dimensional, one may need to conduct a feature-screening step before applying the above algorithm. To do this, we propose to first exclude k-mers whose frequencies are less than 5%. Then, we conduct a sure-independence screening to reduce the dimension of Z to a moderate number, e.g., n/(2 log n).
3 Simulation
In this section, we conducted simulation studies to examine the performance of the proposed approach. We first built a pool of TCR repertoires using the Cancer Genome Atlas (TCGA) data. We extracted TCR beta-chain’s CDR3 sequences from TCGA’s RNA-Seq data following Chen et al. (2021). We removed TCR sequences that had abundance equal to 1 or contained abnormal amino acid letters. Individuals with a single TCR sequence were excluded. After data processing, we obtained 8,044 individuals’ TCR repertoires as a TCR pool for the subsequent numerical experiments.
Following the outline of Eq 1, we simulated an adjusting variable Xi1 from N (0, 1) for i = 1, …, n. The intercept and the coefficient for Xi1 were set as β0 = 2, β1 = −1, respectively. For each individual, we randomly sampled a TCR repertoire, i.e., Ri, from the TCGA-TCR pool. We considered k = 3, 4 for the binary outcome and k = 3, 4, 5 for the continuous outcome. Then, for a given k, we extracted all k-mers from Ri (i = 1, …, n), and recorded the frequencies of the k-mers as a feature matrix Z. Since Z may have uneven variances for its columns, we normalized Z by dividing each column by its 0.75 quantile of values. Then, the six most frequent k-mers were set as important features. The corresponding regression coefficients for the 6 features, γj (j = 1, …, 6), were simulated from c0×Uniform (-1, 1). We considered c0 = 1, 3, 5 for the binary outcome, and c0 = 1 for the continuous outcome. Given Ri (i = 1, …, n), we calculated the n × n TCRhom matrix K (Liu et al., 2022). The homology matrix K was constructed based on the BLOSUM62 or the PAM250. The ith column of K represents the similarities between the ith individual and the other individuals. Then, in line with Sun et al. (2013), we simulated the hidden effects h (Ri) ∼ N (0, τK), where τ is a scale factor. We set τ to 5 in our simulations. A low-rank approximation via eigen-decomposition was used to ensure that the homology matrix K is positive semi-definite. For the binary outcome, the proportion of cases was between 0.3 and 0.7 for the analyzed datasets. For each replicate, we simulated 500 training samples and 500 testing samples. We replicated 500 times for each parameter setting. For performance evaluation, we used classification error and area under the ROC curve (AUC) for the binary outcome, and the mean squared error (MSE) for the continuous outcome.
We compared the proposed TCRpred with potential alternative approaches: Basic-GLM, tcrLASSO, tcrRidge, and DeepTCR. Since the DeepTCR does not consider adjusting-covariates X, we first simulated data without adjusting-covariates to compare the considered approaches. For the Basic-GLM, we fitted a GLM model with an intercept. The true underlying value of k was used to extract the feature matrix Z. For the tcrLASSO and tcrRidge, we conducted a screening on the extracted features Z to obtain the top n/(2 log n) k-mers, and then fit a regularized regression (via either LASSO or Ridge) for the top k-mers. For TCRpred, depending on whether K was based on BLOSUM62 or PAM250, this approach yielded two versions, TCRpred_B and TCRpred_P. Our simulation results are shown in Supplementary Tables S1, 2, and it can be seen that the proposed TCRpred achieved the highest prediction accuracy among the compared approaches.
Next, we simulated data that included adjusting-covariates, and then compared TCRpred with Basic-GLM, tcrLASSO, and tcrRidge by incorporating X. The DeepTCR was omitted because it does not accommodate adjusting covariates. The results are shown in Tables 1-4 for the binary outcome, and in Table 5 for the continuous outcome. TCRpred tends to have a better prediction performance in terms of lower classification errors and higher AUC than the compared approaches. When the data were simulated based on BLOSUM62, the TCRpred_P still performed well. Similarly, when the data were simulated based on PAM250, the TCRpred_B also performed well. This indicated that TCRpred was robust to the choice of the substitution matrices. We also simulated data with τ = 8, and the results showed a similar pattern (Supplementary Tables S3–7). In order to examine the influence of choice of k in model fitting on the performance of the considered methods, we used k = 2, 4 in method application for the binary outcome when 3-mers were employed in data generation. The results showed that the proposed methods still performed well even when k was misspecified (Supplementary Tables S8, 9).
TABLE 1. Classification error (C.Err) and AUC for the binary outcome. Data were simulated based on 3-mers (k = 3) and BLOSUM62.
TABLE 2. Classification error (C.Err) and AUC for the binary outcome. Data were simulated based on 4-mers (k = 4) and BLOSUM62.
TABLE 3. Classification error (C.Err) and AUC for the binary outcome. Data were simulated based on 3-mers (k = 3) and PAM250.
TABLE 4. Classification error (C.Err) and AUC for the binary outcome. Data were simulated based on 4-mers (k = 4) and PAM250.
TABLE 5. Mean squared error for the continuous outcome. Data were simulated based on BLOSUM62 (left panel) or PAM250 (right panel).
4 Real data analysis
Lung cancer is the leading cause of cancer-associated death, and non-small cell lung cancer (NSCLC) accounts for approximately 85% of total lung cancer cases (Blandin Knight et al., 2017). Both lung squamous cell carcinoma (LUSC) and lung adenocarcinoma (LUAD) are common subtypes of NSCLC. Evaluating prediction errors requires large sample sizes. Since each of the two datasets has a limited sample size, we used LUSC as the training dataset and LUAD as the testing dataset to evaluate the proposed approach.
We obtained TCR β-chain’s CDR3 sequences of LUSC and LUAD by following the same processing and filtering procedure described in Section 3. The details of data processing were given in Supplementary Material S5. We focused on stage I patients because the immune profiles of early stage patients were less likely to be altered by clinical treatments. Following Liu et al. (2022), we dichotomized the overall survival (OS) time into short/long-term survival based on the median survival time in the LUSC and the LUAD data, respectively. We wish to compare the performance of the considered models on the classification of the survival status. The Basic-GLM model included age, gender and Shannon entropy. We adjusted for age, gender and Shannon entropy for tcrLASSO, tcrRidge, and TCRpred. After removing individuals with missing adjusting covariates, 78 and 65 individuals remained in the training and the testing datasets, respectively. The Shannon entropy was computed as
We considered k = 3 for constructing the extracted-feature matrix Z. Each column in Z was scaled by the 0.75 quantile of the non-zero entries in the corresponding column. PAM250 was used to construct the homology matrix for TCRpred. For DeepTCR, the V(D)J gene usages were included and the default setting was used. The results of the prediction performance for the compared methods were included in Table 6. Our results showed that the TCRpred had the lowest classification error and highest AUC among these methods. Compared to the tcrLASSO and tcrRidge, the TCRpred additionally considered the TCR-repertoire homology, which harnesses the effects of the hidden features to improve the prediction performance.
The effect of the hidden features for the LUAD dataset is shown in Figure 2. The two survival groups appeared to have different means in their effects of the hidden features. Specifically, for the short survival group, the mean for the effects of the hidden features is close to 0, while for the long survival group, the counterpart is 0.114. For the extracted features, the TCRpred approach identified 7 TCR-sequence features, GNE, ETQ, AGG, GGR, GDT, RYN, and PDR. The tcrLASSO also identified the same set of 3-mers as the TCRpred. The estimated regression coefficients for these 3-mers in the TCRpred model were included in Supplementary Table S10. Further analysis indicated that the GNE was often harbored in the longer motif GNEQFF, and the ETQ was often included in the motif ETQYF (Figure 3). The motif GNEQFF belongs to the T cell receptor beta joining 2-1 (TRBJ2-1) segment (Demarest, 1996). The TRBJ2-1 segment is enriched in lymphoid tissue (Farmanbar et al., 2019) which is closely related to tumor metastasis in resected NSCLC (Rakaee et al., 2021). The motif ETQYF belongs to the TRBJ2-5 segment (Demarest, 1996) which plays potential prognostic roles in predicting postoperative recurrence of NSCLC (Song et al., 2020). It will be interesting to study the potential antigen targets for these motifs, i.e., GNEQFF and ETQYF. Further experimental studies are needed to shed light on the functional importance of the identified k-mers and motifs.
5 Discussion
We have developed an approach, TCRpred, for incorporating TCR repertoire data for predicting clinical outcomes. Our approach harnesses information from both extracted features and hidden features in the TCR repertoire, and is applicable to both binary and continuous outcomes. With TCR profiling being increasingly used in diagnosis and monitoring of cancer patients, our proposed approach provides a powerful tool for assessing patients’ disease risks and informing decision making in clinical treatment.
It is worth to note that the problem of predicting antigen-cognate for TCR sequences (Hudson et al., 2023) is different from the problem of using TCR repertoire to predict clinical outcome. For the former problem, the predictor is a single TCR sequence, while for the latter, the predictors are a large set of TCR sequences with different lengths and compositions, along with demographic and clinical variables. The latter problem requires aggregating information across different TCR sequences and further integrate genetic and clinical variables for predicting the outcome. Nevertheless, both problems are highly challenging, and more research efforts are needed to study these important problems.
In our analysis, the extracted feature matrix was constructed based on an exhaustive search of amino acid k-mers in the studied TCR repertoire. Such an agnostic approach ensures that every possible k-mers is interrogated, but on the other hand, the exhaustive search brings in the high-dimensionality issue. The dimension of the extracted features increases exponentially with k, posing tremendous challenges to data analysis when k is large. To overcome such challenges, a possible strategy is to utilize prior biological knowledge to narrow down the scope of k-mers being searched, and then focus on a smaller set of k-mers for risk prediction. Multiple databases have been built to include both antigen information and the corresponding TCR sequences, such as the VDJdb (Bagaev et al., 2020). While many of the collections in these databases are derived from infectious disease studies, collections on tumor antigens and their TCR sequences are expected to grow significantly in the coming years. It will be highly meaningful to explore the use of such databases for improving the power of risk prediction.
TCR repertoire involves dynamic changes along disease progression and clinical treatment, and some studies have been designed to monitor the TCR repertoire at multiple time points (Öjlert et al., 2021). Such studies capture not only the immuno profiles at different stages, but also the shift of certain sub-populations of the T cells which may be critical for evaluating treatment responses. On the other hand, the involvement of longitudinal data adds one more layer of complexity to the TCR repertoire analysis, and how to effectively analyze such data remains to be investigated.
To conclude, the proposed method, TCRpred, can be used for clinical outcome prediction by harnessing both the compositions and the sequence information of the TCR repertoire. Our simulation studies showed that the TCRpred outperformed the compared alternative approaches under various parameter settings. In real data analysis, the proposed method performed well and identified a group of k-mers that are potentially related to the survival status of lung cancer patients. Overall, the TCRpred adds a useful tool to the existing toolbox for the analysis of TCR repertoire.
Data availability statement
Clinical data can be accessed at https://xenabrowser.net/datapages/. The TCR data were extracted from RNASeq data, which can be accessed at https://portal.gdc.cancer.gov/ with approval from GDC.
Author contributions
ML: Conceptualization, Data curation, Formal Analysis, Methodology, Software, Validation, Writing–original draft, Writing–review and editing. YL: Conceptualization, Methodology, Writing–review and editing. LH: Conceptualization, Writing–review and editing. QH: Conceptualization, Supervision, Methodology, Writing–original draft, Writing–review and editing, Funding acquisition.
Funding
The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This research is supported by the National Institutes of Health, R01CA223498, R01CA189532, R01GM105785, and P30CA015704.
Acknowledgments
We thank the reviewers for their very helpful and constructive comments.
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2024.1345559/full#supplementary-material
References
Bagaev, D. V., Vroomans, R. M., Samir, J., Stervbo, U., Rius, C., Dolton, G., et al. (2020). Vdjdb in 2019: database extension, new analysis infrastructure and a t-cell receptor motif compendium. Nucleic Acids Res. 48, D1057–D1062. doi:10.1093/nar/gkz874
Blandin Knight, S., Crosbie, P. A., Balata, H., Chudziak, J., Hussell, T., and Dive, C. (2017). Progress and prospects of early detection in lung cancer. Open Biol. 7, 170070. doi:10.1098/rsob.170070
Chen, K., Bai, J., Reuben, A., Zhao, H., Kang, G., Zhang, C., et al. (2021). Multiomics analysis reveals distinct immunogenomic features of lung cancer with ground-glass opacity. Am. J. Respir. Crit. care Med. 204, 1180–1192. doi:10.1164/rccm.202101-0119OC
Chopp, L., Redmond, C., O’Shea, J. J., and Schwartz, D. M. (2022). From thymus to tissues and tumors: a review of t cell biology. J. Allergy Clin. Immunol. 151, 81–97. doi:10.1016/j.jaci.2022.10.011
Clambey, E. T., Davenport, B., Kappler, J. W., Marrack, P., and Homann, D. (2014). Molecules in medicine mini review: the αβ t cell receptor. J. Mol. Med. 92, 735–741. doi:10.1007/s00109-014-1145-2
Cordes, M., Pike-Overzet, K., van Eggermond, M., Vloemans, S., Baert, M. R., Garcia-Perez, L., et al. (2020). Imspectr: R package to quantify immune repertoire diversity in spectratype and repertoire sequencing data
Demarest, J. F. (1996). Analysis of the T cell receptor V (beta) repertoire during primary HIV infection. The George Washington University.
Desponds, J., Mora, T., and Walczak, A. M. (2016). Fluctuating fitness shapes the clone-size distribution of immune repertoires. Proc. Natl. Acad. Sci. 113, 274–279. doi:10.1073/pnas.1512977112
Farmanbar, A., Kneller, R., and Firouzi, S. (2019). Rna sequencing identifies clonal structure of t-cell repertoires in patients with adult t-cell leukemia/lymphoma. NPJ Genomic Med. 4, 10. doi:10.1038/s41525-019-0084-9
He, Q., Cai, T., Liu, Y., Zhao, N., Harmon, Q. E., Almli, L. M., et al. (2016). Prioritizing individual genetic variants after kernel machine testing using variable selection. Genet. Epidemiol. 40, 722–731. doi:10.1002/gepi.21993
Hudson, D., Fernandes, R. A., Basham, M., Ogg, G., and Koohy, H. (2023). Can we predict t cell specificity with digital biology and machine learning? Nat. Rev. Immunol. 23, 511–521. doi:10.1038/s41577-023-00835-3
ImmunoMind Team (2019). Immunarch: an R Package for Painless Bioinformatics analysis of T-cell and B-cell immune repertoires. doi:10.5281/zenodo.3367200
Kumar, B. V., Connors, T. J., and Farber, D. L. (2018). Human t cell development, localization, and function throughout life. Immunity 48, 202–213. doi:10.1016/j.immuni.2018.01.007
Li, N., Yuan, J., Tian, W., Meng, L., and Liu, Y. (2020). T-cell receptor repertoire analysis for the diagnosis and treatment of solid tumor: a methodology and clinical applications. Cancer Commun. 40, 473–483. doi:10.1002/cac2.12074
Liu, M., Goo, J., Liu, Y., Sun, W., Wu, M. C., Hsu, L., et al. (2022). Tcr-l: an analysis tool for evaluating the association between the t-cell receptor repertoire and clinical phenotypes. BMC Bioinforma. 23, 152–216. doi:10.1186/s12859-022-04690-2
Ni, Q., Zhang, J., Zheng, Z., Chen, G., Christian, L., Grönholm, J., et al. (2020). Vistcr: an interactive software for t cell repertoire sequencing data analysis. Front. Genet. 11, 771. doi:10.3389/fgene.2020.00771
Nikolich-Žugich, J., Slifka, M. K., and Messaoudi, I. (2004). The many important facets of t-cell repertoire diversity. Nat. Rev. Immunol. 4, 123–132. doi:10.1038/nri1292
Öjlert, Å. K., Nebdal, D., Snapkov, I., Olsen, V., Kidman, J., Greiff, V., et al. (2021). Dynamic changes in the t cell receptor repertoire during treatment with radiotherapy combined with an immune checkpoint inhibitor. Mol. Oncol. 15, 2958–2968. doi:10.1002/1878-0261.13082
Park, M. Y., and Hastie, T. (2008). Penalized logistic regression for detecting gene interactions. Biostatistics 9, 30–50. doi:10.1093/biostatistics/kxm010
Parras-Moltó, M., Campos-Laborie, F. J., García-Diéguez, J., Rodríguez-Griñolo, M. R., and Pérez-Pulido, A. J. (2013). Classification of protein motifs based on subcellular localization uncovers evolutionary relationships at both sequence and functional levels. BMC Bioinforma. 14, 229–311. doi:10.1186/1471-2105-14-229
Rakaee, M., Kilvaer, T. K., Jamaly, S., Berg, T., Paulsen, E.-E., Berglund, M., et al. (2021). Tertiary lymphoid structure score: a promising approach to refine the tnm staging in resected non-small cell lung cancer. Br. J. Cancer 124, 1680–1689. doi:10.1038/s41416-021-01307-y
Rock, E. P., Sibbald, P. R., Davis, M. M., and Chien, Y.-H. (1994). Cdr3 length in antigen-specific immune receptors. J. Exp. Med. 179, 323–328. doi:10.1084/jem.179.1.323
Shafer, P., Kelly, L. M., and Hoyos, V. (2022). Cancer therapy with tcr-engineered t cells: Current strategies, challenges, and prospects. Front. Immunol. 13, 835762. doi:10.3389/fimmu.2022.835762
Shah, K., Al-Haidari, A., Sun, J., and Kazi, J. U. (2021). T cell receptor (tcr) signaling in health and disease. Signal Transduct. Target. Ther. 6, 412. doi:10.1038/s41392-021-00823-w
Sidhom, J.-W., Larman, H. B., Pardoll, D. M., and Baras, A. S. (2021). Deeptcr is a deep learning framework for revealing sequence concepts within t-cell repertoires. Nat. Commun. 12, 1605. doi:10.1038/s41467-021-21879-w
Song, Z., Chen, X., Shi, Y., Huang, R., Wang, W., Zhu, K., et al. (2020). Evaluating the potential of t cell receptor repertoires in predicting the prognosis of resectable non-small cell lung cancers. Mol. Therapy-Methods Clin. Dev. 18, 73–83. doi:10.1016/j.omtm.2020.05.020
Stiffler, M. A., Poelwijk, F. J., Brock, K. P., Stein, R. R., Riesselman, A., Teyra, J., et al. (2020). Protein structure from experimental evolution. Cell Syst. 10, 15–24. doi:10.1016/j.cels.2019.11.008
Sun, J., Zheng, Y., and Hsu, L. (2013). A unified mixed-effects model for rare-variant association in sequencing studies. Genet. Epidemiol. 37, 334–344. doi:10.1002/gepi.21717
Tanno, H., Gould, T. M., McDaniel, J. R., Cao, W., Tanno, Y., Durrett, R. E., et al. (2020). Determinants governing t cell receptor α/β-chain pairing in repertoire formation of identical twins. Proc. Natl. Acad. Sci. 117, 532–540. doi:10.1073/pnas.1915008117
ValizadehAslani, T., Zhao, Z., Sokhansanj, B. A., and Rosen, G. L. (2020). Amino acid k-mer feature extraction for quantitative antimicrobial resistance (amr) prediction by machine learning and model interpretation for biological insights. Biology 9, 365. doi:10.3390/biology9110365
Valpione, S., Mundra, P. A., Galvani, E., Campana, L. G., Lorigan, P., De Rosa, F., et al. (2021). The t cell receptor repertoire of tumor infiltrating t cells is predictive and prognostic for cancer survival. Nat. Commun. 12, 4098. doi:10.1038/s41467-021-24343-x
Keywords: CDR3, clinical outcome prediction, high dimensions, high throughput sequencing, T-cell receptor, TCR repertoire
Citation: Liu M, Liu Y, Hsu L and He Q (2024) TCRpred: incorporating T-cell receptor repertoire for clinical outcome prediction. Front. Genet. 15:1345559. doi: 10.3389/fgene.2024.1345559
Received: 28 November 2023; Accepted: 26 February 2024;
Published: 13 March 2024.
Edited by:
Ni Zhao, Johns Hopkins University, United StatesReviewed by:
Hiren Karathia, National Cancer Institute at Frederick (NIH), United StatesMo Li, University of Louisiana at Lafayette, United States
Copyright © 2024 Liu, Liu, Hsu and He. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Qianchuan He, cWhlQGZyZWRodXRjaC5vcmc=