How Can Gene-Expression Information Improve Prognostic Prediction in TCGA Cancers: An Empirical Comparison Study on Regularization and Mixed Cox Models

Yu, Xinghao; Wang, Ting; Huang, Shuiping; Zeng, Ping

doi:10.3389/fgene.2020.00920

ORIGINAL RESEARCH article

Front. Genet., 21 August 2020

Sec. Computational Genomics

Volume 11 - 2020 | https://doi.org/10.3389/fgene.2020.00920

How Can Gene-Expression Information Improve Prognostic Prediction in TCGA Cancers: An Empirical Comparison Study on Regularization and Mixed Cox Models

$\r\nXinghao Yu$ Xinghao Yu¹

Ting Wang¹

Shuiping Huang^1,2*

Ping Zeng^1,2*

¹Department of Epidemiology and Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, China
²Center for Medical Statistics and Data Analysis, School of Public Health, Xuzhou Medical University, Xuzhou, China

Background: Previous cancer prognostic prediction models often consider only the most important transcriptomic expressions, and their power is limited. It is unknown whether prediction power can be further improved when additional transcriptomic information is incorporated.

Methods: To integrate transcriptomes, four models are compared based on 32 types of cancer in the Cancer Genome Atlas, including the general Cox model with only clinical covariates, the Cox model with a lasso penalty (coxlasso), the Cox model with an elastic net penalty (coxenet), and the mixed-effects Cox model (coxlmm). Furthermore, we partition the survival variance into the relative contribution of clinical and transcriptomic components within the framework of coxlmm. Finally, the influence of different numbers of genes was evaluated in the context of coxlmm.

Results: Compared with the clinical covariates–only Cox model, the average prediction gain was 2.4% for coxlasso, 4.2% for coxenet, and 7.2% for coxlmm across 16 low-censored cancers; a significant elevation of prediction power was observed for SARC, SKCM, LGG, PAAD, and HNSC. Similar findings were observed for all 32 cancers with the average prediction gain of 2.7, 3.8, and 5.8% for coxlasso, coxenet, and coxlmm. Coxlmm always had comparable or better prediction performance relative to coxlasso and coxenet with an average of 2.8% prediction improvement across the 16 low-censored cancers. In addition, it is shown that the predictive accuracy of coxlmm generally increases with the number of genes included. The survival variance partition analysis demonstrates that the transcriptomic contribution was higher for some cancers (e.g., LGG, CESC, PAAD, SKCM, and SARC) and lower for others (e.g., BRCA, COAD, KIRC, and STAD).

Conclusion: This study demonstrates that the integration of transcriptomic information can substantially improve prognostic prediction accuracy, but the prediction performance is cancer-specific and varies across cancer types. It further reveals that gene expression exhibits distinct contributions to survival variation across cancers.

Introduction

Cancer is one of the primary causes of death worldwide, leading to a growing severe threat to public health (Kyu et al., 2018; Roth et al., 2018; Siegel et al., 2019). Developing accurate models for prognostic prediction has long been an active research topic in cancer epidemiological study. Traditional prognostic prediction models consider only clinical and lifestyle information (e.g., tumor features, patient conditions, a few social and environmental factors) (Mallett et al., 2010); the predictive power is, hence, limited. The past few years have witnessed a great advent of high-throughput biotechnologies, and cancer research has already entered the era of precision medicine, in which patients’ individual genomic information is profiled to develop better diagnoses and treatment strategies that are tailored to their own tumors (Ashley, 2015; Collins and Varmus, 2015). As one of the largest genomics programs in the field of cancer biology, the Cancer Genome Atlas (TCGA) has generated multiple high-dimensional omic profiles (e.g., genomic, epigenomic, and transcriptomic biomarkers) on more than 11,000 patients with various types of cancer, offering an unprecedented opportunity for designing individualized treatments in terms of omics information (Weinstein et al., 2013; Tomczak et al., 2015).

Meanwhile, numerous prediction models were established by efficiently integrating available omics data sets for evaluating the progress and prognosis of tumors (Zhu et al., 2017), including head and neck squamous cell carcinoma (Shen et al., 2017a), oral squamous cell carcinoma (Shen et al., 2017b), and gastric cancer (Zhou et al., 2017). However, in terms of our literature review, we find many previous predictions only incorporate a small set of biomarkers into models in addition to clinical covariates (Supplementary Table S1); for example, only seven CpG-based methylation signatures were employed in Shen et al. (2017a), and alternatively, only some important biomarker information was extracted with dimensional reduction methods (e.g., principal component analysis, partial least squares, or variable selection methods; Zhao et al., 2014; Tang et al., 2017a, b) and was employed for prediction. This may be due to statistical difficulties when modeling high-dimensional omic data sets or due to the consideration of clinical application, in which a small set of biomarker predictors allow developing an implementable assay in clinical practice. From a methodological perspective, those existing approaches can be deemed to be sparse models as they explicitly assume only a small fraction of omics information is useful for prognostic prediction.

Although it has been shown that sparse models have the ability to improve the predictive accuracy of patients’ prognoses, we still have some concerns with regards to their prediction power due to the following limitations. First, the sparsity is a relatively strong assumption with little biological support. In practice, the truly genomic architecture for tumor progress and prognosis is rarely known in advance and is likely to vary dramatically among cancers (Zeng and Zhou, 2017; Zhu et al., 2017). Second, the performance of those prognostic methods generally depends on the correlation of selected biomarkers (e.g., CpG sites or mRNAs) with the target cancer. However, the selection of associated genes with cancer prognosis remains challenging (Chen and Hunter, 2005; Bouvard et al., 2009; El Ghissassi et al., 2009; Secretan et al., 2009; Plummer et al., 2016); as a consequence, the biomarkers incorporated into models are not necessarily highly predictive of tumor prognosis. Third, many potential biomarkers, which are excluded due to moderate or weak effect sizes, are possibly jointly important for survival variation (Eskin, 2015) and are actually useful for prognostic prediction (Yang et al., 2010; Golan and Rosset, 2011; Zhou et al., 2013). Therefore, the sparse prognostic prediction methods may be suboptimal.

Indeed, the prediction accuracy for prognostic prediction was recently shown to increase from 0.58, 0.62 to 0.64 when integrating 100, 300, or 5000 mRNAs into the model (Zhu et al., 2017), implying that, to some extent, incorporating more informative biomarkers can improve prediction performance. In contrast to sparse models, mixed-effects models, which include all available biomarkers, have shown promising accuracy in genetic prediction (Makowsky et al., 2011; Zhou et al., 2013; Zeng and Zhou, 2017). Therefore, a natural consideration is whether integrating a large number of biomarkers (e.g., genome-wide transcriptomic expressions) into prognostic models can further improve prediction performance. The difficulty is to determine how many genes should be included. Note that the inclusion of whole genome-wide transcriptomic expressions is one of the most commonly representative choices. It is also of great interest to empirically evaluate and compare the prediction performance of mixed models with sparse models when a large amount of omics information is available.

As it has been demonstrated that gene expressions possess the best predictive power for cancer prognostic assessment compared with other genomic measurements related to survival risk (Zhao et al., 2005, 2014; Zhu et al., 2017; Kim et al., 2018), in the present study, we only focus on this kind of omic information to explore how transcriptome data can be leveraged to improve prediction accuracy relative to prior sparse methods. First, a general Cox model with only clinical covariates is considered to be a benchmark. Next, we employ the sparse Cox model with an elastic net penalty and the linear mixed-effects Cox model to integrate clinical information as well as gene-expression levels. Moreover, we quantify the relative contribution of clinical and transcriptomic information to survival variance within the framework of a mixed-effects Cox model. In order to deeply evaluate the four kinds of Cox prediction models for the survival prognosis of cancer patients, we empirically apply them to 32 types of cancer in TCGA. Our results illustrate that the aggregation of genome-wide transcriptomic information can improve the prediction power substantially and further reveal that expression measurements show varying contributions to survival variation across cancers.

Materials and Methods

Overview of Four Prognostic Cox Prediction Models

We here offer an overview of the four kinds of Cox prognostic prediction models used in the present study with detailed descriptions of those models relegated to Supplementary File. Let X_i be a p-dimensional vector for available clinical covariates (e.g., disease stage, age, and gender) for individual i in the TCGA data set (Hoadley et al., 2018) and assume each X is standardized to have mean zero and variance one. Denote the observed survival time by t_i and the true survival time by T_i with d_i indicating the censored status (i.e., d_i = 1 if T_i = t_i, whereas d_i = 0 if T_i < t_i). We then employ the most widely used Cox model (Cox, 1972) $h (t_{i} | X_{i}) = h_{0} (t_{i}) e^{X_{i}^{T} a}$ to link survival risk with clinical information, where h₀(t) is an arbitrary baseline hazard function and a = (a₁, a_2,…, a_p) is a p-dimensional vector of effect sizes for covariates.

Let G_i be an m-dimensional vector for a set of expression levels for individual i and assume each expression is standardized. The Cox model including both X_i and G_i is written as $h (t_{i} | X_{i}, G_{i}) = h_{0} (t_{i}) e^{X_{i}^{T} a + G_{i}^{T} b}$ with b = (b₁, b_2, …, b_m) an m-dimensional vector of effect sizes for expressions. Because of the high dimension (i.e., m ≫ n, see Table 1), we have to apply regularization methods for parameter estimation to avoid model overfitting (Fan and Li, 2001; Zou and Hastie, 2005; Tibshirani, 2011; Hastie et al., 2015) and only focus on the Cox model with the lasso or elastic net penalty (denoted by coxlasso or coxenet, respectively) (Zou and Hastie, 2005; Zeng et al., 2017). We adopt the coordinate descent algorithm (Friedman et al., 2010) to fit coxenet and select the optimal tuning parameter via a subsampling strategy (Hastie et al., 2009). Note that, with a suitable tuning parameter, most of the effect sizes for gene expressions in coxlasso or coxenet would be shrunk to be exactly zero, leading to the so-called sparse model. We set α = 0.50 in the elastic net penalty as done in prior work (Gamazon et al., 2015; Zeng et al., 2017) and implement model fitting with the R glmnet (version 2.0-5) package (Friedman et al., 2010).

TABLE 1

Table 1. Basic information of the raw data sets and quality control for 32 cancers in TCGA used in the present study.

Unlike coxenet, which assumes only a few of genes are involved in the survival risk, the linear mixed-effects Cox model (denoted by coxlmm) explicitly supposes that all genes may be implicated in cancer progress and have nonzero effects (Zhou et al., 2013; Ott, 2016) (i.e., $b_{j} \sim N (0, σ_{b}^{2})$ with $σ_{b}^{2}$ the variance). We fit coxlmm with the R coxme (version 2.2-10) package (Therneau, 2019) via the Laplace approximation method based on the second order Taylor series expansion (Therneau et al., 2003).

Relative Overall Importance of Clinical and Transcriptomic Information

To quantify the relative overall importance/contribution of clinical and transcriptomic information to survival phenotypes (Korsgaard et al., 1998; Yazdi et al., 2002; Gorfine et al., 2017), we first make a log-transformation for the hazard function and then define two quantities: the proportion of the survival variation explained by the clinical information (PCE) and the proportion of the survival variation explained by the transcriptome information (PGE). The summation of PCE and PGE is the proportion of the survival variation explained (PVE) by currently available clinical and transcriptomic information together. We apply the Jackknife method to yield the confidence interval for PCE or PGE (Efron and Tibshirani, 1994). Further computational details for PCE and PGE are described in Supplementary File.

TCGA Cancer Data Sets and Quality Control

We now apply these Cox models to cancer data sets publicly available from TCGA (Hoadley et al., 2018). For those cancers, we obtained their clinical information and RNAseq expression levels. Note that, for comparison across all types of cancer in TCGA, we employed pan-cancer normalized gene expression RNAseq (IlluminaHiSeq) data sets provided by UCSC Xena¹. We selected overall survival time and status and included age, gender, and pathologic tumor stage because only these clinical variables are available for most of the patients. When the pathologic tumor stage is unavailable, we alternatively employed the clinical stage (i.e., CESC and OV) or histological grade (i.e., LGG). All three stage variables are missing for five cancer data sets (e.g., SARC). Cancer-specific covariates were also considered for some cancers; for example, two binary variables (i.e., the status of estrogen or progesterone receptor) were added to BRCA.

For each cancer, we first merged clinical information and gene expressions measured from the primary cancer tissue and then excluded samples soaked in the formalin-fixed paraffin-embedded tissue. Cancers with a sample size greater than 175 and a proportion of censored event less than 15% were defined as low-censored data sets. To some extent, these two threshold values (i.e., the sample size of 175 and the proportion of 15%) were selected arbitrarily. Testicular germ cell tumor (TGCT) was excluded as nearly all TGCT patients were alive during follow-up. Therefore, among the 33 publicly available types of TCGA cancer, we reserved 32 cancer data sets in the subsequent prediction analysis and divided them into 16 low-censored and 16 high-censored cancer data sets according to our criterion above. In addition, we removed genes with more than 50% zero expressions and variances smaller than 20% quantile of expressions (Tang et al., 2017a; Yu et al., 2019). Finally, we standardized the remaining gene-expression levels and clinical covariates. The data sets used in this study are summarized in Table 2.

TABLE 2

Table 2. Summary information of 32 types of cancer in TCGA.

Model Comparison and Implementation

Following prior work (Zeng and Zhou, 2017), we conducted a subsampling strategy to evaluate prediction performance. We performed 100 Monte Carlo cross-validations (MCCVs) by randomly dividing the total cancer data set into two parts with 80% of the samples as training data and the remaining 20% as test data. Then, we fitted prediction models in the training data and calculated Harrell’s concordance index (C-index) to measure the prediction accuracy in the test data (Harrell et al., 1982). The C-index is a measure of goodness of fit for survival models, which produce risk scores. Specifically, assume all the subjects under consideration are randomly paired. In survival analysis, if one of them has a longer survival time and predicted survival time or predicted survival probability than the other, it is said that the predicted result is consistent with the actual result, which is referred to as the concordance. Then, Harrell’s C-index can be simply calculated as the proportion of the number of concordant pairs in the total number of concordant and discordant pairs. A C-index statistic of 0.5 means complete inconsistency, indicating the model has little predictive ability, and a C-index statistic of one represents complete consistency, indicating the prediction results of the model are completely in line with the actual situations. To yield the prediction gain of one model (say M1) relative to another model (say M2), we computed (C_M2-C_M1)/C_M1 with C_M2 and C_M1 the C-index statistic values for the two models.

In our prediction analysis, as is shown, compared with other models (e.g., coxenet), coxlmm displays a robust predictive performance by integrating whole gene-expression profiles. However, it remains unknown whether the improved performance of coxlmm is due to the incorporation of useful transcriptomic information or just a consequence of overfitting owing to more parameters being involved. Therefore, in order to further validate the prediction performance of coxlmm, for each cancer, we simultaneously generated a new data set by only permuting expression levels. Doing this is equivalent to adding a set of noise expressions into the prediction model. For each type of cancer, we re-performed coxlmm in terms of MCVC based on unchanged covariates (and survival time and status) and permuted expressions. Theoretically, the newly permuted expression levels are little predictive, and the resulting C-index statistic is similar to or even lower than that obtained with the total original data sets because the permuted expressions are redundant for prediction once they are included into models. In order to further validate our assumption that a large number of weak-effect genes also may be useful in the prediction of cancer prognoses and to examine whether the whole transcriptome model is optimal, we compare the accuracy of models with different sets of genes via coxlmm based on 20 replications of MCCV. We considered two schemes to select genes: first, genes were randomly selected; second, genes were selected in terms of marginal p-values of the univariate Cox model in the training data. Finally, the prediction accuracy was evaluated in the testing data set via C-index.

Results

Evaluation of Prognostic Prediction

A total of 32 TCGA cancers are included in our analysis, and the summary results of those cancers are displayed in Tables 1, 2. Here, we mainly consider three covariates (i.e., age, sex, and stage or grade) if they are available for some cancers and make a comparison for the four Cox models in terms of the difference in C-index statistic. We observe that the clinical information displays varying prognostic prediction ability across cancers. Specifically, the C-index statistic ranges from 0.54 for PAAD to 0.81 for KIRP with an average of 0.66 for the low-censored cancers; the C-index statistic ranges from 0.46 for PRAD to 0.81 for KIRP with an average of 0.66 for all 32 cancers.

Prediction Performance for the 16 Low-Censored TCGA Cancers

We first focus on the performance of 16 special cancers that have a relatively low-censored rate and, thus, are of our main interest in prognostic prediction. It is worth noting that, when gene expressions are incorporated into the model, the prediction performance is substantially improved for almost all the cancers (Figure 1). Compared with the Cox model with clinical information alone, the averages of the prediction gain for coxlasso, coxenet, and coxlmm are 2.4, 4.2, and 7.2% across cancers, respectively (Supplementary Table S2).

FIGURE 1

Figure 1. Comparison of predictive performance of four models in 16 low-censored cancers. Performance is measured by C-index difference with respect to Cox model with only clinical covariates; therefore, a negative value (i.e., values below the horizontal line) indicates worse performance than the Cox model with only clinical covariates, and the predictive performance was assessed across 100 replicates.

In particular, a much more significant elevation of C-index statistic is discovered for several cancers (e.g., SARC, SKCM, LGG, PAAD, and HNSC) for coxlasso, coxenet, and coxlmm. For example, compared to the Cox model with only clinical covariates, coxlasso displays the maximum improvement for SARC (13.8%), followed by LGG (7.8%), PAAD (6.4%), and HNSC (6.0%). The corresponding gains are 15.6, 7.6, 7.4, and 6.8% for coxenet and 22.8, 8.9, 14.8, and 10.2% for coxlmm for these four cancers, respectively. Furthermore, the prediction improvement for coxlmm is also evident for other cancers, including CESC (14.3%), LIHC (7.9%), KIRP (6.2%), and BLCA (6.2%). Nevertheless, relative to the clinical information–only Cox model, we find there is nearly no remarkable increase in the prediction accuracy of coxlmm, coxenet, or coxlasso for COAD and OV after incorporating transcriptomic information.

Finally, we find coxlmm always has a comparable or the best prediction performance among the methods. For example, coxlmm demonstrates an average of 2.8% prediction improvement across all 16 cancers compared with coxenet with the maximum gain for CESC (9.1%, from 4.8% for coxenet to 14.3% for coxlmm relative to the clinical information–only Cox model), followed by PAAD (6.9%, from 7.4% for coxenet to 14.8% for coxlmm relative to the clinical information–only Cox model). In addition, coxlmm also displays an obvious gain in the prediction performance for LIHC and SARC relative to coxenet with 6.3 and 6.1% increases in the C-index statistic, respectively. These findings imply that the Cox model incorporating all the transcriptome information can further increase prediction ability relative to that including only a small set of important genes.

Prediction Performance for All 32 TCGA Cancers

With regards to all 32 cancers, compared with the Cox model with clinical information alone, the maximum gain is observed for MESO (35.6% for coxlasso, 39.6% for coxenet, and 50.0% for coxlmm) (Supplementary Figures S1, S2). The averages of prediction gain for coxlasso, coxenet, and coxlmm are 2.7, 3.8, and 5.8% across cancers, respectively, and are lower than performance for those low-censored cancer data sets. Compared to coxlasso and coxenet, the average gains of coxlmm are 3.0 and 1.3%, which are also lower than performance for those low-censored cancer data sets.

Prediction Performance of Coxlmm With Permuted Data Sets

When predicting with the permuted data set, we find that, compared with coxlmm with permuted expressions, the original coxlmm model has an average of 4.7% higher C-index statistics across the 16 low-censored cancers and shows an average of 9.0% performance gain for the five most promising cancers with the highest PVE (i.e., CESC, LGG, PAAD, SARC, and SKCM). As anticipated, coxlmm even demonstrates slightly worse performance using those permuted cancer data sets compared with the general Cox model with only covariates. The detailed performance for each permuted cancer is showed in Figure 2. These observations suggest that integrating all the gene expressions into coxlmm does improve the prediction performance and the predictive gain of coxlmm relative to other existing models is not by chance alone.

FIGURE 2

Figure 2. Performance comparison of coxlmm using the original data sets (denoted by coxlmm) and the permuted data sets (denoted by coxlmm-permuted) across the 32 TCGA cancers. The cancers located to the left of the red dotted line are low-censored, and the cancers on the right side are high-censored.

Estimates of PCE and PGE

The estimated PCE and PGE are shown in Figure 3 with the confidence intervals displayed in Supplementary Figure S3 and Supplementary Table S3. Again, here we mainly discuss the estimates for the 16 low-censored cancers. Several interesting findings can be observed. First, except for LUSC (PVE = 4.6%), it is shown that both the clinical and transcriptomic information plays an important role in the survival variation of TCGA cancers (e.g., PVE is greater than 10.0%) with PVE ranging from 10.6% for OV to 63.0% for LGG. Second, for five cancers (i.e., LGG, CESC, PAAD, SKCM, and SARC), PGE is 15.3, 12.1, 17.8, 6.3, and 29.9% higher than PCE (with an average of 16.3%), suggesting the transcriptomic information is relatively more important than the clinical information for those cancers, and the prediction performance would improve substantially if gene-expression levels are included (see results above and Supplementary Table S1; the average increase in C-index is 3.8% for coxlasso, 8.7% for coxenet, and 14.1% for coxlmm for the five cancers). On the other hand, PCE is higher than PGE for four cancers (i.e., BRCA, COAD, KIRC, and STAD), implying that the prediction performance would improve little when integrating transcriptomic information (see results above and Supplementary Table S1; the average increase in C-index is 1.3% for coxlasso, 1.1% for coxenet, and 1.7% for coxlmm for the four cancers). Finally, the remaining cancers (i.e., BLCA, HNSC, KIRP, LIHC, LUAD, LUSC, and OV) show similar estimates for PCE and PGE. We also perform coxlmm to estimate the PCE and PGE in the other 16 high-censored cancers. However, their results are rather unstable, reflected by the wide confidence intervals (Supplementary Figure S3).

FIGURE 3

Figure 3. Estimated PCE and PGE for the 32 TCGA cancer types. The cancers located to the left of the red dotted line are low-censored, and the cancers on the right side are high-censored. PCE represents the proportion of the survival variation explained by the clinical information alone. PGE represents the proportion of the survival variation explained by the transcriptome information alone.

Influence of Different Numbers of Genes Included in Coxlmm

It is found that the prediction performance of coxlmm with different numbers of genes is cancer-specific (Figure 4 and Supplementary Figure S4). For example, integrating more genes in several cancers (e.g., LUSC, LGG, and KIRC) does not lead to the improvement of prediction although doing this indeed increases the prediction performance for other cancers (e.g., CESC, HNSC, and SARC). Generally, no matter how the genes were selected, the prediction accuracy of the model (denoted by thick lines in Figure 4) shows a slightly increasing trend with the included genes and approaches the peak when 3000 genes were employed.

FIGURE 4

Figure 4. (A) The predictive accuracy with a different number of genes after sorting by importance for 16 low-censored cancers. (B) The predictive accuracy with a different number of randomly selected genes for 16 low-censored cancers. The thick pink line represents the average C-index of all cancers.

Discussion and Conclusion

In this study, we focused on four Cox prediction approaches (i.e., the general Cox model, coxlasso, coxenet, coxlmm) and systematically assessed the prognostic values of transcriptomes with the publicly available TCGA pan-cancer data sets (Hoadley et al., 2018). It should be first emphasized that prediction performance is cancer-specific; that is, the prediction accuracy, quantified by C-index statistic, is substantially different across cancer types, consistent with the conclusions found in prior studies (Zhao et al., 2014; Zhu et al., 2017). In particular, our results indicate that coxlmm holds preferable predictive performance compared with the other three methods and reveal that integrating both clinical and transcriptomic information can elevate the prediction accuracy greatly. Furthermore, it also implies that incorporating a large number of gene expressions rather than a small set of selective genes can obtain more gain in prediction accuracy, which generally continues to improve with the increasing number of genes included. We further assess the contributions of clinical covariates and gene expressions to the survival variation within the framework of coxlmm. We can expect that PCE and PGE together should capture almost all the survival variation explained by available information as it has been illustrated that no significant power gain can be achieved by combining other omic measurements into clinical covariates except for gene expression (Zhao et al., 2014). This result is biologically informative and illustrates that the performance of prediction models would improve considerably when transcriptomic information accounts for a large proportion of the variation as quantified by PGE.

Note that our main objective here is not to explore how to efficiently integrate multiple omic profiles into prediction models (Zhao et al., 2014; Yang et al., 2016, 2018; Huang et al., 2017; Zhu et al., 2017) although this is a very interesting and important research area. Instead, we aim to evaluate how a single omic profile (i.e., gene expression) can be utilized to achieve better prognostic prediction in cancers. To the best of our knowledge, the present study is among the first work to comprehensively investigate prognostic prediction with the transcriptome profile on dozens of cancer types (i.e., a total of 32 common complex cancers). Therefore, our results provide complementary insights for prior work in which gene expressions were already shown to be very predictive for cancer prognosis assessment relative to other omic profiles (Zhao et al., 2014; Zhu et al., 2017). In addition, the analysis we carried out here also can be applicable to other types of omic measurements.

We finally highlight some interesting topics for future exploration. First, as mentioned before, we only consider transcriptomic information in our analysis and ignore other omics data sets (e.g., copy number alteration and DNA methylation), and we also cannot incorporate more useful clinical covariates (e.g., smoking or drinking history) as that information is missing during the period of data collection in TCGA. Thus, the performance of our prediction models may be underestimated. Integrating multi-omics information into the model can unquestionably achieve better prediction accuracy and has the potential to provide a deeper understanding of the mechanism of tumor progression.

Second, although TCGA includes many cancer types, its effective sample size is still relatively small, and the censored proportion is high (Hoadley et al., 2018), which may inevitably attenuate the prediction accuracy and undermine our ability to construct prognostic models that can be applicable in practice. In addition, the small sample size may also lead to unstable estimation for PCE and PGE. Thus, the external validation of our models with larger a sample size is warranted.

Third, in the present study, the proposed coxlmm actually can be treated to be a kernel machine learning–based prediction method, and only the linear kernel was considered here. The selection of optimal kernel function to measure the similarity among transcriptomes and to better improve prognostic evaluation needs further investigation (Yang et al., 2016, 2018). In addition, coxlmm also can be viewed as a penalty-type Cox model with ridge regularization (denoted by coxridge), which has the square penalty on effect sizes. We compared coxlmm and coxridge in 16 low-censored cancers and demonstrated that two methods generally had consistent performance (Supplementary Figure S5) although sometimes coxlmm performed slightly better in several cancers.

Fourth, compared with coxlasso and coxenet, we indeed observed that coxlmm, integrating more genes, had a better prediction performance. In general, the predictive accuracy of coxlmm would continue to improve when more genes were integrated. However, the model with whole transcriptomic information may be suboptimal for some cancers (e.g., CESC, HNSC, and SARC) because of the inclusion of redundant genes that were not useful for prediction. Therefore, exploring adaptive models that can select optimal genes is an interesting direction in the future. In addition, following the idea of functional gene enrichment analysis, we can first classify genes into various groups in terms of similar function in the same pathway and then conduct a separate prediction in each group and finally aggregate the individual predictions into an omnibus prediction with some reasonable weighted manners. This will probably be a considerably promising avenue in our further investigation.

Fifth, we find that the mixed-effects Cox model (i.e., coxlmm) sometimes showed low prediction accuracy for some cancers (e.g., KIRC) compared with the sparse Cox model (i.e., coxlasso and coxenet) although it demonstrated promising performance for most of the TCGA cancer types, indicating the two kinds of models have their own advantages depending on the omic architecture of a specific cancer type (Zhou et al., 2013; Tang et al., 2017a, b; Zeng and Zhou, 2017). A natural extension is to combine the two models and generate a hybrid of the mixed and sparse prediction approaches by following prior work in genetic prediction (Zhou et al., 2013; Zeng and Zhou, 2017). The mixed-sparse Cox model is anticipated to provide better prognostic prediction for different cancer types. Simultaneously, this type of model has the ability to select significant genes that may be biologically important for cancers.

Overall, the present study demonstrates that the aggregation of genome-wide transcriptomic information can lead to great improvement in prediction accuracy, but the prediction performance is cancer-specific and varies across cancer types. It further reveals that gene expression shows varying contributions to survival variation across cancers.

Data Availability Statement

The TCGA data sets can be publicly available from https://xenabrowser.net/. Additionally, all data generated or analyzed during this study are included in this article/Supplementary Material. The codes that were employed to perform coxlasso, coxenet and coxlmm can be available from https://github.com/biostatYu/coxlmm.

Author Contributions

PZ and SH conceived the idea for the study. PZ, XY, and TW obtained the data, performed the data analyses, and interpreted the results of the data analyses. PZ and XY drafted the manuscript, and all authors approved the manuscript and provided relevant suggestions.

Funding

This study was supported by the Youth Foundation of Humanity and Social Science funded by the Ministry of Education of China (18YJC910002), Natural Science Foundation of Jiangsu Province (BK20181472), China Postdoctoral Science Foundation (2018M630607 and 2019T120465), Postdoctoral Science Foundation of Xuzhou Medical University, QingLan Research Project of Jiangsu Province for Outstanding Young Teachers, Six-Talent Peaks Project in Jiangsu Province of China (WSN-087), Social Development Project of Xuzhou City (KC19017), National Natural Science Foundation of China (81402765), Statistical Science Research Project from National Bureau of Statistics of China (2014LY112), Postgraduate Research & Practice Innovation Program of Jiangsu Province (KYCX19_2250), and Training Project for Youth Science and Technology Innovation Team at Xuzhou Medical University.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

We are very grateful to the TCGA program, from which the cancer data sets and gene expressions are publicly available. We thank all the patients who participated in TCGA to provide their invaluable information for scientific research, and we also thank two reviewers for their constructive and insightful comments that substantially improved our manuscript. The data analyses in the present study were supported by the high-performance computing cluster at Xuzhou Medical University.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2020.00920/full#supplementary-material

Abbreviations

Coxenet, the Cox model with elastic net penalty; coxlasso, Cox model with lasso penalty; coxlmm: the mixed Cox model; MCCV, Monte Carlo Cross Validation; the abbreviations of cancers were summarized in Table 1; PCE, the proportion of the survival variation explained by the clinical information alone; PGE, the proportion of the survival variation explained by the transcriptome information alone; PVE, proportion of the survival variation explained; TCGA, the Cancer Genome Atlas.

Footnotes

^ https://xenabrowser.net/

References

Ashley, E. A. (2015). The precision medicine initiative: a new national effort. JAMA 313, 2119–2120.