High-Dimensional Mediation Analysis Based on Additive Hazards Model for Survival Data

Cui, Yidan; Luo, Chengwen; Luo, Linghao; Yu, Zhangsheng

doi:10.3389/fgene.2021.771932

ORIGINAL RESEARCH article

Front. Genet., 23 December 2021

Sec. Statistical Genetics and Methodology

Volume 12 - 2021 | https://doi.org/10.3389/fgene.2021.771932

High-Dimensional Mediation Analysis Based on Additive Hazards Model for Survival Data

Yidan Cui^1,2

Chengwen Luo³

Linghao Luo^1,2

Zhangsheng Yu^1,2,4*

¹Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
²SJTU-Yale Joint Center for Biostatistics, Shanghai Jiao Tong University, Shanghai, China
³Public Laboratory, Taizhou Hospital of Zhejiang Province, Wenzhou Medical University, Linhai, Zhejiang, China
⁴Clinical Research Institute, Shanghai Jiao Tong University School of Medicine, Shanghai, China

Mediation analysis has been extensively used to identify potential pathways between exposure and outcome. However, the analytical methods of high-dimensional mediation analysis for survival data are still yet to be promoted, especially for non-Cox model approaches. We propose a procedure including “two-step” variable selection and indirect effect estimation for the additive hazards model with high-dimensional mediators. We first apply sure independence screening and smoothly clipped absolute deviation regularization to select mediators. Then we use the Sobel test and the BH method for indirect effect hypothesis testing. Simulation results demonstrate its good performance with a higher true-positive rate and accuracy, as well as a lower false-positive rate. We apply the proposed procedure to analyze DNA methylation markers mediating smoking and survival time of lung cancer patients in a TCGA (The Cancer Genome Atlas) cohort study. The real data application identifies four mediate CpGs, three of which are newly found.

1 Introduction

Lung cancer continues to be the most common cancer type worldwide with the highest (18%) death rate among all malignant tumors (Wild et al., 2020). Zeilinger et al. (2013) found that tobacco smoking has an extensive genome-wide influence on DNA methylation. Meanwhile, Tsou et al. (2002) discovered that DNA methylation has a strong relationship with lung cancer. It is of interest to study how DNA methylation mediates the causal pathway between smoking and lung cancer patient’s survival.

Mediation analysis, for potential indirect effects (IEs) detection, was first applied to psychological theory and research (Baron and Kenny, 1986). Then this idea was generally applied to sociological and biomedical fields (Kahler et al., 2017; Lapointe-Shaw et al., 2018; Vansteelandt et al., 2019; Arora et al., 2020; Song et al., 2020). The mediation model can be expressed in the following equations:

Y = c + γ X + ε (1)

M = c_{m} + α X + ε (2)

Y_{M} = c_{y} + γ^{'} X + β M + ε, (3)

where Y and Y_M are the outcomes, M is the mediator, and X is the exposure. Eq. 1 is the original regression model. Eq. 2 models the X’s effect on M, and Eq. 3 models the X’s effect on Y adjusting for M. Estimation and inference of IE are essential to mediation analysis, which includes (MacKinnon et al., 2002) the causal steps tests (Judd and Kenny, 1981; Baron and Kenny, 1986), the coefficients difference tests (Freedman and Schatzkin, 1992), and the coefficients product tests (Sobel, 1987). Mediation analysis has been extended from univariate to multivariate or even high-dimensional mediators. Meanwhile, the outcome could be continuous, binary, longitudinal data (Selig and Preacher, 2009), as well as survival data (VanderWeele, 2011).

While the Cox model serves a purpose to survival data analysis, the additive hazards model becomes more and more common now, which could model the time-varying effect directly (Aalen, 1989). Lin and Ying (1994) studied a semiparametric method by mimicking the Cox proportional hazards model estimation method. Yin and Cai (2004) proposed an estimated method for multivariate failure time data and demonstrated its convergence properties. Mediation analysis has been applied to the additive hazards model. The early study for natural IE estimation was presented by Lange and Hansen (2011). Then, the study has been extended to multiple mediators (Taylor et al., 2008; Huang and Yang, 2017), and time-dependent mediators (Deboeck and Preacher, 2016; Aalen et al., 2020).

In recent years, scientists utilized the additive hazards model to analyze high-dimensional time-to-event data. Lin and Lv (2013) compared five penalized regularization methods and found that SCAD (smoothly clipped absolute deviation (Fan and Li, 2001)), MCP (minimax concave penalty (Zhang, 2010)), and SICA (smooth integration of counting and absolute deviation (Lv and Fan, 2009)) have better performance. Chen et al. (2019) proposed a screening method based on a sparsity-restricted pseudo-score estimator for ultrahigh-dimensional sparse data with an additive hazards model. On the other hand, extensive works have been done in high-dimensional mediation analysis. Zhang et al. (2016) applied high-dimensional mediation analysis to investigate DNA methylation sites mediating the causal pathway from smoking to reduced lung function. Latent variables, Cox model, nonlinear mediators, and sparse PCA are also discussed for high-dimensional mediation analysis (Derkach et al., 2019; Loh et al., 2020; Luo et al., 2020; Zhao et al., 2020), as well as IE testing methods (Djordjilović et al., 2019; Gao et al., 2019; Dai et al., 2020; Liu et al., 2021).

However, the analytical approach for high-dimensional mediators based on the additive hazards model is still lacking. We aim to establish a procedure for additive hazards model and investigate DNA methylation markers with IE between tobacco smoking and lung cancer patient’s survival. The main idea of the proposed procedure is to reduce high-dimensional mediators by the “two-step” sure independence screening (SIS)–SCAD method and identify positive mediators by the Sobel test. We apply SIS in the first step for its oracle property and large-scale dimensionality reduction studied by Fan and Lv (2008), who also demonstrated that combining SIS and SCAD can perform the variable selection and parameter estimation simultaneously. SIS has been extended to survival analysis with Cox proportional data (Zhao and Li, 2012) and additive hazards model (Gorst-Rasmussen and Scheike, 2013). We apply the SCAD penalty in the second step with the utilization of the R package “haza” (Gorst-Rasmussen and Scheike, 2012).

The rest of this article proceeds as follows. In the next part, we present methodological materials involving notations, assumptions, and detailed procedures. Then, we provide simulation studies to evaluate the proposed procedure’s performance and a factual data application to identify mediate CpGs between smoking and lung cancer patients’ survival time. Conclusion and discussion are then included at last.

2 Materials and Methods

2.1 Notation and Models of the Proposed Procedure

For each individual i = 1, 2, …, n, T_i = min(D_i, C_i) denotes the observed survival time, where D_i is the time from beginning to the event and C_i is the censoring time. δ_i = I(D_i ≤ C_i) is the failure indicator, and I(⋅) is the indicator function. When D_i > C_i, the participant is said to be right-censored, which we consider more in this article. Censoring rate, representing the rate of participants whose information is not available due to loss to follow-up or nonoccurrence of the interested event within the trial duration, is significant to survival analysis (Prinja et al., 2010).

Figure 1 is a direct acyclic graph showing the relationship between exposure, outcome, covariates, and high-dimensional mediators. X is the exposure. $M = {M_{1}, M_{2}, \dots, M_{p}}^{T}$ denotes the high-dimensional mediators, and p ≫ n. Y is the survival outcome. Z represents covariates. The additive hazards model with mediators is:

λ_{i} (t | X_{i}, M_{i}, Z_{i}) = λ_{0 i} (t) + γ X_{i} (t) + θ^{T} Z_{i} (t) + \sum_{k = 1}^{p} β_{k} M_{k i} i = 1,2, \dots, n, (4)

M_{k i} = c_{k} + α_{k} X_{i} (t) + ϑ^{T} Z_{i} (t) + e_{k i} k = 1,2, \dots, p . (5)

FIGURE 1

FIGURE 1. Direct acyclic graph with the exposure, outcome, and high-dimensional mediators.

Eq. 4 is an additive hazards model showing individual’s hazard rates. λ_i is associated with exposure, covariates, and high-dimensional mediators. λ_0i(t) indicates the time-varying intercept. Eq. 5 describes the way how exposure and covariates linearly influence mediators. c_k is the intercept and e_ki is random error.

2.2 Assumptions

To obtain a causal inference conclusion from the mediation analysis, we make some assumptions about mediators and confounders. Here, T(x, M₁, M₂, …, M_p) denotes that the survival time depends on X and M_k(k = 1, 2, …, p). M_k(x*) represents the mediators with different exposure values. The consistency assumption matters to the proposed procedure requiring to hold the outcome once the exposure and mediators were set (VanderWeele and Vansteelandt, 2009; Rehkopf et al., 2016). Based on Luo et al. (2020) and Huang and Yang (2017), the assumptions for the proposed procedure are as follows:

1) X ⊥ T(x, m₁, m₂, …, m_p)|Z; there is no unmeasured confounding effect between X and T conditional on Z.

2) For any k = 1, 2, …, p, M_k ⊥ T(x, m₁, m₂, …, m_p)|X, Z; there is no unmeasured confounding effect between M_k and T conditional on X and Z.

3) For any k = 1, 2, …, p, X ⊥ M_k|Z; there is no unmeasured confounding effect between X and M_k conditional on Z.

4) For any k = 1, 2, …, p, $M_{k}^{x *} ⊥ T (x, m_{1}, m_{2}, \dots, m_{p}) | Z$ ; there is no X-induced factor confounding the pathway from M to T conditional on Z, where x* is intervention for X with different value than x.

2.3 Proposed Procedure

Referring to the counting process notation, N_i(t) = I(T_i ≤ t, δ_i = 1) represents the observed failure counting process, where δ_i = I(D_i ≤ C_i). Y_i(t) = I(T_i ≥ t) is the at-risk indicator. And

M_{i} (t) = N_{i} (t) - \int_{0}^{t} Y_{i} (s) \{λ_{0 i} (s) + γ X_{i} (s) + θ^{T} Z_{i} (s) + β^{T} M_{i} (s)\} d s

is the additive martingale process. Let P = (γ, θ, β) and Q_i = (X_i, Z_i, M_i). Then the martingale could be simplified as $M_{i} (t) = N_{i} (t) - \int_{0}^{t} Y_{i} (s) {λ_{0 i} (s) + P^{T} Q} d s$ .

According to Lin and Ying (1994), the pseudo-likelihood score function of the proposed model is:

U (P) = \sum_{i = 1}^{n} \int_{0}^{\infty} \{Q_{i} (t) - \bar{Q} (t)\} \{d N_{i} (t) - Y_{i} (t) P^{T} Q_{i} (t) d t\},

where $\bar{Q} (t) = \sum_{j = 1}^{n} Y_{j} (t) Q_{j} (t) / \sum_{j = 1}^{n} Y_{j} (t)$ . Referring to Lin and Lv (2013), we can write the score function into

U (P) = b - V P,

where

b = \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{\infty} \{Q_{i} (t) - \bar{Q} (t)\} d N_{i} (t), V = \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{\infty} Y_{i} (t) {\{Q_{i} (t) - \bar{Q} (t)\}}^{\otimes 2} d t,

and a^⊗2 = aa^T. Then the least-squares type loss function of the proposed model is:

L (P) = \frac{1}{2} P^{T} V P - b^{T} P . (6)

However, the maximum likelihood estimation is not feasible when p ≫ n. To identify the true-positive mediators, we consider the “two-step” method for dimension reduction. First, we apply SIS to reduce dimension from an ultrahigh level to a moderate one (Fan and Lv, 2008). Then we perform the regularization method with SCAD penalty for the SIS-selected subset. The Sobel test is applied to identify true mediators in SCAD-selected subset. Figure 2 shows the overall workflow of the proposed procedure. We will introduce details below.

FIGURE 2

FIGURE 2. Overall workflow of the proposed procedure.

Step 1. (Screening) Using SIS to reduce candidate mediators from p dimension to d dimension, we identify a subset S₁ = {M_k: 1 ≤ k ≤ p}. Here we select d = [2n/ log(n)] mediators instead of [n/ log(n)] recommended by Fan and Lv (2008) to contain more positive mediators in subset S₁, because mediators are related to exposure and outcome simultaneously.

Step 2. (SCAD-penalized selection) Further selection with SCAD penalty for the subset $S_{2} = {M_{k} : {\hat{β}}_{k} \neq 0}$ based on M_k ∈ S₁ is applied by minimizing the following objective function with penalty:

Q (β) = L (β) + \sum_{j = 1}^{p} p_{λ} (|β_{j}|),

where L(β) has been shown in Eq. 6, and

p_{λ}^{'} (| β |) = λ I (| β | \leq λ) + \frac{{(a λ - | β |)}_{+}}{a - 1} I (| β | > λ) a > 2 λ > 0 .

Here we choose the regularization parameters by 5-fold cross-validation. Gorst-Rasmussen and Scheike (2012) implemented the SCAD penalized method for additive hazards model in R package ahaz.

Step 3. (Effect decomposition and IE test) Referring to the single mediator (Lange and Hansen, 2011) and two mediators based on the additive hazards model (Huang and Yang, 2017), we use the counterfactual hazard difference to measure the effect difference when exposure changes from x to x*. Counterfactual hazard difference, also named total effect (TE), includes two parts: direct effect (DE) and IE. DE represents the exposure directly caused effect. And IE expresses the effect caused by exposure through mediators indirectly.

Defining $M_{k}^{x}$ and $M_{k}^{x^{*}}$ as the mediator value with different exposure value x and x* separately, we have the following decomposition of TE (for more details see Supplementary Appendix):

\begin{array}{l} TE = & λ (T (x *, M_{1} (x *), \dots, M_{p} (x *)); t | Z) - λ (T (x, M_{1} (x), \dots, M_{p} (x)); t | Z) \\ = & λ (T (x *, M_{1} (x *), \dots, M_{p} (x *)); t | Z) - λ (T (x *, M_{1} (x), \dots, M_{p} (x)); t | Z) \\ + λ (T (x *, M_{1} (x), \dots, M_{p} (x)); t | Z) - λ (T (x, M_{1} (x), \dots, M_{p} (x)); t | Z) \\ = & γ (x * - x) + (α_{1} β_{1} + \dots + α_{p} β_{p}) (x * - x) \\ = & DE + IE \end{array}

Then we apply the Sobel mediation significance test to subset S₂ to pick out true-positive mediators from candidates by significant IE. According to Sobel (1987), we have the null hypothesis H₀: α_kβ_k = 0 and following p value calculating formula:

\begin{align} P_{r a w, k} = 2 \{1 - ϕ (\frac{| \hat{α_{k}} \hat{β_{k}} |}{{\hat{σ}}_{α_{k} β_{k}}})\}, \end{align} (7)

where ${\hat{σ}}_{α_{k} β_{k}} = \sqrt{{\hat{α}}_{k}^{2} {\hat{σ}}_{β_{k}}^{2} + {\hat{β}}_{k}^{2} {\hat{σ}}_{α_{k}}^{2}}$ is the estimated standard error, and ${\hat{α}}_{k}$ is the estimator of α_k, ${\hat{β}}_{k}$ is the estimator of β_k, ${\hat{σ}}_{α_{k}}^{2}$ is the estimated variance of α_k, and ${\hat{σ}}_{β_{k}}^{2}$ is the estimated variance of β_k.

3 Results

3.1 Simulation Studies

This section demonstrates the simulation results of the proposed procedure with high-dimensional mediator’s selection and IE estimation in a series of simulation studies.

3.1.1 Simulation Design

We generate hazard rate of survival outcome based on additive hazards model $λ_{i} (t_{i} | X_{i}, Z_{i}, M_{i}) = 5 t + X_{i} + 0.4 Z_{1 i} + 0.4 Z_{2 i} + \sum_{k = 1}^{p} β_{k} M_{k i}$ and high-dimensional mediators based on linear model M_ki = c_k + α_kX_i + 0.4Z_1i + 0.4Z_2i + e_ki. The simulation data are generated according to the following parameter settings with different sample size (n = 500, 1,000) and mediator dimensions (p = 10,000, 20,000, 50,000, and 100,000). The censoring time follows the uniform distribution as U(0, c). By adjusting constant c, we control the censoring rate from 15% to 50% with a 5% gap to see the level of sensitivity of the proposed procedure with different censoring rates. For each scenario, we generate 500 replicates.

• X_i ∼ B(1, 0.6) is the exposure.

• c_k ∼ U(0, 0.5) is the intercept and e_ki ∼ N(0, 1) is the random error.

• $α^{T} = (1,1,1,1,0.5, 0.5, 0,0, \underset{9992}{\underset{︸}{0, \dots, 0}})$ and $β^{T} = (1,1,1,1,0,0,0.5, 0.5, \underset{9992}{\underset{︸}{0, \dots, 0}})$ .

• Z_i1 ∼ B(1, 0.3), Z_i2 ∼ U(0, 1).

Candidates with nonzero IEs are positive mediators, and zero IEs are negative mediators. We use TPR (true-positive rate), FP (false-positive number), and FDP to evaluate mediator’s selection. And we use estimated IE, coverage probability, estimated standard error, and empirical standard error to evaluate IE estimation. To control the multiple hypothesis test error, we apply the BH (Benjamini and Hochberg, 1995) method to adjust the estimated p value. However, the BH method assumes independent hypotheses, which are not satisfied in some cases. We also consider the BY (Benjamini and Yekutieli, 2001) method for dependent situations. We apply both BH method and BY method for adjusting to compare their performance under different scenarios.

3.1.2 Simulation Results

We demonstrate the proposed procedure’s performance with simulation results summarized in Tables 1, 2, visualized in Figures 3, 4. Figure 3 and Table 1 both show the accuracy of the mediator’s selection with censoring rates ranges from 15% to 50%, 10,000 mediators, and sample sizes of 500 and 1,000 respectively. In general, selection performs better in sample size 1,000 than 500, and the BH method (shown at the first line) performs better (higher TPR and acceptable FDP) than the BY method (shown at the second line). Considering the mediator’s independence assumption, we adopt the BH method into the proposed procedure. Under the adjustment of the BH method, the lowest TPR equals 0.5485 with sample size of 500 and censoring rate of 50%. TPR rises near 1 with the increase of sample size and decrease of censoring rate. The scenario with 1,000 samples and a 30% censoring rate has the highest FP (0.3340) and FDP (0.0617) simultaneously. The naive method estimates the IE for each mediator separately and applies multiple hypothesis adjustments to all candidate mediators without variable selection. Simulation results demonstrate the proposed procedure has better selection performance than the naive method.

TABLE 1

TABLE 1. Select accuracy of the proposed procedure compared with naive method.

TABLE 2

TABLE 2. Indirect effect estimation of the proposed procedure.

FIGURE 3

FIGURE 3. Select accuracy of the proposed procedure. (A) shows TPR variation of the proposed procedure with different censoring rate and sample size, (B) shows FP variation, (C) shows FDP variation.

FIGURE 4

FIGURE 4. Indirect effect estimation of the proposed procedure. (A) is the estimated coefficients of four mediators with sample size 500 in simulation studies, (B) is the coverage probability of four mediators with sample size 500, (C) is the empirical standard error and estimated standard error of four mediators with sample size 500. (D), (E), (F) represent the same simulation results as (A), (B), (C) correspondingly with sample size 1000.

To verify the preponderance of the proposed procedure, we compare it with the joint method, the lasso method, and the Cox model method. The joint method uses the joint significant test in place of the Sobel test; meanwhile, the “two-step” variable selection is the same as the proposed procedure. The FP and FDP of the proposed procedure are much lower than the joint method. The comparison of the proposed procedure and the joint method is shown in Supplementary Table S1. The lasso method replaces the SCAD penalty with the lasso penalty in the regularization step. For both lasso- and SCAD-penalized selection, we apply 5-fold cross-validation to optimize the regularization parameters. The TPR of the proposed procedure is higher than the lasso method. The comparison of the proposed procedure and lasso method is accessible at Supplementary Table S2. The Cox model method fits the Cox proportion hazards model instead of the additive hazards model in regularization and IE estimation parts. In penalized step, we apply 5-fold cross-validation to optimize the regularization parameters for both Cox and additive hazards models. The TPR of the proposed procedure is higher than the Cox model method. The comparison of the proposed procedure and Cox model method is shown in Supplementary Table S3.

We also inspect the performance of the proposed procedure with more mediators like 20,000, 50,000, and 100,000. Under these circumstances, TPR hardly changes, whereas FP and FDP raise slowly with the increase of mediators dimension. Results of more mediators selected by the proposed procedure are available at Supplementary Table S4. To make simulations closer to the real world, we set the dependent mediators in another scenario. Results show that with the increase of mediator’s correlation, TPR decreases, and FP and FDP increase. The assumption of the BH method is not satisfied with dependent mediators. We pick the BY method to adjust the dependent p value. The dependent mediator’s variable selection results are available at Supplementary Table S5. We also look over the selection performance of the proposed procedure under four different coefficients, and the results are shown in Supplementary Table S6.

In addition, we evaluate the IE estimation performance. We show the results of 10,000 mediators and sample sizes 500 and 1,000 in Figure 4 and Table 2 (results of censoring rate equal to 15%, 25%, 35%, and 50% are in Table 2, and the rest shown in Supplementary Table S7). In summary, the estimation performs pretty well and improves with the increase of sample size. The estimated IE is close to the true value with a slight bias. The coverage probabilities are approximately 0.95. The estimated standard error and empirical standard error are close to each other.

In a word, the proposed procedure has good performance in high-dimensional mediation analysis based on the additive hazards model with high selected accuracy and exact estimation performance. Therefore, we apply it to the TCGA (The Cancer Genome Atlas) lung cancer data.

3.2 Application

Lung cancer is still the most fatal cancer worldwide with many pathogenic factors such as tobacco smoking and air pollution; 80% to 85% of lung cancers were caused by smoking (Wild et al., 2020). Nicotine in tobacco may result in genetic mutations. To find out whether smoking leads to lung cancer by affecting the DNA methylation, we applied the proposed procedure to the TCGA lung cancer cohort study involving DNA methylation data (907 samples measured by Illumina Infinium HumanMethylation450 platform), phenotype data (1,299 samples), and survival data (1,145 samples) for lung squamous cell carcinoma and lung adenocarcinoma. DNA methylation values recorded via BeadStudio software were continuous from 0 to 1 representing the intensity ratio. Thus, a higher value represents a higher degree of methylation, and so does the lower one.

After sample matching and data cleaning among the above data sets, we obtained 833 patients; 41.2% (343) were female, and 68.4% (570) were smokers. The patients’ ages ranged from 33 to 90 years with a median of 67 years. The overall survival time represented the days from first diagnosed to death or the last follow-up date. The median survival time was 652 days (1.79 years).

SIS based on the marginal correlation between tobacco smoking status and DNA methylation was first applied to reduce DNA methylation sites from 365,306 to 2n/log(n) (=248). Then we applied the SCAD penalty for a further dimension reduction and get a 25 sites subset. We applied the Sobel test and BH method to that subset for IE hypothesis testing. cg19757631, cg08636115, cg05147638, cg24720672, and cg08530838 are significant DNA methylation sites with adjusted p value $<$ 0.05. We are interested in mediating DNA methylation markers, which increase lung cancer patients’ survival hazards. Therefore, we focus on the CpG sites with positive IEs ( $\hat{α_{k}} \hat{β_{k}} >$ 0): cg19757631, cg08636115, cg05147638, and cg24720672.

Table 3 shows mediated CpG sites with positive IE. The estimated IE was represented by $\hat{α} \hat{β}$ . The TE (effect between exposure and outcome with covariates) of tobacco smoking on lung cancer patients’ survival equaled 0.0137 (95% CI = −0.0252–0.0526), and its DE (effect between exposure and outcome adjusting for mediators and covariates) equals 0.0171 (95% CI = −0.0244–0.0585). The IEs of four significant mediated CpG sites cg19757631, cg08636115, cg05147638, and cg24720672 are equal 0.0296 (95% CI = 0.0129–0.0464), 0.0263 (95% CI = 0.0093–0.0433), 0.0185 (95% CI = 0.0047–0.0323), and 0.0269 (95% CI = 0.0100–0.0438), respectively.

TABLE 3

TABLE 3. Significant mediate CpG sites with positive indirect effect.

Bakulski et al. (2019) studied DNA methylation sites associated with smoking exposure in TCGA lung adenocarcinoma tissue samples and found cg19757631 is significant (FDR-adjusted p value $<$ 0.05). In their study, the estimated methylation change of smokers versus never smokers is −12.28% (adjusted p value = 4.81E-06), which is consistent with ours ( $\hat{β} = - 0.2806$ ). The experiment results from Fei et al. (2019) about gene PRDM16 (cg08636115 located) suggest that PRDM16 is a metastasis suppressor and potential therapeutic target for lung adenocarcinomas, which has the same conclusion as ours $(\hat{β} = - 0.3811)$ . Shtutman et al. (2011) explained the operational mechanism of COPZ1 (cg05147638 located) in the tumor cell: the function-based genomic screening identified COPZ1 gene is essential in different tumor cell types instead of normal cells. Gene COPZ1 methylation is harmful. This conclusion approves our results: $\hat{β} =$ 0.4918. As for CpG site cg24720672, we find some researches about leukemia—a kind of cancer, and we infer it has the similar mechanism in tumor tissue as lung cancer (Nair, 2016; Zhang et al., 2018; Jiang et al., 2020).

The real data application identifies four significant mediated DNA methylation sites with positive IEs between tobacco smoking and lung cancer patients’ survival. CpG site cg19757631 is a mediator having a known relationship with tobacco smoking (Bakulski et al., 2019). CpG sites cg05147638, cg08636115, and cg24720672 are newly addressed mediators. Besides, we also apply the naive method to the TCGA lung cancer data set, but nothing has been identified.

4 Discussion

High-dimensional data analysis methods are becoming increasingly important with the development of sequence technologies. Mediation analysis is effective for identifying potential pathways. High-dimensional mediation models provide a new tool for biomarker finding (e.g., identifying DNA methylation sites as the potential mediator between smoking and cancer patient’s survival). In this article, we propose an approach for high-dimensional mediation analysis based on the additive hazards model, which identifies true mediators and estimates IEs. We first use the “two-step” variable selection method (contains SIS and SCAD-penalized method) to reduce high-dimensional mediators. Then we apply the Sobel test and the BH method for multiple IE hypothesis testing. Besides, we also use the BY method, a more serious adjusting method for dependent multiple hypothesis, to see the results of unsuitable method (and the results demonstrate it does bring a lower TPR). Simulation studies show good performance of the proposed procedure. The real data application identifies four DNA methylation sites with positive IEs between smoking and lung cancer patient’s survival time. The proposed procedure and its application results are valuable theoretically and practically for high-dimensional mediation analysis based on the additive hazards model.

High-dimensional mediation analysis is still at the early stage and yet to be developed further. For example, the proposed procedure for mediation analysis assumes no unmeasured confounder effect. Potential confounders could affect the IE estimation in many observational studies. Methods to incorporate confounders in the high-dimensional mediation model using propensity score or other approaches are still under development. On the other hand, we consider high-dimensional mediation analysis for longitudinal or repeated-measures data. The IE estimation methods for correlated high-dimensional mediators are also of interest.

Data Availability Statement

The TCGA lung cancer data we used in the real data application can be found in (https://xenabrowser.net/) without limitation. The proposed procedure is implemented by R. The corresponding R code can be found at https://github.com/Cui-yd/HMA.

Author Contributions

YC and ZY implemented the method, drafted the manuscript, conceived the idea, designed the study. YC and CL performed the code. CL and LL were involved in the data analysis. All authors read and approved the final manuscript.

Funding

This work was supported by 3-year plan of Shanghai public health system construction (No: GWV-10.1-XK05), Shanghai Commission of Science and Technology (No: 21ZR1436300), and Shanghai Jiao Tong University STAR Grant (No: 20190102).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2021.771932/full#supplementary-material

References

Aalen, O. O. (1989). A Linear Regression Model for the Analysis of Life Times. Statist. Med. 8, 907–925. doi:10.1002/sim.4780080803

High-Dimensional Mediation Analysis Based on Additive Hazards Model for Survival Data

1 Introduction

2 Materials and Methods

2.1 Notation and Models of the Proposed Procedure

2.2 Assumptions

2.3 Proposed Procedure

3 Results

3.1 Simulation Studies

3.1.1 Simulation Design

3.1.2 Simulation Results

3.2 Application

4 Discussion

Data Availability Statement

Author Contributions

Funding

Conflict of Interest

Publisher’s Note

Supplementary Material

References

94% of researchers rate our articles as excellent or good