HFS-SLPEE: A Novel Hierarchical Feature Selection and Second Learning Probability Error Ensemble Model for Precision Cancer Diagnosis

Meng, Yajie; Jin, Min

doi:10.3389/fcell.2021.696359

ORIGINAL RESEARCH article

Front. Cell Dev. Biol. , 30 June 2021

Sec. Epigenomics and Epigenetics

Volume 9 - 2021 | https://doi.org/10.3389/fcell.2021.696359

This article is part of the Research Topic Finding New Epigenomics and Epigenetics Biomarkers for Complex Diseases and Significant Developmental Events with Machine Learning Methods View all 36 articles

HFS-SLPEE: A Novel Hierarchical Feature Selection and Second Learning Probability Error Ensemble Model for Precision Cancer Diagnosis

$\r\nYajie Meng$ Yajie Meng

Min Jin^*

College of Computer Science and Electronic Engineering, Hunan University, Changsha, China

The emergence of high-throughput RNA-seq data has offered unprecedented opportunities for cancer diagnosis. However, capturing biological data with highly nonlinear and complex associations by most existing approaches for cancer diagnosis has been challenging. In this study, we propose a novel hierarchical feature selection and second learning probability error ensemble model (named HFS-SLPEE) for precision cancer diagnosis. Specifically, we first integrated protein-coding gene expression profiles, non-coding RNA expression profiles, and DNA methylation data to provide rich information; afterward, we designed a novel hierarchical feature selection method, which takes the CpG-gene biological associations into account and can select a compact set of superior features; next, we used four individual classifiers with significant differences and apparent complementary to build the heterogeneous classifiers; lastly, we developed a second learning probability error ensemble model called SLPEE to thoroughly learn the new data consisting of classifiers-predicted class probability values and the actual label, further realizing the self-correction of the diagnosis errors. Benchmarking comparisons on TCGA showed that HFS-SLPEE performs better than the state-of-the-art approaches. Moreover, we analyzed in-depth 10 groups of selected features and found several novel HFS-SLPEE-predicted epigenomics and epigenetics biomarkers for breast invasive carcinoma (BRCA) (e.g., TSLP and ADAMTS9-AS2), lung adenocarcinoma (LUAD) (e.g., HBA1 and CTB-43E15.1), and kidney renal clear cell carcinoma (KIRC) (e.g., IRX2 and BMPR1B-AS1).

Introduction

Cancer has the characteristics of concealed onset, low cure rate, and high mortality. Traditional surgery, radiotherapy, and chemotherapy have limited effects on patients with advanced cancer. The cancer diagnosis is currently a hot research topic. With the acquisition of numerous gene expression profiles in various tissue samples, it is possible to perform cancer diagnosis at the molecular level. Early biology supposed that cancer is closely related to the mutations of protein-coding genes (Stratton et al., 2009). Thus, mass studies confirmed the feasibility of cancer diagnosis based on messenger RNA (mRNA) expression profiles, achieving some good results (Ben-Dor et al., 2000; Furey et al., 2000; Li et al., 2001). Recently, biological studies have found that without changing the sequence of the protein-coding genome, there are a lot of epigenetic variations involving multiaspects such as non-coding RNA (ncRNA) and DNA methylation in cancer (Baylin and Ohm, 2006; Luo et al., 2016; Xiao et al., 2018a). Numerous studies utilized epigenetic data such as microRNA (miRNA) (Saha et al., 2015), long non-coding RNA (lncRNA) expression profiles (Zhang et al., 2018), and DNA methylation (Al-Juniad et al., 2018) for cancer diagnosis and subtype classification, obtaining some achievements (Raweh et al., 2018; Tang et al., 2018). However, the latest biological research indicates that multibiomarkers can improve the accuracy and robustness of cancer diagnosis (Modelska et al., 2015). Zhao et al. (2018) integrated protein-coding gene, miRNA, and lncRNA expression profiles for lung adenocarcinoma (LUAD) diagnosis. Alghunaim and Al-Baity (2019) fused protein-coding gene expression profiles and DNA methylation for breast invasive carcinoma (BRCA) diagnosis. Classical genetics and epigenetics are two separate mechanisms participating in carcinogenesis (Network, 2012). Additionally, epigenetics data such as ncRNA and DNA methylation are not independent of each other, and they often have synergistic effects (Xu et al., 2018). Therefore, only using protein-coding gene expression profiles and/or ncRNA expression profiles or DNA methylation data leads to the lack of information and prevents the high-performance and robustness of cancer diagnosis from being significantly improved.

The emergence of large-scale RNA-seq data and DNA methylation data has offered unprecedented opportunities for developing cancer diagnosis approaches. However, integrating transcriptome profiling (i.e., protein-coding gene and ncRNA expression profiles) and DNA methylation data for cancer diagnosis faces challenges. Model et al. (2001); Ang et al. (2015), Gao et al. (2017); Lu et al. (2017), and Sun et al. (2019) pointed out that transcriptome profiling and DNA methylation data are featured with high dimensionality, high redundancy, and complex interaction associations. To solve the problem of high dimensionality, feature scoring functions such as differences and distances between normal and tumor samples, correlation coefficients, and information metrics between features and categories are commonly applied for filtering feature (Lazar et al., 2012). For example, Yoon and Lim (2013) used t-test and Euclidean distance, and Cao et al. (2015) adopted the fold-change (FC) and false discovery rate (FDR) for filtering feature. These filter methods efficiently remove irrelevant features to reduce dimensions, which has the characteristics of strong universality and low complexity and are suitable for processing high-dimension data. However, these filter methods are from the perspective of a single feature, without considering the high redundancy between features. Subsequently, Peng et al. (2005) proposed the minimal redundancy and maximal relevance criterion, named mRMR, which based on the maximum correlation between features and categories and the minimum redundancy between features in feature subsets. Lyu et al. (2017) designed LLRFCscore+ algorithm, which first sorts features in descending order via LLRFC criteria and then use the dynamic correlation analysis strategy to eliminate redundant features further. Raweh et al. (2018) developed a hybrid feature selection algorithm with a filtering method and a new feature extraction algorithm. Based on informatics theory, these methods can effectively remove redundancy for single-type data. However, these methods calculated the many-to-many modification associations between DNA methylation CpG sites and genes as redundant correlations. Thus, feature selection methods for transcriptome profiling and DNA methylation data are necessary to study further.

The diagnosis models are also essential for cancer diagnosis. Due to the diversity of classifiers, ensemble models tended to have better performance than single models (Dietterich, 2000a; Yang et al., 2010; Zhou and Jin, 2017). Generally speaking, there are three common types of ensemble strategies in the cancer diagnosis field, namely, the voting method, average method, and learning method. For example, Huang et al. (2017) proposed an support vector machine (SVM) ensemble model, which constructed SVM base classifiers with different kernel functions based on bagging and boosting sampling and used majority voting and weighted average ensemble strategies. The SVM ensemble model solves the problems of easy fitting and limited generalization of a single model, but it is limited to the same type classifiers and cannot fully guarantee the difference between classifiers. Cho and Won (2003) trained four different types of individual classifiers and obtained the final ensemble model by majority voting method. This method takes advantage of the complementarity among the different individual classifiers and breaks through the limitation of the application scope of single classifiers. Although it is relatively simple to integrate homogeneous and heterogeneous classifiers by the voting and average methods, it cannot ensemble the nonlinear relationship between classifiers. To further ameliorate the voting and average methods, Xiao et al. (2018b) utilized the stacking learning ensemble strategy, which based on cross-validation to train five different classifiers and put the training results of the classifiers as the input of a deep learning algorithm. The learning ensemble strategy effectively integrate the nonlinear relationships between the heterogeneous classifiers. However, the deep learning algorithm highly depends on the size of samples, whose performance needs to be enhanced by a large increase in the samples of data. Therefore, these ensemble methods show a limited performance for precise cancer diagnosis.

To address the above limitations, we proposed a novel hierarchical feature selection and second learning probability error ensemble model, called HFS-SLPEE, for precision cancer diagnosis. At the dataset level, we integrated protein-coding genes expression profiles, ncRNAs expression profiles, and DNA methylation data to construct a triple dataset that provides a multiview perspective and diverse information. At the feature selection level, due to the significant differences in dimensions, abundance, and association relationships of the triple dataset, we designed a novel hierarchical feature selection algorithm. In stage 1, we developed a CpG sites aggregation feature selection algorithm, termed CSAFS, to non-destructively store the biological associations between ultra-high dimensions DNA methylation CpG sites and genes and rapidly reduce the nearly 500,000 dimensions DNA methylation data to nearly 30,000 dimensions methylated gene data. In stage 2 feature selection, we used different thresholds to further select significantly differentially features via FC and FDR. In stage 3 feature selection, we adopted the mRMR algorithm to select a compact set of superior features. At the model diagnosis level, we developed a second learning probability error ensemble model, named SLPEE. Specifically, we selected four individual classifiers with significant differences and apparent complementary effects to build heterogeneous classifiers and then obtain the classifiers-predicted class probability predictions. To thoroughly learn the nonlinear data, SLPEE integrated the classifiers-predicted class probability predictions and the actual class label in the validating set to construct a new training set, which implicitly included the probability error of each classifier. Furthermore, we utilized eXtreme Gradient Boosting (Xgboost) as ensemble learner to secondly learn the new training set. Via training on three cancers in The Cancer Genome Atlas (TCGA) based on 10-fold cross-validation, HFS-SLPEE achieved 100% multi-indicators of LUAD and kidney renal clear cell carcinoma (KIRC), and obtained 99.65% accuracy, 99.61% sensitivity, 100% specificity, and 99.81% F1-score of BRCA, outperforming previously published approaches. The results indicate that HFS-SLPEE is an accurate and robust approach for cancer diagnosis.

Theoretically, the three contributions of this work are as follows: (i) we integrate protein-coding gene expression profiles, ncRNA expression profiles, and DNA methylation data to solve the problem of lacking information on cancer diagnosis; (ii) we take the biological complex associations into account and develop a novel hierarchical feature selection approach, which integrates the proposed CSAFS, FC&FDR, mRMR, to efficiently select a group of superior and compact features; and (iii) we design an SLPEE model, which makes predictions by secondly learning the error rules between the predicted class probability values of the heterogeneous classifiers and the real values to realize the self-correction of the diagnosis errors via the nonlinear ensemble of heterogeneous classifiers.

Materials and Methods

As shown in Figure 1, HFS-SLPEE consists of four parts: (i) the construction of a triple dataset, (ii) the novel hierarchical feature selection method, (iii) heterogeneous classifiers, and (iv) SLPEE. First, we integrated three biological data including protein-coding gene expression profiles, ncRNA expression profiles, and DNA methylation data to construct a triple dataset, which contains the rich information. Afterward, we hierarchically selected features for the triple dataset. In the first-stage feature selection, we designed the CSAFS algorithm, which could quickly reduce nearly 500,000 dimensions of DNA methylation data to tens of thousands of dimensions and non-destructively preserved the biological complex modification associations between DNA methylation CpG sites and genes. In the second-stage feature selection, we adopted the FC and FDR to select the features with the highest relevance to the target class. In the third-stage feature selection, we applied the mRMR algorithm to select a compact set of superior features with minimal redundancy and maximal relevance. Next, we trained four heterogeneous classifiers in the training set with the features selected via (ii) and optimize the parameters of heterogeneous classifiers via the grid search algorithm in the validating set. Finally, we developed the SLPEE model to ensemble the class probability predictions of the heterogeneous classifiers under the optimal parameters. SLPEE was utilized to predict the testing set in each fold and obtained the novel informative biomarkers.

FIGURE 1

Figure 1. A flowchart of HFS-SLPEE. We first integrate the protein-coding gene expression profiles, non-coding RNA expression profiles, and DNA methylation data to get rich information. Afterward, considering the CpG-gene biological associations, we design a novel hierarchical feature selection method to get a compact group of superior features. Next, we train four heterogeneous classifiers in the training set with the selected features and optimize the parameters of heterogeneous classifiers via the grid search algorithm in the validating set. Finally, we develop a second learning probability error ensemble model (named SLPEE) to ensemble the class probability predictions of the heterogeneous classifiers under the optimal parameters. SLPEE is utilized to predict the testing set in each fold. HFS-SLPEE is a precision cancer diagnosis framework, which is powerful tool for precision cancer diagnosis.

Construction of the Triple Dataset

The central dogma of classical genetics indicated that genetic information was stored in protein-coding genes (Crick et al., 1961). It had been thought that the formation of tumors is due to the mutations of protein-coding genes. In recent years, plenty of evidence showed that epigenetics also played an essential role in tumor progression (Xu et al., 2019, 2020; Meng et al., 2020). Epigenetics did not involve the DNA sequence changes but changed the structure of chromosomes via different mechanisms that affected the activity of surrounding genes to induce cancer. These known mechanisms commonly included the regulation of ncRNA, DNA methylation, and histone modification, but there are many mechanisms that are unknown so far. Herein, we integrated protein-coding gene expression profiles, ncRNA expression profiles, and DNA methylation data to construct a triple dataset, which contained rich information and provided a multiview perspective for precision cancer diagnosis.

The Novel Hierarchical Feature Selection

The triple dataset exhibits complex characteristics: (i) There were many-to-many complex modification associations between methylation CpG sites and genes; (ii) DNA methylation data were nearly 500,000 dimensions, while the transcriptome profiling were nearly 60,000 dimensions; (iii) there were many noise and redundancy features, and only a small part of the features was positively related to the two phenotypes as tumor and normal tissues; and (iv) the expression abundance of the transcriptome profiling was significantly different from the DNA methylation data. We proposed a novel hierarchical feature selection algorithm against the complex characteristics of the triple dataset.

Stage 1: CpG Sites Aggregation Feature Selection

In this study, we used Illumina Human Methylation 450 array methylation chip data, which measures the level of methylation at known CpG sites as follows:

β = \frac{M}{M + U} (1)

where M denotes the methylated array intensity, M+U denotes the unmethylated array intensity, and β represents the ratio between the methylated array intensity and the total array intensity, ranging from 0 to 1. A CpG site could modify multigenes, while a gene might be related to multi CpG sites. To rapidly reduce the dimensions of the ultra-high dimensionality DNA methylation data and preserve the many-to-many complex biological associations between CpG sites and genes in advance, we proposed a novel CpG sites aggregation feature selection method, called CSAFS, as the stage 1 feature selection algorithm to obtain methylated genes. Specifically, we defined the methylated genes as MG_gj, which represents the arithmetic mean value of the methylation level of all CpG sites related to the gene g_j. The MG_gj is calculated as follows:

M G_{g_{j}} = \frac{\sum_{i = 1}^{P} C p G_{i}}{P} (2)

where P denotes the dimensions of the DNA methylation data, P = 485,577, CpG_i(i = 1, 2…,P) represents the level of methylation at the known ith CpG site, and $\sum_{i = 1}^{P} C p G_{i}$ denotes the aggregate value of all relevant CpG sites of the gene g_j.

Stage 2: FC and FDR

The dimensions of methylated genes data and protein-coding gene and ncRNA expression profiles belonged to tens of thousands of dimensions after the stage 1 feature selection. However, the dimensions were still relatively high. Therefore, we utilized FC and FDR statistical methods to further perform overall rapid dimension reduction.

Fold Change

FC is a well-known method to screen the differentially expressed genes of microarray data (DeRisi et al., 1996; Schena et al., 1996), which measures the difference through calculating the ratio of the mean value of two groups data (DeRisi et al., 1997). The FC value of the gth gene is calculated as follows:

F C_{g} = \frac{\frac{\sum_{t = 1}^{T} x_{g}^{t}}{T}}{\frac{\sum_{n = 1}^{N} y_{g}^{n}}{N}} (3)

where T is the size of tumor samples, N is the size of normal samples, $x_{g}^{t}$ is the gth gene expression value of the tth tumor sample, $y_{g}^{n}$ is the gth gene expression value of the nth normal sample. Similarly, the FC values of methylated genes are obtained by calculating the ratio of the mean methylation level of methylated genes in normal and tumor samples. In this study, the thresholds for screening significantly differentially methylated genes and expressed genes are, respectively, (|log₂⁡FC| > 0.5) and (FDR < 0.05) and (|log₂⁡FC| > 3) and (FDR < 0.05). Although the differential features selected by FC method have strong repeatability, the false-positive results rate is relatively high in the absence of false-positive control (McCarthy and Smyth, 2009).

False Discovery Rate

Without any control, the probability of making the type I error will increase rapidly with the number of hypothesis tests. The FDR could test as many features as possible and effectively control the overall false-positive rate within an acceptable range (Norris and Kahn, 2006). In this study, we applied the Benjamini–Hochberg method to perform multihypothesis test FDR correction on the significant p-value. We adopted FDR < 0.05 as the threshold of screening the significant difference feature.

In the stage 3 feature selection, we used FC and FDR as the stage 2 feature selection algorithm to balance the repeatability and false positive rate of the differential features. To test whether methylated genes and transcriptome profiling are significantly different between normal and tumor samples, we performed heatmap analysis on the differentially methylated genes of BRCA with 892 samples after the stage 2 feature selection (Figure 2A). We found that the level of the methylated gene in each normal or tumor samples was similar, and differentially methylated genes after stage 2 feature selection showed significant differences in normal and tumor tissues. Similarly, we implemented the volcano plot analysis on the transcriptome profiling of BRCA with 1,211 samples after the stage 2 feature selection (Figure 2B). Figure 2B indicates that the joint screening of |log₂⁡FC| and FDR not only ensured the difference of features but also effectively controlled the overall false-positive rate.

FIGURE 2

Figure 2. Heatmap analysis of differentially methylated genes and volcano plot analysis of differentially expression genes for the BRCA dataset. In panel (A), the row represents the methylation level of the genes, and the column represents the normal and tumor samples. Dark red shades indicate the higher level of methylation, and dark blue shades indicate the lower level of methylation. Color keys indicate the intensity associated with normalized beta values. In panel (B), the x-axis represents the log₂FC, and the y-axis represents −log₁₀(FDR), and each dot represents a gene. The significantly upregulated genes are highlighted in red, and the significantly downregulated genes in blue.

Stage 3: mRMR

We adopted the mRMR algorithm (Peng et al., 2005) as the stage 3 feature selection, which could remove redundant features on the premise of preserving the regulation, modification, and collaborative associations in the triple dataset to the utmost extent. The mRMR algorithm maximizes the relevance between features x_i and categorical target variables t while minimizing the redundancy between features by solving the equation as follows:

m a x (\frac{1}{| S |} \sum_{x_{i} \in S} I (x_{i}; t) - \frac{1}{{| S |}^{2}} \sum_{x_{i}, x_{j \in S}} I (x_{i}; x_{j})) (4)

where I(;;) is the mutual information of the two random variables, S is a feature subset, and |S| is the number of features in S. I(;) is defined in terms of the probabilistic density functions p(x),p(y), and p(x,y):

I (x; y) = \iint p (x, y) log \frac{p (x, y)}{p (x) p (y)} dxdy (5)

The input data of the mRMR algorithm needed to be discrete. We defined the discrete function as follows:

\begin{matrix} x_{i, j}^{*} = {\begin{matrix} 2 x_{i, j} \geq \bar{x_{j}} + k * δ_{j}, \\ 0 o t h e r s, \\ - 2 x_{i, j} \leq \bar{x_{j}} - k * δ_{j}, \end{matrix} \end{matrix} (6)

where δ_j represents the standard deviation of the jth feature, $\bar{x_{j}}$ is the average value of the jth feature, and k is the threshold parameter. In this study, we set k = 0.5, and the dataset has been discretized into three states{−2,0,2}.

Heterogeneous Classifiers

After the hierarchical feature selection, we used four different models, i.e., SVM, decision tree (DT), random forest (RF), and AdaBoost, to capture the inherent information of the data with the characteristics of low dimension, nonlinear, imbalance class, and singular values. The large differences and obvious complementarity of the four models are as follows. First, SVM mapped data to higher dimensional space by kernel function and found the best separation hyperplane to maximize the margin between two types of training samples (Noble, 2006). SVM is good at solving the problems of low-dimension and nonlinear inseparable data. However, SVM is sensitive to the singular value, and its performance mainly depends on the selection of kernel function and related parameters. Second, DT was a tree structure, starting from the root node and recursively constructing from top to bottom. Non-leaf nodes were attribute features, with each branch representing the output of the judgment result, and leaf nodes were categories (Quinlan, 1986; Safavian and Landgrebe, 1991). DT was suitable for dealing with all kinds of discrete data. It has the characteristics of simple structure, strong interpretability, and few parameters. However, DT was biased toward the class with a large number of samples, was susceptible to singular values, and ignored the correlation between features (Breiman, 1996; Bauer and Kohavi, 1999; Dietterich, 2000b). Third, RF and AdaBoost were ensemble classifiers that used decision trees as individual classifiers. RF considered the correlation between features and had a high tolerance for singular values. However, if the training set of each tree was unbalanced in the process of random sampling, the performance would be very low on the small sample dataset (Breiman, 2001). Fourth, AdaBoost has certain adaptability for the unbalanced dataset, but it is sensitive to the abnormal samples.

Second Learning Probability Error Ensemble Model

Existing ensemble strategies could not fully integrate the nonlinear relationships between different classifiers and unbiasedly estimate the nonlinear change rules of the triple dataset. Thus, we proposed the SLPEE. First, we combined the class probability prediction values of the first-learning heterogeneous classifiers and the actual class labels of the validating set to form a new dataset, so the diagnosis errors of individual classifiers were implicit in the new dataset. Then, we conducted a second-learning on the new data set via Xgboost, which used the residual between the real value and the predicted value as the next iteration of the learning goal. Xgboost could effectively learn errors and had a strong ability of nonlinear fitting, self-learning, and self-correction (Chen and Guestrin, 2016). Algorithm 1 described the SLPEE in detail. D was the data matrix, y was the label set that has two labels {−1,1}, corresponding to the normal and tumor samples, h was the heterogeneous classifier, and E represented the SLPEE model. L_f was the learning set in the fth fold, Test_f was the testing set in the fth fold, and T_fk represented the k-fold training set in the f-fold learning set.

Algorithm 1: The SLPEE based on F-fold CV.

www.frontiersin.org

Datasets and Preprocessing

We downloaded the protein-coding gene expression profiles, non-coding RNA expression profiles, and DNA methylation data of BRCA, LUAD, and KIRC from the TCGA official website¹. For the transcriptome profiling, the amount of gene expression is HTSeq-Counts, and the total dimension of protein-coding gene and ncRNA expression profiles is 60,244. For the DNA methylation data, we used Illumina Human Methylation 450 array methylation chip data. The chip has 485,577 probes, which can detect nearly 450,000 methylation CpG sites in the entire human genome, covering 96% of CpG islands. Based on the same sample ID, we integrated the protein-coding gene expression profiles, non-coding RNA expression profiles, and DNA methylation data to construct a triple dataset. The sample numbers of the original datasets for the three cancers are shown in Table 1.

TABLE 1

Table 1. Summary of the original different datasets for three cancers.

We preprocessed the samples and data successively. First, due to the limitations of experiments conditions and manual operation, some metastatic samples are wrongly classified as primary tumor samples, resulting in one patient who may correspond to multiple tumor samples. For example, in the DNA methylation dataset of BRCA, patient TCGA-BH-A1ES corresponds to two tumor samples, of which TCGA-BH-A1ES-06A-12D-A244-05 is a metastatic sample. In this work, we kept the primary tumor sample and the solid tissue normal sample related to the study and excluded outliers. Second, we preprocessed the data, including removing duplicate features, removing features with severe missing values, and correcting the normalized data. For example, when the missing value of a feature accounts for 100% of the total sample size, we think that the missing feature is too serious and should be deleted. We used the normalizeBetweenArrays function of the limma package in R to correct the normalized data. The datasets after preprocessing of the three cancers are shown in Table 2.

TABLE 2

Table 2. Summary of preprocessed datasets for three cancers.

Parameter Settings

In order to reproduce all experimental results in our paper, we set the specific random seeds for BRCA, KIRC, and LUAD as 14, 14, 20, and the other parameters setting of three cancers were same. In the stage 3 feature selection, we assumed that the size of the optimal feature subset is n, which increased from 5 to 30 with a step of one. Our proposed model optimizes the parameters of individual classifiers via grid-search over a parameter grid. For DT, we used the default parameters. For SVM, we set the kernel function as RBF, C = {0.001, 0.01, 0.1}, and gamma = {1.0, 10.0, 100.0}. For RF, we set n_estimators = {50, 100}. For Xgboost and AdaBoost, we, respectively, set n_estimators = {100, 200, 300} and n_estimators = 50.

Performance Evaluation of HFS-SLPEE

Prediction of Cancer Diagnosis

In this work, we implemented 10-fold cross-validation (CV) on three cancers in TCGA to evaluate the prediction performance of HFS-SLPEE. In the 10-fold CV experiment, the triple dataset was randomly divided into 10-folds with equal size, 9 of which were taken as the learning set, and the remaining 1-fold was the testing set. The process is repeated 10 times until all samples are predicted once. In each learning set, we divided the learning set into K disjoint subsets by performing K-fold CV again, of which K−1 subsets were the training set and the remaining one subset was the validating set.

Evaluation Metrics

We considered that the primary tumor was a positive class, and the solid normal tissue was the negative class. There were four results of cancer diagnosis in the testing set: true positive (TP), true negative (TN), false negative (FN), and false positive (FP). Among the results, TP represented the number of correctly classified tumor tissue samples, TN denoted the number of correctly classified normal tissue samples, FN indicated the number of samples predicted to be normal tissue but actually tumor tissue, and FP stood for the number of samples predicted to be tumor tissue but actually normal tissue. Therefore, accuracy, sensitivity/recall, specificity, and F1-score were defined as follows:

A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N} (7)

S e n s i t i v i t y / R e c a l l = \frac{T P}{T P + F N} (8)

S p e c i f i c i t y = \frac{T N}{T N + F P} (9)

F 1 - s c o r e = 2 \cdot \frac{P r e c i s o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l} = \frac{2 T P}{2 T P + F P + F N} (10)

Results

Performance of HFS-SLPEE Based on the Cross-Validation

To verify whether HFS-SLPEE can generalize the diagnosis of different cancers, we researched on three high-incidence cancers BRCA, LUAD, and KIRC. We recorded the accuracy corresponding to the feature variable n (see Figure 3). It indicates that as n continues to increase, the accuracy increases; when n = 21, n = 12, n = 16, the accuracy reached the peak value and then show a downward trend, which means that the added features contain more noise than information. In this study, we take the features subset when the highest point is reached at first as the optimal. That is, n = 21, n = 12, n = 16, respectively, as the number of features finally selected for the BRCA, LUAD, and KIRC.

FIGURE 3

Figure 3. The relationship curves of the features and the accuracy of three cancers.

The predicted results of HFS-SLPEE for three cancers in TCGA are shown in Table 3. We used 21 key features to make BRCA achieve 99.65% accuracy, 99.61% sensitivity, 100% specificity, and 99.81% F1-score, and only three samples were misdiagnosed. For LUAD and KIRC, we selected 12 and 16 key features to achieve four-indicator 100% precision diagnosis. The results show that HFS-SLPEE achieves an excellent performance and has the generalization ability for three high-incidence cancers diagnosis.

TABLE 3

Table 3. The diagnosis results of three cancer by HFS-SLPEE (%).

Performance of HFS-SLPEE by Ablation Analysis

Our proposed approach mainly consists of three parts, namely, the triple dataset (TDS), the novel hierarchical feature selection algorithm, and SLPEE model. To examine the contribution of each component, we compared the proposed approach with several combinations.

First, we compared the triple dataset with the other seven datasets, including mRNA, miRNA, lncRNA, ncRNA, DNA methylation, transcriptomic, and mRNA and DNA methylation, to inspect the contribution of the triple dataset. We found that the triple dataset achieved the best performance compared with the other seven datasets (see Figures 4A,B). Specifically, miRNA, mRNA, ncRNA, and DNA methylation are all single-type datasets with no absolute dominance, and their contribution rates in different cancers and diagnostic performances are different. The duplex-type datasets (transcriptomic, mRNA, and DNA methylation) have improved performance in many cases compared with the contained single-type dataset. The results indicated that the triple dataset contained more comprehensive and useful information and provided a robust data support.

FIGURE 4

Figure 4. The comparison results of different datasets. (A) The histogram of comparison results. (B) The annotated heatmap of comparison results. As is shown, compared with the other seven datasets, the integrated data of the protein-coding gene expression profiles, non-coding RNA expression profiles, and DNA methylation can improve the performance of the model.

Next, to examine the contribution of CSAFS, which saves the information of all methylation CpG sites related to genes in an aggregated form, we compared the hierarchical feature selection algorithm without and with CSAFS. Owing to the BRCA with the largest number of samples among three cancers, we take BRCA as an example, when the feature variable n is 21, to compare the performance of diagnosis and the time of the feature selection process. On the one hand, the experimental results showed that the hierarchical feature selection with CSAFS achieved better performance, and the accuracy, sensitivity, specificity, and F1-score of BRCA are improved by 0.23, 0.12, 1.2, and 0.13%, respectively. In the final feature subset selected by the hierarchical feature selection with CSAFS, we screened out 11 methylated genes such as WT1-AS and AL513523.2, accounting for about 30% of the final feature subset. On the other hand, the time consumed by the hierarchical feature selection without and with CSAFS is 2.6 and 0.15 h, respectively. It indicated that CSAFS, as the stage 1 feature selection approach, is capable of non-destructively preserving the complex and essential many-to-many modification associations between methylation CpG sites and genes and improving the efficiency of cancer diagnosis.

Finally, to verify the contribution of SLPEE further, we compared the accuracy of SLPEE with DT, RF, SVM, and Adaboost by using the triple dataset and the proposed hierarchical feature selection algorithm that traversed each feature in the range of 5–30. The results showed that the four models have different accuracies for different cancers with different feature numbers and have significant differences and complementary effects (see Figure 5). We found that the SLPEE accounts for 61.5, 88.5, and 92.3% with higher accuracy than the other four models for BRCA, KIRC, and LUAD datasets, respectively. That is, the accuracies of SLPEE model for three cancers are generally better than the other four models, which fully reveals that SLPEE absorbs the advantages of four models and overcomes their respective shortcomings. SLPEE breaks through the limitation of the single model with a limited scope of application and enhances the generalization of different features and different types of cancer diagnoses.

FIGURE 5

Figure 5. The comparison results of SLPEE and the other four models.

Comparison With the Published State-of-the-Art Research

We compared the performance of HFS-SLPEE with state-of-the-art research (Raweh et al., 2018; Xiao et al., 2018b; Alghunaim and Al-Baity, 2019; see Table 4). Table 4 shows that most of the latest studies generally gave a single performance indicator and at most three performance indicators. In this study, we presented four performance indicators to comprehensively measure the performance of HFS. Alghunaim and Al-Baity (2019) used protein-coding gene expression profiles, DNA methylation, and their integrated data in TCGA and treated all data as features with SVM, DT, and RF three models for BRCA diagnosis. Comparing with Alghunaim and Al-Baity (2019), Table 4 shows that the accuracy and sensitivity of the HFS-SLPEE based on the protein-coding gene expression profiles and DNA methylation data are improved by 1.97 and 2.54%, respectively. The triple dataset was an effective solution to the shortcomings of the duplex dataset, which not only achieved the same 100% specificity but also improved the accuracy and sensitivity by 2.32 and 2.79%, respectively. Raweh et al. (2018) proposed a hybrid feature selection algorithm based on the DNA methylation data and applied naive base, RF, and SVM for BRCA, LUAD, and KIRC diagnosis. Comparing with Raweh et al. (2018), we found that HFS-SLPEE steadily improved the accuracy and F1-score by 0.03–0.35% and 0.29–4.21% for the three cancers, respectively. Xiao et al. (2018b) utilized the DESeq feature selection method to screen for the differentially expressed genes and adopted a deep neural network to learn the predictions of KNN, SVM, DT, RF, and GBDT. It was indicated that HFS-SLPEE improved the accuracy by 0.76%, comparing with the results of Xiao et al. (2018b). In summary, the proposed approach is a powerful framework for precision cancer diagnosis, which refers to the three aspects of the dataset, feature selection, and diagnosis model, outperforming the previously published state-of-the-art methodologies.

TABLE 4

Table 4. Comparison results with the state-of-the-art approaches (%).

HFS-SLPEE Uncovers New Potential Biomarkers for Three Cancers

According to the optimal feature variable, we obtained 10 group features corresponding to the BRCA, LUAD, and KIRC with a total of 210, 120, and 160 features. After removing the overlapping features, the optimal features subsets were, respectively, reduced to 37, 35, and 40 features for the three cancers. The average repetition rate of the 10 group features for three cancers is 70.83–82.38%. On the one hand, the features selected in each fold is different, verifying that cancer does not generate along a fixed trajectory, but there are many different signal pathways. On the other hand, although the features subset selected in each fold changes dynamically, some feature genes recurring, and part of them will appear steadily in every fold.

We organized the features into three categories: protein-coding genes, ncRNAs, and methylated genes (see Table 5). We have two findings. First, the proportion of epigenetic factors containing DNA methylation and non-coding genes in carcinogenesis is not less than 1/3. Specifically, the proportion of epigenetic factors (including methylated genes and non-coding genes) in features subset are 40.54, 34.29, and 37.5% for BRCA, LUAD, and KIRC, respectively. Second, a limited number of specific protein-coding genes and lncRNAs (antisense and or lincRNA) appear steadily in each cross-validation experiment, which can be regarded as potential biomarkers for these three cancers.

TABLE 5

Table 5. The summary of selected features for the three cancers.

Discussion

In this work, we developed a novel hierarchical feature selection and second learning probability error ensemble model, called HFS-SLPEE, for cancer diagnosis. HFS-SLPEE is a precision cancer diagnosis framework, constituted by the integrated data of protein-coding gene expression profiles, non-coding RNA expression profiles, and DNA methylation data, the novel HFS algorithm, and the SLPEE model. We experimentally studied three high-incidence cancer as BRCA, LUAD, and KIRC in the TCGA database. The results have demonstrated that HFS-SLPEE achieves higher performance in comparison to several state-of-the-art methodologies. Therefore, HFS-SLPEE could be a powerful tool for cancer diagnosis. Moreover, HFS-SLPEE is universal, not limited to the field of cancer diagnosis. It is also suitable for cancer subtype classification, tumor origin detection, etc.

Herein, we acknowledge some limitations of our proposed method. Since a recent related study (Liang et al., 2020) has demonstrated that sufficient samples may enhance performance of models. Despite the enormous availability of cancer datasets in TCGA, the number of samples is still not enough. As a machine-learning-based model, HFS-SLPEE needed much time to train because of the high number of combinations of trainable hyperparameters. The number of normal samples is much less than that of tumor samples in practice, which has been a big challenge when building a gold-standard dataset for cancer diagnosis. A study by Liang et al. (2017) showed that somatically acquired structural variation (SV) may induce tumor formation; we will explore some other data, such as copy number alteration (CNA) and SV, for better performances of HFS-SLPEE in future work.

Data Availability Statement

Source data and code can be downloaded from https://github.com/luckymengmeng/HFS-SLPEE.

Author Contributions

YM and MJ conceived the concept of the work. YM collected data, performed the experiments, and wrote the manuscript. MJ helped in revising and supervising the manuscript. Both authors contributed to the article and approved the submitted version.

Funding

The work was supported by the National Natural Science Foundation of China Grant 61773157.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Footnotes

^ https://portal.gdc.cancer.gov/

References

Alghunaim, S., and Al-Baity, H. H. (2019). On the scalability of machine-learning algorithms for breast cancer prediction in big data context. IEEE Access 7, 91535–91546. doi: 10.1109/ACCESS.2019.2927080

HFS-SLPEE: A Novel Hierarchical Feature Selection and Second Learning Probability Error Ensemble Model for Precision Cancer Diagnosis

Introduction

Materials and Methods

Construction of the Triple Dataset

The Novel Hierarchical Feature Selection

Stage 1: CpG Sites Aggregation Feature Selection

Stage 2: FC and FDR

Fold Change

False Discovery Rate

Stage 3: mRMR

Heterogeneous Classifiers

Second Learning Probability Error Ensemble Model

Datasets and Preprocessing

Parameter Settings

Performance Evaluation of HFS-SLPEE

Prediction of Cancer Diagnosis

Evaluation Metrics

Results

Performance of HFS-SLPEE Based on the Cross-Validation

Performance of HFS-SLPEE by Ablation Analysis

Comparison With the Published State-of-the-Art Research

HFS-SLPEE Uncovers New Potential Biomarkers for Three Cancers

Discussion

Data Availability Statement

Author Contributions

Funding

Conflict of Interest

Footnotes

References

95% of researchers rate our articles as excellent or good