Construction and Identification of a Novel 5-Gene Signature for Predicting the Prognosis in Breast Cancer

Guo, Lingling; Jing, Yu

doi:10.3389/fmed.2021.669931

ORIGINAL RESEARCH article

Front. Med. , 14 October 2021

Sec. Gene and Cell Therapy

Volume 8 - 2021 | https://doi.org/10.3389/fmed.2021.669931

Construction and Identification of a Novel 5-Gene Signature for Predicting the Prognosis in Breast Cancer

$\nLingling Guo$ Lingling Guo¹

Yu Jing²^*

¹Department of Ultrasound, The First Affiliated Hospital of Jinzhou Medical University, Jinzhou, China
²Clinical Trial Ward of the First Affiliated Hospital of Jinzhou Medical University, Jinzhou, China

Background: Breast cancer is one of the most common malignancies in women worldwide. The purpose of this study was to identify the hub genes and construct prognostic signature that could predict the survival of patients with breast cancer (BC).

Methods: We identified differentially expressed genes between the responder group and non-responder group based on the GEO cohort. Drug-resistance hub genes were identified by weighted gene co-expression network analysis, and a multigene risk model was constructed by univariate and multivariate Cox regression analysis based on the TCGA cohort. Immune cell infiltration and mutation characteristics were analyzed.

Results: A 5-gene signature (GP6, MAK, DCTN2, TMEM156, and FKBP14) was constructed as a prognostic risk model. The 5-gene signature demonstrated favorable prediction performance in different cohorts, and it has been confirmed that the signature was an independent risk indicater. The nomogram comprising 5-gene signature showed better performance compared with other clinical features, Further, in the high-risk group, high M2 macrophage scores were related with bad prognosis, and the frequency of TP53 mutations was greater in the high-risk group than in the low-risk group. In the low-risk group, high CD8+ T cell scores were associated with a good prognosis, and the frequency of CDH1 mutations was greater in the low-risk group than that in the high-risk group. At the same time, patients in the low risk group have a good response to immunotherapy in terms of immunotherapy. The results of immunohistochemistry showed that MAK, GP6, and TEMEM156 were significantly highly expressed in tumor tissues, and DCTN2 was highly expressed in normal tissues.

Conclusions: Our study may find potential new targets against breast cancer, and provide new insight into the underlying mechanisms.

Introduction

Breast cancer is one of the most frequently diagnosed malignancies in women and the major cause of cancer-associated mortality worldwide. Recently, the World Health Organization's International Agency for Research on Cancer released the latest global cancer burden data for 2020. The most obvious change is that the incidence of breast cancer has increased rapidly with 2.26 million new cases, thereby replacing lung cancer (2.20 million new cases) as the most common cancer (1). From 2012 to 2016, the incidence of breast cancer increased by 0.3% per year, and the mortality rate continued to decline (2, 3). Assessing and improving breast cancer patients' outcomes are still tasks of considerable importance.

In recent years, the diagnosis of BC mainly depended on pathological examination, imaging examination, and evaluation of tumor biomarkers. Because of the high recurrence rate of BC, the age of onset of BC become younger gradually (4). As a potential non-invasive monitoring option for the risk of recurrence in BC patients, gene signatures have attracted more and more attention. The integration of multiple biomarkers into a single model can improve the accuracy of prediction compared to a single clinical biomarker. Therefore, it is necessary and effective to construct new biomarkers related to the prediction of curative effect. The construction of such genetic markers may have the clinical potential to predict the prognosis of patients and aid in treatment selection. Previous studies have established prognostic signatures for breast cancer by bioinformatics, such as, Zhong et al. established an autophagy-related genes-based prognostic signature for breast cancer patients, which was of great significance in predicting the overall survival rate (5). Zhang et al. developed an 11-gene signature associated with glycolysis to predict survival in breast cancer patients (6). In addition, Xie et al. developed a 12-gene prognostic signature that provided new insights for assessing the high risk of death from breast cancer and individualized use of immunotherapy (7). Although several gene signatures associated with breast cancer have been published, some of them still have some defects, and the existing work related to prognosis of breast cancer patients has not been well carried out. Therefore, there is an urgent need to construct a breast cancer gene signature biomarkers to predict prognosis and optimize treatment.

This study aimed to identify prognostic differentially expressed genes (DEGs) and construct and validate a risk model for breast cancer. Moreover, the differences in immune cell infiltration and mutation character between high- and low-risk patients were evaluated. We built a 5-gene signature prognostic risk model with excellent stability and reliability for predicting prognosis in breast cancer patients.

Materials and Methods

Data Download and Preprocessing

The RNA-seq and clinical data of breast cancer were downloaded from The Cancer Genome Atlas (TCGA; https://portal.gdc.cancer.gov/). The RNA sequencing data were pre-processed in the following steps: (1) the samples without clinical data were removed; (2) the median expression value was selected for gene symbols corresponding to multiple probes.

The GSE59515 data set, which contained information on neoadjuvant ultrasound evaluating the sensitivity of drug resistance, and the GSE20685 and GSE31448 data sets, which contained information on the survival time of breast cancer, were obtained from the Gene Expression Omnibus (GEO). The GEO data sets were pre-processed in the following steps: (1) the samples without clinical data were removed; (2) the probes were converted to gene symbols; (3) the probes corresponding to more than 1 gene were eliminated; (4) the median expression value was selected for gene symbols corresponding to multiple probes. After preprocessing, there were a total of 50 samples, including 34 for drug response (responder) and 16 for non-response (non-responder), in the GSE59515 data set. There were 1,034 samples in the TCGA data set, 327 in the GSE20685 data set, and 246 in the GSE31448 data set. The clinical statistics information for all cohorts is shown in Table 1.

TABLE 1

Table 1. Sample information.

Identification of Differentially Expressed Genes Associated With Neoadjuvant Chemosensitivity and Functional Annotation

The limma package was applied to calculate the DEGs between the responder and non-responder subtypes in the GSE59515 data set. Further, the Gene Ontology (GO) functional enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway of the differentially expressed genes between the responder and non-responder groups were performed by the R package WebGestaltR (v0.4.3).

Identification of Co-expression Modules

The weighted gene co-expression network analysis (WGCNA)was applied to identify co-expression genes and modules based on the GSE59515 expression profiles by the R software package. The log(k) of the node with connection degree k was inversely associated with the log(P(k)) of the probability of the occurrence of the node, and the genes with a correlation coefficient > 0.85 were included. Subsequently, the expression matrix was transformed converted into the adjacency matrix, and the topological overlap matrix was calculated from the adjacency matrix, and then hierarchical clustering was performed. Next, average linkage hierarchical clustering was performed on the basis of the topological overlap dissimilarity measure. After the gene module was determined, coexpressed modules were determined using a dynamic hybrid tree cut algorithm setting with a least number 100 for each module, and the eigenvectors of each module were calculated and the closer modules were merged into a new module.

Identification of Hub Genes and Protein-Protein Interaction Network Analysis

STRING (https://string-db.org/) is a public database that contains interactions between known and predicted proteins, covering 9.6 million proteins and 13.8 million protein-protein interactions from more than 2,031 species. STRING is a comprehensive database derived from experimental data, co-expression data, and automated text mining, and it also contains the results of bioinformatics predictions. The study of the protein-protein interaction network is helpful to excavate hub regulatory genes. There are many databases of protein-protein interactions, but STRING covers the most species and has the most information about interactions.

We identified the common DEGs between the responder, non-responder, and yellow module genes, and then we drew a Venn diagram. We analyzed the protein-protein interaction (PPI) network of these common genes by the STRING database and performed visualization with Cytoscape (v3.7.2) to find the network module.

The Prognostic Risk Model Construction

Training Set Samples Random Grouping

First, all samples (n = 1,034) from the TCGA database were randomly assigned to a training cohort and a validation cohort at the ratio of 1:1. To eliminate the influence of random allocation bias on the model stability, samples were randomized by resampling 100 times with replacement. Eligible training and validation cohort were selected according to the following criteria: (1) both groups were well-matched for age, sex, follow-up time, and patient mortality ratio; and (2) the numbers of binary classification samples were close after clustering the gene expression profiles into 2 randomly grouped data sets.

Univariate Survival Analysis of the Training Cohort

Univariate survival analysis was assessed using the R survival coxph function (P < 0.05) to identify the prognostic hub genes in the training cohort. Stepwise regression based on Akaike Information Criterion (AIC) was used to provide a balance between the goodness of model fit and the number of parameters required. Variables were dropped sequentially by evaluating the effects of their removal on the models AIC, where the lower the AIC, the better the fit.

Immunohistochemical Staining Evaluation

To validate the expression of 5 gene ssiganture, tissue microarrays comprised of 89 cases (45 cases of BRCA tissues, 44 cases of normal paired samples) were purchased from Shanghai Outdo Biotech Co., Ltd. The studies were conducted in accordance with the International Ethical Guidelines for Biomedical Research Involving Human Subjects (CIOMS), and the research protocols were approved by the Clinical Research Ethics Committee of First Affiliated Hospital of Jinzhou Medical University.

The TMA slides were incubated with anti-GP6 antibody (1:200 dilution; SAB - 47582), FKBP14 antibody (1:100 dilution; GENX SPAN- GXP155296), DCTN2 antibody (1:100 dilution; GENX SPAN- GXP187676), MAK antibody (1:100 dilution; GENX SPAN- GXP309295), TMEM156 antibody (1:100 dilution; Proteintech-25159-1-AP), and spend the night at 4C.

The stained score were evaluated by three pathologists who were blinded to patients' clinical characteristics. The scoring system was based on the proportion of positive cells in all tissue cells and the staining intensity of these positive cells. The intensity of staining was classified as 0 (negative), 1 (weak), 2 (moderate), or 3 (strong). The staining ratio of positive cells was: 0 (<5%), 1 (5–25%), 2 (26–50%), 3 (51–75%), or 4 (> 75%). According to the staining intensity and the proportion of positive cells, the immunohistochemical results were divided into 0–1 grade, negative (–); > 1–4, weakly positive (+); > 4–8, moderately positive (++), and > 8–12, strong positive (+++).

Results

Identification of Differentially Expressed Genes and Functional Enrichment

The DEGs were filtered with thresholds of P < 0.05 and Fold Change (FC) >1.2. There were 979 DEGs including 367 upregulated and 612 downregulated genes between the responder and non-responder groups were identified. The volcano plot and heat map showed that the DEGs between the responder and non-responder groups were mainly downregulated genes (Figures 1A,B). The DEGs were listed in Supplementary Table 1.

FIGURE 1

Figure 1. Identification of differentially expressed genes between the Responder and Non-responder group and functional enrichment. (A) Volcano plot, red represents up-regulation, and green represents down-regulation. The abscissa represents log2FC, and the ordinate represents p-value; (B) Heat map of differentially expressed gene expression of Responder and Non-responder group; (C) Differentially expressed genes enriched in Biological process; (D) Differentially expressed genes enriched in Cellular component; (E) Differentially expressed genes enriched in Molecular function; (F) Differentially expressed genes enriched in pathways. The abscissa represents the percentage of gene enrichment; the ordinate represents the enriched function or pathway.

The 979 DEGs between the responder and non-responder groups were subjected to GO functional enrichment analysis and KEGG pathway analysis by the R package WebGestaltR (v0.4.3). There were 10 significantly enriched biological process annotation terms with a False Discovery Rate (FDR) smaller than 0.05 presented (Figure 1C); 56 significantly enriched cellular component annotation terms (Figure 1D); and 3 significantly enriched molecular function annotation terms (Figure 1E). Besides, 10 KEGG pathways were identified (Figure 1F, Supplementary Table 2).