m5Cpred-XS: A New Method for Predicting RNA m5C Sites Based on XGBoost and SHAP

Liu, Yinbo; Shen, Yingying; Wang, Hong; Zhang, Yong; Zhu, Xiaolei

doi:10.3389/fgene.2022.853258

ORIGINAL RESEARCH article

Front. Genet., 30 March 2022

Sec. Computational Genomics

Volume 13 - 2022 | https://doi.org/10.3389/fgene.2022.853258

m5Cpred-XS: A New Method for Predicting RNA m5C Sites Based on XGBoost and SHAP

Yinbo Liu^†

Yingying Shen^†

Hong Wang

Yong Zhang*

Xiaolei Zhu*

School of Sciences, Anhui Agricultural University, Hefei, China

As one of the most important post-transcriptional modifications of RNA, 5-cytosine-methylation (m5C) is reported to closely relate to many chemical reactions and biological functions in cells. Recently, several computational methods have been proposed for identifying m5C sites. However, the accuracy and efficiency are still not satisfactory. In this study, we proposed a new method, m5Cpred-XS, for predicting m5C sites of H. sapiens, M. musculus, and A. thaliana. First, the powerful SHAP method was used to select the optimal feature subset from seven different kinds of sequence-based features. Second, different machine learning algorithms were used to train the models. The results of five-fold cross-validation indicate that the model based on XGBoost achieved the highest prediction accuracy. Finally, our model was compared with other state-of-the-art models, which indicates that m5Cpred-XS is superior to other methods. Moreover, we deployed the model on a web server that can be accessed through http://m5cpred-xs.zhulab.org.cn/, and m5Cpred-XS is expected to be a useful tool for studying m5C sites.

Introduction

RNA modification plays pivotal roles in many biological processes (Tang et al., 2001; Matzke et al., 2004; Xu et al., 2013; Jespersen et al., 2017; Xue Chen et al., 2020). Until now, about 170 types of RNA modifications have been discovered (Xuan et al., 2018), among which, 5-methylcytosine (m5C) is one of the most abundant post-transcriptional modifications (PTCM). In this modification, a methyl group is transferred to the fifth carbon atom of cytosine by RNA methyl-transferase (Jespersen et al., 2017). The m5C modification plays important roles in many biochemical reactions (Catania and Fairweather 1991; Fasolino et al., 2017; Yang et al., 2017; He et al., 2020; Xue MiaoMiao et al., 2020; Zhang et al., 2020), such as the pathogenesis of various cancers (He et al., 2020; Xue MiaoMiao et al., 2020; Zhang et al., 2020), rRNA assembly (Zhang et al., 2020), and cellular aging (Catania and Fairweather 1991), etc. Thus, it is meaningful to pinpoint m5C sites in RNA sequences.

Several experimental methods have been developed to identify m5C sites, including miCLIP-seq (Hussain et al., 2013), Aza-IP-seq (Khoddami and Cairns 2013), bisulfite sequencing (Agris 2008; Schaefer et al., 2010), and m5C-RIP-seq (Khoddami et al., 2019). However, these methods have their own shortcomings (Fu et al., 2012). For example, bisulfite sequencing cannot detect m5C sites in low-abundance RNA. Moreover, these existing experimental methods are time-consuming and expensive. In recent years, with the development of computer technology, several computational methods, especially those machine-learning based methods, have been developed for RNA m5C site identification (Feng et al., 2016; Qiu et al., 2017; Sabooh et al., 2018; Zhang et al., 2018).

The computational methods are mainly classified into two categories: random forest (RF)-based models and support vector machine (SVM)-based models. Based on RF, Qiu et al. (2017) proposed iRNAm5C-PseDNC based on pseudo dinucleotide composition (PseDNC) feature encoding, and Li et al. (2018) constructed RNAm5Cfinder by using mononucleotide binary encoding (MNBE) to encode the RNA sequences. Based on these two feature encodings and K-tuple nucleotide frequency component (KNFC), Song et al. (2018) developed a predictor named PEA-m5C. By using SVM as the classifier, Feng et al. (2016) developed m5C-PseDNC based on features of PseDNC. Fang et al. (2019) built RNAm5CPred based on the features of PseDNC, KNFC, and MNBE. By integrating multiple SVM methods, Zhang et al. (2018) developed an ensemble model, m5C-HPCR, by incorporating different physical–chemical properties into PseDNC. Chen Xiao et al. (2020) proposed another SVM-based model, m5CPred-SVM, which uses six sequence-based features, including k-nucleotide frequency (KNF), K-spaced nucleotide pair frequency (KSNPF), position-specific nucleotide propensity (PSNP), K-spaced position-specific dinucleotide propensity (KSPSDP), PseDNC, and chemical property with density (CPD).

As mentioned above, different kinds of features have been generated for predicting m5C sites, and the dimension of these features can be very high; however, not all the features are relevant for building machine learning models. Moreover, the features with ultrahigh dimensions also pose a great challenge to computer performance (Li et al., 2021). Selecting the optimal feature subset by appropriate feature selection methods can not only improve the accuracy of the prediction model, but also effectively reduce the huge computing power required for model training.

Recently, different feature selection methods have been used in developing models for predicting the RNA modification sites. Wang et al. (2018) used a minimum redundancy maximum (mRMR) correlation algorithm to select discriminative features from the features encoded based on RNA sequences. Sabooh et al. (2018) developed a new computational method pm5CS-Comp-mRMR by also using mRMR for selecting the discriminate features. Furthermore, Visentini et al. (2016) first sorted the features according to the F-score obtained in the eXtreme gradient boosting (XGBoost) (Chen, 2016) package and then selected the top 50 features based on the incremental feature selection (IFS) strategy as the optimal feature subset. To reduce the dimension of features, Chai et al. (2021a) proposed an efficient m5C sites prediction approach, Staem5, based on features selected by F-score. The SHapley Additive ExPlanations (SHAP) (Wang and Gribskov 2019; Bi et al., 2020) method, which can interpret the importance of features, is another effective method for selecting relevant features. The method was also used in several recent works (Bi et al., 2020; Pathy et al., 2020; Effrosynidis and Arampatzis 2021).

In this study, we established a new method to predict m5C sites by using XGBoost based on features selected by SHAP. We named this method m5Cpred_XS, which can be used to predict m5C sites in multiple species. Extensive experiments demonstrated that the proposed predictor, m5Cpred_XS, outperformed other existing prediction methods. Finally, a web server (http://m5cpred-xs.zhulab.org.cn/) was deployed for the users.

Materials and Methods

Overall Framework of m5Cpred_XS

For building our model reasonably, we conducted our study in six steps. I) A benchmark data set was collected. The benchmark data set was divided into the training set and the independent test set. II) The features were extracted from RNA sequences. III) The SHAP-based feature selection was carried out to select the optimal feature subset. IV) The XGBoost was used to train the model. V) The comparison and analysis of different models was conducted. VI) A web server for predicting m5C sites was developed for the community. The overall flow chart of our study is shown in Figure 1.

FIGURE 1

FIGURE 1. The flowchart of m5Cpred_XS.

Benchmark Data Sets

For fair comparison, we used the same data sets as in Chen Xiao et al. (2020). In their work, they collected data for three species: H. sapiens, M. musculus, and A. thaliana. As shown in Table 1, the data sets contain 269, 5563, and 6289 positive samples for the three species, respectively, and the numbers of negative samples are the same as positive samples. The positive samples of H. sapiens, M. musculus, and A. thaliana were collected from the work of Yang et al. (2017), Khoddami et al. (2019), and Cui et al. (2017), respectively. For the details about how the data sets were obtained, please refer to Chen Xiao et al. (2020).

TABLE 1

TABLE 1. Training and test data sets of three species.

To build and evaluate the models, the benchmark data sets were divided into two parts: the training data sets and the independent test sets. The training data sets were used for the model construction, cross-validation, and the determination of the hyperparameters of machine learning algorithms, whereas the independent test sets were used for testing the prediction performance and generalization ability of the models. For A. thaliana, 1000 positive and 1000 negative samples were randomly selected from the data set as the independent test data set, and the remaining 5298 positive and 5298 negative samples were selected as the training data set. Similarly, 1000 positive and 1000 negative samples from M. musculus’ benchmark data set were selected as the independent test set, and the remaining 4563 positive and 4563 negative samples were selected as the training data set. For H. sapiens, 69 positive and 69 negative samples were randomly selected as the independent test set, and the remaining 200 positive and 200 negative samples were selected as the training data set. The specific partitioning of the data sets is shown in Table 1.

For each RNA segment, it can be expressed in the following form:

R_{λ} (C) = N_{- λ} N_{- (λ - 1)} \dots N_{- 1} C N_{1} \dots N_{λ - 1} N_{λ} .

In this formula, $N_{λ}$ and $N_{- λ}$ represent the downstream and upstream nucleotide with cytosine (C) at the center, respectively. Previous studies (Hussain et al., 2013; Khoddami and Cairns 2013; Qiu et al., 2017; Sabooh et al., 2018; Zhang et al., 2018; Khoddami et al., 2019) show that the performance is better when $λ$ is set to 20. Therefore, in this study, we also set $λ = 20$ , which means that all the RNA segments have a length of 41 bp.

Feature Encoding Extraction

Enhanced Nucleic Acid Composition

ENAC encoding (Ahmad and Shatabda 2019) is used for feature extraction in equal-length RNA sequences. It first determines a fixed length window, and then the window is slid from the 5-terminal to the 3-terminal of the RNA segment without interval. The features of ENAC are expressed as follows (Han et al., 2019):

V = [\frac{N_{A, wi n_{1}}}{S}, \frac{N_{c, wi n_{1}}}{S}, \frac{N_{G, wi n_{1}}}{S}, \frac{N_{U, wi n_{1}}}{S}, \frac{N_{A, wi n_{2}}}{S}, \frac{N_{C, wi n_{2}}}{S}, \dots, \frac{N_{C, wi n_{L - S + 1}}}{S}, \frac{N_{G, wi n_{L - S + 1}}}{S}, \frac{N_{U, wi n_{L - S + 1}}}{S}] .

In this formula, $S$ represents the size of the sliding window, and $N_{t, r}$ represents the number of nucleotide $t$ in this window $r$ $(r = 1,2, \dots, L - S + 1, t \in {A, C, G, U})$ . In this paper, the value of $S$ is set to five; thus, the dimension of ENAC is 148.

The Composition of K-Spaced Nucleic Acid Pairs

The CKSNAP feature encoding scheme (Cui et al., 2017; Ju and Wang 2020) is based on the frequency of k-spaced nucleotide pairs (k = 0, 1, 2, 3, 4, 5). For example, when k = 1, the nucleotide pairs corresponding to k-spaced 16 possible nucleotide pairs (i.e., “A∗A″, “A∗C″, “A∗G″, …, “C∗G″, “G∗A″, …, “G∗C″, “U∗U″, “U∗A″, “U∗C″, “U∗G″), CKSNAP can be expressed as the following formula:

{(\frac{N_{A∗A}}{N_{total}}, \frac{N_{A∗C}}{N_{total}}, \frac{N_{A∗G}}{N_{total}}, \dots, \frac{N_{T∗T}}{N_{total}})}_{16},

where $*$ represents k arbitrary nucleotides, and $N_{A * A}$ represents the number of nucleotide pairs $A * A$ appearing in the entire RNA sequence. $N_{t o t a l}$ represents the total number of nucleotide pairs appearing in the RNA sequence with the interval k. A total number of 96 (16∗6) dimensional features were generated by CKSNAP encoding.

Accumulated Nucleotide Frequency

ANF, also known as nucleotide density (ND), fully considers the distribution and nucleotide frequency information of each nucleotide in the RNA sequence (Chen Zhen, et al., 2020). The density of a nucleotide $n_{i}$ at $i$ position in RNA sequence can be expressed as follows:

d_{i} = \frac{1}{i} \sum_{j = 1}^{i} f (S_{j}), f (q) = {\begin{matrix} 1, n_{i} = q \\ 0, otherwise, \end{matrix}

where $S_{j}$ represents the type of nucleotide at the sequence position $j$ . For example, an RNA sequence ‘AUCUCAUGAG,’ the densities of A at positions 1, 6, and 9 can be expressed as 1.00 (1/1), 0.33 (2/6), and 0.33 (3/9). The densities of U at positions 2 and 4 are 0.50 (1/2), 0.50 (2/4), respectively. In this way, the whole RNA sequence can be expressed as (1.00.0.50.0.33.0.50, 0.20.0.33.0.43.0.13.0.33.0.20). ANF produces 41 dimensional features for a 41-bp RNA sequence.

Nucleotide Chemical Property

Adenine (A), guanine (G), cytosine (C), and uracil (U) are the four types of nucleotides in RNA, each of which has unique chemical properties and physical structure. According to different chemical properties, these four nucleotides can be divided into three categories (Chen et al., 2016). The details are shown in Table 2.

TABLE 2

TABLE 2. Chemical structure of each nucleotide.

Based on the three types of chemical properties, A, C, U, and G can be expressed as (1, 1, 1), (0, 1, 0), (1, 0, 0), and (0, 0, 1), respectively. The feature dimension generated by NCP is 123.

Binary Encoding

The method of using a four-dimensional binary vector to encode the nucleotide is called the binary encoding scheme (Foster et al., 2003) by which A, C, G, and U are encoded as (1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0), and (0, 0, 0, 1), respectively. Thus, we obtained a 164-dimensional feature vector for an RNA segment containing 41 nucleotides.

Series Correlation Pseudo Dinucleotide Composition

The expression of SCPseDNC (Chen et al., 2014) coding is as follows:

D = {[d_{1}, d_{2}, d_{3}, \dots d_{16}, d_{16 + 1}, \dots, d_{16 + λ}, \dots, d_{16 + λ Λ}]}^{T},

where $d_{k}$ represents

d_{k} = {\begin{matrix} \frac{f_{k}}{\sum_{i = 1}^{16} f_{i} + w \sum_{j = 1}^{λ} θ_{j}} (1 \leq k \leq 16) \\ \frac{w θ_{k - 16}}{\sum_{i = 1}^{16} f_{i} + w \sum_{j = 1}^{λΛ} θ_{j}} (17 \leq k \leq 16 + λΛ) \end{matrix} .

Here, $f_{k} (k = 1,2, \dots, 16)$ is the standardized occurrence frequency of the 16 types of dinucleotides in a sequence, $λ$ represents the highest counted rank (or tie) of the correlation along the nucleotide sequence, $w$ is the weight, which ranges from zero to one, and $Λ$ is the six physicochemical indices, including ‘Roll (RNA)', ‘Rise (RNA)', ‘Shift (RNA)', ‘Twist (RNA)', ‘Slide (RNA)' and ‘Tilt (RNA)'. $θ_{j} (j = 1,2, \dots, λ)$ is the $j$ -tier correlation factor, defined as follows:

{\begin{matrix} θ_{1} = \frac{1}{L - 3} \sum_{i = 1}^{L - 3} j_{i, i + 1}^{1} \\ θ_{2} = \frac{1}{L - 3} \sum_{i = 1}^{L - 3} j_{i, i + 1}^{2} \\ \dots \dots \\ θ_{Λ} = \frac{1}{L - 3} \sum_{i = 1}^{L - 3} j_{i, i + 1}^{Λ} (λ < L - 2) \\ \dots \dots \\ θ_{λ Λ - 1} = \frac{1}{L - λ - 2} \sum_{i = 1}^{L - λ - 2} j_{i, i + 1}^{Λ - 1} \\ θ_{λ Λ} = \frac{1}{L - λ - 2} \sum_{i = 1}^{L - λ - 2} j_{i, i + 1}^{Λ} \end{matrix},

where the correlation function $j_{i, i + k}^{ς}$ is defined as

{\begin{matrix} J_{i, i + m}^{ς} = P_{ς} (R_{i} R_{i + 1}) P_{ς} (R_{i + m} R_{i + m + 1}) \\ ς = 1,2, \dots, Λ; m = 1,2, \dots, λ; i = 1,2, \dots, L - λ - 2 \end{matrix},

where $ς$ is the number of physicochemical indices. $P_{ς} (R_{i} R_{i + 1})$ is the value of the $ς ‐ t h$ physical and chemical index of the $i$ -dinucleotide $R_{i} R_{i + 1}$ . $P_{ς} (R_{i + m} R_{i + m + 1})$ refers to the value of the $ς ‐ t h$ physical and chemical index of the $i + m$ -dinucleotide $R_{i + m} R_{i + m + 1}$ . In this paper, we set $λ = 20$ and $w = 0.9$ to generate a 136-dimensional feature vector.

Word2Vec by FastText

FastText is a natural language model released by Facebook (Joulin et al., 2017). By considering the RNA segments as sentences, we used the FastText program to build a word2vec model and then to encode the RNA segments as word vectors. Both skipgram and cbow models can be trained in FastText; we, thus, trained a cbow model to generate word embeddings for RNA segments. A total of 100-dimensional feature data was generated by using FastText.

Feature Selection

Feature selection is an important step in building effective machine learning models when high-dimensional features were generated. In this study, three different feature selection methods were employed to select the optimal feature subsets. As one of the frameworks for explaining the prediction model, the SHAP algorithm was proposed to characterize feature importance and assess feature behavior (Swann et al., 2011). The contribution of each feature can be evaluated by the SHAP value, which is calculated by

Γ_{i} = \sum_{S \subseteq F, {i}} (| S |! (| F | - | S | −1)!/ | F |!) [f_{S \cup {i}} (x_{S \cup {i}}) - f_{S} (x_{S})],

where $Γ_{i}$ represents the importance score of the feature $i$ , F denotes the set of all features, and $S$ expresses all feature subsets obtained from $F$ without feature $i$ . The predictive results of the two models based on $f_{S \cup {i}}$ and $f_{S}$ were compared with the current input $f_{S \cup {i}} (x_{S \cup {i}}) - f_{S} (x_{S})$ , where $x_{S}$ represents the values of the input features in the set $S$ . To estimate $Γ_{i}$ based on the $2^{| F |}$ difference, the SHAP method approximates the Shapley value by performing Shapley sampling or Shapley quantitative influence.

The F-score (Polat and Guenes 2009) is another feature selection method that measures the discriminatory ability of two sets of real values. The F-score value of each feature in the data set can be calculated by the following equation:

F_{i} = \frac{{({\bar{x}}_{i}^{(+)} - {\bar{x}}_{i})}^{2} + {({\bar{x}}_{i}^{(-)} - {\bar{x}}_{i})}^{2}}{\frac{1}{n_{+} - 1} \sum_{k = 1}^{n_{+}} {(x_{k, i}^{(+)} - {\bar{x}}_{i}^{(+)})}^{2} + \frac{1}{n_{-} - 1} \sum_{k = 1}^{n_{-}} {(x_{k, i}^{(-)} - {\bar{x}}_{i}^{(-)})}^{2}},

where $F_{i}$ represents the F-score value of the $i th$ feature; ${\bar{x}}_{i}$ , ${\bar{x}}_{i}^{(+)}$ , ${\bar{x}}_{i}^{(-)}$ are the average of the $i th$ feature of all, positive, and negative samples of the data set, respectively; $n_{+}$ and $n_{-}$ mean the numbers of positive and negative samples in the data set, respectively; $x_{k, i}^{(+)}$ is the $i th$ feature of the $k th$ positive sample; and $x_{k, i}^{(-)}$ is the $i th$ feature of the $k th$ negative sample. Thus, the numerator means the variance between means of the positive and negative samples, and the denominator represents the sum of variances of positive and negative samples. The larger the F-score, the more likely this feature is to be more discriminative.

The third feature selection method used in this study is maximum relevance minimum redundancy (mRMR), which was developed by Peng et al. (Hanchuan et al., 2005). In this method, mutual information (MI) is used to evaluate the relationships among the features and the labels, and the goal of the method is to identify features that can maximize the relevance between features and labels and simultaneously minimize the relevance between the features. The following equation is used to select features recursively:

\max_{f_{j} \in Ω_{r}} [I (f_{j}, l) - \frac{1}{| Ω_{s} |} \sum_{f_{i} \in Ω_{s}} I (f_{j}, f_{i})],

where $Ω_{s}$ represents the subset with selected features and $Ω_{r}$ represents the subset of remaining features; $f_{j}$ and $f_{i}$ represent the features in $Ω_{s}$ and $Ω_{r}$ , respectively; $l$ represents the label vector; $I (x, y)$ means the mutual information between vector $x$ and $y$ , which can be calculated as follows:

I (x, y) = \iint p (x, y) \log \frac{p (x, y)}{p (x) p (y)} dxdy,

where $p (x, y)$ is the joint probabilistic density and $p (x)$ , $p (y)$ are the marginal probabilistic densities.

Classifier

The XGBoost was a distributed gradient enhancement library that was widely used in classification scenarios (Ji et al., 2019; Zhao et al., 2019; Ding et al., 2020; Samat et al., 2020). It has many advantages, such as flexibility, efficiency, and portability. The basic principle of this algorithm is to assign quantitative weight to each leaf node of a series of decision trees. The parallel enhanced trees are provided by XGBoost. This algorithm has very good ability to process sparse and high-dimensional data, and it also inherits the high accuracy of the original boosting algorithm. Some researchers apply this algorithm in bioinformatics, such as the prediction of m6A (Qiang et al., 2018; Zhao et al., 2019) and m7G sites (Bi et al., 2020). In this paper, we used a python package to build the XGBoost model and used a grid search method to optimize hyperparameters, max_depth, learning_rate, and n_estimators. The ranges of these three hyperparameters are (2, 4, 6, 8,10, 12, 14.16), (0.005, 0.01, 0.02, 0.05, 0.1), and (1,600,1800,2000, 2,200, 2,400, 2,600, 2,800), respectively. Finally, we obtained different optimal hyperparameters for different species. The optimal hyperparameters for three species are shown in Table 3.

TABLE 3

TABLE 3. The optimal hyperparameters of XGBoost for three species.

Evaluation Criteria

Cross-validation is often used to evaluate the performance and generalization ability of machine learning models. In this paper, five-fold cross-validation was used to evaluate the models, and the random sampling method was used to divide the training data set into five subsets with very close data volume (Fushiki 2011). In each training, one of the five subsets was used as validation data set, and the other four were used for training the model. Thus, a total of five m5C site prediction models were obtained. Finally, the prediction results of these five models were evaluated, and the five evaluation values were averaged as the ultimate evaluation indices. Similarly, this five-fold cross-validation was also adopted for hyperparameter selection, algorithm comparison, etc.

Different evaluation metrics are used in bioinformatics classification. In this study, we selected the accuracy (Acc), sensitivity (Sen), specificity (Spe), precision (Pre), Matthews correlation coefficient (Mcc), and F1-score as the main evaluation metrics (Zhang et al., 2019; Lv et al., 2020). Counts of true positive, true negative, false positive, and false negative predictions were recorded as TP, TN, FP, and FN, respectively. Thus, the six metrics can be represented as follows:

{\begin{matrix} Sen = \frac{TP}{TP + FN} \\ \begin{matrix} Spe = \frac{TN}{TN + FP} \\ Pre = \frac{TP}{TP + FP} \end{matrix} \\ \begin{matrix} Acc = \frac{TP + TN}{TP + FP + TN + FN} \\ Mcc = \frac{TN∗TP - FN∗FP}{\sqrt{(TP + FP) (TP + FN) (TN + FP) (TN + FN)}} \\ F 1 = \frac{2 ∗TP}{(2 ∗TP + FP + FN)} \end{matrix} \end{matrix}

In addition to the above evaluation indicators, the precision recall curve (PRC curve) (Keilwagen et al., 2014; Saito and Rehmsmeier 2017) and receiver operating characteristic curve (ROC curve) (Fawcett 2006; Li et al., 2019) were also used to evaluate the model. These two curves have the ability to evaluate the prediction performance of the proposed method in the whole decision value range, and the areas under the curves (AUPRC and AUROC) are often used to quantify the performance of the models. We quantify the performance of the model by plotting these two kinds of curves and calculating the areas under the ROC and PRC curves.

Results

Models Based on Features Selected by SHAP

Seven kinds of features were generated from the RNA segments of the three species of which the dimension is 808 in total. Considering the redundancy between the features, SHAP was used to select the optimal feature subsets by which the scores of importance of the 808-dimensional features were calculated based on XGBoost ensemble algorithm. Figure 2 shows the cross-validation AUROC values of models based on the top n features. The highest AUROCs were obtained when the top 48, 228, and 208 features were used for H. sapiens, M. musculus, and A. thaliana, respectively. The corresponding AUROC values are 0.935, 0.834, and 0.787, for the three species, respectively.

FIGURE 2

FIGURE 2. The cross-validation AUROC values of models based on the top n features selected by SHAP, mRMR, and f-score.

In addition, Table 4 shows all the evaluation metrics for the models based on features selected by SHAP and the models based on the original 808 features. It indicates that the models based on features selected by SHAP achieved higher values than the model based on the original 808 features for most of the metrics, which demonstrates the advantages of using SHAP for feature selection.

TABLE 4

TABLE 4. The five-fold cross-validation results for models based on features selected by SHAP or the original 808 features.

Comparison With Other Feature Selection Methods

Besides this, another two kinds of feature-selection methods, F-score (Polat and Guenes 2009) and mRMR (Li et al., 2017; Bugata and Drotar 2020), were also used to select the optimal feature subsets. The cross-validation AUROCs of the models based on the top n features selected by these two methods are also plotted in Figure 2. As shown in Figure 2, generally, the models based on features selected by SHAP are superior to the models based on features selected by the other two methods. Thus, we used the feature subsets selected by SHAP as the optimal feature subsets.

Models Based on Different Classifiers

To verify the effectiveness of the XGBoost algorithm in m5C site prediction, two other learning algorithms, random forests (Biau 2012; Ziegler and Konig 2014; Li et al., 2018) and support vector machine (Boopathi et al., 2019; Chen et al., 2019; Liu et al., 2020), were also used to build models based on the optimal feature subsets selected by SHAP. The hyperparameters of RF and SVM were also optimized by grid search.

Table 5 shows the five-fold cross-validation performances for the models based on the three different learning algorithms. For A. thaliana, the AUROC value of the model based on XGBoost is 0.787, which is higher than the models based on RF (0.780) and SVM (0.768). For M. musculus, the AUROC value of the model based on XGBoost is 0.834, which is also higher than the models based on RF (0.795) and SVM (0.824). For H. sapiens, the AUROC value of the model based on XGBoost is 0.935, which is also higher than the models based on RF (0.911) and SVM (0.903). The ROC and PRC curves for three species are shown in Figure 3. As shown in Figure 3, for H. sapiens, the AUPRC of the model based on XGBoost is 0.942, which is higher than the models based on RF (0.910) and SVM (0.897). Similarly, for A. thaliana, the AUPRC of the model based on XGBoost is 0.794, which is higher than that based on RF (0.784) and SVM (0.771). In addition, for M. musculus, the AUPRC of the model based on XGBoost is 0.827, which is higher than the models based on SVM (0.812) and RF (0.791). Thus, the models built by using XGBoost were selected as our final models.

TABLE 5

TABLE 5. The five-fold cross-validation performance of models built based on different classifiers with the features selected by SHAP.

FIGURE 3

FIGURE 3. The ROC curves and PRC curves of five-fold cross-validation results based on three learning algorithms for the three species.

Comparison With Other Existing Methods

To further evaluate the generalization of our models, the predictive results of our models on the independent test sets were compared with other existing methods, iRNA-m5C (Lv et al., 2020), m5CPred-SVM (Chen Xiao et al., 2020), RNAm5Cfinder (Li et al., 2018), iRNAm5C-PseDNC (Qiu et al., 2017), RNAm5CPred (Fang et al., 2019), PEA-m5C (Song et al., 2018), and Staem5 (Chai et al., 2021b). However, not all of these methods can predict m5C sites in all three species. For example, RNAm5Cfinder (Li et al., 2018) can predict m5C sites for H. sapiens and M. musculus but not for A. thaliana. iRNAm5C-PseDNC (Qiu et al., 2017) and RNAm5CPred (Fang et al., 2019) can only predict the m5C sites of H. sapiens, and PEA-m5C (Song et al., 2018) can only be used for prediction of A. thaliana. By using the default decision threshold, Table 6 shows that our model achieved the highest performance for all seven evaluation metrics except specificity for H. sapiens compared with other state-of-the-art methods. For M. musculus, our model obtained the best AUROC, MCC, accuracy, and FOR (false omission rate). For A. thaliana, our model achieved the highest values for all seven evaluation metrics. Thus, we prove the superiority of our m5Cpred_XS model for predicting the m5C sites for three species. By using other decision thresholds as shown in Table 6, the precisions, specificities, accuracies, and MCCs of our models can be improved; however, other evaluation metrics, such as sensitivities and F1 scores drop away.

TABLE 6

TABLE 6. Comparison with other existing models on the independent test sets.

It is noted that the predictive accuracies of iRNA-5mC and PEA-m5C on the independent test sets are even less than 0.50. The possible reason is that the corresponding training data sets for building these models are small. For example, the model of iRNA-m5C for homo sapiens is based on a data set that only contains 120 positive samples, and PEA-m5C is based on a data set that contains 1196 positive samples. Both data sets were smaller than the data sets used in this study. The small size of the data set limits the generalization of the model on the independent test set. In addition, the model was not evaluated on an independent test set in the original paper of iRNA-m5C and the redundancy of the data set used for PEA-m5C was not removed.

Implementation of the m5CPred-XS Web Server

To facilitate the use of our model, we built a web server that is freely available at http://m5cpred-xs.zhulab.org.cn/. The server was implemented using flask, docker, and nginx. The users can easily carry out the prediction by the following procedures: First, users can type the query RNA sequences into the input box or upload a FASTA format file. (Note that the input sequence should be in FASTA format, and the length of each query sequence should be longer than 41 bp.) After that, one of the three species, H. sapiens, M. musculus, and A. thaliana, should be chosen. Users can provide their email address as a way to obtain the query results. Then, by clicking the “submit” button, the server generates a unique task ID and do the calculation until the final result is reached. During this process, you can query the task status by task ID. When the task was done, the results would be sent back to the users as an email attachment.

Discussions

Analysis of Features Selected by SHAP

To further analyze the features selected by SHAP, the most important top 20 features for the three species are shown in Figure 4, in which the horizontal axis shows the distribution of the SHAP values and the vertical axis shows the features. If the SHAP values are positive, it will help to predict the m5C sites. Otherwise, it means the prediction tends to be of the negative class.

FIGURE 4

FIGURE 4. Top 20 features sorted by SHAP for the three species.

Figure 5 shows the distribution of the top 20 features in the seven types of features for three species. Overall, the top 20 most important features are not evenly distributed in the seven types of features for the three species. ENAC and SCPseDNC are the two types of features that appear in the top 20 features of all three species. ENAC represents the detailed distribution of nucleotides in each slide window. SCPseDNC represents the detailed distribution of dinucleotides and the distribution of its physical–chemical properties. Our results indicate that the distribution of nucleotides and their properties are related to the modification. Specifically, when identifying m5C sites of H. sapiens, features belonging to ENAC account for the largest proportion of the top 20 most important features, including a total of seven features. The three types of features, binary, ANF, and word2vec, are not included in the top 20 most important features, which indicates that these features contribute little to the prediction m5C sites of H. sapiens. For M. musculus, five features from NCP and SCPseDNC appeared in the top 20 features, and ANF and CKSNAP did not appear. For A. thaliana, five features of SCPseDNC and FastText appeared in top 20 features, and NCP was not included. These results indicate that the relevant features are related to the data sets, and feature selection is helpful for building high-performance models.

FIGURE 5

FIGURE 5. Distribution of top 20 features in the seven types of features for the three species.

Moreover, the principal component analysis was used to visualize the effectiveness of the selected features. Figure 6 shows that the boundaries between positive and negative samples for the three species are a little bit clearer in the features selected by SHAP than the original 808 dimensional features.

FIGURE 6

FIGURE 6. PCA plots for the original 808 dimensional features and features selected by SHAP for the three species. Upper panel: the original 808 dimensional features; Lower panel: the features selected by SHAP.

Cross-Species Validation

To further evaluate the generalization of our models, we conducted the cross-species validation to analyze the species-specificity and transferability of the models that were tested on the three independent test sets of the three species. Figure 7 shows that the models of all three species performs well (AUROC>0.7) on the independent test set of H. sapiens. However, the model of H. sapiens does not performs well on the independent test sets of the other two species. Figure 7 also shows that the model of M. musculus performs on the independent set of H. sapiens even better than that of M. musculus. In addition, the model of A. thaliana performs worse on the independent test set of M. musculus. We thought the small size of the benchmark data set of H. sapiens was one of the possible reasons for the results. The other reason is that both M. musculus and H. sapiens are mammals.

FIGURE 7

FIGURE 7. The heat map for the cross species predictive AUROCs. The models (y-axis) were tested on the three independent test sets (x-axis).

Conclusion

In this study, we proposed a new computational model, m5Cpred_XS, for predicting m5C sites. Three different feature-selection methods were used to select the optimal subset from 808 dimensional data of seven kinds of features. It turns out that the features selected by SHAP are more relevant compared with the features selected by the other two methods. The selected feature subsets were used to build our models. Our results show that the models based on XGBoost are superior to the models trained with RF and SVM. The m5Cpred_XS was further compared with other existing methods on the independent test sets, which demonstrates that our model outperforms the other methods according to AUROC values.

Data Availability Statement

Publicly available data sets were analyzed in this study. This data can be available at: https://github.com/yinboliu-git/m5Cpred-XS.

Author Contributions

XZ and YZ conceived the study; XZ and YL designed the experiments; YL and YS performed the experiments. YL, YS and HW analyzed the data. YL, XZ and YZ wrote the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by National Natural Science Foundation of China (grant numbers: 21403002).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Agris, P. F. (2008). Bringing Order to Translation: the Contributions of Transfer RNA Anticodon‐domain Modifications. EMBO Rep. 9, 629–635. doi:10.1038/embor.2008.104

PubMed Abstract | CrossRef Full Text | Google Scholar

Ahmad, A., and Shatabda, S. (2019). EPAI-NC: Enhanced Prediction of Adenosine to Inosine RNA Editing Sites Using Nucleotide Compositions. Anal. Biochem. 569 (569), 16–21. doi:10.1016/j.ab.2019.01.002

PubMed Abstract | CrossRef Full Text | Google Scholar

Bi, Y., Xiang, D., Ge, Z., Li, F., Jia, C., and Song, J. (2020). An Interpretable Prediction Model for Identifying N7-Methylguanosine Sites Based on XGBoost and SHAP. Mol. Ther. - Nucleic Acids 22 (22), 362–372. doi:10.1016/j.omtn.2020.08.022

PubMed Abstract | CrossRef Full Text | Google Scholar

Biau, G. (2012). Analysis of a Random Forests Model. J. Mach Learn. Res. Apr 13, 1063–1095.

Google Scholar

Boopathi, V., Subramaniyam, S., Malik, A., Lee, G., Manavalan, B., and Yang, D. C. (2019). mACPpred: A Support Vector Machine-Based Meta-Predictor for Identification of Anticancer Peptides. Int. J. Mol. Sci. 20, 20. doi:10.3390/ijms20081964

PubMed Abstract | CrossRef Full Text | Google Scholar

Bugata, P., and Drotar, P.(2020). On Some Aspects of Minimum Redundancy Maximum Relevance Feature Selection. Sci. China Inform. Sci. Jan;63. doi:10.1007/s11432-019-2633-y

m5Cpred-XS: A New Method for Predicting RNA m5C Sites Based on XGBoost and SHAP

Introduction

Materials and Methods

Overall Framework of m5Cpred_XS

Benchmark Data Sets

Feature Encoding Extraction

Enhanced Nucleic Acid Composition

The Composition of K-Spaced Nucleic Acid Pairs

Accumulated Nucleotide Frequency

Nucleotide Chemical Property

Binary Encoding

Series Correlation Pseudo Dinucleotide Composition

Word2Vec by FastText

Feature Selection

Classifier

Evaluation Criteria

Results

Models Based on Features Selected by SHAP

Comparison With Other Feature Selection Methods

Models Based on Different Classifiers

Comparison With Other Existing Methods

Implementation of the m5CPred-XS Web Server

Discussions

Analysis of Features Selected by SHAP

Cross-Species Validation

Conclusion

Data Availability Statement

Author Contributions

Funding

Conflict of Interest

Publisher’s Note

References

94% of researchers rate our articles as excellent or good