ORIGINAL RESEARCH article

Front. Bioeng. Biotechnol., 26 February 2020

Sec. Computational Genomics

Volume 8 - 2020 | https://doi.org/10.3389/fbioe.2020.00134

RF-PseU: A Random Forest Predictor for RNA Pseudouridine Sites

  • 1. Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China

  • 2. Rehabilitation Department, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China

  • 3. Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China

Abstract

One of the ubiquitous chemical modifications in RNA, pseudouridine modification is crucial for various cellular biological and physiological processes. To gain more insight into the functional mechanisms involved, it is of fundamental importance to precisely identify pseudouridine sites in RNA. Several useful machine learning approaches have become available recently, with the increasing progress of next-generation sequencing technology; however, existing methods cannot predict sites with high accuracy. Thus, a more accurate predictor is required. In this study, a random forest-based predictor named RF-PseU is proposed for prediction of pseudouridylation sites. To optimize feature representation and obtain a better model, the light gradient boosting machine algorithm and incremental feature selection strategy were used to select the optimum feature space vector for training the random forest model RF-PseU. Compared with previous state-of-the-art predictors, the results on the same benchmark data sets of three species demonstrate that RF-PseU performs better overall. The integrated average leave-one-out cross-validation and independent testing accuracy scores were 71.4% and 74.7%, respectively, representing increments of 3.63% and 4.77% versus the best existing predictor. Moreover, the final RF-PseU model for prediction was built on leave-one-out cross-validation and provides a reliable and robust tool for identifying pseudouridine sites. A web server with a user-friendly interface is accessible at http://148.70.81.170:10228/rfpseu.

Introduction

More than 150 types of chemical modification have been identified in cellular RNA, including adenosine methylation, cytosine modification, isomerization of uridine, and ribose modification (Boccaletto et al., 2018). These modifications have critical roles in cellular biological and physiological processes (Song and Yi, 2017). For instance, one of the most prevalent RNA modifications in eukaryotes, N6-methyladenosine (m6A), affects RNA stability (Wang et al., 2014), RNA-protein interaction (Liu et al., 2015b), RNA splicing and translation (Meyer and Jaffrey, 2014), the circadian clock (Fustin et al., 2013), immune response (Winkler et al., 2019), etc. Another widespread RNA modification is 5-methylcytosine (m5C), which has functions including preservation of the secondary structure of tRNA (Motorin and Helm, 2010), control of amino-acylation (Helm, 2006), codon identification and metabolic stability (Agris, 2008; Li et al., 2017). The pseudouridine modification is another common post-transcriptional modification in various living organisms (Zaringhalam and Papavasiliou, 2016). In 1951, pseudouridine was first identified, and experiments in 1960 revealed that it was abundant in tRNA and rRNA (Cohn, 1960). Pseudouridine results from an isomerization of uridine by breaking the glycosidic bond with 180° base rotation (Karijolich et al., 2015). This modification has been shown to have vital roles, for instance, in stabilizing RNA and in the stress response (Zhao and He, 2015; Cheng et al., 2019a; Wang et al., 2019b).

Although RNA pseudouridylation was discovered decades ago, the first transcriptome-wide RNA pseudouridylation map was not published until 2014, following the rapid development of next-generation sequencing technology (Goodwin et al., 2016). Carlile et al. (2014) developed the PseudoU-seq technology, which they used to identify more than 200 pseudouridylation sites in the regulated mRNA of yeast and human cells; in the same year, Schwartz et al. (2014) performed transcriptome-wide mapping using a similar protocol, finding more than 300 dynamic-regulated pseudouridine sites in non-coding RNA and mRNA. Li et al. (2015a) presented a chemical labeling method (CeU-Seq) that they used to pull down more than 2000 pseudouridine sites in human mRNA. Other RNA pseudouridylation sequencing protocols were also developed (Carlile et al., 2015).

As an alternative to costly and labor-intensive laboratory experiments, robust, swift, and inexpensive computational methods for RNA chemical modification prediction have emerged recently, owing to the increasing amount of data generated in this post-genomics era (Libbrecht and Noble, 2015). A large number of m6A (Chen et al., 2015, 2018a,b, 2019a; Zhou et al., 2016; Zhao et al., 2019; Zou et al., 2019) and m5C (Feng et al., 2016; Qiu et al., 2017; Li et al., 2018; Sabooh et al., 2018; Zhang et al., 2018; Yin et al., 2019) site predictors based on traditional machine learning and emerging deep learning algorithms have been proposed. However, few computational tools have been developed to predict pseudouridine sites. Li et al. (2015b) used a support vector machine (SVM) classifier to design a web server called PPUS for the identification of pseudouridine sites in Saccharomyces cerevisiae and Homo sapiens. Chen et al. (2016) constructed another SVM-based web server for pseudouridine site prediction, using the frequency composition of the nucleotides and pseudo K-tuple nucleotide composition (PseKNC) for feature representation. He et al. (2018) presented another model, PseUI, to identify pseudouridine sites in RNA sequences from three species (H. sapiens, S. cerevisiae, and M. musculus); this was an SVM-based model incorporating multiple feature-extraction technologies. Tahir et al. (2019) used convolutional neural networks to design a new predictor, iPseU-CNN; and Liu et al. (2019b) developed the eXtreme gradient boosting (XGboost) method for RNA pseudouridine site prediction (XG-PseU). Cross-validation scores for RNA pseudouridine site identification in the abovementioned three species showed the best accuracy for iPseU-CNN (66.9%) in H. sapiens, whereas XG-PseU and iPseU-CNN had the best accuracy (68.2%) in S. cerevisiae, and XG-PseU was the most accurate (72.0%) in M. musculus. According to independent testing scores, iPseU-CNN outperformed the other models, with 69.0% accuracy in H. sapiens and 73.6% accuracy in S. cerevisiae. Although the iPseU-CNN predictor had a high average cross-validation accuracy (68.9%) and independent testing accuracy (71.3%) scores, there was still room for improvement in comparison with some high-performing m6A site predictors (Chen et al., 2019a; Zou et al., 2019).

In this work, a model is developed based on the random forest algorithm, RF-PseU, for pseudouridine site recognition. The modeling overview is shown in Figure 1. RF-PseU incorporates multiple sequence feature representation technologies, and the light gradient boosting machine (LGBM) algorithm is employed to remove redundant features and rank the remaining features. Evaluation with leave-one-out (LOO) cross-validation demonstrated the robustness of the model. The average cross-validation accuracy (71.3% for 10-Fold and 71.4% for LOO) of RF-PseU was improved by 3.48–10.3% compared with existing state-of-the-art predictors, and the average independent testing accuracy (74.7%) showed a 4.8–19% increase. A user-friendly web server was also implemented, which can be accessed at http://148.70.81.170:10228/rfpseu. RF-PseU is expected to be a useful supplement to the existing tools for pseudouridine site identification.

FIGURE 1

Materials and Methods

Data Sets

Given that there were small differences between the benchmark data sets used in the studies of Chen et al. (2018a) and Liu et al. (2019b), data sets obtained from Chen et al. (2018a) were used to train and test our models. The training data sets included data for three species. That is, H. sapiens training dataset with 495 psedouridine-sites-containing sequences and 495 non-psedouridine-sites-containing; S. cerevisiae training dataset contains 314 psedouridine-sites-sequences and 314 non-psedouridine-sites-sequences; M. musculus training dataset consists of 944 sequences, half of which is positive samples. Whereas the independent testing data sets covered only two species, H. sapiens and S. cerevisiae, both of which contain 100 positive samples and 100 negative samples. For the H. sapiens and M. musculus data sets, the window size was 21, i.e. the positive samples were psedouridine site centroid sequences of 21 base pairs each, whereas those for the S. cerevisiae samples window site was 31, with psedouridine site centroid sequences of 31 base pairs. Negative samples, in which no psedouridine sites were detected, consisted of 21 base pairs for H. sapiens and M. musculus, and 31 base pairs for S. cerevisiae. The benchmark data sets can be downloaded from http://lin-group.cn/server/iRNAPseu/data.

Feature Representation

Several widely used and convenient bio-sequence feature representation tools have been developed (Mrozek et al., 2013; Liu et al., 2015a, 2019c; Yu et al., 2015, 2019; Cheng and Hu, 2018; Hu et al., 2019; Muhammod et al., 2019). The two main tools used in this work were iLearn (Hu et al., 2019) and PyFeat (Muhammod et al., 2019).

Nucleotide Binary Profiles

Binary profiles encode the four bases (ACGU) as (1,0,0,0), (0,1,0,0), (0,0,1,0), and (0,0,0,1), whereas dibinary profiles encode the 16 dinucleotides (AA, AC, AG, AU, CA, CC, CG, CU, GA, GC, GG, GU, UA, UC, UG, and UU) as (0,0,0,0), (0,0,0,1), (0,0,1,0), (0,0,1,1),…, (1,1,1,1).

Accumulated Nucleotide Frequency

Suppose si is a base (ACGU) at the ith position of a RNA sequence. Then we can determine the si density di of the ith prefix subsequence of a RNA sequence as follows:

where L is the sequence length and q is one of the four nucleotides (ACGU).

Nucleotide Chemical Properties

The four RNA nucleotides (ACGU) are different from each other in terms of chemical structure and chemical bonds. On the basis of these differences, AGCU can be categorized into three different classes (Table 1) and encoded using a three-dimensional coordinate, i.e. A is denoted by (1,1,1), C by (0,1,0), G by (1,0,0), and U by (0,0,1).

TABLE 1

NucleotidesChemical property
C,UPyrimidine and ring structure
A,GPurine and ring structure
A,UWeak and hydrogen bond
C,GStrong and hydrogen bond
G,UKeto and functional group
A,CAmino and functional group

ACGU categories based on chemical properties.

Electron-Ion Interaction Pseudopotentials (EIIP)

Nair and Sreenadhan (2006) used the EIIP values of A, G, C, and T (A: 0.1260, G: 0.0806, C: 0.1340, T: 0.1335) to directly represent the nucleotides in a DNA sequence. Here, iLearn was used to encode each nucleotide in the RNA sequences into EIIP feature vectors.

Enhanced Nucleic Acid Composition

The nucleotide composition was calculated for a fixed-length window of the RNA sequence, allowing the fixed window (length = 5) to continuously slide from the 5′ to the 3′ terminus. RNA sequences were then encoded into feature vectors of equal length.

Xmer k-Spaced Ymer Composition Frequency

This method is used to count the composition of a subsequence of X and Y consecutive nucleotides with intervals k, e.g. AGU@AU, A@CU, GU@@@A, where @ indicates a one-interval space, @@ a two-interval space, and so on. Generally, using Xmer k-spaced Ymer to encode an RNA sequence will generate a 4X × 4Y feature vector. In this study, X, Y, and k were set to 1, 2, or 3; and eight XYK combinations (except for 3mer-kspaced-3mer) were used for encoding. The PyFeat tool developed by Rafsanjani et al. (Muhammod et al., 2019) was used to convert RNA sequences into vectors.

Feature Selection

Feature selection is an effective way to remove redundant information and prevent over-fitting in machine learning modeling (Tang et al., 2017; Xu et al., 2018a; Cheng et al., 2019a; Liu, 2019; Sun et al., 2019; Yu et al., 2019). Several feature selection technologies, including ANOVA (Lv et al., 2019b) and MRMD (Zou et al., 2016), have been developed and are widely used for DNA, RNA, and protein identification (Xu et al., 2018b). In this work, an LGBM (Ke et al., 2017)1 wrapper was used to select appropriate feature spaces for model training. In this process, raw training data were fed into the LGBM model and their features were ranked by importance value as calculated with the LGBM algorithm. Features with importance values greater than the average were selected to compose the feature space for modeling.

Model Evaluation Metrics and Methods

The proposed models were evaluated by five commonly used metrics, accuracy (ACC), sensitivity (Sn), specificity (Sp), Matthew correlation coefficient (MCC), and integral area under the receiver operating characteristic curve (auROC). These metrics were calculated using the following equations, where TP, TN, FP, and FN stand for true positive, true negative, false positive, and false negative, respectively (Cheng et al., 2016, 2019b,c; Wei et al., 2017d, e; Liu et al., 2019a). For the ROC curve, 1-specificity was plotted on the horizontal axis, and sensitivity on the vertical axis.

LOO, K-Fold cross-validation, and independent testing are the most widely used methods for predictor evaluation (Mrozek et al., 2015; Cao and Cheng, 2016; Chen et al., 2017, 2018a, 2019b; Pan et al., 2017; He et al., 2018, 2019; Jiang et al., 2018; Xiong et al., 2018; Yu et al., 2018; Zhang et al., 2018; Ding et al., 2019; Feng et al., 2019; Kong and Zhang, 2019; Li and Liu, 2019; Lv et al., 2019a; Manavalan et al., 2019; Shan et al., 2019; Wang et al., 2019a; Wei et al., 2019a, b; Xu et al., 2019; Yu and Dai, 2019). That is the general machine learning evaluation methods (training, validation and testing) are used for optimized model evaluation. To test the efficiency of the classification, LOO cross-validation was performed for a data set containing n items, of which n-1 items were used for training and the remaining one was used for validation. This procedure was repeated until every sequence in the training data set had been used once as a validation testing sample. LOO cross-validation is robust but time-consuming for a large data set. To compare the performance of the model with that of existing predictors, 10-Fold cross-validation was also used. The training data set was stochastically divided into 10 subsets, with one subset for validation and the remaining nine for training. This process was repeated 10 times and the average results were used to evaluate the model. Finally, independent testing was performed to obtain a data set that was completely distinct from the training data set for evaluation of the trained model.

Algorithm

The random forest method is a bagging-type ensemble learning algorithm (Cheng et al., 2018a, b). By combining multiple weak classifiers, the final results can be voted or averaged to obtain an overall model with higher accuracy, better general performance, and resistance to overfitting. This algorithm has been extensively used in bioinformatics and other areas, and has been confirmed to be an effective modeling technique in various domains (Ding et al., 2016a,b; Mrozek et al., 2016; Qiu et al., 2016; Wang et al., 2017; Wei et al., 2017a,b,c; Yu et al., 2017a; Zheng et al., 2017; Tang et al., 2018, 2019a; Xue et al., 2018; Degenhardt et al., 2019; Xu et al., 2019). In this study, the scikit-learn toolkit, available at https://scikit-learn.org, was used to establish the models.

Support vector machine (Cortes and Vapnik, 1995) is a generalized linear classifier that classifies data based on supervised learning; its decision boundary is the maximum-margin hyperplane required to solve the learning sample. SVM has been widely used in a variety of fields (Xiong et al., 2012; Ding et al., 2017; Yu et al., 2017b; Fu et al., 2018; Fang et al., 2019; Lai et al., 2019; Meng et al., 2019; Shen et al., 2019; Tang et al., 2019b; Zhang et al., 2019; Zhu et al., 2019). Here, it was used for modeling comparisons. SVM was also implemented via the scikit-learn toolkit, using the Gaussian radial basis functions, with the critical hyper-parameters (C and γ) of SVM optimized in a range from 10–6 to 106 with exponent step 100.5.

Results and Discussion

Optimization With Different Feature Spaces

To determine optimal feature spaces, we first used the LGBM algorithm to sort the features from maximum to minimum according to their importance value. All the features with importance value greater than the average were kept. Second, we used an incremental feature selection strategy; as shown in Figure 2A, the 10-Fold cross-validation and independent testing accuracy varied as features were added. Initially, the accuracy increased rapidly for each species. As shown in Figure 2 (A1) and Figure 2 (A2), when the feature dimensions for H. sapiens and S. cerevisiae reached 257 and 397, the model achieved maximum independent testing accuracies of 75.0 and 77.0%, respectively. Owing to the lack of independent test data sets for M. musculus, Figure 2 (A3) shows only the cross-validation accuracy curve, with its peak value (74.8%) at a feature dimension of 161. The optimal feature space dimensions selected for each species were 257, 397, and 161, respectively. These values were used for further experiments and optimization.

FIGURE 2

Comparison With SVM Predictors

Given that PPUS (Li et al., 2015b), iRNA-PseU (Chen et al., 2016), and PseUI (He et al., 2018) were all based on SVM, an optimized SVM model for pseudouridine site identification with the same feature spaces as the RF model was constructed to determine the effects of the SVM and RF on prediction performance. The performances of the two models are shown in Table 2. Overall, the models based on RF showed markedly better performance than those based on SVM. For instance, in terms of 10-Fold cross-validation accuracy, the RF models for H. sapiens, S. cerevisiae, and M. musculus outperformed the corresponding SVM models by 3.71%, 10.8%, and 5.80%, respectively. The independent testing accuracy scores showed an even greater contrast. For example, the RF model had 75.0% accuracy for H. sapiens, exactly 1.17 times that of the SVM model. The ROC curve and auROC value shown in Figure 2B also demonstrate that the optimized RF models performed better than the optimized SVM models for the same feature spaces. Thus, non-SVM models such as XG-PseU (Liu et al., 2019b), iPseU-CNN (Tahir et al., 2019), and our RF-PseU model might be more suitable for distinguishing pseudouridine sites from non- pseudouridine sites.

TABLE 2

SpeciesAlgorithm10 fold cross-validationIndependent testing


ACCMCCSnSpauROCACCMCCSnSpauROC
H. sapiensSVM62.0%0.24061.4%62.6%0.65664.0%0.28066.0%62.0%0.679
RF64.3%0.28766.1%62.6%0.70075.0%0.50178.0%72.0%0.800
S. cerevisiaeSVM67.5%0.35273.7%61.2%0.72072.5%0.4573.0%73.0%0.786
RF74.8%0.49777.2%72.4%0.81077.0%0.54075.0%79.0%0.838
M. musculusSVM70.7%0.4265.9%75.4%0.759/////
RF74.8%0.5073.1%76.5%0.796/////

Cross-validation and independent testing scores of two different classifiers for three species.

Comparison With Previous Predictors

The performance of RF-PseU was also compared with that of state-of-the-art predictors including iRNA-PseU (Chen et al., 2016), PseUI (He et al., 2018), iPseU-CNN (Tahir et al., 2019), and XG-PseU (Liu et al., 2019b). First, we compared the evaluation scores for the three species. Table 3 compares the cross-validation and independent testing scores for the state-of-the-art pseudouridine sites predictors with those of RF-PseU. In terms of cross-validation scores, the LOO accuracy values for S. cerevisiae and M. musculus were 75.4% and 74.5%, respectively, representing increments of approximately 10.5% and 3.47% over the values for the existing predictor (XG-PseU) with the best cross-validation score. However, the LOO accuracy of RF-PseU for H. sapiens, at 64.0%, showed a decrease of 4.0% compared with the best H. sapiens pseudouridine site predictor, PseU-CNN. In terms of independent testing, as shown in Table 3, RF-PseU scored higher than the existing predictors in all aspects. For comprehensive comparison, the average scores for different species were calculated. The results, shown in Table 4, demonstrate that RF-PseU performed better overall than the other four predictors. The cross-validation accuracy scores of RF-PseU were 3.48% higher than those of the best existing predictor, iPseU-CNN; in terms of independent testing scores, RF-PseU showed a marked improvement of 4.7–10.6% compared with iPseU-CNN. The overall performance of RF-PseU was also significantly better than those of the other predictors, indicating that RF-PseU can discriminate true pseudouridine sites from non-pseudouridine sites more precisely than the existing predictors.

TABLE 3

SpeciesClassifierCross-validationIndependent testing


ACCMCCSnSpauROCACCMCCSnSpauROC
H. sapiensiRNA-PseU(LOO)a60.4%0.2161.0%59.8%0.64065.0%0.3060.0%70.0%/
PseUI(LOO)a64.2%0.2864.9%63.6%0.6865.5%0.3163.0%68.0%/
iPseU-CNN(5F)b66.7%0.3465.0%68.8%/69.0%0.4077.7%60.8%/
XG-PseU (10F)c66.1%0.3263.5%68.7%0.70067.5%////
RF-PseU(10F)d64.3%0.2966.1%62.6%0.70075.0%0.5078.0%72.0%0.800
RF-PseU(LOO)e64.0%0.2965.9%62.6%0.69474.0%0.4874.0%74.0%0.814
S. cerevisiaeiRNA-PseU(LOO)64.5%0.2964.7%64.3%0.8160.0%0.2063.0%57.0%/
PseUI(LOO)64.1%0.3064.7%67.5%0.6968.5%0.3765.0%72.0%/
iPseU-CNN(5F)68.2%0.3766.4%70.5%/73.5%0.4768.8%77.8%/
XG-PseU(10F)68.2%0.3766.8%69.5%0.7771.0%////
RF-PseU(10F)74.8%0.4977.2%72.4%0.81077.0%0.5475.0%79.0%0.838
RF-PseU(LOO)75.8%0.5278.2%73.4%0.81974.5%0.4970.0%79.0%0.823
M. musculusiRNA-PseU(LOO)69.1%0.3873.3%64.8%0.75/////
PseUI(LOO)70.4%0.4179.9%70.3%0.71/////
iPseU-CNN(5F)71.8%0.4474.8%69.1%//////
XG-PseU(10F)72.0%0.4576.5%67.6%0.74/////
RF-PseU(10F)74.8%0.5073.1%76.5%0.796/////
RF-PseU(LOO)74.5%0.4872.7%75.2%0.794/////

Comparison of cross-validation and independent testing scores of existing state-of-the-art pseudouridine site predictors and RF-PseU.

aPredictors based on support vector machine, Leave-One-Out Cross-Validation (LOO); bPredictors based on convolutional neural nets, five-fold cross-validation (5F); cPredictors based on XGboost,10-fold cross-validation (10F); dPredictors based on Random Forest, 10-fold cross-validation; eLOO:Leave-One-Out Cross-Validation

TABLE 4

Scores typeRF-PseU (10 Foldc)RF-PseU (LOOd)iRNA-PseU (LOO)PseUI (LOO)iPseU-CNN (5 Folde)XG-PseU (10 Fold)
Cross-validationa71.3%71.4%64.7%66.2%68.9%68.7%
Independent testingb76.0%74.7%62.5%67.0%71.3%69.3%

Comparison of average accuracies for state-of-the-art predictors.

aAverage values of H. sapiens, S. cerevisiae and M. musculus; bAverage values of H. sapiens and S. cerevisiae; cmodel with 10-fold cross-validation; dmodel with leave-one-out cross-validation; emodel with five-fold cross-validation.

Web Server Implementation

For convenience, a webserver with an easy-to-use interface was developed (see screenshot in Figure 3), which can be accessed freely at http://148.70.81.170:10228/rfpseu. A step-by-step user guide is given here. First, users select a species from the drop-down box and paste or type the query RNA sequences in FASTA format into the textbox. Second, after clicking the submit button, the query results will be shown in a table on the same page after a wait. Note that once a query task has been submitted, the submit button will be disabled. Third, the user can click the clear button to empty the input text box and enable the submit button, and return to step one to enter a new query task.

FIGURE 3

Conclusion

In this study, a new model, named RF-PseU, for predicting RNA pseudouridine sites in multiple species is presented. For given feature spaces, the random forest algorithm was shown to be more efficient than SVM models for discriminating pseudouridine sites from non-pseudouridine sites. In terms of average cross-validation and independent testing accuracy scores, RF-PseU showed improvements of 3.6–10% and 4.8–21%, respectively, compared with state-of-the-art predictors. Moreover, a web server with a user-friendly interface is available. It is anticipated that RF-PseU will be a useful tool for RNA pseudouridine site analysis. However, the model requires further development via combination with other technologies before it is suitable for use as a classifier for RNA pseudouridine sites. Future work will explore emerging methods such as Gene2Vec (Zou et al., 2019), m6Acomet (Wu et al., 2019), and iterative feature representation (Wei et al., 2019b) to improve the model’s performance.

Statements

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found here: http://lin-group.cn/server/iRNAPseu/data.

Author contributions

ZL and JZ were responsible for experiments and manuscripts preparation. HD participated in discussions. QZ worked as supervisor for all procedures.

Funding

The work was supported by the National Natural Science Foundation of China (Nos. 61922020, 61771331, and 91935302), and the Scientific Research Foundation in Shenzhen (JCYJ20170818100431895 and JCYJ20180306172207178).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  • 1

    AgrisP. F. (2008). Bringing order to translation: the contributions of transfer RNA anticodon-domain modifications.Embo Rep.9629635. 10.1038/embor.2008.104

  • 2

    BoccalettoP.MachnickaM. A.PurtaE.PiatkowskiP.BaginskiB.WireckiT. K. (2018). MODOMICS: a database of RNA modification pathways. 2017 update.Nucleic Acids Res.46D303D307. 10.1093/nar/gkx1030

  • 3

    CaoR.ChengJ. (2016). Protein single-model quality assessment by feature-based probability density functions.Sci. Rep.6:23990. 10.1038/srep23990

  • 4

    CarlileT. M.Rojas-DuranM. F.GilbertW. V. (2015). Pseudo-seq: genome-wide detection of pseudouridine modifications in RNA.Methods Enzymol.560219245. 10.1016/bs.mie.2015.03.011

  • 5

    CarlileT. M.Rojas-DuranM. F.ZinshteynB.ShinH.BartoliK. M.GilbertW. V. (2014). Pseudouridine profiling reveals regulated mRNA pseudouridylation in yeast and human cells.Nature515143146. 10.1038/nature13802

  • 6

    ChenK. Q.WeiZ.ZhangQ.WuX. Y.RongR.LuZ. L.et al (2019a). WHISTLE: a high-accuracy map of the human N-6-methyladenosine (m(6)A) epitranscriptome predicted using a machine learning approach.Nucleic Acids Res.47:e41. 10.1093/nar/gkz074

  • 7

    ChenW.DingH.ZhouX.LinH.ChouK. C. (2018a). iRNA(m6A)-PseDNC: identifying N-6-methyladenosine sites using pseudo dinucleotide composition.Anal. Biochem.561–5625965. 10.1016/j.ab.2018.09.002

  • 8

    ChenW.FengP. M.DingH.LinH.ChouK. C. (2015). iRNA-Methyl: identifying N-6-methyladenosine sites using pseudo nucleotide composition.Anal. Biochem.4902633. 10.1016/j.ab.2015.08.021

  • 9

    ChenW.FengP. M.YangH.DingH.LinH.ChouK. C. (2018b). iRNA-3typeA: identifying three types of modification at RNA’s adenosine sites.Mol. Ther. Nucleic Acids11468474. 10.1016/j.omtn.2018.03.012

  • 10

    ChenW.LvH.NieF.LinH. (2019b). i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome.Bioinformatics (Oxf. Engl.)3527962800. 10.1093/bioinformatics/btz015

  • 11

    ChenW.TangH.YeJ.LinH.ChouK. C. (2016). iRNA-PseU: identifying RNA pseudouridine sites.Mol. Ther. Nucleic Acids5:e332. 10.1038/mtna.2016.37

  • 12

    ChenW.YangH.FengP.DingH.LinH. (2017). iDNA4mC: identifying DNA N-4-methylcytosine sites based on nucleotide chemical properties.Bioinformatics3335183523. 10.1093/bioinformatics/btx479

  • 13

    ChengL.HuY. (2018). Human disease system biology.Curr. Gene Ther.18255256. 10.2174/1566523218666181101143116

  • 14

    ChengL.HuY.SunJ.ZhouM.JiangQ. (2018a). DincRNA: a comprehensive web-based bioinformatics toolkit for exploring disease associations and ncRNA function.Bioinformatics3419531956. 10.1093/bioinformatics/bty002

  • 15

    ChengL.JiangY.JuH.SunJ.PengJ.ZhouM.et al (2018b). InfAcrOnt: calculating cross-ontology term similarities using information flow by a random walk.BMC Genomics19(Suppl. 1):919. 10.1186/s12864-017-4338-6

  • 16

    ChengL.QiC.ZhuangH.FuT.ZhangX. (2019a). gutMDisorder: a comprehensive database for dysbiosis of the gut microbiota in disorders and interventions.Nucleic Acids Res.48(Suppl. 1):gkz843. 10.1093/nar/gkz843

  • 17

    ChengL.SunJ.XuW. Y.DongL. X.HuY.ZhouM. (2016). OAHG: an integrated resource for annotating human genes with multi-level ontologies.Sci. Rep.619. 10.1038/srep34820

  • 18

    ChengL.YangH.ZhaoH.PeiX.ShiH.SunJ.et al (2019b). MetSigDis: a manually curated resource for the metabolic signatures of diseases.Brief. Bioinform.20203209. 10.1093/bib/bbx103

  • 19

    ChengL.ZhuangH.JuH.YangS.HanJ.TanR.et al (2019c). Exposing the causal effect of body mass index on the risk of type 2 diabetes mellitus: a mendelian randomization study.Front. Genet.10:94. 10.3389/fgene.2019.00094

  • 20

    CohnW. E. (1960). Pseudouridine, a carbon-carbon linked ribonucleoside in ribonucleic acids: isolation, structure, and chemical characteristics.J. Biol. Chem.23514881498. 10.1002/jbmte.390020410

  • 21

    CortesC.VapnikV. (1995). Support-vector networks.Mach. Learn.20273297. 10.1007/BF00994018

  • 22

    DegenhardtF.SeifertS.SzymczakS. (2019). Evaluation of variable selection methods for random forests and omics data sets.Brief. Bioinform.20492503. 10.1093/bib/bbx124

  • 23

    DingY.TangJ.GuoF. (2016a). Identification of protein–protein interactions via a novel matrix-based sequence representation model with amino acid contact information.Int. J. Mol. Sci.17:1623. 10.3390/ijms17101623

  • 24

    DingY.TangJ.GuoF. (2016b). Predicting protein-protein interactions via multivariate mutual information of protein sequences.BMC Bioinformatics17:398. 10.1186/s12859-016-1253-9

  • 25

    DingY.TangJ.GuoF. (2017). Identification of drug-target interactions via multiple information integration.Inf. Sci.418–419546560. 10.1016/j.ins.2017.08.045

  • 26

    DingY.TangJ.GuoF. (2019). Identification of drug-side effect association via multiple information integration with centered kernel alignment.Neurocomputing325211224. 10.1016/j.neucom.2018.10.028

  • 27

    FangT.ZhangZ.SunR.ZhuL.HeJ.HuangB.et al (2019). RNAm5CPred: prediction of RNA 5-methylcytosine sites based on three different kinds of nucleotide composition.Mol. Ther. Nucleic Acids18739747. 10.1016/j.omtn.2019.10.008

  • 28

    FengP.YangH.DingH.LinH.ChenW.ChouK.-C. (2019). iDNA6mA-PseKNC: identifying DNA N-6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC.Genomics11196102. 10.1016/j.ygeno.2018.01.005

  • 29

    FengP. M.DingH.ChenW.LinH. (2016). Identifying RNA 5-methylcytosine sites via pseudo nucleotide compositions.Mol. Biosyst.1233073311. 10.1039/c6mb00471g

  • 30

    FuJ.TangJ.WangY.CuiX.YangQ.HongJ.et al (2018). Discovery of the consistently well-performed analysis chain for SWATH-MS Based pharmacoproteomic quantification.Front. Pharmacol.9:681. 10.3389/fphar.2018.00681

  • 31

    FustinJ.-M.DoiM.YamaguchiY.HidaH.NishimuraS.YoshidaM.et al (2013). RNA-methylation-dependent RNA processing controls the speed of the circadian clock.Cell155793806. 10.1016/j.cell.2013.10.026

  • 32

    GoodwinS.McPhersonJ. D.McCombieW. R. (2016). Coming of age: ten years of next-generation sequencing technologies.Nat. Rev. Genet.17333351. 10.1038/nrg.2016.49

  • 33

    HeJ. J.FangT.ZhangZ. Z.HuangB.ZhuX. L.XiongY. (2018). PseUI: pseudouridine sites identification based on RNA sequence information.BMC Bioinformatics19:11. 10.1186/s12859-018-2321-0

  • 34

    HeW.JiaC.ZouQ. (2019). 4mCPred: machine learning methods for DNA N-4-methylcytosine sites prediction.Bioinformatics35593601. 10.1093/bioinformatics/bty668

  • 35

    HelmM. (2006). Post-transcriptional nucleotide modification and alternative folding of RNA.Nucleic Acids Res.34721733. 10.1093/nar/gkj471

  • 36

    HuY.ZhaoT.ZhangN.ZhangY.ChengL. (2019). A review of recent advances and research on drug target identification methods.Curr. Drug Metab.20209216. 10.2174/1389200219666180925091851

  • 37

    JiangL.DingY.TangJ.GuoF. (2018). MDA-SKF: similarity kernel fusion for accurately discovering miRNA-disease association.Front. Genet.9:618. 10.3389/fgene.2018.00618

  • 38

    KarijolichJ.YiC. Q.YuY. T. (2015). Transcriptome-wide dynamics of RNA pseudouridylation.Nat. Rev. Mol. Cell Biol.16581585. 10.1038/nrm4040

  • 39

    KeG.MengQ.FinleyT.WangT.ChenW.MaW.et al (2017). “LightGBM: a highly efficient gradient boosting decision tree,” in Advances in Neural Information Processing Systems 30, edsGuyonI.LuxburgU. V.BengioS.WallachH.FergusR.VishwanathanS. (Red Hook, NY: Curran Associates, Inc), 31463154.

  • 40

    KongL.ZhangL. (2019). i6mA-DNCP: computational identification of DNA N6-methyladenine sites in the rice genome using optimized dinucleotide-based features.Genes10:828. 10.3390/genes10100828

  • 41

    LaiH. Y.ZhangZ. Y.SuZ. D.SuW.DingH.ChenW.et al (2019). iProEP: a computational predictor for predicting promoter.Mol. Ther. Nucleic Acids17337346. 10.1016/j.omtn.2019.05.028

  • 42

    LiB.TangJ.YangQ.LiS.CuiX.LiY. (2017). NOREVA: normalization and evaluation of MS-based metabolomics data.Nucleic Acids Res.45W162W170. 10.1093/nar/gkx449

  • 43

    LiC.-C.LiuB. (2019). MotifCNN-fold: protein fold recognition based on fold-specific features extracted by motif-based convolutional neural networks.Brief. Bioinform.bbz133. 10.1093/bib/bbz133

  • 44

    LiX. Y.ZhuP.MaS. Q.SongJ. H.BaiJ. Y.SunF. F.et al (2015a). Chemical pulldown reveals dynamic pseudouridylation of the mammalian transcriptome.Nat. Chem. Biol.11592597. 10.1038/nchembio.1836

  • 45

    LiY. H.ZhangG. G.CuiQ. H. (2015b). PPUS: a web server to predict PUS-specific pseudouridine sites.Bioinformatics3133623364. 10.1093/bioinformatics/btv366

  • 46

    LiY. Z.FanY.-X.YangH.-H. (2018). KELMPSP: pseudouridine sites identification based on kernel extreme learning machine.Chin. J. Biochem. Mol. Biol.34785793. 10.13865/j.cnki.cjbmb2018.07.14

  • 47

    LibbrechtM. W.NobleW. S. (2015). Machine learning applications in genetics and genomics.Nat. Rev. Genet.16321332. 10.1038/nrg3920

  • 48

    LiuB. (2019). BioSeq-Analysis: a platform for DNA, RNA, and protein sequence analysis based on machine learning approaches.Brief. Bioinform.2012801294. 10.1093/bib/bbx165

  • 49

    LiuB.LiC.-C.YanK. (2019a). DeepSVM-fold: protein fold recognition by combining support vector machines and pairwise sequence similarity scores generated by deep learning networks.Brief. Bioinform.bbz098. 10.1093/bib/bbz098

  • 50

    LiuB.LiuF. L.WangX. L.ChenJ. J.FangL. Y.ChouK. C. (2015a). Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences.Nucleic Acids Res.43W65W71. 10.1093/nar/gkv458

  • 51

    LiuK.ChenW.LinH. (2019b). XG-PseU: an eXtreme gradient boosting based method for identifying pseudouridine sites.Mol. Genet. Genomics2951321. 10.1007/s00438-019-01600-9

  • 52

    LiuN.DaiQ.ZhengG.HeC.ParisienM.PanT. (2015b). N-6-methyladenosine-dependent RNA structural switches regulate RNA-protein interactions.Nature518560564. 10.1038/nature14234

  • 53

    LiuS.ZhengB.ShengY.KongQ.JiangY.YangY.et al (2019c). Identification of cancer dysfunctional subpathways by integrating DNA methylation, copy number variation, and gene-expression data.Front. Genet.10:441. 10.3389/fgene.2019.00441

  • 54

    LvH.DaoF.-Y.GuanZ.-X.ZhangD.TanJ.-X.Jiu-XinTan (2019a). iDNA6mA-Rice: a computational tool for detecting N6-methyladenine sites in rice.Front. Genet.10:793. 10.3389/fgene.2019.00793

  • 55

    LvZ.JinS.DingH.ZouQ. (2019b). A random forest sub-Golgi protein classifier optimized via dipeptide and amino acid composition features.Front. Bioeng. Biotechnol.7:215. 10.3389/fbioe.2019.00215

  • 56

    ManavalanB.BasithS.ShinT. H.WeiL.LeeG. (2019). Meta-4mCpred: a sequence-based meta-predictor for accurate DNA 4mC site prediction using effective feature representation.Mol. Ther. Nucleic Acids16733744. 10.1016/j.omtn.2019.04.019

  • 57

    MengC. L.JinS. S.WangL.GuoF.ZouQ. (2019). AOPs-SVM: a sequence-based classifier of antioxidant proteins using a support vector machine.Front. Bioeng. Biotechnol.7:10. 10.3389/fbioe.2019.00224

  • 58

    MeyerK. D.JaffreyS. R. (2014). The dynamic epitranscriptome: N-6-methyladenosine and gene expression control.Nat. Rev. Mol. Cell Biol.15313326. 10.1038/nrm3785

  • 59

    MotorinY.HelmM. (2010). tRNA stabilization by modified nucleotides.Biochemistry4949344944. 10.1021/bi100408z

  • 60

    MrozekD.GoskP.Malysiak-MrozekB. (2015). Scaling Ab initio predictions of 3D protein structures in microsoft azure cloud.J. Grid Comput.13561585. 10.1007/s10723-015-9353-8

  • 61

    MrozekD.Malysiak-MrozekB.SiaznikA. (2013). search GenBank: interactive orchestration and ad-hoc choreography of web services in the exploration of the biomedical resources of the national center For biotechnology information.BMC Bioinformatics14:18. 10.1186/1471-2105-14-73

  • 62

    MrozekD.SochaB.KozielskiS.Malysiak-MrozekB. (2016). An efficient and flexible scanning of databases of protein secondary structures.J. Intell. Inf. Syst.46213233. 10.1007/s10844-014-0353-0

  • 63

    MuhammodR.AhmedS.Md FaridD.ShatabdaS.SharmaA.DehzangiA. (2019). PyFeat: a Python-based effective feature generation tool for DNA, RNA and protein sequences.Bioinformatics (Oxf. Engl.)3538313833. 10.1093/bioinformatics/btz165

  • 64

    NairA. S.SreenadhanS. P. (2006). A coding measure scheme employing electron-ion interaction pseudopotential (EIIP).Bioinformation1197202.

  • 65

    PanG.TangJ.GuoF. (2017). Analysis of co-associated transcription factors via ordered adjacency differences on motif distribution.Sci. Rep.7:43597. 10.1038/srep43597

  • 66

    QiuW.-R.JiangS.-Y.XuZ.-C.XiaoX.ChouK.-C. (2017). iRNAm5C-PseDNC: identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition.Oncotarget84117841188. 10.18632/oncotarget.17104

  • 67

    QiuW. R.SunB. Q.XiaoX.XuZ. C.ChouK. C. (2016). iPTM-mLys: identifying multiple lysine PTM sites and their different types.Bioinformatics3231163123. 10.1093/bioinformatics/btw380

  • 68

    SaboohM. F.IqbalN.KhanM.KhanM.MaqboolH. F. (2018). Identifying 5-methylcytosine sites in RNA sequence using composite encoding feature into Chou’s PseKNC.J. Theor. Biol.45219. 10.1016/j.jtbi.2018.04.037

  • 69

    SchwartzS.BernsteinD. A.MumbachM. R.JovanovicM.HerbstR. H.León-RicardoB. X. (2014). Transcriptome-wide mapping reveals widespread dynamic-regulated pseudouridylation of ncRNA and mRNA.Cell159148162. 10.1016/j.cell.2014.08.028

  • 70

    ShanX.WangX.LiC. D.ChuY.ZhangY.XiongY. I. (2019). Prediction of CYP450 enzyme-substrate selectivity based on the network-based label space division method.J. Chem. Inf. Model.5945774586. 10.1021/acs.jcim.9b00749

  • 71

    ShenY.TangJ.GuoF. (2019). Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou’s general PseAAC.J. Theor. Biol.462230239. 10.1016/j.jtbi.2018.11.012

  • 72

    SongJ. H.YiC. Q. (2017). Chemical modifications to RNA: a new layer of gene expression regulation.ACS Chem. Biol.12316325. 10.1021/acschembio.6b00960

  • 73

    SunW.HanY.YangS.ZhuangH.ZhangJ.ChengL. (2019). The assessment of Interleukin-18 on the risk of coronary heart disease.Med. Chem.10.2174/1573406415666191004115128[Epub ahead of print].

  • 74

    TahirM.TayaraH.ChongK. T. (2019). iPseU-CNN: identifying RNA pseudouridine sites using convolutional neural networks.Mol. Ther. Nucleic Acids16463470. 10.1016/j.omtn.2019.03.010

  • 75

    TangH.CaoR.-Z.WangW.LiuT.-S.WangL.-M.HeC.-M. (2017). A two-step discriminated method to identify thermophilic proteins.Int. J. Biomath.10:1750050. 10.1142/s1793524517500504

  • 76

    TangJ.FuJ.WangY.LiB.LiY.YangQ. (2019a). ANPELA: analysis and performance assessment of the label-free quantification workflow for metaproteomic studies.Brief. Bioinform.10:bby127. 10.1093/bib/bby127

  • 77

    TangJ.FuJ.WangY.LuoY.YangQ.LiB.et al (2019b). Simultaneous improvement in the precision, accuracy, and robustness of label-free proteome quantification by optimizing data manipulation chains.Mol. Cell. Proteomics1816831699. 10.1074/mcp.RA118.001169

  • 78

    TangW.WanS. X.YangZ.TeschendorffA. E.ZouQ. (2018). Tumor origin detection with tissue-specific miRNA and DNA methylation markers.Bioinformatics34398406. 10.1093/bioinformatics/btx622

  • 79

    WangP.ZhangX.FuT.LiS.LiB.XueW.et al (2017). Differentiating physicochemical properties between addictive and nonaddictive ADHD drugs revealed by molecular dynamics simulation studies.ACS Chem. Neurosci.814161428. 10.1021/acschemneuro.7b00173

  • 80

    WangX.LuZ.GomezA.HonG. C.YueY.HanD. (2014). N-6-methyladenosine-dependent regulation of messenger RNA stability.Nature505117120. 10.1038/nature12730

  • 81

    WangX.ZhuX.YeM.WangY.LiC.-D.XiongY. (2019a). STS-NLSP: a network-based label space partition method for predicting the specificity of membrane transporter substrates using a hybrid feature of structural and semantic similarity.Front. Bioeng. Biotechnol.7:306. 10.3389/fbioe.2019.00306

  • 82

    WangY.ZhangS.LiF.ZhouY.ZhangY.WangZ. (2019b). Therapeutic target database 2020: enriched resource for facilitating research and early development of targeted therapeutics.Nucleic Acids Res.48D1031D1041. 10.1093/nar/gkz981

  • 83

    WeiL.LuanS.NagaiL. A. E.SuR.ZouQ. (2019a). Exploring sequence-based features for the improved prediction of DNA N4-methylcytosine sites in multiple species.Bioinformatics3513261333. 10.1093/bioinformatics/bty824

  • 84

    WeiL.SuR.LuanS.LiaoZ.ManavalanB.ZouQ.et al (2019b). Iterative feature representations improve N4-methylcytosine site prediction.Bioinformatics3549304937. 10.1093/bioinformatics/btz408

  • 85

    WeiL.XingP.SuR.ShiG.MaZ.ZouQ. (2017a). CPPred-RF: a sequence-based predictor for identifying cell-penetrating peptides and their uptake efficiency.J. Proteome Res.1620442053. 10.1021/acs.jproteome.7b00019

  • 86

    WeiL.XingP.TangJ.ZouQ. (2017b). PhosPred-RF: a novel sequence-based predictor for phosphorylation sites using sequential information only.IEEE Trans. Nanobioscience16240247. 10.1109/TNB.2017.2661756

  • 87

    WeiL. Y.TangJ. J.ZouQ. (2017c). Local-DPP: an improved DNA-binding protein prediction method by exploring local evolutionary information.Inf. Sci.384135144. 10.1016/j.ins.2016.06.026

  • 88

    WeiL. Y.WanS. X.GuoJ. S.WongK. K. L. (2017d). A novel hierarchical selective ensemble classifier with bioinformatics application.Artif. Intell. Med.838290. 10.1016/j.artmed.2017.02.005

  • 89

    WeiL. Y.XingP. W.ZengJ. C.ChenJ. X.SuR.GuoF. (2017e). Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier.Artif. Intell. Med.836774. 10.1016/j.artmed.2017.03.001

  • 90

    WinklerR.GillisE.LasmanL.SafraM.GeulaS.SoyrisC. (2019). m(6)A modification controls the innate immune response to infection by targeting type I interferons.Nat. Immunol.20173182. 10.1038/s41590-018-0275-z

  • 91

    WuX. Y.WeiZ.ChenK. Q.ZhangQ.SuJ. L.LiuH.et al (2019). m6Acomet: large-scale functional prediction of individual m(6)A RNA methylation sites from an RNA co-methylation network.BMC Bioinformatics20:223. 10.1186/s12859-019-2840-3

  • 92

    XiongY.LiuJ.ZhangW.ZengT. (2012). Prediction of heme binding residues from protein sequences with integrative sequence profiles.Proteome Sci.10(Suppl. 1):S20. 10.1186/1477-5956-10-S1-S20

  • 93

    XiongY.WangQ.YangJ.ZhuX.WeiD. Q. (2018). PredT4SE-Stack: prediction of bacterial Type IV secreted effectors from protein sequences using a stacked ensemble method.Front. Microbiol.9:2571. 10.3389/fmicb.2018.02571

  • 94

    XuL.LiangG.LiaoC.ChenG.-D.ChangC.-C. (2018a). An efficient classifier for Alzheimer’s disease genes identification.Molecules23:3140. 10.3390/molecules23123140

  • 95

    XuL.LiangG.LiaoC.ChenG.-D.ChangC.-C. (2019). k-Skip-n-Gram-RF: a random forest based method for Alzheimer’s disease protein identification.Front. Genet.10:33. 10.3389/fgene.2019.00033

  • 96

    XuL.LiangG.WangL.LiaoC. (2018b). A novel hybrid sequence-based model for identifying anticancer peptides.Genes9:158. 10.3390/genes9030158

  • 97

    XueW.YangF.WangP.ZhengG.ChenY.YaoX.et al (2018). What contributes to serotonin-norepinephrine reuptake inhibitors’ dual-targeting mechanism? The key role of transmembrane domain 6 in human serotonin and norepinephrine transporters revealed by molecular dynamics simulation.ACS Chem. Neurosci.911281140. 10.1021/acschemneuro.7b00490

  • 98

    YinJ.SunW.LiF.HongJ.LiX.ZhouY. (2019). VARIDT 1.0: variability of drug transporter database.Nucleic Acids Res.48D1042D1050. 10.1093/nar/gkz779

  • 99

    YuH.DaiZ. (2019). SNNRice6mA: a deep learning method for predicting DNA N6-methyladenine sites in rice genome.Front. Genet.10:1071. 10.3389/fgene.2019.01071

  • 100

    YuL.HuangJ. B.MaZ. X.ZhangJ.ZouY. P.GaoL. (2015). Inferring drug-disease associations based on known protein complexes.BMC Med. Genomics8:13. 10.1186/1755-8794-8-s2-s2

  • 101

    YuL.SuR.WangB.ZhangL.ZouY.ZhangJ.et al (2017a). Prediction of novel drugs for hepatocellular carcinoma based on multi-source random walk.IEEE ACM Trans. Comput. Biol. Bioinform.14966977. 10.1109/tcbb.2016.2550453

  • 102

    YuL.YaoS.GaoL.ZhaY. (2019). Conserved disease modules extracted from multilayer heterogeneous disease and gene networks for understanding disease mechanisms and predicting disease treatments.Front. Genet.9:745. 10.3389/fgene.2018.00745

  • 103

    YuL.ZhaoJ.GaoL. (2017b). Drug repositioning based on triangularly balanced structure for tissue-specific diseases in incomplete interactome.Artif. Intell. Med.775363. 10.1016/j.artmed.2017.03.009

  • 104

    YuL.ZhaoJ.GaoL. (2018). Predicting potential drugs for breast cancer based on miRNA and tissue specificity.Int. J. Biol. Sci.14971980. 10.7150/ijbs.23350

  • 105

    ZaringhalamM.PapavasiliouF. N. (2016). Pseudouridylation meets next-generation sequencing.Methods1076372. 10.1016/j.ymeth.2016.03.001

  • 106

    ZhangM.LiF. Y.Marquez-LagoT. T.LeierA.FanC.KwohC. K.et al (2019). MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters.Bioinformatics3529572965. 10.1093/bioinformatics/btz016

  • 107

    ZhangM.XuY.LiL.LiuZ.YangX.YuD.-J. (2018). Accurate RNA 5-methylcytosine site prediction based on heuristic physical-chemical properties reduction and classifier ensemble.Anal. Biochem.5504148. 10.1016/j.ab.2018.03.027

  • 108

    ZhaoB. S.HeC. (2015). Pseudouridine in a new era of RNA modifications.Cell Res.25153154. 10.1038/cr.2014.143

  • 109

    ZhaoX. W.ZhangY.NingQ.ZhangH. R.JiJ. C.YinM. H. (2019). Identifying N-6-methyladenosine sites using extreme gradient boosting system optimized by particle swarm optimizer.J. Theor. Biol.4673947. 10.1016/j.jtbi.2019.01.035

  • 110

    ZhengG.XueW.YangF.ZhangY.ChenY.YaoX.et al (2017). Revealing vilazodone’s binding mechanism underlying its partial agonism to the 5-HT1A receptor in the treatment of major depressive disorder.Phys. Chem. Chem. Phys.192888528896. 10.1039/c7cp05688e

  • 111

    ZhouY.ZengP.LiY. H.ZhangZ. D.CuiQ. H. (2016). SRAMP: prediction of mammalian N-6-methyladenosine (m(6)A) sites based on sequence-derived features.Nucleic Acids Res.44:e91. 10.1093/nar/gkw104

  • 112

    ZhuX.HeJ.ZhaoS.TaoW.XiongY.BiS. (2019). A comprehensive comparison and analysis of computational predictors for RNA N6-methyladenosine sites of Saccharomyces cerevisiae.Brief. Funct. Genomics18367376. 10.1093/bfgp/elz018

  • 113

    ZouQ.XingP. W.WeiL. Y.LiuB. (2019). Gene2vec: gene subsequence embedding for prediction of mammalian N-6-methyladenosine sites from mRNA.RNA25205218. 10.1261/rna.069112.118

  • 114

    ZouQ.ZengJ. C.CaoL. J.JiR. R. (2016). A novel features ranking metric with application to scalable visual and bioinformatics data classification.Neurocomputing173346354. 10.1016/j.neucom.2014.12.123

Summary

Keywords

pseudouridine sites, light gradient boosting, random forest, machine learning, RNA

Citation

Lv Z, Zhang J, Ding H and Zou Q (2020) RF-PseU: A Random Forest Predictor for RNA Pseudouridine Sites. Front. Bioeng. Biotechnol. 8:134. doi: 10.3389/fbioe.2020.00134

Received

17 January 2020

Accepted

10 February 2020

Published

26 February 2020

Volume

8 - 2020

Edited by

Yongchun Zuo, Inner Mongolia University, China

Reviewed by

Yi Xiong, Shanghai Jiao Tong University, China; Hongmin Cai, South China University of Technology, China

Updates

Copyright

*Correspondence: Jun Zhang, Quan Zou,

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Bioengineering and Biotechnology

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics