RF-PseU: A Random Forest Predictor for RNA Pseudouridine Sites

Lv, Zhibin; Zhang, Jun; Ding, Hui; Zou, Quan

doi:10.3389/fbioe.2020.00134

ORIGINAL RESEARCH article

Front. Bioeng. Biotechnol., 26 February 2020

Sec. Computational Genomics

Volume 8 - 2020 | https://doi.org/10.3389/fbioe.2020.00134

RF-PseU: A Random Forest Predictor for RNA Pseudouridine Sites

1. Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
2. Rehabilitation Department, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
3. Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China

Abstract

One of the ubiquitous chemical modifications in RNA, pseudouridine modification is crucial for various cellular biological and physiological processes. To gain more insight into the functional mechanisms involved, it is of fundamental importance to precisely identify pseudouridine sites in RNA. Several useful machine learning approaches have become available recently, with the increasing progress of next-generation sequencing technology; however, existing methods cannot predict sites with high accuracy. Thus, a more accurate predictor is required. In this study, a random forest-based predictor named RF-PseU is proposed for prediction of pseudouridylation sites. To optimize feature representation and obtain a better model, the light gradient boosting machine algorithm and incremental feature selection strategy were used to select the optimum feature space vector for training the random forest model RF-PseU. Compared with previous state-of-the-art predictors, the results on the same benchmark data sets of three species demonstrate that RF-PseU performs better overall. The integrated average leave-one-out cross-validation and independent testing accuracy scores were 71.4% and 74.7%, respectively, representing increments of 3.63% and 4.77% versus the best existing predictor. Moreover, the final RF-PseU model for prediction was built on leave-one-out cross-validation and provides a reliable and robust tool for identifying pseudouridine sites. A web server with a user-friendly interface is accessible at http://148.70.81.170:10228/rfpseu.

Introduction

More than 150 types of chemical modification have been identified in cellular RNA, including adenosine methylation, cytosine modification, isomerization of uridine, and ribose modification (Boccaletto et al., 2018). These modifications have critical roles in cellular biological and physiological processes (Song and Yi, 2017). For instance, one of the most prevalent RNA modifications in eukaryotes, N⁶-methyladenosine (m6A), affects RNA stability (Wang et al., 2014), RNA-protein interaction (Liu et al., 2015b), RNA splicing and translation (Meyer and Jaffrey, 2014), the circadian clock (Fustin et al., 2013), immune response (Winkler et al., 2019), etc. Another widespread RNA modification is 5-methylcytosine (m5C), which has functions including preservation of the secondary structure of tRNA (Motorin and Helm, 2010), control of amino-acylation (Helm, 2006), codon identification and metabolic stability (Agris, 2008; Li et al., 2017). The pseudouridine modification is another common post-transcriptional modification in various living organisms (Zaringhalam and Papavasiliou, 2016). In 1951, pseudouridine was first identified, and experiments in 1960 revealed that it was abundant in tRNA and rRNA (Cohn, 1960). Pseudouridine results from an isomerization of uridine by breaking the glycosidic bond with 180° base rotation (Karijolich et al., 2015). This modification has been shown to have vital roles, for instance, in stabilizing RNA and in the stress response (Zhao and He, 2015; Cheng et al., 2019a; Wang et al., 2019b).

Although RNA pseudouridylation was discovered decades ago, the first transcriptome-wide RNA pseudouridylation map was not published until 2014, following the rapid development of next-generation sequencing technology (Goodwin et al., 2016). Carlile et al. (2014) developed the PseudoU-seq technology, which they used to identify more than 200 pseudouridylation sites in the regulated mRNA of yeast and human cells; in the same year, Schwartz et al. (2014) performed transcriptome-wide mapping using a similar protocol, finding more than 300 dynamic-regulated pseudouridine sites in non-coding RNA and mRNA. Li et al. (2015a) presented a chemical labeling method (CeU-Seq) that they used to pull down more than 2000 pseudouridine sites in human mRNA. Other RNA pseudouridylation sequencing protocols were also developed (Carlile et al., 2015).

As an alternative to costly and labor-intensive laboratory experiments, robust, swift, and inexpensive computational methods for RNA chemical modification prediction have emerged recently, owing to the increasing amount of data generated in this post-genomics era (Libbrecht and Noble, 2015). A large number of m6A (Chen et al., 2015, 2018a,b, 2019a; Zhou et al., 2016; Zhao et al., 2019; Zou et al., 2019) and m5C (Feng et al., 2016; Qiu et al., 2017; Li et al., 2018; Sabooh et al., 2018; Zhang et al., 2018; Yin et al., 2019) site predictors based on traditional machine learning and emerging deep learning algorithms have been proposed. However, few computational tools have been developed to predict pseudouridine sites. Li et al. (2015b) used a support vector machine (SVM) classifier to design a web server called PPUS for the identification of pseudouridine sites in Saccharomyces cerevisiae and Homo sapiens. Chen et al. (2016) constructed another SVM-based web server for pseudouridine site prediction, using the frequency composition of the nucleotides and pseudo K-tuple nucleotide composition (PseKNC) for feature representation. He et al. (2018) presented another model, PseUI, to identify pseudouridine sites in RNA sequences from three species (H. sapiens, S. cerevisiae, and M. musculus); this was an SVM-based model incorporating multiple feature-extraction technologies. Tahir et al. (2019) used convolutional neural networks to design a new predictor, iPseU-CNN; and Liu et al. (2019b) developed the eXtreme gradient boosting (XGboost) method for RNA pseudouridine site prediction (XG-PseU). Cross-validation scores for RNA pseudouridine site identification in the abovementioned three species showed the best accuracy for iPseU-CNN (66.9%) in H. sapiens, whereas XG-PseU and iPseU-CNN had the best accuracy (68.2%) in S. cerevisiae, and XG-PseU was the most accurate (72.0%) in M. musculus. According to independent testing scores, iPseU-CNN outperformed the other models, with 69.0% accuracy in H. sapiens and 73.6% accuracy in S. cerevisiae. Although the iPseU-CNN predictor had a high average cross-validation accuracy (68.9%) and independent testing accuracy (71.3%) scores, there was still room for improvement in comparison with some high-performing m6A site predictors (Chen et al., 2019a; Zou et al., 2019).

In this work, a model is developed based on the random forest algorithm, RF-PseU, for pseudouridine site recognition. The modeling overview is shown in Figure 1. RF-PseU incorporates multiple sequence feature representation technologies, and the light gradient boosting machine (LGBM) algorithm is employed to remove redundant features and rank the remaining features. Evaluation with leave-one-out (LOO) cross-validation demonstrated the robustness of the model. The average cross-validation accuracy (71.3% for 10-Fold and 71.4% for LOO) of RF-PseU was improved by 3.48–10.3% compared with existing state-of-the-art predictors, and the average independent testing accuracy (74.7%) showed a 4.8–19% increase. A user-friendly web server was also implemented, which can be accessed at http://148.70.81.170:10228/rfpseu. RF-PseU is expected to be a useful supplement to the existing tools for pseudouridine site identification.

FIGURE 1

Materials and Methods

Data Sets

Given that there were small differences between the benchmark data sets used in the studies of Chen et al. (2018a) and Liu et al. (2019b), data sets obtained from Chen et al. (2018a) were used to train and test our models. The training data sets included data for three species. That is, H. sapiens training dataset with 495 psedouridine-sites-containing sequences and 495 non-psedouridine-sites-containing; S. cerevisiae training dataset contains 314 psedouridine-sites-sequences and 314 non-psedouridine-sites-sequences; M. musculus training dataset consists of 944 sequences, half of which is positive samples. Whereas the independent testing data sets covered only two species, H. sapiens and S. cerevisiae, both of which contain 100 positive samples and 100 negative samples. For the H. sapiens and M. musculus data sets, the window size was 21, i.e. the positive samples were psedouridine site centroid sequences of 21 base pairs each, whereas those for the S. cerevisiae samples window site was 31, with psedouridine site centroid sequences of 31 base pairs. Negative samples, in which no psedouridine sites were detected, consisted of 21 base pairs for H. sapiens and M. musculus, and 31 base pairs for S. cerevisiae. The benchmark data sets can be downloaded from http://lin-group.cn/server/iRNAPseu/data.

Feature Representation

Several widely used and convenient bio-sequence feature representation tools have been developed (Mrozek et al., 2013; Liu et al., 2015a, 2019c; Yu et al., 2015, 2019; Cheng and Hu, 2018; Hu et al., 2019; Muhammod et al., 2019). The two main tools used in this work were iLearn (Hu et al., 2019) and PyFeat (Muhammod et al., 2019).

Nucleotide Binary Profiles

Binary profiles encode the four bases (ACGU) as (1,0,0,0), (0,1,0,0), (0,0,1,0), and (0,0,0,1), whereas dibinary profiles encode the 16 dinucleotides (AA, AC, AG, AU, CA, CC, CG, CU, GA, GC, GG, GU, UA, UC, UG, and UU) as (0,0,0,0), (0,0,0,1), (0,0,1,0), (0,0,1,1),…, (1,1,1,1).

Accumulated Nucleotide Frequency

Suppose s_i is a base (ACGU) at the i^th position of a RNA sequence. Then we can determine the s_i density d_i of the i^th prefix subsequence of a RNA sequence as follows:

where L is the sequence length and q is one of the four nucleotides (ACGU).

Nucleotide Chemical Properties

The four RNA nucleotides (ACGU) are different from each other in terms of chemical structure and chemical bonds. On the basis of these differences, AGCU can be categorized into three different classes (Table 1) and encoded using a three-dimensional coordinate, i.e. A is denoted by (1,1,1), C by (0,1,0), G by (1,0,0), and U by (0,0,1).

TABLE 1

Nucleotides	Chemical property
C,U	Pyrimidine and ring structure
A,G	Purine and ring structure
A,U	Weak and hydrogen bond
C,G	Strong and hydrogen bond
G,U	Keto and functional group
A,C	Amino and functional group

ACGU categories based on chemical properties.

Electron-Ion Interaction Pseudopotentials (EIIP)

Nair and Sreenadhan (2006) used the EIIP values of A, G, C, and T (A: 0.1260, G: 0.0806, C: 0.1340, T: 0.1335) to directly represent the nucleotides in a DNA sequence. Here, iLearn was used to encode each nucleotide in the RNA sequences into EIIP feature vectors.

Enhanced Nucleic Acid Composition

The nucleotide composition was calculated for a fixed-length window of the RNA sequence, allowing the fixed window (length = 5) to continuously slide from the 5′ to the 3′ terminus. RNA sequences were then encoded into feature vectors of equal length.

Xmer k-Spaced Ymer Composition Frequency

This method is used to count the composition of a subsequence of X and Y consecutive nucleotides with intervals k, e.g. AGU@AU, A@CU, GU@@@A, where @ indicates a one-interval space, @@ a two-interval space, and so on. Generally, using Xmer k-spaced Ymer to encode an RNA sequence will generate a 4^X × 4^Y feature vector. In this study, X, Y, and k were set to 1, 2, or 3; and eight XYK combinations (except for 3mer-kspaced-3mer) were used for encoding. The PyFeat tool developed by Rafsanjani et al. (Muhammod et al., 2019) was used to convert RNA sequences into vectors.

Feature Selection

Feature selection is an effective way to remove redundant information and prevent over-fitting in machine learning modeling (Tang et al., 2017; Xu et al., 2018a; Cheng et al., 2019a; Liu, 2019; Sun et al., 2019; Yu et al., 2019). Several feature selection technologies, including ANOVA (Lv et al., 2019b) and MRMD (Zou et al., 2016), have been developed and are widely used for DNA, RNA, and protein identification (Xu et al., 2018b). In this work, an LGBM (Ke et al., 2017)¹ wrapper was used to select appropriate feature spaces for model training. In this process, raw training data were fed into the LGBM model and their features were ranked by importance value as calculated with the LGBM algorithm. Features with importance values greater than the average were selected to compose the feature space for modeling.

Model Evaluation Metrics and Methods

The proposed models were evaluated by five commonly used metrics, accuracy (ACC), sensitivity (Sn), specificity (Sp), Matthew correlation coefficient (MCC), and integral area under the receiver operating characteristic curve (auROC). These metrics were calculated using the following equations, where TP, TN, FP, and FN stand for true positive, true negative, false positive, and false negative, respectively (Cheng et al., 2016, 2019b,c; Wei et al., 2017d, e; Liu et al., 2019a). For the ROC curve, 1-specificity was plotted on the horizontal axis, and sensitivity on the vertical axis.

LOO, K-Fold cross-validation, and independent testing are the most widely used methods for predictor evaluation (Mrozek et al., 2015; Cao and Cheng, 2016; Chen et al., 2017, 2018a, 2019b; Pan et al., 2017; He et al., 2018, 2019; Jiang et al., 2018; Xiong et al., 2018; Yu et al., 2018; Zhang et al., 2018; Ding et al., 2019; Feng et al., 2019; Kong and Zhang, 2019; Li and Liu, 2019; Lv et al., 2019a; Manavalan et al., 2019; Shan et al., 2019; Wang et al., 2019a; Wei et al., 2019a, b; Xu et al., 2019; Yu and Dai, 2019). That is the general machine learning evaluation methods (training, validation and testing) are used for optimized model evaluation. To test the efficiency of the classification, LOO cross-validation was performed for a data set containing n items, of which n-1 items were used for training and the remaining one was used for validation. This procedure was repeated until every sequence in the training data set had been used once as a validation testing sample. LOO cross-validation is robust but time-consuming for a large data set. To compare the performance of the model with that of existing predictors, 10-Fold cross-validation was also used. The training data set was stochastically divided into 10 subsets, with one subset for validation and the remaining nine for training. This process was repeated 10 times and the average results were used to evaluate the model. Finally, independent testing was performed to obtain a data set that was completely distinct from the training data set for evaluation of the trained model.

Algorithm

The random forest method is a bagging-type ensemble learning algorithm (Cheng et al., 2018a, b). By combining multiple weak classifiers, the final results can be voted or averaged to obtain an overall model with higher accuracy, better general performance, and resistance to overfitting. This algorithm has been extensively used in bioinformatics and other areas, and has been confirmed to be an effective modeling technique in various domains (Ding et al., 2016a,b; Mrozek et al., 2016; Qiu et al., 2016; Wang et al., 2017; Wei et al., 2017a,b,c; Yu et al., 2017a; Zheng et al., 2017; Tang et al., 2018, 2019a; Xue et al., 2018; Degenhardt et al., 2019; Xu et al., 2019). In this study, the scikit-learn toolkit, available at https://scikit-learn.org, was used to establish the models.

Support vector machine (Cortes and Vapnik, 1995) is a generalized linear classifier that classifies data based on supervised learning; its decision boundary is the maximum-margin hyperplane required to solve the learning sample. SVM has been widely used in a variety of fields (Xiong et al., 2012; Ding et al., 2017; Yu et al., 2017b; Fu et al., 2018; Fang et al., 2019; Lai et al., 2019; Meng et al., 2019; Shen et al., 2019; Tang et al., 2019b; Zhang et al., 2019; Zhu et al., 2019). Here, it was used for modeling comparisons. SVM was also implemented via the scikit-learn toolkit, using the Gaussian radial basis functions, with the critical hyper-parameters (C and γ) of SVM optimized in a range from 10^–6 to 10⁶ with exponent step 10^0.5.

Results and Discussion

Optimization With Different Feature Spaces

To determine optimal feature spaces, we first used the LGBM algorithm to sort the features from maximum to minimum according to their importance value. All the features with importance value greater than the average were kept. Second, we used an incremental feature selection strategy; as shown in Figure 2A, the 10-Fold cross-validation and independent testing accuracy varied as features were added. Initially, the accuracy increased rapidly for each species. As shown in Figure 2 (A1) and Figure 2 (A2), when the feature dimensions for H. sapiens and S. cerevisiae reached 257 and 397, the model achieved maximum independent testing accuracies of 75.0 and 77.0%, respectively. Owing to the lack of independent test data sets for M. musculus, Figure 2 (A3) shows only the cross-validation accuracy curve, with its peak value (74.8%) at a feature dimension of 161. The optimal feature space dimensions selected for each species were 257, 397, and 161, respectively. These values were used for further experiments and optimization.

FIGURE 2

Comparison With SVM Predictors

Given that PPUS (Li et al., 2015b), iRNA-PseU (Chen et al., 2016), and PseUI (He et al., 2018) were all based on SVM, an optimized SVM model for pseudouridine site identification with the same feature spaces as the RF model was constructed to determine the effects of the SVM and RF on prediction performance. The performances of the two models are shown in Table 2. Overall, the models based on RF showed markedly better performance than those based on SVM. For instance, in terms of 10-Fold cross-validation accuracy, the RF models for H. sapiens, S. cerevisiae, and M. musculus outperformed the corresponding SVM models by 3.71%, 10.8%, and 5.80%, respectively. The independent testing accuracy scores showed an even greater contrast. For example, the RF model had 75.0% accuracy for H. sapiens, exactly 1.17 times that of the SVM model. The ROC curve and auROC value shown in Figure 2B also demonstrate that the optimized RF models performed better than the optimized SVM models for the same feature spaces. Thus, non-SVM models such as XG-PseU (Liu et al., 2019b), iPseU-CNN (Tahir et al., 2019), and our RF-PseU model might be more suitable for distinguishing pseudouridine sites from non- pseudouridine sites.

TABLE 2

Species	Algorithm	10 fold cross-validation					Independent testing

		ACC	MCC	Sn	Sp	auROC	ACC	MCC	Sn	Sp	auROC
H. sapiens	SVM	62.0%	0.240	61.4%	62.6%	0.656	64.0%	0.280	66.0%	62.0%	0.679
	RF	64.3%	0.287	66.1%	62.6%	0.700	75.0%	0.501	78.0%	72.0%	0.800
S. cerevisiae	SVM	67.5%	0.352	73.7%	61.2%	0.720	72.5%	0.45	73.0%	73.0%	0.786
	RF	74.8%	0.497	77.2%	72.4%	0.810	77.0%	0.540	75.0%	79.0%	0.838
M. musculus	SVM	70.7%	0.42	65.9%	75.4%	0.759	/	/	/	/	/
	RF	74.8%	0.50	73.1%	76.5%	0.796	/	/	/	/	/

Cross-validation and independent testing scores of two different classifiers for three species.

Comparison With Previous Predictors

The performance of RF-PseU was also compared with that of state-of-the-art predictors including iRNA-PseU (Chen et al., 2016), PseUI (He et al., 2018), iPseU-CNN (Tahir et al., 2019), and XG-PseU (Liu et al., 2019b). First, we compared the evaluation scores for the three species. Table 3 compares the cross-validation and independent testing scores for the state-of-the-art pseudouridine sites predictors with those of RF-PseU. In terms of cross-validation scores, the LOO accuracy values for S. cerevisiae and M. musculus were 75.4% and 74.5%, respectively, representing increments of approximately 10.5% and 3.47% over the values for the existing predictor (XG-PseU) with the best cross-validation score. However, the LOO accuracy of RF-PseU for H. sapiens, at 64.0%, showed a decrease of 4.0% compared with the best H. sapiens pseudouridine site predictor, PseU-CNN. In terms of independent testing, as shown in Table 3, RF-PseU scored higher than the existing predictors in all aspects. For comprehensive comparison, the average scores for different species were calculated. The results, shown in Table 4, demonstrate that RF-PseU performed better overall than the other four predictors. The cross-validation accuracy scores of RF-PseU were 3.48% higher than those of the best existing predictor, iPseU-CNN; in terms of independent testing scores, RF-PseU showed a marked improvement of 4.7–10.6% compared with iPseU-CNN. The overall performance of RF-PseU was also significantly better than those of the other predictors, indicating that RF-PseU can discriminate true pseudouridine sites from non-pseudouridine sites more precisely than the existing predictors.

TABLE 3

Species	Classifier	Cross-validation					Independent testing

		ACC	MCC	Sn	Sp	auROC	ACC	MCC	Sn	Sp	auROC
H. sapiens	iRNA-PseU(LOO)^a	60.4%	0.21	61.0%	59.8%	0.640	65.0%	0.30	60.0%	70.0%	/
	PseUI(LOO)^a	64.2%	0.28	64.9%	63.6%	0.68	65.5%	0.31	63.0%	68.0%	/
	iPseU-CNN(5F)^b	66.7%	0.34	65.0%	68.8%	/	69.0%	0.40	77.7%	60.8%	/
	XG-PseU (10F)^c	66.1%	0.32	63.5%	68.7%	0.700	67.5%	/	/	/	/
	RF-PseU(10F)^d	64.3%	0.29	66.1%	62.6%	0.700	75.0%	0.50	78.0%	72.0%	0.800
	RF-PseU(LOO)^e	64.0%	0.29	65.9%	62.6%	0.694	74.0%	0.48	74.0%	74.0%	0.814
S. cerevisiae	iRNA-PseU(LOO)	64.5%	0.29	64.7%	64.3%	0.81	60.0%	0.20	63.0%	57.0%	/
	PseUI(LOO)	64.1%	0.30	64.7%	67.5%	0.69	68.5%	0.37	65.0%	72.0%	/
	iPseU-CNN(5F)	68.2%	0.37	66.4%	70.5%	/	73.5%	0.47	68.8%	77.8%	/
	XG-PseU(10F)	68.2%	0.37	66.8%	69.5%	0.77	71.0%	/	/	/	/
	RF-PseU(10F)	74.8%	0.49	77.2%	72.4%	0.810	77.0%	0.54	75.0%	79.0%	0.838
	RF-PseU(LOO)	75.8%	0.52	78.2%	73.4%	0.819	74.5%	0.49	70.0%	79.0%	0.823
M. musculus	iRNA-PseU(LOO)	69.1%	0.38	73.3%	64.8%	0.75	/	/	/	/	/
	PseUI(LOO)	70.4%	0.41	79.9%	70.3%	0.71	/	/	/	/	/
	iPseU-CNN(5F)	71.8%	0.44	74.8%	69.1%	/	/	/	/	/	/
	XG-PseU(10F)	72.0%	0.45	76.5%	67.6%	0.74	/	/	/	/	/
	RF-PseU(10F)	74.8%	0.50	73.1%	76.5%	0.796	/	/	/	/	/
	RF-PseU(LOO)	74.5%	0.48	72.7%	75.2%	0.794	/	/	/	/	/

Comparison of cross-validation and independent testing scores of existing state-of-the-art pseudouridine site predictors and RF-PseU.

^aPredictors based on support vector machine, Leave-One-Out Cross-Validation (LOO); ^bPredictors based on convolutional neural nets, five-fold cross-validation (5F); ^cPredictors based on XGboost,10-fold cross-validation (10F); ^dPredictors based on Random Forest, 10-fold cross-validation; ^eLOO:Leave-One-Out Cross-Validation

TABLE 4

Scores type	RF-PseU (10 Fold^c)	RF-PseU (LOO^d)	iRNA-PseU (LOO)	PseUI (LOO)	iPseU-CNN (5 Fold^e)	XG-PseU (10 Fold)
Cross-validation^a	71.3%	71.4%	64.7%	66.2%	68.9%	68.7%
Independent testing^b	76.0%	74.7%	62.5%	67.0%	71.3%	69.3%

Comparison of average accuracies for state-of-the-art predictors.

^aAverage values of H. sapiens, S. cerevisiae and M. musculus; ^bAverage values of H. sapiens and S. cerevisiae; ^cmodel with 10-fold cross-validation; ^dmodel with leave-one-out cross-validation; ^emodel with five-fold cross-validation.

Web Server Implementation

For convenience, a webserver with an easy-to-use interface was developed (see screenshot in Figure 3), which can be accessed freely at http://148.70.81.170:10228/rfpseu. A step-by-step user guide is given here. First, users select a species from the drop-down box and paste or type the query RNA sequences in FASTA format into the textbox. Second, after clicking the submit button, the query results will be shown in a table on the same page after a wait. Note that once a query task has been submitted, the submit button will be disabled. Third, the user can click the clear button to empty the input text box and enable the submit button, and return to step one to enter a new query task.

FIGURE 3

Conclusion

In this study, a new model, named RF-PseU, for predicting RNA pseudouridine sites in multiple species is presented. For given feature spaces, the random forest algorithm was shown to be more efficient than SVM models for discriminating pseudouridine sites from non-pseudouridine sites. In terms of average cross-validation and independent testing accuracy scores, RF-PseU showed improvements of 3.6–10% and 4.8–21%, respectively, compared with state-of-the-art predictors. Moreover, a web server with a user-friendly interface is available. It is anticipated that RF-PseU will be a useful tool for RNA pseudouridine site analysis. However, the model requires further development via combination with other technologies before it is suitable for use as a classifier for RNA pseudouridine sites. Future work will explore emerging methods such as Gene2Vec (Zou et al., 2019), m6Acomet (Wu et al., 2019), and iterative feature representation (Wei et al., 2019b) to improve the model’s performance.

Statements

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found here: http://lin-group.cn/server/iRNAPseu/data.

Author contributions

ZL and JZ were responsible for experiments and manuscripts preparation. HD participated in discussions. QZ worked as supervisor for all procedures.

Funding

The work was supported by the National Natural Science Foundation of China (Nos. 61922020, 61771331, and 91935302), and the Scientific Research Foundation in Shenzhen (JCYJ20170818100431895 and JCYJ20180306172207178).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Footnotes

1.^https://lightgbm.readthedocs.io

References

1
AgrisP. F. (2008). Bringing order to translation: the contributions of transfer RNA anticodon-domain modifications.Embo Rep.9629–635. 10.1038/embor.2008.104
2
BoccalettoP.MachnickaM. A.PurtaE.PiatkowskiP.BaginskiB.WireckiT. K. (2018). MODOMICS: a database of RNA modification pathways. 2017 update.Nucleic Acids Res.46D303–D307. 10.1093/nar/gkx1030
3
CaoR.ChengJ. (2016). Protein single-model quality assessment by feature-based probability density functions.Sci. Rep.6:23990. 10.1038/srep23990
4
CarlileT. M.Rojas-DuranM. F.GilbertW. V. (2015). Pseudo-seq: genome-wide detection of pseudouridine modifications in RNA.Methods Enzymol.560219–245. 10.1016/bs.mie.2015.03.011
5
CarlileT. M.Rojas-DuranM. F.ZinshteynB.ShinH.BartoliK. M.GilbertW. V. (2014). Pseudouridine profiling reveals regulated mRNA pseudouridylation in yeast and human cells.Nature515143–146. 10.1038/nature13802
6
ChenK. Q.WeiZ.ZhangQ.WuX. Y.RongR.LuZ. L.et al (2019a). WHISTLE: a high-accuracy map of the human N-6-methyladenosine (m(6)A) epitranscriptome predicted using a machine learning approach.Nucleic Acids Res.47:e41. 10.1093/nar/gkz074
7
ChenW.DingH.ZhouX.LinH.ChouK. C. (2018a). iRNA(m6A)-PseDNC: identifying N-6-methyladenosine sites using pseudo dinucleotide composition.Anal. Biochem.561–56259–65. 10.1016/j.ab.2018.09.002
8
ChenW.FengP. M.DingH.LinH.ChouK. C. (2015). iRNA-Methyl: identifying N-6-methyladenosine sites using pseudo nucleotide composition.Anal. Biochem.49026–33. 10.1016/j.ab.2015.08.021
9
ChenW.FengP. M.YangH.DingH.LinH.ChouK. C. (2018b). iRNA-3typeA: identifying three types of modification at RNA’s adenosine sites.Mol. Ther. Nucleic Acids11468–474. 10.1016/j.omtn.2018.03.012
10
ChenW.LvH.NieF.LinH. (2019b). i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome.Bioinformatics (Oxf. Engl.)352796–2800. 10.1093/bioinformatics/btz015
11
ChenW.TangH.YeJ.LinH.ChouK. C. (2016). iRNA-PseU: identifying RNA pseudouridine sites.Mol. Ther. Nucleic Acids5:e332. 10.1038/mtna.2016.37
12
ChenW.YangH.FengP.DingH.LinH. (2017). iDNA4mC: identifying DNA N-4-methylcytosine sites based on nucleotide chemical properties.Bioinformatics333518–3523. 10.1093/bioinformatics/btx479
13
ChengL.HuY. (2018). Human disease system biology.Curr. Gene Ther.18255–256. 10.2174/1566523218666181101143116
14
ChengL.HuY.SunJ.ZhouM.JiangQ. (2018a). DincRNA: a comprehensive web-based bioinformatics toolkit for exploring disease associations and ncRNA function.Bioinformatics341953–1956. 10.1093/bioinformatics/bty002
15
ChengL.JiangY.JuH.SunJ.PengJ.ZhouM.et al (2018b). InfAcrOnt: calculating cross-ontology term similarities using information flow by a random walk.BMC Genomics19(Suppl. 1):919. 10.1186/s12864-017-4338-6
16
ChengL.QiC.ZhuangH.FuT.ZhangX. (2019a). gutMDisorder: a comprehensive database for dysbiosis of the gut microbiota in disorders and interventions.Nucleic Acids Res.48(Suppl. 1):gkz843. 10.1093/nar/gkz843
17
ChengL.SunJ.XuW. Y.DongL. X.HuY.ZhouM. (2016). OAHG: an integrated resource for annotating human genes with multi-level ontologies.Sci. Rep.61–9. 10.1038/srep34820
18
ChengL.YangH.ZhaoH.PeiX.ShiH.SunJ.et al (2019b). MetSigDis: a manually curated resource for the metabolic signatures of diseases.Brief. Bioinform.20203–209. 10.1093/bib/bbx103
19
ChengL.ZhuangH.JuH.YangS.HanJ.TanR.et al (2019c). Exposing the causal effect of body mass index on the risk of type 2 diabetes mellitus: a mendelian randomization study.Front. Genet.10:94. 10.3389/fgene.2019.00094
20
CohnW. E. (1960). Pseudouridine, a carbon-carbon linked ribonucleoside in ribonucleic acids: isolation, structure, and chemical characteristics.J. Biol. Chem.2351488–1498. 10.1002/jbmte.390020410
- CrossRef
- Google Scholar
21
CortesC.VapnikV. (1995). Support-vector networks.Mach. Learn.20273–297. 10.1007/BF00994018
- CrossRef
- Google Scholar
22
DegenhardtF.SeifertS.SzymczakS. (2019). Evaluation of variable selection methods for random forests and omics data sets.Brief. Bioinform.20492–503. 10.1093/bib/bbx124
23
DingY.TangJ.GuoF. (2016a). Identification of protein–protein interactions via a novel matrix-based sequence representation model with amino acid contact information.Int. J. Mol. Sci.17:1623. 10.3390/ijms17101623
24
DingY.TangJ.GuoF. (2016b). Predicting protein-protein interactions via multivariate mutual information of protein sequences.BMC Bioinformatics17:398. 10.1186/s12859-016-1253-9
25
DingY.TangJ.GuoF. (2017). Identification of drug-target interactions via multiple information integration.Inf. Sci.418–419546–560. 10.1016/j.ins.2017.08.045
26
DingY.TangJ.GuoF. (2019). Identification of drug-side effect association via multiple information integration with centered kernel alignment.Neurocomputing325211–224. 10.1016/j.neucom.2018.10.028
- CrossRef
- Google Scholar
27
FangT.ZhangZ.SunR.ZhuL.HeJ.HuangB.et al (2019). RNAm5CPred: prediction of RNA 5-methylcytosine sites based on three different kinds of nucleotide composition.Mol. Ther. Nucleic Acids18739–747. 10.1016/j.omtn.2019.10.008
28
FengP.YangH.DingH.LinH.ChenW.ChouK.-C. (2019). iDNA6mA-PseKNC: identifying DNA N-6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC.Genomics11196–102. 10.1016/j.ygeno.2018.01.005
29
FengP. M.DingH.ChenW.LinH. (2016). Identifying RNA 5-methylcytosine sites via pseudo nucleotide compositions.Mol. Biosyst.123307–3311. 10.1039/c6mb00471g
30
FuJ.TangJ.WangY.CuiX.YangQ.HongJ.et al (2018). Discovery of the consistently well-performed analysis chain for SWATH-MS Based pharmacoproteomic quantification.Front. Pharmacol.9:681. 10.3389/fphar.2018.00681
31
FustinJ.-M.DoiM.YamaguchiY.HidaH.NishimuraS.YoshidaM.et al (2013). RNA-methylation-dependent RNA processing controls the speed of the circadian clock.Cell155793–806. 10.1016/j.cell.2013.10.026
32
GoodwinS.McPhersonJ. D.McCombieW. R. (2016). Coming of age: ten years of next-generation sequencing technologies.Nat. Rev. Genet.17333–351. 10.1038/nrg.2016.49
33
HeJ. J.FangT.ZhangZ. Z.HuangB.ZhuX. L.XiongY. (2018). PseUI: pseudouridine sites identification based on RNA sequence information.BMC Bioinformatics19:11. 10.1186/s12859-018-2321-0
34
HeW.JiaC.ZouQ. (2019). 4mCPred: machine learning methods for DNA N-4-methylcytosine sites prediction.Bioinformatics35593–601. 10.1093/bioinformatics/bty668
35
HelmM. (2006). Post-transcriptional nucleotide modification and alternative folding of RNA.Nucleic Acids Res.34721–733. 10.1093/nar/gkj471
36
HuY.ZhaoT.ZhangN.ZhangY.ChengL. (2019). A review of recent advances and research on drug target identification methods.Curr. Drug Metab.20209–216. 10.2174/1389200219666180925091851
37
JiangL.DingY.TangJ.GuoF. (2018). MDA-SKF: similarity kernel fusion for accurately discovering miRNA-disease association.Front. Genet.9:618. 10.3389/fgene.2018.00618
38
KarijolichJ.YiC. Q.YuY. T. (2015). Transcriptome-wide dynamics of RNA pseudouridylation.Nat. Rev. Mol. Cell Biol.16581–585. 10.1038/nrm4040
39
KeG.MengQ.FinleyT.WangT.ChenW.MaW.et al (2017). “LightGBM: a highly efficient gradient boosting decision tree,” in Advances in Neural Information Processing Systems 30, edsGuyonI.LuxburgU. V.BengioS.WallachH.FergusR.VishwanathanS. (Red Hook, NY: Curran Associates, Inc), 3146–3154.
- Google Scholar
40
KongL.ZhangL. (2019). i6mA-DNCP: computational identification of DNA N6-methyladenine sites in the rice genome using optimized dinucleotide-based features.Genes10:828. 10.3390/genes10100828
41
LaiH. Y.ZhangZ. Y.SuZ. D.SuW.DingH.ChenW.et al (2019). iProEP: a computational predictor for predicting promoter.Mol. Ther. Nucleic Acids17337–346. 10.1016/j.omtn.2019.05.028
42
LiB.TangJ.YangQ.LiS.CuiX.LiY. (2017). NOREVA: normalization and evaluation of MS-based metabolomics data.Nucleic Acids Res.45W162–W170. 10.1093/nar/gkx449
43
LiC.-C.LiuB. (2019). MotifCNN-fold: protein fold recognition based on fold-specific features extracted by motif-based convolutional neural networks.Brief. Bioinform.bbz133. 10.1093/bib/bbz133
44
LiX. Y.ZhuP.MaS. Q.SongJ. H.BaiJ. Y.SunF. F.et al (2015a). Chemical pulldown reveals dynamic pseudouridylation of the mammalian transcriptome.Nat. Chem. Biol.11592–597. 10.1038/nchembio.1836
45
LiY. H.ZhangG. G.CuiQ. H. (2015b). PPUS: a web server to predict PUS-specific pseudouridine sites.Bioinformatics313362–3364. 10.1093/bioinformatics/btv366
46
LiY. Z.FanY.-X.YangH.-H. (2018). KELMPSP: pseudouridine sites identification based on kernel extreme learning machine.Chin. J. Biochem. Mol. Biol.34785–793. 10.13865/j.cnki.cjbmb2018.07.14
- CrossRef
- Google Scholar
47
LibbrechtM. W.NobleW. S. (2015). Machine learning applications in genetics and genomics.Nat. Rev. Genet.16321–332. 10.1038/nrg3920
48
LiuB. (2019). BioSeq-Analysis: a platform for DNA, RNA, and protein sequence analysis based on machine learning approaches.Brief. Bioinform.201280–1294. 10.1093/bib/bbx165
49
LiuB.LiC.-C.YanK. (2019a). DeepSVM-fold: protein fold recognition by combining support vector machines and pairwise sequence similarity scores generated by deep learning networks.Brief. Bioinform.bbz098. 10.1093/bib/bbz098
50
LiuB.LiuF. L.WangX. L.ChenJ. J.FangL. Y.ChouK. C. (2015a). Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences.Nucleic Acids Res.43W65–W71. 10.1093/nar/gkv458
51
LiuK.ChenW.LinH. (2019b). XG-PseU: an eXtreme gradient boosting based method for identifying pseudouridine sites.Mol. Genet. Genomics29513–21. 10.1007/s00438-019-01600-9
52
LiuN.DaiQ.ZhengG.HeC.ParisienM.PanT. (2015b). N-6-methyladenosine-dependent RNA structural switches regulate RNA-protein interactions.Nature518560–564. 10.1038/nature14234
53
LiuS.ZhengB.ShengY.KongQ.JiangY.YangY.et al (2019c). Identification of cancer dysfunctional subpathways by integrating DNA methylation, copy number variation, and gene-expression data.Front. Genet.10:441. 10.3389/fgene.2019.00441
54
LvH.DaoF.-Y.GuanZ.-X.ZhangD.TanJ.-X.Jiu-XinTan (2019a). iDNA6mA-Rice: a computational tool for detecting N6-methyladenine sites in rice.Front. Genet.10:793. 10.3389/fgene.2019.00793
55
LvZ.JinS.DingH.ZouQ. (2019b). A random forest sub-Golgi protein classifier optimized via dipeptide and amino acid composition features.Front. Bioeng. Biotechnol.7:215. 10.3389/fbioe.2019.00215
56
ManavalanB.BasithS.ShinT. H.WeiL.LeeG. (2019). Meta-4mCpred: a sequence-based meta-predictor for accurate DNA 4mC site prediction using effective feature representation.Mol. Ther. Nucleic Acids16733–744. 10.1016/j.omtn.2019.04.019
57
MengC. L.JinS. S.WangL.GuoF.ZouQ. (2019). AOPs-SVM: a sequence-based classifier of antioxidant proteins using a support vector machine.Front. Bioeng. Biotechnol.7:10. 10.3389/fbioe.2019.00224
58
MeyerK. D.JaffreyS. R. (2014). The dynamic epitranscriptome: N-6-methyladenosine and gene expression control.Nat. Rev. Mol. Cell Biol.15313–326. 10.1038/nrm3785
59
MotorinY.HelmM. (2010). tRNA stabilization by modified nucleotides.Biochemistry494934–4944. 10.1021/bi100408z
60
MrozekD.GoskP.Malysiak-MrozekB. (2015). Scaling Ab initio predictions of 3D protein structures in microsoft azure cloud.J. Grid Comput.13561–585. 10.1007/s10723-015-9353-8
- CrossRef
- Google Scholar
61
MrozekD.Malysiak-MrozekB.SiaznikA. (2013). search GenBank: interactive orchestration and ad-hoc choreography of web services in the exploration of the biomedical resources of the national center For biotechnology information.BMC Bioinformatics14:18. 10.1186/1471-2105-14-73
62
MrozekD.SochaB.KozielskiS.Malysiak-MrozekB. (2016). An efficient and flexible scanning of databases of protein secondary structures.J. Intell. Inf. Syst.46213–233. 10.1007/s10844-014-0353-0
- CrossRef
- Google Scholar
63
MuhammodR.AhmedS.Md FaridD.ShatabdaS.SharmaA.DehzangiA. (2019). PyFeat: a Python-based effective feature generation tool for DNA, RNA and protein sequences.Bioinformatics (Oxf. Engl.)353831–3833. 10.1093/bioinformatics/btz165
64
NairA. S.SreenadhanS. P. (2006). A coding measure scheme employing electron-ion interaction pseudopotential (EIIP).Bioinformation1197–202.
- Google Scholar
65
PanG.TangJ.GuoF. (2017). Analysis of co-associated transcription factors via ordered adjacency differences on motif distribution.Sci. Rep.7:43597. 10.1038/srep43597
66
QiuW.-R.JiangS.-Y.XuZ.-C.XiaoX.ChouK.-C. (2017). iRNAm5C-PseDNC: identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition.Oncotarget841178–41188. 10.18632/oncotarget.17104
67
QiuW. R.SunB. Q.XiaoX.XuZ. C.ChouK. C. (2016). iPTM-mLys: identifying multiple lysine PTM sites and their different types.Bioinformatics323116–3123. 10.1093/bioinformatics/btw380
68
SaboohM. F.IqbalN.KhanM.KhanM.MaqboolH. F. (2018). Identifying 5-methylcytosine sites in RNA sequence using composite encoding feature into Chou’s PseKNC.J. Theor. Biol.4521–9. 10.1016/j.jtbi.2018.04.037
69
SchwartzS.BernsteinD. A.MumbachM. R.JovanovicM.HerbstR. H.León-RicardoB. X. (2014). Transcriptome-wide mapping reveals widespread dynamic-regulated pseudouridylation of ncRNA and mRNA.Cell159148–162. 10.1016/j.cell.2014.08.028
70
ShanX.WangX.LiC. D.ChuY.ZhangY.XiongY. I. (2019). Prediction of CYP450 enzyme-substrate selectivity based on the network-based label space division method.J. Chem. Inf. Model.594577–4586. 10.1021/acs.jcim.9b00749
71
ShenY.TangJ.GuoF. (2019). Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou’s general PseAAC.J. Theor. Biol.462230–239. 10.1016/j.jtbi.2018.11.012
72
SongJ. H.YiC. Q. (2017). Chemical modifications to RNA: a new layer of gene expression regulation.ACS Chem. Biol.12316–325. 10.1021/acschembio.6b00960
73
SunW.HanY.YangS.ZhuangH.ZhangJ.ChengL. (2019). The assessment of Interleukin-18 on the risk of coronary heart disease.Med. Chem.10.2174/1573406415666191004115128[Epub ahead of print].
74
TahirM.TayaraH.ChongK. T. (2019). iPseU-CNN: identifying RNA pseudouridine sites using convolutional neural networks.Mol. Ther. Nucleic Acids16463–470. 10.1016/j.omtn.2019.03.010
75
TangH.CaoR.-Z.WangW.LiuT.-S.WangL.-M.HeC.-M. (2017). A two-step discriminated method to identify thermophilic proteins.Int. J. Biomath.10:1750050. 10.1142/s1793524517500504
- CrossRef
- Google Scholar
76
TangJ.FuJ.WangY.LiB.LiY.YangQ. (2019a). ANPELA: analysis and performance assessment of the label-free quantification workflow for metaproteomic studies.Brief. Bioinform.10:bby127. 10.1093/bib/bby127
77
TangJ.FuJ.WangY.LuoY.YangQ.LiB.et al (2019b). Simultaneous improvement in the precision, accuracy, and robustness of label-free proteome quantification by optimizing data manipulation chains.Mol. Cell. Proteomics181683–1699. 10.1074/mcp.RA118.001169
78
TangW.WanS. X.YangZ.TeschendorffA. E.ZouQ. (2018). Tumor origin detection with tissue-specific miRNA and DNA methylation markers.Bioinformatics34398–406. 10.1093/bioinformatics/btx622
79
WangP.ZhangX.FuT.LiS.LiB.XueW.et al (2017). Differentiating physicochemical properties between addictive and nonaddictive ADHD drugs revealed by molecular dynamics simulation studies.ACS Chem. Neurosci.81416–1428. 10.1021/acschemneuro.7b00173
80
WangX.LuZ.GomezA.HonG. C.YueY.HanD. (2014). N-6-methyladenosine-dependent regulation of messenger RNA stability.Nature505117–120. 10.1038/nature12730
81
WangX.ZhuX.YeM.WangY.LiC.-D.XiongY. (2019a). STS-NLSP: a network-based label space partition method for predicting the specificity of membrane transporter substrates using a hybrid feature of structural and semantic similarity.Front. Bioeng. Biotechnol.7:306. 10.3389/fbioe.2019.00306
82
WangY.ZhangS.LiF.ZhouY.ZhangY.WangZ. (2019b). Therapeutic target database 2020: enriched resource for facilitating research and early development of targeted therapeutics.Nucleic Acids Res.48D1031–D1041. 10.1093/nar/gkz981
83
WeiL.LuanS.NagaiL. A. E.SuR.ZouQ. (2019a). Exploring sequence-based features for the improved prediction of DNA N4-methylcytosine sites in multiple species.Bioinformatics351326–1333. 10.1093/bioinformatics/bty824
84
WeiL.SuR.LuanS.LiaoZ.ManavalanB.ZouQ.et al (2019b). Iterative feature representations improve N4-methylcytosine site prediction.Bioinformatics354930–4937. 10.1093/bioinformatics/btz408
85
WeiL.XingP.SuR.ShiG.MaZ.ZouQ. (2017a). CPPred-RF: a sequence-based predictor for identifying cell-penetrating peptides and their uptake efficiency.J. Proteome Res.162044–2053. 10.1021/acs.jproteome.7b00019
86
WeiL.XingP.TangJ.ZouQ. (2017b). PhosPred-RF: a novel sequence-based predictor for phosphorylation sites using sequential information only.IEEE Trans. Nanobioscience16240–247. 10.1109/TNB.2017.2661756
87
WeiL. Y.TangJ. J.ZouQ. (2017c). Local-DPP: an improved DNA-binding protein prediction method by exploring local evolutionary information.Inf. Sci.384135–144. 10.1016/j.ins.2016.06.026
- CrossRef
- Google Scholar
88
WeiL. Y.WanS. X.GuoJ. S.WongK. K. L. (2017d). A novel hierarchical selective ensemble classifier with bioinformatics application.Artif. Intell. Med.8382–90. 10.1016/j.artmed.2017.02.005
89
WeiL. Y.XingP. W.ZengJ. C.ChenJ. X.SuR.GuoF. (2017e). Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier.Artif. Intell. Med.8367–74. 10.1016/j.artmed.2017.03.001
90
WinklerR.GillisE.LasmanL.SafraM.GeulaS.SoyrisC. (2019). m(6)A modification controls the innate immune response to infection by targeting type I interferons.Nat. Immunol.20173–182. 10.1038/s41590-018-0275-z
91
WuX. Y.WeiZ.ChenK. Q.ZhangQ.SuJ. L.LiuH.et al (2019). m6Acomet: large-scale functional prediction of individual m(6)A RNA methylation sites from an RNA co-methylation network.BMC Bioinformatics20:223. 10.1186/s12859-019-2840-3
92
XiongY.LiuJ.ZhangW.ZengT. (2012). Prediction of heme binding residues from protein sequences with integrative sequence profiles.Proteome Sci.10(Suppl. 1):S20. 10.1186/1477-5956-10-S1-S20
93
XiongY.WangQ.YangJ.ZhuX.WeiD. Q. (2018). PredT4SE-Stack: prediction of bacterial Type IV secreted effectors from protein sequences using a stacked ensemble method.Front. Microbiol.9:2571. 10.3389/fmicb.2018.02571
94
XuL.LiangG.LiaoC.ChenG.-D.ChangC.-C. (2018a). An efficient classifier for Alzheimer’s disease genes identification.Molecules23:3140. 10.3390/molecules23123140
95
XuL.LiangG.LiaoC.ChenG.-D.ChangC.-C. (2019). k-Skip-n-Gram-RF: a random forest based method for Alzheimer’s disease protein identification.Front. Genet.10:33. 10.3389/fgene.2019.00033
96
XuL.LiangG.WangL.LiaoC. (2018b). A novel hybrid sequence-based model for identifying anticancer peptides.Genes9:158. 10.3390/genes9030158
97
XueW.YangF.WangP.ZhengG.ChenY.YaoX.et al (2018). What contributes to serotonin-norepinephrine reuptake inhibitors’ dual-targeting mechanism? The key role of transmembrane domain 6 in human serotonin and norepinephrine transporters revealed by molecular dynamics simulation.ACS Chem. Neurosci.91128–1140. 10.1021/acschemneuro.7b00490
98
YinJ.SunW.LiF.HongJ.LiX.ZhouY. (2019). VARIDT 1.0: variability of drug transporter database.Nucleic Acids Res.48D1042–D1050. 10.1093/nar/gkz779
99
YuH.DaiZ. (2019). SNNRice6mA: a deep learning method for predicting DNA N6-methyladenine sites in rice genome.Front. Genet.10:1071. 10.3389/fgene.2019.01071
100
YuL.HuangJ. B.MaZ. X.ZhangJ.ZouY. P.GaoL. (2015). Inferring drug-disease associations based on known protein complexes.BMC Med. Genomics8:13. 10.1186/1755-8794-8-s2-s2
101
YuL.SuR.WangB.ZhangL.ZouY.ZhangJ.et al (2017a). Prediction of novel drugs for hepatocellular carcinoma based on multi-source random walk.IEEE ACM Trans. Comput. Biol. Bioinform.14966–977. 10.1109/tcbb.2016.2550453
102
YuL.YaoS.GaoL.ZhaY. (2019). Conserved disease modules extracted from multilayer heterogeneous disease and gene networks for understanding disease mechanisms and predicting disease treatments.Front. Genet.9:745. 10.3389/fgene.2018.00745
103
YuL.ZhaoJ.GaoL. (2017b). Drug repositioning based on triangularly balanced structure for tissue-specific diseases in incomplete interactome.Artif. Intell. Med.7753–63. 10.1016/j.artmed.2017.03.009
104
YuL.ZhaoJ.GaoL. (2018). Predicting potential drugs for breast cancer based on miRNA and tissue specificity.Int. J. Biol. Sci.14971–980. 10.7150/ijbs.23350
105
ZaringhalamM.PapavasiliouF. N. (2016). Pseudouridylation meets next-generation sequencing.Methods10763–72. 10.1016/j.ymeth.2016.03.001
106
ZhangM.LiF. Y.Marquez-LagoT. T.LeierA.FanC.KwohC. K.et al (2019). MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters.Bioinformatics352957–2965. 10.1093/bioinformatics/btz016
107
ZhangM.XuY.LiL.LiuZ.YangX.YuD.-J. (2018). Accurate RNA 5-methylcytosine site prediction based on heuristic physical-chemical properties reduction and classifier ensemble.Anal. Biochem.55041–48. 10.1016/j.ab.2018.03.027
108
ZhaoB. S.HeC. (2015). Pseudouridine in a new era of RNA modifications.Cell Res.25153–154. 10.1038/cr.2014.143
109
ZhaoX. W.ZhangY.NingQ.ZhangH. R.JiJ. C.YinM. H. (2019). Identifying N-6-methyladenosine sites using extreme gradient boosting system optimized by particle swarm optimizer.J. Theor. Biol.46739–47. 10.1016/j.jtbi.2019.01.035
110
ZhengG.XueW.YangF.ZhangY.ChenY.YaoX.et al (2017). Revealing vilazodone’s binding mechanism underlying its partial agonism to the 5-HT1A receptor in the treatment of major depressive disorder.Phys. Chem. Chem. Phys.1928885–28896. 10.1039/c7cp05688e
111
ZhouY.ZengP.LiY. H.ZhangZ. D.CuiQ. H. (2016). SRAMP: prediction of mammalian N-6-methyladenosine (m(6)A) sites based on sequence-derived features.Nucleic Acids Res.44:e91. 10.1093/nar/gkw104
112
ZhuX.HeJ.ZhaoS.TaoW.XiongY.BiS. (2019). A comprehensive comparison and analysis of computational predictors for RNA N6-methyladenosine sites of Saccharomyces cerevisiae.Brief. Funct. Genomics18367–376. 10.1093/bfgp/elz018
113
ZouQ.XingP. W.WeiL. Y.LiuB. (2019). Gene2vec: gene subsequence embedding for prediction of mammalian N-6-methyladenosine sites from mRNA.RNA25205–218. 10.1261/rna.069112.118
114
ZouQ.ZengJ. C.CaoL. J.JiR. R. (2016). A novel features ranking metric with application to scalable visual and bioinformatics data classification.Neurocomputing173346–354. 10.1016/j.neucom.2014.12.123
- CrossRef
- Google Scholar

Summary

Keywords

pseudouridine sites, light gradient boosting, random forest, machine learning, RNA

Citation

Lv Z, Zhang J, Ding H and Zou Q (2020) RF-PseU: A Random Forest Predictor for RNA Pseudouridine Sites. Front. Bioeng. Biotechnol. 8:134. doi: 10.3389/fbioe.2020.00134

Received

17 January 2020

Accepted

10 February 2020

Published

26 February 2020

Volume

8 - 2020

Edited by

Yongchun Zuo, Inner Mongolia University, China

Reviewed by

Yi Xiong, Shanghai Jiao Tong University, China; Hongmin Cai, South China University of Technology, China

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Jun Zhang, zhangjun13902003@163.comQuan Zou, zouquan@nclab.net

This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Bioengineering and Biotechnology

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Computational Genomics

ORIGINAL RESEARCH article

RF-PseU: A Random Forest Predictor for RNA Pseudouridine Sites

Abstract

Introduction