Abstract
The 2-oxoglutarate/Fe (II)-dependent (2OG) oxygenase superfamily is mainly responsible for protein modification, nucleic acid repair and/or modification, and fatty acid metabolism and plays important roles in cancer, cardiovascular disease, and other diseases. They are likely to become new targets for the treatment of cancer and other diseases, so the accurate identification of 2OG oxygenases is of great significance. Many computational methods have been proposed to predict functional proteins to compensate for the time-consuming and expensive experimental identification. However, machine learning has not been applied to the study of 2OG oxygenases. In this study, we developed OGFE_RAAC, a prediction model to identify whether a protein is a 2OG oxygenase. To improve the performance of OGFE_RAAC, 673 amino acid reduction alphabets were used to determine the optimal feature representation scheme by recoding the protein sequence. The 10-fold cross-validation test showed that the accuracy of the model in identifying 2OG oxygenases is 91.04%. Besides, the independent dataset results also proved that the model has excellent generalization and robustness. It is expected to become an effective tool for the identification of 2OG oxygenases. With further research, we have also found that the function of 2OG oxygenases may be related to their polarity and hydrophobicity, which will help the follow-up study on the catalytic mechanism of 2OG oxygenases and the way they interact with the substrate. Based on the model we built, a user-friendly web server was established and can be friendly accessed at http://bioinfor.imu.edu.cn/ogferaac.
Introduction
2-Oxoglutarate/Fe (II)-dependent (2OG) oxygenases (EC:1.14.11), generally using nonheme iron as an active-site cofactor, promote oxidative decarboxylation of the substrate to produce carbon dioxide and succinic acid (Hausinger, 2004; Hewitson et al., 2005; Islam et al., 2018). 2OG oxygenases, which can catalyze many different oxidation reactions, are a superfamily with members widely distributed in animals, plants, and microorganisms. In animals, their catalytic range includes hydroxylation and N-demethylation proceeding via hydroxylation; in plants and microbes, they affect a wider range, including hydroxylation, ring formations, cleavage, oxidation, rearrangements, desaturations, and halogenations (Farrow and Facchini, 2014; Kawai et al., 2014). The proteins of this superfamily can be divided into 2OG oxygenase domain-containing oxygenases and JmjC domain-containing oxygenases (Jia et al., 2017). Figure 1 is a schematic diagram of the structure of 2OG oxygenases.
FIGURE 1
Due to the diversity of 2OG oxygenases and the wide range of binding substrates, these oxygenases play an important role in physiology and have high therapeutic value and therapeutic potential as targets in cancer and many other diseases (Rose et al., 2011). For example, the protein containing the JmjC domain (JMJD6) is located in the nucleus that catalyzes lysine hydroxylation and arginine demethylation of histone and non-histone peptides (Chang et al., 2007; Liu et al., 2013). JMJD6 promotes cell proliferation and migration in vitro and accelerates tumor growth in vivo, so it may become an attractive target for a new generation of anticancer drugs (Lin et al., 2006; Lee et al., 2012). Prolyl 4-hydroxylase (P4H) plays a vital role in the synthesis of collagen and the regulation of oxygen homeostasis. Collagen P4Hs are considered to be attractive targets for drug inhibitors and involved in the treatment of fibrotic diseases and cancer metastasis (Vasta and Raines, 2018). Hypoxia-inducible transcription factor-prolyl 4-hydroxylase inhibitors are believed to have beneficial effects in the treatment of diseases such as myocardial infarction, stroke, peripheral vascular disease, diabetes, and severe anemias (Myllyharju, 2008; Liao and Zhang, 2020). ALKB homologs (ALKBH) homologs can regulate the physiological and pathological processes of cardiovascular diseases (CVDs), which have great potential in the development of CVD drugs and are expected to become a potential target for the treatment of CVD (Xiao et al., 2020). The change in the catalytic activity or expression level of lysine demethylases (KDMs) is closely related to many diseases, including cancer genesis and progression, neurological disorders, inflammatory and immune disorders, metabolic diseases, and regenerative diseases. Modulators/inhibitors of KDMs may be used as new treatments for cancer and other diseases (Arifuzzaman et al., 2020). Therefore, it is particularly meaningful to predict 2OG oxygenases and find more potential 2OG oxygenases. Since the identification of 2OG oxygenase is time-consuming and expensive, machine learning is an effective and fast method to predict it.
In the past, many machine learning methods for the prediction of metal ion-binding proteins have achieved excellent results. For example, Lin et al. (2006) applied the sequence information used by support vector machine (SVM) to predict the metal ion-binding protein and got a relatively marvelous prediction result. Mohan et al. (2010) used a set of physicochemical parameters of metal ion-binding proteins encoded by the three genes CzcA, CzcB, and CzcD as the training set of the supervised classifier, establishing a model to identify metal ion-binding proteins from unknown proteins. Valasatava et al. (2016) developed MetalPredator, a web server used to predict iron–sulfur cluster-binding proteomes, and it featured an excellent performance in terms of precision and recall. Many studies have also achieved good results in the prediction of metal ion-binding sites, including iron ion-binding sites (Liu and Hu, 2011; Liou et al., 2014), zinc ion-binding sites (Shu et al., 2008; Chen et al., 2013; Yan et al., 2019), copper ion binding sites (Levy et al., 2009; Brylinski and Skolnick, 2011). The above indicate that machine learning is suitable for the application of metal ion-binding proteins (Valasatava et al., 2016). Not only that, studies have shown that using the reduced amino acid cluster (RAAC) strategy to predict the types of proteins can reduce noise and achieve higher accuracy (Zheng et al., 2019). In the prediction of human and nonhuman enzymes (Wang H. et al., 2021), ion channel-targeted conotoxins (Sun et al., 2020), plasmodium secretory protein (Zhang et al., 2020), and defensin peptides (Zuo et al., 2019), the method of reduced amino acid has shown superior performance.
In this study, we established a prediction model for 2OG oxygenases based on SVM, which can effectively identify 2OG oxygenases. A new feature representation scheme (amino acid reduction cluster) was involved in this work. The RAAC strategy can greatly decrease the complexity of protein sequences and extremely reduce the use of computer memory (Zuo et al., 2017; Zheng et al., 2019). The workflow of constructing the OGFE_RAAC is shown in Figure 2. Firstly, an objective dataset was established, which contains 734 2OG oxygenases and 385,381 non-2OG oxygenases from the InterPro database. Subsequently, reduced amino acid composition combined with K-mer strategy was used to represent sequence features, and the optimal one was selected from 673 reduction schemes (Zuo et al., 2015). At the same time, we obtained the best feature combination through analysis of variance (ANOVA) combined with incremental feature selection (IFS) and applied SVM to establish the model. The results of 10-fold cross-validation and independent test set showed that OGFE_RAAC could accurately predict 2OG oxygenases.
FIGURE 2
Materials and Methods
Dataset
The 2OG oxygenase superfamily can be classified into 2OG oxygenase domain-containing oxygenases and JmjC domain-containing oxygenases, so we collected all the verified 734 proteins of these two domains in the IPR number (IPR005123 and IPR003347) of the InterPro public database as a positive sample. Concurrently, 385381 protein data verified by SwissProt were gathered as negative samples, which is the manual annotation and review part of UniProt. Then, CD-HIT (Huang et al., 2010) was used to remove sequences with a similarity of more than 50% (Zou et al., 2020), and 480 samples are selected as the training set (Fu et al., 2012). We chose 150 samples from the rest as the test set, and the dataset was named 2OG-SwissProt. For the purpose of getting a better model, we also used iron-binding protein as a negative sample to construct a dataset. We acquired 593 iron-binding proteins (GO:0005506, 2OG oxygenase proteins removed) from the InterPro public database and processed them in the same way as the 2OG-SwissProt dataset to obtain 471 training set samples and 159 test set samples; the dataset was named 2OG-Fe.
For further research, we manually extracted the domain sequences of 2OG oxygenase and iron-binding proteins. The processing method is the same as the above; in order to better verify the prediction results, we used CD-HIT processing sequence similarity less than 50% as the training set and the rest as the independent test set. Among them, 1,036 samples constitute an independent test set, 621 positive samples and 415 negative samples; 283 samples constitute a training set, 113 positive samples and 170 negative samples. This dataset was named 2OG-domain (Table 1).
TABLE 1
| Dataset | Group | Training set | Test set |
| 2OG-SwissProt | Positive | 240 | 75 |
| Negative | 240 | 75 | |
| 2OG-Fe | Positive | 240 | 75 |
| Negative | 231 | 84 | |
| 2OG-domain | Positive | 113 | 621 |
| Negative | 170 | 415 |
Data composition of each dataset.
Reduce Protein Sequence
Under normal circumstances, protein is composed of 20 natural amino acids. We combine amino acids with similar characteristics based on the physicochemical properties and atomic arrangement of amino acids. For instance, using fuzzy clustering technology and matrices cluster amino acids and interpret the sequence in a new encoding method (Georgiou et al., 2009; Zuo and Li, 2009). The strategy of RAACs can effectively reduce the complexity of the sequence and improve computational efficiency. In the study, we used 673 amino acid reduction schemes generated by 74 types to predict 2OG oxygenases, and each type has a reduced size of 2–19 (Zuo et al., 2019; Zheng et al., 2020).
Extract Features Based on K-mer
The typical K-mer (N-peptide) composition can effectively dig out the detailed information of the amino acid composition of the sequence (Zhu et al., 2019; Jaillard et al., 2020). We use K-mer (K = 1, 2, 3) to extract amino acid sequence information. Due to the limited memory, the maximum K value is 3, and a total of 20K features can be obtained according to the original amino acid composition. The composition of K-mer (K = 2) can be expressed as follows:
Here, Ri represents the i-th residue of the 2OG oxygenases.L represents the total length of the amino acid sequence. di (i = 1, 2,…, 400) is the i-th dipeptide in the 400-amino acid combination, and T means transposition operator. The di can be calculated as follows:
Here, ni denotes the number of the i-th dipeptide. Combined with RAAC strategy, the feature extraction method can be expressed as follows:
where denotes the method of the N-peptide with different RAAC descriptors (N-peptide). N denotes the N-peptide. T denotes the type of different amino acid alphabets, and C denotes the cluster of the reduced amino acid alphabet. The parameters of the above equation can be limited as follows:
Support Vector Machine
Support vector machine is a machine learning model that classifies data according to supervised learning methods and has been widely used in bioinformatics (Beer, 2017; Huang et al., 2018; Manavalan et al., 2018; Meng et al., 2020; Tahir and Idris, 2020). There are four types of kernel function, including linear functions, polynomial functions, S-shaped functions, and radial basis functions (RBFs). In the past predictions of proteins, the RBF kernel function had better performance, and we have verified that the RBF kernel function has better performance in our model through the calculation and comparison of the four kernel functions. Accordingly, we used the SVM package with RBF kernel for the classifier, which can be obtained from https://www.csie.ntu.edu.tw/~cjlin/libsvm (Chang and Lin, 2011). The libsvm package provides a grid search program to optimize the parameters C and γ. The kernel parameter γ and the regularization parameter C are used to adjust the SVM model to obtain the best performance. The selection ranges of C and γ are as follows:
Feature Screening
The initial features extracted by K-mer are exclusive features, not the optimal combination of features (Zou et al., 2016; He et al., 2020). ANOVA is a popular feature selection method that can help us measure the weight value of each feature (Saeys et al., 2007; Tang et al., 2018). Then, we used IFS to determine the dimensionality of the best feature set according to the feature weights obtained by the ANOVA. The ANOVA equations are as follows:
where F is the variance value of the feature. is the sample variance between groups. denotes the sample variance within groups.
Performance Evaluation
In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling (K-fold cross-validation) test, and jackknife test. However, among the three cross-validation methods, the jackknife test is deemed the least arbitrary that can always yield a unique result for a given benchmark dataset and hence has been increasingly used and widely recognized by investigators to examine the accuracy of various predictors (Chou and Shen, 2008; Chou, 2011; Chou et al., 2012; Zhang et al., 2021). However, since the current study would involve feature selection as described above, to reduce the computational time, the 10-fold cross-validation test and independent dataset test would be adopted as done by many investigators using SVM as the prediction engine. The performance can be measured in term of Sensitivity (Sn), Specificity (Sp), F1 score, Matthew’s correlation coefficient (MCC), and Accuracy (Acc; Li et al., 2020; Shen and Zou, 2020; Yang et al., 2021), which are expressed as follows:
where TP, TN, FP, and FN represent true-positive, true-negative, false-positive, and false-negative samples, respectively.
Results
Predictive Performance of Different Reducing Amino Acid Cluster
To obtain the optimal amino acid reduction scheme and the appropriate K value (K = 1, 2, 3), we calculated the accuracy of the 673 reduction schemes mentioned in RAACBook (Zheng et al., 2019) with the different K values. We found that all three models showed the best performance at K = 3, and most of the reduction schemes had higher accuracy when K = 3 (Figure 3). We guessed that there would be more features when K = 3, and they would better reflect the properties of the protein and get a more accurate model.
FIGURE 3
After confirming that the model has better performance when K = 3, we then selected the best scheme from 673 RAAC schemes to construct the model. In the 2OG-SwissProt model, we tested each size of each reduction type and compared different reduction sizes of different reduction types (Figure 4A). We found that when t = 33 (Table 2), s = 15 (t represents the t-th reduction type in RAACBook; s represents the size of the RAAC), the highest accuracy rate is 83.75% (Figure 4B). In the prediction of the 2OG-Fe dataset, we were pleasantly surprised to find that the highest accuracy rate also appears in the reduction type 33, and the highest accuracy rate is 90.04% when s = 16 (Supplementary Figure 1B). There is also a very high accuracy rate at s = 15, reaching 88.76% (Supplementary Figure 1A). The reduction method of type 33 uses a database of aligned protein structures to propose a new clustering method based on the substitution scores, which aggregates 20 amino acids in two groups, namely, the hydrophobic groups and the polar groups (Li and Wang, 2007). Therefore, we speculated that the function of 2OG oxygenases may be related to its polarity and hydrophobicity.
FIGURE 4
TABLE 2
| Size | Reduced amino acid cluster |
| 2 | STANDGRQEKHPIVLMWYF-C |
| 3 | STANDGRQEKHP-IVLMWYF-C |
| 4 | STANDG-RQEKHP-IVLMWYF-C |
| 5 | STAND-G-RQEKHP-IVLMWYF-C |
| 6 | STAND-G-RQEK-HP-IVLMWYF-C |
| 7 | STA-ND-G-RQEK-HP-IVLMWYF-C |
| 8 | STA-ND-G-RQ-EK-HP-IVLMWYF-C |
| 9 | STA-ND-G-RQ-EK-HP-IVLM-WYF-C |
| 10 | ST-A-ND-G-RQ-EK-HP-IVLM-WYF-C |
| 11 | ST-A-ND-G-RQ-EK-H-P-IVLM-WYF-C |
| 12 | ST-A-N-D-G-RQ-EK-H-P-IVLM-WYF-C |
| 13 | ST-A-N-D-G-RQ-EK-H-P-IV-LM-WYF-C |
| 14 | S-T-A-N-D-G-RQ-EK-H-P-IV-LM-WYF-C |
| 15 | S-T-A-N-D-G-RQ-EK-H-P-IV-L-M-WYF-C |
| 16 | S-T-A-N-D-G-RQ-E-K-H-P-IV-L-M-WYF-C |
| 17 | S-T-A-N-D-G-RQ-E-K-H-P-IV-L-M-WY-F-C |
| 18 | S-T-A-N-D-G-R-Q-E-K-H-P-IV-L-M-WY-F-C |
| 19 | S-T-A-N-D-G-R-Q-E-K-H-P-I-V-L-M-WY-F-C |
Cluster size of reduced amino acid alphabet of type 33.
To further prove that polarity and hydrophobicity may be related to the function of 2OG oxygenases, we manually extracted the 2OG oxygenase domain and JmjC domain sequences and other iron-binding domain sequences for prediction. Protein functions mainly through its domain region, and 2OG oxygenases also bind Fe(II) and 2-oxoglutarate in their domain position to perform their functions. Therefore, the region outside the domain may be noise information for feature extraction, and only using the domain sequence to extract features can better reflect the function of 2OG oxygenases (Shen and Zou, 2020). The result is the same as we expected, when t = 33 and s = 15, the highest accuracy rate is obtained (Supplementary Figure 1B). The same result is obtained with the complete sequence, which further proves that the polarity and hydrophobicity may be related to the function of 2OG oxygenases.
The functional domain of 2OG oxygenases contains Fe2+-binding sites and α-ketoglutarate-binding sites, and their amino acid composition is almost completely conserved. The Fe2+-binding motif (HXD-H) and α-KG-binding motif (N-Y-R-R) of the ALKBH family are entirely conserved in the homologs (Bjornstad et al., 2011; Fedeles et al., 2015; Alemu et al., 2016; Xu et al., 2021), and other 2OG oxygenases have similar structures (Bleijlevens et al., 2008; Islam et al., 2018; Wang et al., 2021). They all combine Fe2+ and α-ketoglutarate through conserved polar amino acid regions, which may be the reason why polarity is an essential feature of 2OG oxygenase identification. In addition, in the best reduction scheme, Phenylalanine (F), Tryptophan (W), and Tyrosine (Y) are recombined into a new letter, and these three amino acids are all aromatic amino acids. We speculate that the function of 2OG oxygenases may be related to the hydrophobicity of aromatic amino acids and the unique properties of its benzene ring.
Feature Selection
Although we can get more features when K = 3, not every feature can be helpful to the prediction of 2OG oxygenases; some features may even become noise information and affect the final result. Therefore, we used ANOVA combined with IFS to select the best feature combination. Through 10-fold cross-validation, the 2OG-SwissProt model achieves an optimal performance of 91.46% with 812 feature combinations (Figure 4C); the 2OG-Fe model achieves an optimal performance of 96.61% with 1,181 feature combinations (Supplementary Figure 1C); 2OG-domain model also achieves an optimal performance of 96.07% with 350 feature combinations (Supplementary Figure 1C). For more clearly showing that the filtered features can better reflect the nature of 2OG oxygenases, we used t-Distributed Stochastic Neighbor Embedding (t-SNE) to visualize the feature sets after unreduced, reduced, and feature screening in a 2D feature space (Figures 5A–C). Obviously, the results show that the feature set clustering effect after feature screening is better, and it can effectively separate 2OG oxygenases from non-2OG oxygenases.
FIGURE 5
Performance Evaluation
We evaluated our model by 10-fold cross-validation to verify that our model is effective (Table 3). At the same time, we drew the receiver operating characteristic (ROC) curve through the 10-fold cross-validation (Figures 5D–F).
TABLE 3
| Model | Acc (%) | Sn (%) | SP (%) | MCC (%) | F1score (%) | AUC (%) |
| 2OG-SwissProt | 91.04 | 93.33 | 88.75 | 82.34 | 91.26 | 97.15 |
| 2OG-Fe | 97.23 | 97.92 | 96.53 | 94.48 | 97.31 | 99.57 |
| 2OG-domain | 97.87 | 98.23 | 97.65 | 95.60 | 97.37 | 99.89 |
The results of each evaluation index of the three models.
Acc, accuracy; AUC, area under the curve; MCC, Matthew’s correlation coefficient; Sn, sensitivity; and Sp, specificity.
In order to further evaluate our predictor, we used an independent test set to test 2OG-SwissProt, 2OG-Fe, and 2OG-domain models. The 2OG-SwissProt model accurately predicts 143 samples out of 150 test set samples, and the accuracy rate is 95.33%. The 2OG-Fe model accurately predicts 149 samples out of 159 test set samples, with an accuracy rate of 93.71%. The 2OG-domain model accurately predicts 963 samples out of 1,036 test set samples, with an accuracy rate of 92.95%. These show that our predictor is effective and robust.
Web Server Guidance
For the purpose of other researchers to use our model more conveniently, an easy-to-use web server was established to implement our predictor, which can be freely accessed at http://bioinfor.imu.edu.cn/ogferaac. When you want to use our tool, you need to click the “Service” module and then import the FASTA protein sequence into the input box or upload the button to upload your protein data. Meanwhile, according to the different sequences you provide, you can also choose different modules (2OG-SwissProt, 2OG-Fe, and 2OG-domain) for prediction. After submitting the task, the website will provide the corresponding forecast report, which will display the forecast results and probability of each sequence in the form of tables and flowcharts (Figure 6).
FIGURE 6
Discussion
At present, the research on 2OG oxygenases is more in-depth, and its many functions (such as demethylation) occupy an important position in the research of diseases (Liu et al., 2019; Ao et al., 2021). Based on RAAC strategy and SVM, the prediction model of 2OG oxygenases is constructed. t-SNE results show that RAAC can effectively reduce protein complexity, extract conservative features hidden in noise information, and improve prediction accuracy. OGFE_RAAC has strong robustness and generalization to accurately predict 2OG oxygenases. We anticipate that OGFE_RAAC can accurately and rapidly identify 2OG oxygenases based on peptide sequence and promote the development of related drug research. Not only that, we also found that the function of 2OG oxygenases may be related to its hydrophobicity and polarity during the prediction process, which also provides a new research idea for the future study of 2OG oxygenases.
Statements
Data availability statement
Publicly available datasets were analyzed in this study. This data can be found here: http://bioinfor.imu.edu.cn/ogferaac/public/Download.
Author contributions
YZ conceived and designed the study. JZ and PL organized and collected the data and carried out the computation. LZ designed and developed the web server. JZ and HW wrote the manuscript. SB participated in all subsequent revisions of the manuscript. YZ planned overall and revised the manuscript. All authors read and approved the manuscript.
Funding
This work was supported by the National Natural Scientific Foundation of China (Nos. 62061034 and 61861036), Program for Young Talents of Science and Technology in Universities of Inner Mongolia Autonomous Region (NJYT-18-B01), the Fund for Excellent Young Scholars of Inner Mongolia (2017JQ04), and the Science and Technology Major Project of Inner Mongolia Autonomous Region of China to the State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock (2019ZD031).
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fcell.2021.707938/full#supplementary-material
References
1
AlemuE. A.HeC.KlunglandA. (2016). ALKBHs-facilitated RNA modifications and de-modifications.DNA Repair4487–91. 10.1016/j.dnarep.2016.05.026
2
AoC.YuL.ZouQ. (2021). Prediction of bio-sequence modifications and the associations with diseases.Brief. Funct. Genomics201–18. 10.1093/bfgp/elaa023
3
ArifuzzamanS.KhatunM. R.KhatunR. (2020). Emerging of lysine demethylases (KDMs): from pathophysiological insights to novel therapeutic opportunities.Biomed. Pharmacother.129:110392. 10.1016/j.biopha.2020.110392
4
BeerM. A. (2017). Predicting enhancer activity and variant impact using gkm-SVM.Hum. Mutat.381251–1258. 10.1002/humu.23185
5
BjornstadL. G.ZoppellaroG.TomterA. B.FalnesP. O.AnderssonK. K. (2011). Spectroscopic and magnetic studies of wild-type and mutant forms of the Fe(II)- and 2-oxoglutarate-dependent decarboxylase ALKBH4.Biochem. J.434391–398. 10.1042/bj20101667
6
BleijlevensB.ShivarattanT.FlashmanE.YangY.SimpsonP. J.KoivistoP.et al (2008). Dynamic states of the DNA repair enzyme AlkB regulate product release.EMBO Rep.9872–877. 10.1038/embor.2008.120
7
BrylinskiM.SkolnickJ. (2011). FINDSITE-metal: integrating evolutionary information and machine learning for structure-based metal-binding site prediction at the proteome level.Proteins79735–751. 10.1002/prot.22913
8
ChangB. S.ChenY.ZhaoY. M.BruickR. K. (2007). JMJD6 is a histone arginine demethylase.Science318444–447. 10.1126/science.1145801
9
ChangC.-C.LinC.-J. (2011). LIBSVM: a library for support vector machines.ACM Trans. Intell. Syst. Technol.21–27. 10.1145/1961189.1961199
10
ChenZ.WangY.ZhaiY. F.SongJ.ZhangZ. (2013). ZincExplorer: an accurate hybrid method to improve the prediction of zinc-binding sites from protein sequences.Mol. Biosyst.92213–2222. 10.1039/c3mb70100j
11
ChouK. C. (2011). Some remarks on protein attribute prediction and pseudo amino acid composition.J. Theor. Biol.273236–247. 10.1016/j.jtbi.2010.12.024
12
ChouK. C.ShenH. B. (2008). Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms.Nat. Protoc.3153–162. 10.1038/nprot.2007.494
13
ChouK. C.WuZ. C.XiaoX. (2012). iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites.Mol. Biosyst.8629–641. 10.1039/c1mb05420a
14
FarrowS. C.FacchiniP. J. (2014). Functional diversity of 2-oxoglutarate/Fe(II)-dependent dioxygenases in plant metabolism.Front. Plant Sci.5:524. 10.3389/fpls.2014.00524
15
FedelesB. I.SinghV.DelaneyJ. C.LiD. Y.EssigmannJ. M. (2015). The AlkB Family of Fe(II)/alpha-ketoglutarate-dependent dioxygenases: repairing nucleic acid alkylation damage and beyond.J. Biol. Chem.29020734–20742. 10.1074/jbc.r115.656462
16
FuL. M.NiuB. F.ZhuZ. W.WuS. T.LiW. Z. (2012). CD-HIT: accelerated for clustering the next-generation sequencing data.Bioinformatics283150–3152. 10.1093/bioinformatics/bts565
17
GeorgiouD. N.KarakasidisT. E.NietoJ. J.TorresA. (2009). Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou’s pseudo amino acid composition.J. Theor. Biol.25717–26. 10.1016/j.jtbi.2008.11.003
18
HausingerR. P. (2004). FeII/alpha-ketoglutarate-dependent hydroxylases and related enzymes.Crit. Rev. Biochem. Mol. Biol.3921–68. 10.1080/10409230490440541
19
HeS.GuoF.ZouQ.DingH. (2020). MRMD2.0: a python tool for machine learning with feature ranking and reduction.Curr. Bioinform.151213–1221. 10.2174/1574893615999200503030350
20
HewitsonK. S.GranatinoN.WelfordR. W.McdonoughM. A.SchofieldC. J. (2005). Oxidation by 2-oxoglutarate oxygenases: non-haem iron systems in catalysis and signalling.Philos. Trans. A Math. Phys. Eng. Sci.363807–828. discussion 1035-1040., 10.1098/rsta.2004.1540
21
HuangS.CaiN.PachecoP. P.NarrandesS.WangY.XuW. (2018). Applications of Support Vector Machine (SVM) Learning in Cancer Genomics.Cancer Genomics Proteomics1541–51.
22
HuangY.NiuB. F.GaoY.FuL. M.LiW. Z. (2010). CD-HIT Suite: a web server for clustering and comparing biological sequences.Bioinformatics26680–682. 10.1093/bioinformatics/btq003
23
IslamM. S.LeissingT. M.ChowdhuryR.HopkinsonR. J.SchofieldC. J. (2018). 2-oxoglutarate-dependent oxygenases.Annu. Rev. Biochem.87585–620.
24
JaillardM.PalmieriM.Van BelkumA.MaheP. (2020). Interpreting k-mer-based signatures for antibiotic resistance prediction.Gigascience9:giaa110. 10.1093/gigascience/giaa110
25
JiaB.TangK.ChunB. H.JeonC. O. (2017). Large-scale examination of functional and sequence diversity of 2-oxoglutarate/Fe(II)-dependent oxygenases in Metazoa.Biochim. Biophys. Acta Gen. Sub.18612922–2933. 10.1016/j.bbagen.2017.08.019
26
KawaiY.OnoE.MizutaniM. (2014). Evolution and diversity of the 2-oxoglutarate-dependent dioxygenase superfamily in plants.Plant J.78328–343. 10.1111/tpj.12479
27
LeeY. F.MillerL. D.ChanX. B.BlackM. A.PangB.OngC. W.et al (2012). JMJD6 is a driver of cellular proliferation and motility and a marker of poor prognosis in breast cancer.Breast Cancer Res.14:R85.
28
LevyR.EdelmanM.SobolevV. (2009). Prediction of 3D metal binding sites from translated gene sequences based on remote-homology templates.Proteins76365–374. 10.1002/prot.22352
29
LiF. Y.LeierA.LiuQ. Z.WangY. A.XiangD. X.AkutsuT.et al (2020). Procleave: predicting protease-specific substrate cleavage sites by combining sequence and structural information.Genomics Proteomics Bioinformatics1852–64. 10.1016/j.gpb.2019.08.002
30
LiJ.WangW. (2007). Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids.Sci. China Series C Life Sci.50392–402. 10.1007/s11427-007-0023-3
31
LiaoC. H.ZhangQ. (2020). ASIP COTRAN EARLY CAREER INVESTIGATOR AWARD LECTURE Understanding the oxygen-sensing pathway and its therapeutic implications in diseases.Am. J. Pathol.1901584–1595. 10.1016/j.ajpath.2020.04.003
32
LinH. H.HanL. Y.ZhangH. L.ZhengC. J.XieB.CaoZ. W.et al (2006). Prediction of the functional class of metal-binding proteins from sequence derived physicochemical properties by support vector machine approach.BMC Bioinform.7(Suppl. 5):S13. 10.1186/1471-2105-7-S5-S13
33
LiouY. F.CharoenkwanP.SrinivasuluY.VasylenkoT.LaiS. C.LeeH. C.et al (2014). SCMHBP: prediction and analysis of heme binding proteins using propensity scores of dipeptides.BMC Bioinform.15(Suppl. 16):S4. 10.1186/1471-2105-15-S16-S4
34
LiuD. Y.LiG. P.ZuoY. C. (2019). Function determinants of TET proteins: the arrangements of sequence motifs with specific codes.Brief. Bioinform.201826–1835. 10.1093/bib/bby053
35
LiuR.HuJ. (2011). HemeBIND: a novel method for heme binding residue prediction by combining structural and sequence information.BMC Bioinform.12:207. 10.1186/1471-2105-12-207
36
LiuW.MaQ.WongK.LiW. B.OhgiK.ZhangJ.et al (2013). Brd4 and JMJD6-associated anti-pause enhancers in regulation of transcriptional pause release.Cell1551581–1595. 10.1016/j.cell.2013.10.056
37
ManavalanB.ShinT. H.LeeG. (2018). PVP-SVM: sequence-based prediction of phage virion proteins using a support vector machine.Front. Microbiol.9:476. 10.3389/fmicb.2018.00476
38
MengC.GuoF.ZouQ. (2020). CWLy-SVM: a support vector machine-based tool for identifying cell wall lytic enzymes.Comput. Biol. Chem.87:107304. 10.1016/j.compbiolchem.2020.107304
39
MohanA.AnishettyS.GautamP. (2010). Global metal-ion binding protein fingerprint: a method to identify motif-less metal-ion binding proteins.J. Bioinform. Comput. Biol.8717–726. 10.1142/s0219720010004884
40
MyllyharjuJ. (2008). Prolyl 4-hydroxylases, key enzymes in the synthesis of collagens and regulation of the response to hypoxia, and their roles as treatment targets.Ann. Med.40402–417. 10.1080/07853890801986594
41
RoseN. R.McdonoughM. A.KingO. N. F.KawamuraA.SchofieldC. J. (2011). Inhibition of 2-oxoglutarate dependent oxygenases.Chem. Soc. Rev.404364–4397.
42
SaeysY.InzaI.LarranagaP. (2007). A review of feature selection techniques in bioinformatics.Bioinformatics232507–2517. 10.1093/bioinformatics/btm344
43
ShenZ. J.ZouQ. (2020). Basic polar and hydrophobic properties are the main characteristics that affect the binding of transcription factors to methylation sites.Bioinformatics364263–4268. 10.1093/bioinformatics/btaa492
44
ShuN.ZhouT.HovmöllerS. (2008). Prediction of zinc-binding sites in proteins from sequence.Bioinformatics24775–782. 10.1093/bioinformatics/btm618
45
SunZ. J.HuangS. H.ZhengL.LiangP. F.YangW. R. T.ZuoY. C. (2020). ICTC-RAAC: an improved web predictor for identifying the types of ion channel-targeted conotoxins by using reduced amino acid cluster descriptors.Comput. Biol. Chem.89:107371. 10.1016/j.compbiolchem.2020.107371
46
TahirM.IdrisA. (2020). MD-LBP: an efficient computational model for protein subcellular localization from HeLa cell lines using SVM.Curr. Bioinform.15204–211. 10.2174/1574893614666190723120716
47
TangH.ZhaoY. W.ZouP.ZhangC. M.ChenR.HuangP.et al (2018). HBPred: a tool to identify growth hormone-binding proteins.Int. J. Biol. Sci.14957–964. 10.7150/ijbs.24174
48
ValasatavaY.RosatoA.BanciL.AndreiniC. (2016). MetalPredator: a web server to predict iron-sulfur cluster binding proteomes.Bioinformatics322850–2852. 10.1093/bioinformatics/btw238
49
VastaJ. D.RainesR. T. (2018). Collagen Prolyl 4-Hydroxylase as a therapeutic target.J. Med. Chem.6110403–10411. 10.1021/acs.jmedchem.8b00822
50
WangH.XiQ. L. M. G.LiangP. F.ZhengL.HongY.ZuoY. C. (2021). IHEC_RAAC: a online platform for identifying human enzyme classes via reduced amino acid cluster strategy.Amino Acids53239–251. 10.1007/s00726-021-02941-9
51
WangZ.LiuD.XuB.TianR.ZuoY. (2021). Modular arrangements of sequence motifs determine the functional diversity of KDM proteins.Brief. Bioinform.22:bbaa215. 10.1093/bib/bbaa215
52
XiaoM. Z.LiuJ. M.XianC. L.ChenK. Y.LiuZ. Q.ChengY. Y. (2020). Therapeutic potential of ALKB homologs for cardiovascular disease.Biomed. Pharmacother.131:110645. 10.1016/j.biopha.2020.110645
53
XuB.LiuD.WangZ.TianR.ZuoY. (2021). Multi-substrate selectivity based on key loops and non-homologous domains: new insight into ALKBH family.Cell. Mol. Life Sci.78129–141. 10.1007/s00018-020-03594-9
54
YanR.WangX.TianY.XuJ.XuX.LinJ. (2019). Prediction of zinc-binding sites using multiple sequence profiles and machine learning metethods.Mol. Omics15205–215. 10.1039/c9mo00043g
55
YangH.LuoY.RenX.WuM.HeX.PengB.et al (2021). Risk Prediction of Diabetes: Big data mining with fusion of multifarious physical examination indicators.Inf. Fusion75140–149. 10.1016/j.inffus.2021.02.015
56
ZhangD.ChenH.-D.ZulfiqarH.YuanS.-S.HuangQ.-L.ZhangZ.-Y.et al (2021). iBLP: an XGBoost-based predictor for identifying bioluminescent proteins.Comput. Math. Methods Med.2021:6664362.
57
ZhangH. Y.XiQ.HuangS. H.ZhengL.YangW.ZuoY. C. (2020). iSP-RAAC: identify secretory proteins of malaria parasite using reduced amino acid composition.Comb. Chem. High Throughput Screen.23536–545. 10.2174/1386207323666200402084518
58
ZhengL.HuangS.MuN.ZhangH.ZhangJ.ChangY.et al (2019). RAACBook: a web server of reduced amino acid alphabet for sequence-dependent inference by using Chou’s five-step rule.Database (Oxford)2019:baz131.
59
ZhengL.LiuD.YangW.YangL.ZuoY. (2020). RaacLogo: a new sequence logo generator by using reduced amino acid clusters.Brief. Bioinform.22:bbaa096. 10.1093/bib/bbaa096
60
ZhuX. J.FengC. Q.LaiH. Y.ChenW.LinH. (2019). Predicting protein structural classes for low-similarity sequences by evaluating different features.Knowl. Based Syst.163787–793. 10.1016/j.knosys.2018.10.007
61
ZouQ.LinG.JiangX.LiuX.ZengX. (2020). Sequence clustering in bioinformatics: an empirical study.Brief. Bioinform.211–10.
62
ZouQ.ZengJ. C.CaoL. J.JiR. R. (2016). A novel features ranking metric with application to scalable visual and bioinformatics data classification.Neurocomputing173346–354. 10.1016/j.neucom.2014.12.123
63
ZuoY.LvY.WeiZ.YangL.LiG.FanG. (2015). iDPF-PseRAAAC: a web-Server for identifying the defensin peptide family and subfamily using pseudo reduced amino acid alphabet composition.PLoS One10:e0145541. 10.1371/journal.pone.0145541
64
ZuoY. C.ChangY.HuangS. H.ZhengL.YangL.CaoG. F. (2019). iDEF-PseRAAC: identifying the defensin peptide by using reduced amino acid composition descriptor.Evol. Bioinform.15:1176934319867088.
65
ZuoY. C.LiQ. Z. (2009). Using reduced amino acid composition to predict defensin family and subfamily: integrating similarity measure and structural alphabet.Peptides301788–1793. 10.1016/j.peptides.2009.06.032
66
ZuoY. C.LiY.ChenY. L.LiG. P.YanZ. H.YangL. (2017). PseKRAAC: a flexible web server for generating pseudo K-tuple reduced amino acids composition.Bioinformatics33122–124. 10.1093/bioinformatics/btw564
Summary
Keywords
2-oxoglutarate/Fe (II)-dependent oxygenase, reduced amino acid cluster, machine learning, anova, incremental feature selection, 10-fold cross-validation test
Citation
Zhou J, Bo S, Wang H, Zheng L, Liang P and Zuo Y (2021) Identification of Disease-Related 2-Oxoglutarate/Fe (II)-Dependent Oxygenase Based on Reduced Amino Acid Cluster Strategy. Front. Cell Dev. Biol. 9:707938. doi: 10.3389/fcell.2021.707938
Received
11 May 2021
Accepted
10 June 2021
Published
16 July 2021
Volume
9 - 2021
Edited by
Liang Cheng, Harbin Medical University, China
Reviewed by
Leyi Wei, Shandong University, China; Shihua Zhang, Wuhan University of Science and Technology, China
Updates
Copyright
© 2021 Zhou, Bo, Wang, Zheng, Liang and Zuo.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Yongchun Zuo, yczuo@imu.edu.cn
†These authors share first authorship
This article was submitted to Molecular Medicine, a section of the journal Frontiers in Cell and Developmental Biology
Disclaimer
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.