The Identification of Metal Ion Ligand-Binding Residues by Adding the Reclassified Relative Solvent Accessibility

Hu, Xiuzhen; Feng, Zhenxing; Zhang, Xiaojin; Liu, Liu; Wang, Shan

doi:10.3389/fgene.2020.00214

ORIGINAL RESEARCH article

Front. Genet., 19 March 2020

Sec. Computational Genomics

Volume 11 - 2020 | https://doi.org/10.3389/fgene.2020.00214

The Identification of Metal Ion Ligand-Binding Residues by Adding the Reclassified Relative Solvent Accessibility

Xiuzhen Hu

Zhenxing Feng^*

Xiaojin Zhang^*

Liu Liu

Shan Wang

College of Sciences, Inner Mongolla University of Technology, Hohhot, China

Many proteins realize their special functions by binding with specific metal ion ligands during a cell’s life cycle. The ability to correctly identify metal ion ligand-binding residues is valuable for the human health and the design of molecular drug. Precisely identifying these residues, however, remains challenging work. We have presented an improved computational approach for predicting the binding residues of 10 metal ion ligands (Zn^2+, Cu²⁺, Fe²⁺, Fe³⁺, Co²⁺, Ca²⁺, Mg²⁺, Mn²⁺, Na⁺, and K⁺) by adding reclassified relative solvent accessibility (RSA). The best accuracy of fivefold cross-validation was higher than 77.9%, which was about 16% higher than the previous result on the same dataset. It was found that different reclassification of the RSA information can make different contributions to the identification of specific ligand binding residues. Our study has provided an additional understanding of the effect of the RSA on the identification of metal ion ligand binding residues.

Introduction

Proteins act as an indispensable material in the life process. However, many special functions of protein are realized by binding with specific ligands, and more than one-third of the proteins need to bind with metal ion ligands. Thus, depending on the interaction between the metal ion ligands and specific binding residues, many metal ion ligands can affect the special protein functions (Caspers et al., 1990; Supek et al., 1997; Selvarengan and Kolandaivel, 2005). For instance, Mn²⁺ is used as catalyst in photosynthesis (Degtyarenko, 2000; Reed and Poyner, 2000), Ca²⁺ can lead to anxiety and Alzheimer’s disease (Jiang et al., 2015; Cao et al., 2017), and Cu²⁺ can cause Coronary Heart Disease (Sodhi et al., 2004; Lin et al., 2005). The basic principle of molecular drug design is that the interaction between the receptor and ligand must conform to the “Lock and Key Model,” and the interaction between the protein and ion ligands we studied also conforms to the “Lock and Key Model.” In the experiment of molecular drug design, protein crystallization, structure confirmation, and the interaction between ligands and protein residues are required. Thus, the experimental method is a time-consuming and expensive process, and it cannot be processed in batches, however, theoretical prediction of binding residues between proteins and ligands can overcome these shortcomings, and accurate prediction can provide theoretical information for drug design experiments. Therefore, correctly identifying metal ion ligand-binding residues is helpful for the human health and the design of molecular drug.

In the past two decades, experimental methods have been developed to identify metal ion ligand-binding residues, such as the Nuclear Magnetic Resonance Spectroscopy (Sletten, 1997) and fluorescence method (Kawahashi, 2003). However, due to the time-consuming nature and other limitations of experimental methods, the high-throughput computational methods were developed to predict the binding residues of metal ion ligands. Among the computational methods, many efforts were made to improve the databases, feature parameters, and algorithms. First, the databases were generally acquired from Protein Data Bank (PDB) (Tainer et al., 1991; Bernstein et al., 1997; Sodhi et al., 2004; Lin et al., 2005; Bordner, 2008; Babor et al., 2010; Lu et al., 2012), Structural Classification of Protein (SCOP) (Hubbard et al., 1997; Sodhi et al., 2004; Chauhan et al., 2010; Sobolev and Edelman, 2013), Ligand Protein Contact (LPC) (Sobolev et al., 1999; Chauhan et al., 2010), and BioLip (Yang et al., 2013a, b; Hu et al., 2016a, b, Wang et al., 2019). Second, the feature parameters generally contained the composition information of the amino acid (Cao et al., 2017; Wang et al., 2019), hydrophilicity-hydrophobicity (Lin et al., 2005; Lin et al., 2006; Cao et al., 2017), charge (Lin et al., 2005; Cao et al., 2017; Wang et al., 2019), position specific score matrix (PSSM) (Hu et al., 2016a), relative solvent accessibility (RSA) (Lin et al., 2006; Hu et al., 2016a; Cao et al., 2017; Wang et al., 2019) and three-dimensional structure information (Babor et al., 2010; Roy et al., 2012; Yang et al., 2015; Hu et al., 2016a). Finally, the classification algorithms used were artificial neural network (ANN) (Lin et al., 2005), Support Vector Machine (SVM) (Lin et al., 2006; Jiang et al., 2015; Cao et al., 2017; Hu et al., 2016a), Naïve Bayes (Ebert and Altman, 2010), COFACTOR (Lin et al., 2006; Yang et al., 2015), TargetSeq, TargetCom (Hu et al., 2016b), COACH (Yang et al., 2015), and SMO (Wang et al., 2019). Among the three aspects in the prediction mentioned above, the key step of feature extraction was generated by one of two ways: (1) the three-dimensional structure information or (2) primary sequence information of the protein. However, the precise three-dimensional structure information of many proteins was not available in the recent databases. Thus, feature extraction from sequence information is more popular in current research. Among the sequence information, RSA is one of the important parameters. In the previous works, researchers only divided it to burial and exposure by a certain threshold. However, the effects of different classifications of the RSA on prediction results have not been explored. In this paper, based on the semi-manually curated database of BioLip for biologically relevant ligand–protein interactions, we performed a statistical analysis for RSA and further reclassified the RSA. By integrating the optimized sequence information, we mainly used the Gradient Boosting Machine (GBM) algorithm and obtained better predicted results by using fivefold cross-validation and an independent test.

Materials and Methods

Benchmark Dataset

We selected non-redundant datasets of metal ion-binding proteins that were constructed in our group (Cao et al., 2017; Wang et al., 2019). The benchmark datasets were entirely from the BioLip database (Yang et al., 2013a). The proteins were filtered with a resolution less than 3 Å, the length of sequences was greater than 50, and the sequence identity was below 30%. Among the ∼250 ligands, there were only 10 ligands that could meet the above conditions to contribute to our further statistical analysis and prediction. The statistical information of the datasets containing ten metal ion ligands is shown in Table 1. In the protein sequence, residue binding with ion ligands was not only determined by the residue itself but also by how this was affected by the surrounding residues. Thus, a sliding window method was used to cut the protein sequence into overlapping residue segments with different sizes ranging from 5 to 21. In order to ensure that each residue was in the center of the segments, we added (L-1)/2 dummy residues “X” at both terminals of the proteins, where L was the window length. The optimal window length for each ligand was determined based on the evaluation results of the proposed computation method. If a binding residue was located at the segment center, it was defined as a positive sample; otherwise, it was defined as a negative sample. The number of non-binding segments was much larger than that of the binding segments, which led to a heavy imbalance in the datasets (Table 1). According to the methods of previous works (Yen and Lee, 2006; Roy et al., 2015), we took the number of positive samples as the standard and randomly extracted the equal number of negative samples. In this way, the negative samples were randomly selected 10 times to ensure the credibility of the results. Finally, we averaged the 10 results to calculate our overall accuracy.

TABLE 1

Table 1. The benchmark datasets of 10 metal ion ligands.

Selection and Extraction of Feature Parameters

According to the biological background of protein–ligand interactions and the statistical analysis of protein sequences, we extracted features of the position conservation information, which was acquired from the protein backbone and side chains.

Secondary Structure and Relative Solvent Accessibility

Analyzing the three-dimensional (3D) structure of a protein is critical to the understanding of its function. However, 3D models of only a small fraction of the sequenced proteins were made. The prediction of a secondary structure and RSA is a crucial step from the sequence to the 3D structure, reflecting the spatial structure information of the backbone and side chains, respectively. We therefore selected the predicted secondary structure information and RSA information. The prediction was helpful when simplifying the problem from the 3D structure to sequence information (Chen and Zhou, 2005; Lin et al., 2005; Hu et al., 2016a, b; Cao et al., 2017; Wang et al., 2019). In this paper, they were predicted by using ANGLOR software (Wu and Zhang, 2008). We obtained three secondary structure types, including alpha-helix (H), beta-strand (E), and coil (C). The relative solvent accessibility (RSA) was generally represented as a Boolean value, indicating whether the residue was buried (RSA < 0.25) or exposed (RSA > 0.25).

Physicochemical Properties of Amino Acids

Physicochemical properties affected the protein–ligand interactions, and different physicochemical properties of amino acids were caused by their different side chains (Lin et al., 2005, 2006; Cao et al., 2017; Wang et al., 2019). Metal ion ligands bind to a residue, probably preferring to bind to a specific side-group of this residue. The information from the side chains is therefore important for the prediction of metal ion ligand-binding residues. Since different standards can cause different classifications, the amino acids were divided into six categories according to the hydrophilicity and hydrophobicity (Panek and Eidhammer, 2010) (Supplementary Figure S1) and three categories according to the charge (Taylor, 1986) (Supplementary Figure S2).

Construction of Position Weight Matrix

The ion-binding residues tend to be more conserved than others during the process of evolution, and the residue conservation is a crucial indicator for the presence of functionally important residues. The PWSM has been successfully used in the prediction of transcription factor binding sites and ligand binding sites (Kel et al., 2003; Hu et al., 2016a). Thus, the position weight scoring matrix (PWSM) was used to extract the position conservation information of the basic feature parameters, and the scoring matrix based on amino acid residues was constructed from the sequence segments with a specific window length. The position-specific occurrence frequency of an amino acid is calculated as follows:

P_{i j} = \frac{n_{i j} + \sqrt{N_{i}} / 21}{N_{i} + \sqrt{N_{i}}} (1)

where i is the position index in the sequence segment, j is one of the 20 kinds of amino acids or vacancy, n_ij is the frequency of the jth amino acids at the ith position, and N_i is total number of all amino acids occurring at the ith position. The position weight matrix is then calculated as follows:

W_{i j} = \log \frac{P_{i j}}{P_{o j}} (2)

where P_oj is background probability of the jth amino acid. Therefore, based on the positive and negative training sets, two standard scoring matrices can be obtained. In a testing set, we got 2^∗L dimensional values for every sequence segment. Finally, the 5^∗2L dimensional values from the above five features can be used as the input parameters in the subsequent algorithm.

Gradient Boosting Machine

The Gradient Boosting Machine (GBM) is an improved Boosting algorithm proposed by Friedman (2001, 2002), Rawi et al. (2018) and Jain et al. (2018). The GBM algorithm is different from the original Boosting algorithm. The core of the Boosting algorithm is to set different weights to different samples during the iterative process. Based on the results of the previous iteration, the Boosting algorithm will increase the weight of wrong classification samples and reduce the weight of correct classification samples. Then, a weak classifier will be generated in each iterative process; after m iterations, a strong classifier an improved performance will be obtained by setting weight for each weak classifier. In the iterative process, GBM algorithm classifies the sample residual of the previous iteration and not the sample itself. After the end of the iteration, our classifier F_m(x) was obtained as Equation (3), where m is the number of iterations in the calculation process, ρ_m is the weight value and also the distance of the loss function decreases in its gradient direction, and h_m(x) is the function that fits the sample residuals in the iterations.

F_{m} (x) = F_{m - 1} (x) + ρ_{m} h_{m} (x) (3)

In addition, the GBM algorithm can handle mixture data and its robustness against outliers in the output space is very strong. In this paper, we implemented the GBM algorithm in the R platform by using the “gbm” package. In the classifier, parameters were optimized: “n.trees” ranged from 1 to 500, “n.minobsinnode” ranged from 10 to 50, “interaction.depth” ranged from 3 to 9, and “shrinkage” ranged from 0.01 to 0.1.

The Validation and Evaluation Metrics

As general validation methods, cross-validation and independent tests have been commonly used in previous literature (Hu et al., 2016a, b; Sun et al., 2016; Cao et al., 2017; Wang et al., 2019). In the five cross-validations, the dataset was randomly divided into five equal subsets. Four subsets were then used as training sets, and the remaining subset was used as a testing set. This process was repeated five times in such a way that each subset was used once for testing, and the average performance of the five subsets was then taken as the final performance.

We used several following metrics to evaluate our proposed method: sensitivity (Sn), specificity (Sp), False positive rate (FPR), accuracy of prediction (Acc), and Matthew’s correlation coefficient (MCC). They are defined as follows:

S_{n} = \frac{T P}{T P + F N} \times 100 % (4)

S_{p} = \frac{T N}{T N + F P} \times 100 % (5)

F P R = \frac{F P}{T N + F P} \times 100 % (6)

A c c = \frac{T P + T N}{T P + T N + F P + F N} \times 100 % (7)

M C C = \frac{(T P \times T N) - (F P \times F N)}{\sqrt{(T P \times F P) (T P \times F N) (T N \times F P) (T N \times F N)}} (8)

where TP is the number of correctly predicted metal ion ligand binding residues, FN is the number of binding residues predicted as non-binding residues, TN is the number of correctly predicted non-binding residues, and FP is the number of non-binding residues predicted as binding residues. To explain the above prediction method more directly and clearly, see our detailed flowchart in Figure 1.

FIGURE 1

Figure 1. Flowchart of the method for the identification of metal ion ligand-binding residues.

Results and Discussion

The Classification of Relative Solvent Accessibility

For each metal ion ligand, based on the optimized window length, we gradually added the parameters from the position conservation information of amino acids (WA), hydrophilic-hydrophobic (QS), charge (DH), secondary structure (SS), and RSA to the GBM algorithm. It was found that the predicted result was significantly improved by successively adding each of the features.

Predicted Results for K⁺ Ligand Binding Residues

Table 2 shows the prediction results of the K + ligand by gradually adding parameters to the model. By gradually adding parameters to the model, we found that the different parameters had different effects on the predicted results. In this work, we used the initial classification of Boolean value thresholds (marked as SA_2) and added it to the model; the predicted result was significantly improved, and the Acc and MCC increased by nearly 12 and 24%, respectively. However, the predicted results did not change much by adding other parameters. It indicated that the RSA played an important role in the whole parameters for identifying the metal ion ligand-binding residues.

TABLE 2

Table 2. Predicted results for K⁺ ligand-binding residues.

Statistical Analysis of the Relative Solvent Accessibility

Due to the importance of RSA and the particularity of metal ion ligands, we performed the statistical analysis of the RSA information for different metal ion ligands. Then, we found that the classification was not the same for different metal ion ligands. Therefore, we reclassified the thresholds of the Boolean value for different metal ion ligands. For instance, Figure 2 shows the statistical distribution of the RSA in a positive set and negative set for the K⁺ ligand (the statistical distribution of other metal ion ligands is shown in Supplementary Material 1). In Figure 2, the abscissa indicates the predicted values of amino acid RSA; the ordinate indicates the number of amino acids corresponding to each predicted value in the positive and negative samples.

FIGURE 2

Figure 2. The statistical distribution of relative solvent accessibility in positive and negative set for K⁺ ligand. Note: the abscissa axis is the values of the relative solvent accessibility, and the ordinate is the number of amino acids corresponding to each predicted value. The solid red line represents the positive set, and the dotted blue line represents the negative set.

If it is reclassified by the distribution, it can be divided into four regions (marked as SA_V), namely [0, 0.15), [0.15, 0.25), [0.25, 0.35), and [0.35, 0.85], which are represented by h(x) and four letters.

h (x) = {\begin{matrix} A, x \in [0, 0.15) \\ B, x \in [0.15, 0.25) \\ C, x \in [0.25, 0.35) \\ D, x \in [0.35, 0.85] \end{matrix}

If it is reclassified according to the peak value, it can be divided into four regions (marked as SA_P), namely [0, 0.15), [0.15, 0.4), [0.4, 0.5), and [0.5, 0.85], which are expressed by y(x) and four letters.

y (x) = {\begin{matrix} I, x \in [0, 0.15) \\ K, x \in [0.15, 0.4) \\ L, x \in [0.4, 0.5) \\ M, x \in [0.5, 0.85] \end{matrix}

Besides, we also used the previous four regions (Cao et al., 2017), which were suitable for most metal ion ligands (marked as SA_4), namely [0, 0.2], (0.2, 0.45], (0.45, 0.6], and (0.6, 0.85], which were represented by g(x) and four letters. The four kinds of grouping methods (SA_2, SA_4, SA_P, and SA_V) of other metal ion ligands are shown in Supplementary Material 2.

g (x) = {\begin{matrix} E, x \in [0, 0.2] \\ F, x \in (0.2, 0.45] \\ G, x \in (0.45, 0.6] \\ H, x \in (0.6, 0.85] \end{matrix}

The Predicted Results of Four General RSA Classifications

Then, for each metal ion ligand, four different classification groups of RSA were added to the parameters, and four general prediction models were obtained. The four different predicted results of K⁺ ligand binding residues are shown in Table 3.

TABLE 3

Table 3. Predicted results of K⁺ ligand-binding residues.

We found that the predicted results of the same metal ion ligand were different for the four general prediction models, and the optimal predicted results of ten metal ion ligand-binding residues were from the differently specific prediction model. An additional file shows this in more detail (see Supplementary Material 3). For example, the K⁺ ligand obtained the optimal predicted result from the specific classification namely SA_V, but the Fe²⁺ ligand obtained this from SA_4.

The Optimal Predicted Results of Ten Metal Ion Ligand-Binding Residues

By comparing the four general prediction models, the optimal predicted results for ten metal ion ligand-binding residues were obtained and listed in Table 4.

TABLE 4

Table 4. The optimal predicted results of 10 metal ion ligand-binding residues and corresponding specific classifications of relative solvent accessibility.

Based on the different classifications of RSA, we obtained the optimal predicted results of ten metal ion ligand-binding residues and corresponding specific prediction models.

The Predicted Results (by Use of the Boruta Algorithm)

We used the 5^∗2L dimensional features in the above calculations. However, different features made varied contributions to the predicted results, and the combination of different features did not necessarily result in a good classification performance. Therefore, we used a Boruta algorithm (Kursa and Rudnicki, 2010; Kursa et al., 2010; Feng and Li, 2017, Feng et al., 2018) to make a main feature selection. The algorithm iteratively removed the features that were less relevant than random probes. From this we could obtain the optimal features combination. The algorithm was implemented by the “Boruta” package in R environment. In this way, after a large-scale computation, the confirmed features were obtained, and the rejected features were removed from the combination of all the features. The rejected features are shown in Table 5.

TABLE 5

Table 5. The features rejected by using the Boruta feature selection algorithm.

When using the Boruta algorithm to reduce the dimension of the features, it was found that the reduced dimensions of different metal ion ligands were different. For example, the dimensions of the Ca²⁺ and Mg²⁺ ligands were not reduced, the dimension of the Zn²⁺ ligand was reduced by 5 dimensions, the dimension of the Mn²⁺ ligand was reduced by 13 dimensions, etc. In order to prove the justifiability of the features eliminated by the Boruta algorithm, we analyzed the importance of the features by using the “randomForest” package in R environment. The larger the MeanDecreaseAccuracy and MeanDecreaseGini values, the higher the importance of the feature parameters. Taking the Zn²⁺ ligand as an example, it can be seen from Figure 3 that the important features of the first 30 dimensions were consistent with the confirmed features by the Boruta algorithm.

FIGURE 3

Figure 3. The feature importance of Zn²⁺ ligand indicated by MeanDecreaseAccuracy value (A) and MeanDecreaseGini value (B) from Random Forest. Note: the larger the MeanDecreaseAccuracy and MeanDecreaseGini values, the higher the importance of the feature parameters. WA1-WA18 is the features of amino acid, QS1-QS18 is the features of hydrophobic, DH1-DH18 is the features of charge, SS1-SS18 is the features of secondary structure, and SA1-SA18 is the features of relative solvent accessibility.

The obtained subset features were then input into the GBM, and the predicted results were shown in Table 6. Table 6 shows that we obtained similar results based on subset features. This suggested that, under the premise of ensuring the accuracy, the Boruta algorithm was efficient in its ability to reduce the dimensions of features for predicting metal ion ligand-binding residues. The decline of the subset predicted results showed that all the selected features had certain contributions to the recognition of the binding residues. In addition, the predicted results of the subset were still higher than those of SVM. Our method was therefore relatively reliable for predicting the metal ion ligand binding residues.

TABLE 6

Table 6. Comparison of predicted results based on the full feature and Boruta’s feature.

The Predicted Results of GBM by Using an Independent Test

We used equal samples of positive and negative in the previous calculations. However, the positive and negative samples were not equal when we intercepted segments by using the sliding window method. In order to verify the practicability of the proposed method, we divided the total dataset into two parts: the training dataset was used to construct the predicted methods by fivefold cross-validation, and the independent testing dataset was used to test the extrapolation ability of the predicted methods. The protein chains in the independent testing dataset accounted for 20% of the total dataset, which was consistent with the published work (Cao et al., 2017). The statistical information of the datasets is shown in Table 7.

TABLE 7

Table 7. The statistics of the training dataset and the independent testing dataset.

In the independent test, the 5^∗2L dimension position information was input into the GBM algorithm to obtain the predicted ligand-specific models, and the testing dataset was input into the predicted model to test. The number of positive and negative samples was not balanced, and the MCC values in Table 8 therefore reflect the stability of the predicted model. In order to compare these results more obviously, we added them to Table 8. The comparative results indicated that the selected features and algorithm had better identification abilities for predicting metal ion ligand-binding residues.

TABLE 8

Table 8. Comparison of our independent test results with previous results.

Comparison With Other Methods

It is necessary to compare our proposed methods with previous models using the same dataset, classification strategy, and evaluation methods. For the purposes of comparison with the previous results (Cao et al., 2017; Wang et al., 2019), our predicted results of fivefold cross-validation and independent test are displayed in Tables 8, 9, respectively. Comparing the previous results in our group (Cao et al., 2017; Wang et al., 2019), most of the metal ion ligands were improved to different degrees. With the same dataset, the same feature parameters, classification strategy, and evaluation methods, we further made a comparison between the GBM algorithm and several other machine learning methods, including SVM, Random Forest, and Artificial Neural Network. Using the same features, the ACC and MCC values of each classifier for ten ligands are displayed in Figure 4. The results showed that accuracies of the GBM classifier were higher than other machine learning methods, indicating that the GBM classifier was a powerful tool for predicting metal ion ligand binding residues.

TABLE 9

Table 9. Comparison of our optimal predicted results in fivefold cross-validation with previous results.

FIGURE 4

Figure 4. The comparison of prediction performances between several machine learning methods based on the same features by using five fold cross-validation test.

Conclusion

The interactions between metal ion ligands (e.g., Na⁺, Mn²⁺, Ca²⁺, K⁺, and Cu²⁺) and proteins perform key biological functions in many important life processes. Research into these metal ion ligands and functions is of significant biological import. In particular, the prediction of ligand binding residues is of great significance to the understanding of the biological functions of proteins and drug design. In this work, we predicted the binding residues of 10 metal ion ligands in the BioLip database, and we obtained improved results. According to the biological background of proteins, we selected hydrophobic polarized charges, predicted secondary structures, and RSA information as the basic information. From the statistical analysis of RSA information, we found that the reclassified RSA information has important effects on recognition of metal ion ligand-binding residues. Therefore, on the basis of primary sequence information, we extracted the important features of RSA by reclassifying the RSA as four different classifications (i.e., SA_2, SA_V, SA_P, and SA_4). Using the GBM algorithm and an overall classification strategy, we further improved the prediction success rate of metal ion ligand binding residues in the cross-validation and independent test. In the best performance, MCC values were higher than 0.558, the FPR values were lower than 20.8%, and the Acc values were higher than 77.9%. In comparison with previous results (Cao et al., 2017), our best accuracy of fivefold cross-validation was about 16% higher on the same dataset. In this research, we identified the specific contributions of different reclassified RSA to the identification of 10 ligand-binding residues. However, for the prediction performances of different ligands, there are different improvements that can indicate the differences in the ligand-binding residues. Our next step is to prove this specialty. To make our models available for other researchers, we provide our database in Supplementary Material 4 and full feature parameters in the additional material. In our future work, we will make efforts to provide a web server for the analysis method presented in this paper, which can be manipulated by readers according to their need.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation, to any qualified researcher.

Author Contributions

XH and ZF designed the experiments. XZ and XH performed the experiments. ZF improved the English. LL and SW organized the data. All authors read and approved the final manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (61961032 and 31260203), the Natural Science Foundation of the Inner Mongolia of China (2019BS03025), and the Natural Science Foundation of Inner Mongolia University of Technology (ZY201915).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

We thank the lab of Yang Zhang (https://zhanglab.ccmb.med.umich.edu/BioLiP/) and Xiaoyong Cao for sharing the data.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2020.00214/full#supplementary-material

SUPPLEMENTARY MATERIAL 1 | Relevant statistical analysis of physicochemical properties.

SUPPLEMENTARY MATERIAL 2 | The four grouping methods of relative solvent accessibility of metal ion ligands.

SUPPLEMENTARY MATERIAL 3 | Predicted results of 10 metal ion ligand-binding residues based on different classification.

SUPPLEMENTARY MATERIAL 4 | The PDB IDs of protein chains in the datasets.

References

Babor, M., Gerzon, S., Raveh, B., Sobolev, V., and Edelman, M. (2010). Prediction of transition metal-binding sites from apo protein structures. Proteins Struct. Funct. Bioinform. 70, 208–217. doi: 10.1002/prot.21587

PubMed Abstract | CrossRef Full Text | Google Scholar

Bernstein, F. C., Koetzle, T. F., Williamans, G. J., Meyer, E. F., Brice, M. D., Rodgrts, J. R., et al. (1997). The protein data bank. a computer-base archival file for macromolecular structures. Eur. J. Biochem. 112, 535–542. doi: 10.1016/s0022-2836(77)80200-3

CrossRef Full Text | Google Scholar

Bordner, A. J. (2008). Predicting small ligand binding sites in proteins using backbone structure. Bioinformatics 24, 2865–2871. doi: 10.1093/bioinformatics/btn543

PubMed Abstract | CrossRef Full Text | Google Scholar

Cao, X. Y., Hu, X. Z., Zhang, X. J., Gao, S. J., Ding, C. J., Feng, Y. E., et al. (2017). Identification of metal ion binding sites based on amino acid sequences. PLoS One 12:e0183756. doi: 10.1371/journal.pone.0183756

PubMed Abstract | CrossRef Full Text | Google Scholar

Caspers, M. L., Kwaiser, T. M., and Grammas, P. (1990). Control of [3H] ouabain binding to cerebromicrovascular (Na+ + K+)-ATPase by metal ions and proteins. Biochem. Pharmacol. 39, 1891–1895. doi: 10.1016/0006-2952(90)90606-l

PubMed Abstract | CrossRef Full Text | Google Scholar

Chauhan, J. S., Mishra, N. K., and Raghava, G. P. (2010). Prediction of GTP interacting residues dipeptides and tripeptides in a protein from its evolutionary information. BMC Bioinform. 11:301. doi: 10.1186/1471-2105-11-301

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, H., and Zhou, H. X. (2005). Prediction of solvent accessibility and sites of deleterious mutations from protein sequence. Nucleic Acids Res. 33, 3193–3199. doi: 10.1093/nar/gki633

PubMed Abstract | CrossRef Full Text | Google Scholar

Degtyarenko, K. (2000). Bioinorganic motifs: towards functional classification of metalloproteins. Bioinformatics 16, 851–864. doi: 10.1093/bioinformatics/16.10.851

PubMed Abstract | CrossRef Full Text | Google Scholar

Ebert, J. C., and Altman, R. B. (2010). Robust recognition of zinc binding sites in proteins. Protein Sci. 17, 54–65. doi: 10.1110/ps.073138508

PubMed Abstract | CrossRef Full Text | Google Scholar

Feng, Z. X., and Li, Q. Z. (2017). Recognition of long-range enhancer-promoter interactions by adding genomic signatures of segmented regulatory regions. Genomics 109, 341–352. doi: 10.1016/j.ygeno.2017.05.009

PubMed Abstract | CrossRef Full Text | Google Scholar

Feng, Z. X., Li, Q. Z., and Meng, J. J. (2018). Recognition of the long range enhancer-promoter interactions by further adding DNA structure properties and transcription factor binding motifs in human cell lines. J. Theor. Biol. 445, 136–150. doi: 10.1016/j.jtbi.2018.02.023

PubMed Abstract | CrossRef Full Text | Google Scholar

Friedman, J. H. (2001). Greedy function approximation:a gradient boosting machine. Ann. Statist. 29, 1189–1232.

Google Scholar

Friedman, J. H. (2002). Stochastic gradient boosting. Comput. Statist. Data Analy. 38, 367–378. doi: 10.1016/s0167-9473(01)00065-2

CrossRef Full Text | Google Scholar

Hu, X. Z., Dong, Q. W., Yang, J. Y., and Zhang, Y. (2016a). Recognizing metal and acid radical ion-binding sites by integrating ab initio modeling with template-based transferals. Bioinformatics 32:3260. doi: 10.1093/bioinformatics/btw396

PubMed Abstract | CrossRef Full Text | Google Scholar

Hu, X. Z., Wang, K., and Dong, Q. W. (2016b). Protein ligand-specific binding residue predictions by an ensemble classifier. BMC Bioinform. 17:470. doi: 10.1186/s12859-016-1348-3

PubMed Abstract | CrossRef Full Text | Google Scholar

Hubbard, T. J. R., Murzin, A. G., Brenner, S. E., and Chothla, C. (1997). SCOP: a structural classification of proteins database. Nucleic Acids Res. 25, 236–239. doi: 10.1093/nar/25.1.236

PubMed Abstract | CrossRef Full Text | Google Scholar

Jain, D. S., Gupte, S. R., and Aduri, R. (2018). A data driven model for predicting RNA-Protein interactions based on gradient boosting machine. Sci. Rep. 8, 9552–9561. doi: 10.1038/s41598-018-27814-2

PubMed Abstract | CrossRef Full Text | Google Scholar

Jiang, Z., Hu, X. Z., Geriletu, G., Xing, H. R., and Cao, X. Y. (2015). Identification of Ca (2+)-binding residues of a protein from its primary sequence. Genet. Mol. Res. 15, 1676–1680. doi: 10.4238/gmr.15027618

PubMed Abstract | CrossRef Full Text | Google Scholar

Kawahashi, Y. (2003). In vitro protein microarrays for detecting protein-protein interactions: application of a new method for fluorescence labeling of proteins. Proteomics 3, 1236–1243. doi: 10.1002/pmic.200300444

PubMed Abstract | CrossRef Full Text | Google Scholar

Kel, A. E., Gößling, E., Reuter, I., Cheremushlin, E., Kel-Margoulis, O. V., and Wingender, E. (2003). MATCHTM: a tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res. 31, 3576–3579. doi: 10.1093/nar/gkg585

PubMed Abstract | CrossRef Full Text | Google Scholar

Kursa, M. B., Jankowski, A., and Rudnicki, W. R. (2010). Boruta - a system for feature selection. IOS Press 101, 271–285. doi: 10.3233/fi-2010-288

CrossRef Full Text | Google Scholar

Kursa, M. B., and Rudnicki, W. R. (2010). Feature selection with the boruta package. J. Statist. Softw. 36, 1–13.

Google Scholar

Lin, C. T., Lin, K. L., Yang, C. H., Chung, I. F., Huang, C. D., and Yang, Y. S. (2005). Protein metal binding residue prediction based on neural networks. Int. J. Neural Syst. 15, 71–84. doi: 10.1142/s0129065705000116

PubMed Abstract | CrossRef Full Text | Google Scholar

Lin, H. H., Han, L. Y., Zhang, H. L., Zheng, C. J., Xie, B., and Chen, Y. Z. (2006). Prediction of the functional class of lipid binding proteins from sequence-derived properties irrespective of sequence similarity. J. Lipid Res. 47, 824–831. doi: 10.1194/jlr.m500530-jlr200

PubMed Abstract | CrossRef Full Text | Google Scholar

Lu, C. H., Lin, Y. F., Lin, J. J., and Yu, C. S. (2012). Prediction of metal ion–binding sites in proteins using the fragment transformation method. PLoS One 7:e39252. doi: 10.1371/journal.pone.0039252

PubMed Abstract | CrossRef Full Text | Google Scholar

Panek, J., and Eidhammer, I. R. (2010). A new method for identification of protein (sub) families in a set of proteins based on hydropathy distribution in proteins. Proteins Struct. Funct. Bioinform. 58, 923–934. doi: 10.1002/prot.20356

PubMed Abstract | CrossRef Full Text | Google Scholar

Rawi, R., Mall, R., Kunji, K., Shen, C. H., Kwong, P. D., and Chuang, G. Y. (2018). PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine. Bioinformatics 34, 1092–1098. doi: 10.1093/bioinformatics/btx662

PubMed Abstract | CrossRef Full Text | Google Scholar

Reed, G. H., and Poyner, R. R. (2000). Mn2+ as a probe of divalent metal ion binding and function in enzymes and other proteins. Met. Ions Biol. Syst. 37, 183–207.

Google Scholar

Roy, A., Yang, J., and Zhang, Y. (2012). COFACTOR: an accurate comparative algorithm for structure-based protein function annotation. Nucleic Acids Res. 40, W471–W477. doi: 10.1093/nar/gks372

PubMed Abstract | CrossRef Full Text | Google Scholar

Roy, S., Siahpirani, A. F., Chasman, D., Knaack, S., Ay, F., Stewart, R., et al. (2015). A predictive modeling approach for cell line-specifific long-range regulatory interactions. Nucleic Acids Res. 43, 8694–8712. doi: 10.1093/nar/gkv865

PubMed Abstract | CrossRef Full Text | Google Scholar

Selvarengan, P., and Kolandaivel, P. (2005). Study of metal ions (Na+, K+) interaction with different conformations of glycine molecule. Inter. J. Quant. Chem. 102, 427–434. doi: 10.1002/qua.20413

CrossRef Full Text | Google Scholar

Sletten, E. (1997). “The binding of transition metal ions to DNA oligonucleotides studied by nuclear magnetic resonance spectroscopy,” in Cytotoxic, Mutagenic and Carcinogenic Potential of Heavy Metals Related to Human Environment. NATO ASI Series (Series 2: Environment), ed. N. D. Hadjiliadis (Dordrecht: Springer).

Google Scholar

Sobolev, V., and Edelman, M. (2013). Web tools for predicting metal binding sites in proteins. Israel J. Chem. 53, 166–172. doi: 10.1002/ijch.201200084

CrossRef Full Text | Google Scholar

Sobolev, V., Sorokine, A., Prilusky, J., Abola, E. E., and Edelman, M. (1999). Automated analysis of interatomic contacts in proteins. Bioinformatics 15, 327–332. doi: 10.1093/bioinformatics/15.4.327

PubMed Abstract | CrossRef Full Text | Google Scholar

Sodhi, J. S., Bryson, K., McGuffin, L. J., Ward, J. J., Wernisch, L., and Jones, D. T. (2004). Predicting metal-binding site residues in low-resolution structural models. J. Mol. Biol. 342, 307–320. doi: 10.1016/j.jmb.2004.07.019

PubMed Abstract | CrossRef Full Text | Google Scholar

Sun, L. X., Hu, X. Z., Li, S. B., Jiang, Z., and Li, K. (2016). Prediction of complex super-secondary structure βαβ motifs based on combined features. Saudi J. Biol. Sci. 23, 66–71. doi: 10.1016/j.sjbs.2015.10.005

PubMed Abstract | CrossRef Full Text | Google Scholar

Supek, F., Supekova, L., Nelson, H., and Nelson, N. (1997). Function of metal-ion homeostasis in the cell division cycle, mitochondrial protein processing, sensitivity to mycobacterial infection and brain function. J. Exp. Biol. 200, 321–330.

PubMed Abstract | Google Scholar

Tainer, J. A., Roberts, V. A., and Getzoff, E. D. (1991). Metal-binding sites in proteins. Curr. Opin. Biotechnol. 2, 582–591.

PubMed Abstract | Google Scholar

Taylor, W. R. (1986). The classification of amino acid conservation. J. Theor. Biol. 119, 205–218. doi: 10.1016/s0022-5193(86)80075-3

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, S., Hu, X. Z., Feng, Z. X., Zhang, X. J., Liu, L., Sun, K., et al. (2019). Recognizing ion ligand binding sites by SMO algorithm. BMC Mol. Cell Biol. 20(Suppl. 3):53. doi: 10.1186/s12860-019-0237-9

PubMed Abstract | CrossRef Full Text | Google Scholar

Wu, S., and Zhang, Y. (2008). ANGLOR: a composite machine-learning algorithm for protein backbone torsion angle prediction. PLoS One 3:e3400. doi: 10.1371/journal.pone.0003400

PubMed Abstract | CrossRef Full Text | Google Scholar

Yang, J. Y., Roy, A., and Zhang, Y. (2013a). BioLiP: a semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Res. 41, D1096–D1103. doi: 10.1093/nar/gks966

PubMed Abstract | CrossRef Full Text | Google Scholar

Yang, J. Y., Roy, A., and Zhang, Y. (2013b). Protein-ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment. Bioinformatics 29, 2588–2595. doi: 10.1093/bioinformatics/btt447

PubMed Abstract | CrossRef Full Text | Google Scholar

Yang, J. Y., Yan, R. X., Roy, A., Xu, D., Poission, J., and Zhang, Y. (2015). The I-TASSER suite: protein structure and function prediction. Nat. Methods 12, 7–8. doi: 10.1038/nmeth.3213

PubMed Abstract | CrossRef Full Text | Google Scholar

Yen, S. J., and Lee, Y. S. (2006). Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. Lect. Notes Control Inform. Sci. 344, 731–740. doi: 10.1007/11816492_89

CrossRef Full Text | Google Scholar

Keywords: metal ion ligand, binding residues, relative solvent accessibility, secondary structure, position weight matrix

Citation: Hu X, Feng Z, Zhang X, Liu L and Wang S (2020) The Identification of Metal Ion Ligand-Binding Residues by Adding the Reclassified Relative Solvent Accessibility. Front. Genet. 11:214. doi: 10.3389/fgene.2020.00214

Received: 09 October 2019; Accepted: 24 February 2020;
Published: 19 March 2020.

Edited by:

Pinyi Lu, Biotechnology HPC Software Applications Institute (BHSAI), United States

Reviewed by:

Chang Chen, Northwestern University, United States
Han Wang, Northeast Normal University, China
Yuxing Liao, Baylor College of Medicine, United States
Suresh Kumar, Management and Science University, Malaysia, Malaysia

Copyright © 2020 Hu, Feng, Zhang, Liu and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Zhenxing Feng, ZmVuZ3poZW54aW5nMDcxOUAxMjYuY29t; Xiaojin Zhang, Mzk4NzMwMTYyQHFxLmNvbQ==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

The Identification of Metal Ion Ligand-Binding Residues by Adding the Reclassified Relative Solvent Accessibility

Introduction

Materials and Methods

Benchmark Dataset

Selection and Extraction of Feature Parameters

Secondary Structure and Relative Solvent Accessibility

Physicochemical Properties of Amino Acids

Construction of Position Weight Matrix

Gradient Boosting Machine

The Validation and Evaluation Metrics

Results and Discussion

The Classification of Relative Solvent Accessibility

Predicted Results for K⁺ Ligand Binding Residues

Statistical Analysis of the Relative Solvent Accessibility

The Predicted Results of Four General RSA Classifications

The Optimal Predicted Results of Ten Metal Ion Ligand-Binding Residues

The Predicted Results (by Use of the Boruta Algorithm)

The Predicted Results of GBM by Using an Independent Test

Comparison With Other Methods

Conclusion

Data Availability Statement

Author Contributions

Funding

Conflict of Interest

Acknowledgments

Supplementary Material

References

94% of researchers rate our articles as excellent or good

94% of researchers rate our articles as excellent or good

The Identification of Metal Ion Ligand-Binding Residues by Adding the Reclassified Relative Solvent Accessibility

Introduction

Materials and Methods

Benchmark Dataset

Selection and Extraction of Feature Parameters

Secondary Structure and Relative Solvent Accessibility

Physicochemical Properties of Amino Acids

Construction of Position Weight Matrix

Gradient Boosting Machine

The Validation and Evaluation Metrics

Results and Discussion

The Classification of Relative Solvent Accessibility

Predicted Results for K+ Ligand Binding Residues

Statistical Analysis of the Relative Solvent Accessibility

The Predicted Results of Four General RSA Classifications

The Optimal Predicted Results of Ten Metal Ion Ligand-Binding Residues

The Predicted Results (by Use of the Boruta Algorithm)

The Predicted Results of GBM by Using an Independent Test

Comparison With Other Methods

Conclusion

Data Availability Statement

Author Contributions

Funding

Conflict of Interest

Acknowledgments

Supplementary Material

References

94% of researchers rate our articles as excellent or good

Predicted Results for K⁺ Ligand Binding Residues