Prediction of the tetramer protein complex interaction based on CNN and SVM

Lyu, Yanfen; He, Ruonan; Hu, Jingjing; Wang, Chunxia; Gong, Xinqi

doi:10.3389/fgene.2023.1076904

ORIGINAL RESEARCH article

Front. Genet. , 26 January 2023

Sec. Computational Genomics

Volume 14 - 2023 | https://doi.org/10.3389/fgene.2023.1076904

This article is part of the Research Topic Omics-based Novel Computational Methods Revealing Microbe-Disease Associations View all 7 articles

Prediction of the tetramer protein complex interaction based on CNN and SVM

Yanfen Lyu¹

Ruonan He²

Jingjing Hu¹

Chunxia Wang³*

Xinqi Gong^4,5*

¹Department of Mathematics and PhysicsScience and Engineering, Hebei University of Engineering, Handan, China
²School of Information, Renmin University of China, Beijing, China
³School of Landscape and Ecological Engineering, Hebei University of Engineering, Handan, China
⁴Mathematical Intelligence Application Lab, Institute for Mathematical Sciences, School of Math, Renmin University of China, Beijing, China
⁵Beijing Academy of Artificial Intelligence, Beijing, China

Protein-protein interactions play an important role in life activities. The study of protein-protein interactions helps to better understand the mechanism of protein complex interaction, which is crucial for drug design, protein function annotation and three-dimensional structure prediction of protein complexes. In this paper, we study the tetramer protein complex interaction. The research has two parts: The first part is to predict the interaction between chains of the tetramer protein complex. In this part, we proposed a feature map to represent a sample generated by two chains of the tetramer protein complex, and constructed a Convolutional Neural Network (CNN) model to predict the interaction between chains of the tetramer protein complex. The AUC value of testing set is 0.6263, which indicates that our model can be used to predict the interaction between chains of the tetramer protein complex. The second part is to predict the tetramer protein complex interface residue pairs. In this part, we proposed a Support Vector Machine (SVM) ensemble method based on under-sampling and ensemble method to predict the tetramer protein complex interface residue pairs. In the top 10 predictions, when at least one protein-protein interaction interface is correctly predicted, the accuracy of our method is 82.14%. The result shows that our method is effective for the prediction of the tetramer protein complex interface residue pairs.

1 Introduction

Protein-protein interactions are significant in various biological activities and processes, such as signal transmission, gene expression and transcriptional regulation (Levy and Pereira-Leal, 2008; Malta et al., 2018; Li et al., 2019; Lyu et al., 2020; Zhao et al., 2022). The interactions between proteins in the body can form dimer protein complexes, trimer protein complexes, tetramer protein complexes and higher polymers. The more monomers in a polymer, the more complex its internal interactions become. Therefore, studying protein-protein interactions contributes to a better understanding of the formation mechanism of multibody protein complexes (Gao and Skolnick, 2012; Sun et al., 2020a). Under certain conditions, some protein-protein interaction interface residue pairs are functional sites of protein complexes and are associated with certain diseases (Oganesyan et al., 2004; McKinstry et al., 2009; Vidal et al., 2011; Li et al., 2021a; Baek et al., 2021). If the interface residue pairs of protein-protein interaction can be provided, it will be great helpful for the multibody protein complex structural design, protein complex function prediction and drug design (Yang et al., 2015; Zhang et al., 2017; Zhang et al., 2017).

With the development of technology, some experimental methods can be used to study the interactions of multibody protein complexes, such as X-ray crystallography, Cryogenic electron microssopy (Cryo-EM) and Nucleic Magnetic Resonance (NMR) (Drennan et al., 1994; Sun et al., 2020b). These experimental methods have made great contributions to our understanding of the protein complex interaction mechanism. However, due to experimental conditions or technical limitations, it is impossible to use experimental methods to study all protein complex interactions. For example, X-ray crystallography method can only be used to study some protein complexes that can form stable crystals. When NMR method is used to study protein complex interactions, the size of protein complex is limited. However we have accumulated a number of protein complex data through these experimental methods, which provide the data basis for computing methods to study protein complex interactions.

At present, researchers have developed several calculation methods to predict protein complex interactions, such as Wang et al. proposed to use different machine learning methods to predict different types of protein-protein interaction interface residue pairs (Wang et al., 2017). Ovchinnikov proposed a method based on evolutionary information to predict protein-protein interaction interface residue pairs (Ovchinnikov et al., 2014). Du et al. used depth learning technology (stacked automatic encoder) to build a deep neural network model to tackle the residue-residue contact prediction problem (Du et al., 2016). Liu et al. used an attention mechanism enhanced Long Short Term Memory (LSTM) model to predict dimer protein complex interface residue pairs (Liu and Gong, 2019). Martin et al. predicted residue contact in protein-protein interaction by message passing (Weigt et al., 2009). We also developed a two-layer support vector machine ensemble classifier to predict trimer protein complex interface residue pairs (Lyu and Gong, 2020). There are many other methods, see references (Kamisetty et al., 2013; Fu et al., 2014; Michel et al., 2014; He et al., 2017; Li et al., 2019; Zhang et al., 2020; Li et al., 2021a; Li et al., 2021b; Humphreys et al., 2021; Jumper et al., 2021; Mylonas et al., 2021; Knutson et al., 2022). These methods have achieved good results in the study of protein complex interaction, but most of them focus on the study of dimer and trimer protein complex interaction, and few on the study tetramer protein complex interaction. Sun et al. developed a deep network based on LSTM network with a graph to predict the tetramer protein complex interface residue pairs, but their method did not consider whether the chains of the tetramer protein complex interact with each other (Sun and Gong, 2020). Predicting protein-protein interactions and non-interactions is very important for the study of multibody protein interactions (Humphreys et al., 2021; Zhao et al., 2022). Thus, new methods are needed for studying tetramer protein interaction.

To further improve above mentioned defections, we have done two parts of work on the study of tetramer protein complex interaction. The first part is to predict the interaction between chains of the tetramer protein complex. The second part is to predict the tetramer protein complex interface residue pairs, that is, assuming that the interaction between two chains of the tetramer protein complex is known, we predict the interface residue pairs formed by the interaction.

In first part, according to the five geometric properties of residue, the protein sequence was mapped into five number sequences. Based on these number sequences, we defined the position change sequence and geometric feature change sequences of the same type of amino acids. Then combined with four mathematical statistics, we extracted a 20 × 24 feature map to represent a sample generated by two chains of the tetramer protein complex. Finally, we constructed a CNN model based on PyTorch framework to predict the interaction between chains of the tetramer protein complex.

In second part, the influence of surrounding amino acids (residues) on the central amino acid (the central residue) is fully considered in feature extraction. We defined the Amino Acid k-Average Cumulation Factor, and combined the Amino Acid k-Interval Product Factor to extract features based on protein sequence. We also defined the Residue k-Interval Product Factor, Residue k-Average Cumulation Factor and weight factor to extract features based on protein three-dimensional structure. Finally, we proposed a SVM ensemble method to predict the tetramer protein complex interface residue pairs.

2 Materials and methods

2.1 Dataset

In this paper, we collect 111 tetramer protein complexes from the Protein Data Bank according to the following three requirements: the number of chains in the protein complex is 4, the number of amino acids in each chain is between 20 and 500, its crystal structure is obtained by X-ray experimental method. The PDB ID of these 111 protein tetramers is shown in Supplementary Table S1. If the contact area between any two atoms from two residues of two chains is bigger than zero, we call these two residues an interface residue pair (Lyu and Gong, 2020). The contact area between two atoms is calculated by Qcontacts software. If there is at least one interface residue pair between two chains of the tetramer protein complex, we call the two chains interacting, otherwise the two chains are not interacting.

2.2 Construct feature map and CNN model to predict the interaction between chains of the tetramer protein complex

2.2.1 Construct feature map

For protein sequence P with length L, see formula 1. In protein P three-dimensional structure, different amino acids have different geometric properties. These geometric properties, such as Accessible Surface Area (ASA), Relative solvent Accessible Surface Area (RASA), Exterior Contact Area (ECA), Interior Contact Area (ICA), and Exterior Void Area (EVA), play important roles in multibody protein complex interactions (Wang et al., 2017; Yang and Gong, 2018; Liu and Gong, 2019; Zhao and Gong, 2019; Lyu and Gong, 2020; Sun and Gong, 2020). In this paper, we consider using the above five geometric properties to predict the tetramer protein complex interaction. References (Liu and Gong, 2019; Lyu and Gong, 2020) and (Zhao and Gong, 2019) introduce the five geometric properties and their computing tools in detail.

P = P_{1} P_{2} \dots P_{L} (1)

Where $P_{j} \in Ω, Ω = \{A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y\}, A, C, \dots, Y$ is the abbreviation of amino acid name.

According to the 5 geometric properties of each amino acid, the protein sequence P is mapped into 5 number sequences, see formula 2. We used P¹, P², P³, P⁴ and P⁵ to represent the 5 number sequences. These 5 number sequences are the ASA number sequence, RASA number sequence, ECA number sequence, ICA number sequence and EVA number sequence.

P^{i} = φ_{1}^{i} φ_{2}^{i} ∙ ∙ ∙ φ_{L}^{i} (i = 1,2,3,4,5) (2)

Where $φ_{1}^{1}$ is the ASA value of P₁ in formula 1, $φ_{2}^{1}$ is the ASA value of P₂ in formula 1, and so on. $φ_{1}^{2}$ is the RASA value of P₁ in formula 1, $φ_{2}^{2}$ is the RASA value of P₂ in formula 1, and so on. $φ_{1}^{3}$ is the ECA value of P₁ in formula 1, $φ_{2}^{3}$ is the ECA value of P₂ in formula 1, and so on. $φ_{1}^{4}$ is the ICA value of P₁ in formula 1, $φ_{2}^{4}$ is the ICA value of P₂ in formula 1, and so on. $φ_{1}^{5}$ is the EVA value of P₁ in formula 1, $φ_{2}^{5}$ is the EVA value of P₂ in formula 1, and so on.

For any amino acid $x \in Ω$ , suppose that $x$ occurs n times in protein sequence P, the occurrence positions from left to right are $α_{1} {, α}_{2}, ∙ ∙ ∙ α_{n}$ respectively, and the corresponding values in the number sequence are $β_{1}^{i}, β_{2}^{i}, ∙ ∙ ∙, β_{n}^{i}$ ( $i = 1,2,3,4,5$ ) respectively.

We define the same type of amino acid position change sequence $f_{x} (τ)$ , as following:

f_{x} (τ) = \{\begin{array}{l} α_{τ + 1} - α_{τ} 1 \leq τ \leq n - 1 (n > 1) \\ α_{1} - \frac{\sum_{j = 1}^{L} j}{L} (n = 1) \\ 0 (n = 0) \end{array} (3)

We define the same type of amino acid geometry feature change sequence $f_{x}^{i} (τ)$ , as following:

f_{x}^{i} (τ) = \{\begin{array}{l} β_{τ + 1}^{i} - β_{τ}^{i} 1 \leq τ \leq n - 1 (n > 1) \\ β_{1}^{i} - \frac{\sum_{τ = 1}^{L} β_{τ}^{i}}{L} (n = 1) \\ 0 (n = 0) \end{array} (i = 1,2,3,4,5) (4)

The monomer protein can be represent by the same type of amino acid position change sequence $f_{x} (τ)$ and the same type of amino acid geometry feature change sequences $f_{x}^{i} (τ)$ . This representation method based on amino acid position and geometric features change sequences preserves the important information of protein sequence and three-dimensional structure, so it is feasible to apply it to protein complex interaction prediction.

Based on the same type of amino acid position change sequence $f_{x} (τ)$ and the same type of amino acid geometry feature change sequences $f_{x}^{i} (τ)$ , we extract 24 features.

Firstly, we extract four mathematical statistics from the same type of amino acid position change sequence $f_{x} (τ)$ as follows:

(1). The frequency of amino acid x, denoted as $F_{x}$ , see formula 5. $|f_{x} (τ)|$ represents the length of the same type of amino acid position change sequence $f_{x} (τ)$ .

F_{x} = \frac{|f_{x} (τ)| + 1}{L} (5)

(2). The arithmetic mean of the same type of amino acid x position change sequence $f_{x} (τ)$ , denoted as $A_{x}$ , see formula 6.

A_{x} = \{\begin{array}{l} f_{x} (1) (n = 1) \\ \frac{\sum_{τ = 1}^{n - 1} f_{x} (τ)}{|f_{x} (τ)|} (n > 1) \\ 0 (n = 0) \end{array} (6)

(3). The minimum of the same type of amino acid position x change sequence $f_{x} (τ)$ , denoted as $B_{x}$ , see formula 7.

B_{x} = \{\begin{array}{l} \min (f_{x} (τ)) (n \geq 1) \\ 0 (n = 0) \end{array} (7)

(4). The maximum of the same type of amino acid x position change sequence $f_{x} (τ)$ , denoted as $M_{x}$ , see formula 8.

M_{x} = \{\begin{array}{l} \max (f_{x} (τ)) (n \geq 1) \\ 0 (n = 0) \end{array} (8)

Secondly, we extract four mathematical statistics from the same type of amino acid geometry feature change sequence $f_{x}^{i} (τ)$ , as follows:

(1). The arithmetic mean of the same type of amino acid x geometry feature change sequence $f_{x}^{i} (τ)$ , denoted as $A_{x}^{i}$ , see formula 9.

A_{x}^{i} = \{\begin{array}{l} \frac{\sum_{τ} f_{x}^{i} (τ)}{|f_{x}^{i} (τ)|} (n \geq 1) \\ 0 (n = 0) \end{array} (i = 1,2,3,4,5) (9)

(2). The minimum of the same type of amino acid x geometry feature change sequence $f_{x}^{i} (τ)$ , denoted as $B_{x}^{i}$ , see formula 10.

B_{x}^{i} = \{\begin{array}{l} \min (f_{x}^{i} (τ)) (n \geq 1) \\ 0 (n = 0) \end{array} (i = 1,2,3,4,5) (10)

(3). The maximum of the same type of amino acid x geometry features change sequence $f_{x}^{i} (τ)$ , denoted as $M_{x}^{i}$ , see formula 11.

M_{x}^{i} = \{\begin{array}{l} \max (f_{x}^{i} (τ)) (n \geq 1) \\ 0 (n = 0) \end{array} (i = 1,2,3,4,5) (11)

(4). The ratio of the arithmetic mean of the same type of amino acid x geometry feature change sequence to the arithmetic mean of the same type of amino acid x position change sequence, denoted as $R_{x}^{i}$ , see formula 12.

R_{x}^{i} = \frac{A_{x}^{i}}{A_{x}} (i = 1,2,3,4,5) (12)

According to the above statistics, we obtain 4+4×5 = 24 features to characterize each type of amino acid. The monomer protein is composed of 20 types of amino acids. So we use a 20 × 24 dimension matrix $Q$ to represent each monomer protein, as shown in formula 13.

Q = [\begin{array}{c} q_{1,1} & \dots & q_{1,24} \\ ⋮ & ⋱ & ⋮ \\ q_{20,1} & \dots & q_{20,24} \end{array}] (13)

The line represents the number of amino acid types, and the column shows 24 features of each type of amino acid.

In order to better understand 24 features calculation process, we give an example as follows:

For protein sequence P = ACAGAHHAALKAYAW, we calculate 24 features of the A amino acid. According to the definition of amino acid position change sequence, we can get $f_{A}$ = 2 2 3 1 3 2. Then, we use Qcontacts software to calculate the ASA value of each amino acid on protein sequence P, so as to obtain the ASA number sequence $P^{1}$ = 6 4 7 5 7 8 2 8 9 3 7 11 10 14 15. According to the definition of amino acid geometry feature change sequence, we can get $f_{A}^{1}$ = 1 0 1 1 2 3.

Applying formula 5, formula 6, formula 7, formula 8, formula 9, formula 10, formula 11, formula 12):

F_{A} = \frac{7}{15}, A_{A} = \frac{13}{7}, B_{A} = 1, M_{A} = 3

A_{A}^{1} = \frac{8}{7}, B_{A}^{1} = 0, M_{A}^{1} = 3, R_{A}^{1} = \frac{8}{3}

The calculation process of other four geometry feature change sequences statistics is the same as that of ASA feature change sequencesstatistics. So we can obtain 4+4×5 = 24 features to characterize A amino acid.

A tetramer protein complex is composed of four chains, and any two chains can generate a sample, so a tetramer protein complex can generate six samples. We use a 20 × 24 dimension matrix S to represent a sample, where the matrix S is generated by the absolute value of the difference between the corresponding feature values of two matrices generated by two chains. For example 1REW_ABCD, 1REW is the tetramer protein complex PDB ID. A, B, C and D are the names of four chains. We use $Q_{A}$ , $Q_{B}$ , $Q_{C}$ and $Q_{D}$ to represent 20 × 24 dimension matrixes generated by four chains respectively. A total of six samples are generated from 1REW protein tetramer, as follows:

Sample 1 generated by A chain and B chain: $S_{1} = {(|a_{u v} - b_{u v}|)}_{20 \times 24}$

Sample 2 generated by A chain and C chain: $S_{2} = {(|a_{u v} - c_{u v}|)}_{20 \times 24}$

Sample 3 generated by A chain and D chain: $S_{3} = {(|a_{u v} - d_{u v}|)}_{20 \times 24}$

Sample 4 generated by B chain and C chain: $S_{4} = {(|b_{u v} - c_{u v}|)}_{20 \times 24}$

Sample 5 generated by B chain and D chain: $S_{5} = {(|b_{u v} - d_{u v}|)}_{20 \times 24}$

Sample 6 generated by C chain and D chain: $S_{6} = {(|c_{u v} - d_{u v}|)}_{20 \times 24}$

Where $Q_{A} = {(a_{u v})}_{20 \times 24}, Q_{B} = {(b_{u v})}_{20 \times 24}, Q_{C} = {(c_{u v})}_{20 \times 24}, Q_{D} = {(d_{u v})}_{20 \times 24}$ .

We use $Y_{r}$ to denote the sample label, $Y_{r} = 1$ denotes that there is interaction between two chains and $Y_{r} = 0$ denotes that there is no interaction between two chains.

The matrix S of each sample is standardized by formula 14. The normalized matrix can be regarded as a greyscale image. The larger the value, the brighter the pixel. The smaller the value, the darker the pixel. The grayscale image is called the feature map, as shown in Figure 1.

x^{'} = \frac{x - m e a n}{σ} (14)

FIGURE 1

FIGURE 1. Feature map of a sample.

2.2.2 Construct a convolutional neural network (CNN) model

Convolutional Neural Network (CNN) is a kind of feedforward neural network with deep structure. The CNN model we created is based on PyTorch framework, which consists of 2 sets of convolution layer, a pooling layer and a full connected layer. In first convolution layer, we select 3 × 3 kernels slide over the input feature maps performing convolution operation (step size is 1), and process with the Rectified Linear Unit activation function. In second convolution layer, we use 2 × 2 kernels to perform convolution operation over feature maps (step size is 1), and also process with Rectified Linear Unit activation function. In the pooling layer, we collect the maximum values in every 2 × 2 patch of feature maps through a sliding window to form a more robust pooled feature maps. Then flatten it into a vector and output the results through a fully connected layer. Figure 2 shows the various transformations that occur after feature maps are input into the CNN model.

FIGURE 2

FIGURE 2. Schematic diagram of various transformations that occur after feature maps of samples are input into the CNN model. Input 378 feature maps into the CNN model. The first convolution layer generates 378 [18 × 22] matrixes. The second convolution layer converts the 378 [18 × 22] matrixes into 378 [17 × 21] matrixes. Next, it is converted into 378 [8 × 10] matrixes through the maximum pooling layer, and then expand the matrix into a [1 × 30240] Vector. Finally, a [1 × 2] vector is output through a full connection layer, where 0 represents that the sample is predicted to be a negative class, and 1 represents that the sample is predicted to be a positive class.

The CNN model contains many hyper parameters that have different effects on its overall performance (Wardah et al., 2020). In this paper, we use bayesian optimization to select model hyper parameters. The batch size is set to 128, epoch is set to 100, the learning rate value is set between 0.00001 and 0.001, and loss function is cross entropy loss function. Using adam optimization algorithm to adjust the internal weight of the network. The flow chart of CNN model is shown in Figure 3.

FIGURE 3

FIGURE 3. Flow chart of the CNN model.

2.3 Construct SVM ensemble method to predict the tetramer protein complex interface residue pairs

2.3.1 Feature extraction

In this paper, for a given amino acid (we call it the central amino acid, whle the residue corresponding to the central amino acid in protein three-dimensional structure is called the central residue), we consider the influence of surrounding amino acids (residues) on the central amino acid (the central residue). Firstly, we consider the influence of each surrounding amino acid (residue) on the central amino acid (residue). Secondly, we take a certain amount of amino acids (residues) as a whole, and consider the influence of this whole on the central amino acid (the central residue) and the influence of each residue in the whole on the central residue.

2.3.1.1 Sequence feature extraction

The physicochemical properties of different types of amino acids are different, and these physicochemical properties play important roles in protein-protein interactions. In this paper, we consider hydrophobicity, polarizability, polarity, secondary structure, and codon diversity of the amino acid, and values of these five physicochemical properties of each amino acid are shown in Supplementary Table S2 (Tanford, 1962; Grantham, 1974; Charton and Charton, 1982; Kyte and Doolittle, 1982; Lyu and Gong, 2020). For protein sequence P with length L, see formula 1. According to the five physicochemical properties of each amino acid, we map the protein sequence P to 5 number sequences, as shown in formula 15.

p^{i} = Φ_{1}^{i} Φ_{2}^{i} \dots Φ_{L}^{i} (i = 1,2,3,4,5) (15)

In protein-protein interactions, the individual behavior of the central amino acid is affected by the neighboring amino acids in the protein sequence. In our previous work, we defined the Amino Acid k-Interval Product Factor (AAIPF(k)ⁱ) to describe the influence of neighboring amino acids on the central amino acid (Lyu and Gong, 2020), see formula 16-18. Similarly, in protein-protein interactions, a certain number of amino acids around the central amino acid have an overall effect on the central amino acid. We define the Amino Acid k Average Cumulative Factor (AAACF(k)ⁱ) to describe the overall effect.

The Amino Acid k Average Cumulative Factor (AAACF(k)ⁱ) is defined as follows: for the central amino acid P_j, in the number sequence, divide the sum of the value of the central amino acid position and its forward k positions and backward k positions by 2k+1, as shown in Formula 19.

A A I P F {(k)}^{i} = \{\begin{array}{l} A A F I P F {(k)}^{i} \\ A A B I P F {(k)}^{i} \end{array} (16)

A A F I P F {(k)}^{i} = \frac{Φ_{j}^{i} \times Φ_{j - k}^{i}}{k} (17)

A A B I P F {(k)}^{i} = \frac{Φ_{j}^{i} \times Φ_{j + k}^{i}}{k} (18)

{A A A C F (k)}^{i} = \frac{1}{2 k + 1} \sum_{σ = j - k}^{j + k} Φ_{σ}^{i} (19)

In the previous study (Afreixo et al., 2009; Lyu and Gong, 2020), the protein sequence P was regarded as a cycle alphabet sequence with head-to-tail connections to explore the individual behavior of each amino acid. Good results have been obtained by taking this strategy, so we also use this strategy in this paper. Considering the dimension of descriptors and using the experience of previous works (Wang and Brown, 2006; Wang et al., 2010; Lyu and Gong, 2020), we only consider the influence of before 10 amino acids and after 10 amino acids of the central amino acid. So we extract AAIPF(1)ⁱ, AAIPF(2)ⁱ, AAIPF(3)ⁱ, AAIPF(4)ⁱ, AAIPF(5)ⁱ, AAIPF(6)ⁱ, AAIPF(7)ⁱ, AAIPF(8)ⁱ, AAIPF(9)ⁱ and AAIPF(10)ⁱ to describe the effect of each amino acid on the central amino acid. We extract AAACF(1)ⁱ, AAACF(2)ⁱ, AAACF(3)ⁱ, AAACF(4)ⁱ, AAACF(5)ⁱ, AAACF(6)ⁱ, AAACF(7)ⁱ, AAACF(8)ⁱ, AAACF(9)ⁱ and AAACF(10)ⁱ to describe the effect of the whole formed by a certain number (3,5,7,…,21) of amino acids on the central amino acid. We also use the five physicochemical characteristics of the central amino acid as features to describe the amino acid. Thus we can use 5×(20+10)+5 = 155 features to describe each amino acid.

2.3.1.2 Structure feature extraction

In several previous research studies (Yang and Gong, 2018; Liu and Gong, 2019; Zhao and Gong, 2019; Lyu and Gong, 2020; Sun and Gong, 2020), it has been found that the five geometric properties (ASA, RASA, ECA, ICA, and EVA) can be used to distinguish interface residues and non-interface residues. According to the five geometric properties of the residue, we map the protein P to 5 number sequences, as shown in formula 2.

For a given central residue, we calculate the Euclidean distance between each residue and the given central residue according to the three-dimensional coordinates of the $C_{α}$ in the monomer protein PDB file and perform ascending sort. We use $λ_{1}, λ_{2}, \dots, λ_{L - 1}$ to express the corresponding position of amino acids on the protein sequence P, and we use $d_{1}, d_{2}, \dots, d_{L - 1}$ to express the sorted Euclidean distance.

In protein-protein interactions, the individual behavior of the central residue is affected by neighboring residues in the protein three-dimensional structure, we define Residue k-Interval Product Factor (RIPF(k)) to describe the effect. The RIPF(k) is defined as follows: on the monomer protein three-dimensional structure, for a given central residue P_j, multiply the geometric value of the kth residue closest to the central residue by the geometric value of the central residue, and divide the product by k (see formula 20). When we regard the central residue and some residues closest the central residue as a whole, we define Residue k-Average Cumulative Factor (RACF(k)) and weight factor $ρ_{ξ}^{i}$ to describe the influence of the whole on the central residue and the influence of each residue in the whole on the central residue (see formula 21-22).

{R I P F (k)}_{j}^{i} = \frac{φ_{j}^{i} \times φ_{λ_{k}}^{i}}{k} (20)

Where $λ_{k}$ represents the position of the k-th residue closest to the central residue in the monomer protein three-dimensional structure.

{R A C F (k)}^{i} = \frac{φ_{j}^{i} + \sum_{i i = 1}^{k} φ_{λ_{i i}}^{i}}{k + 1} (21)

ρ_{ξ}^{i} (k) = ω_{ξ} \times φ_{λ_{ξ}}^{i} (22)

ω_{ξ} (k) = e^{- \frac{d_{ξ}^{2} \times (k + 1)}{\sum_{ξ = 1}^{k + 1} d_{ξ}^{2}}} (23)

Where (k+1) is the number of residues in the whole. $ω_{ξ}$ is the weight of the $ξ$ -th residue closest to the central residue.

In the protein three-dimensional structure, for the central residue, we consider the influence of 20 residues closest to the central residue. We extract ${R I P F (l)}_{j}^{i} (l = 1,2, \dots, 20)$ to represent the effect of each residue on the central residue. We extract ${R A C F (l)}^{i} a n d ρ_{l}^{i} (k) (l = 1,2, \dots, 20)$ to describe the effect of the whole formed by a certain number of residues on the central residue and the effect of each residue in the whole on the central residue. We also take the five geometric values of residue as features to describe the central residue. So we can use 5×(20+20+20)+5 = 305 features to describe each residue.

In summary, for each amino acid (residue), we can extract 155+305 = 460 features, and combine these 460 features to form a feature vector U. Therefore, we use a 920 dimensional feature vector to represent a residue pair. Taking the residue pairs generated by residues on A chain and B chain of the 1DD3 tetramer protein complex as an example. We use $U_{j}^{A}$ to represent the feature vector of the residue j on A chain and use $U_{k}^{B}$ to represent the feature vector of the residue k on B chain. Then we use $U 1 = (U_{j}^{A}, U_{k}^{B})$ to represent the residue pair generated by residue j on A chain and residue k on B chain. We use Y to represent the sample label (Y = 1 indicates that the residue pair is an interface residue pair, Y = 0 indicates that the residue pair is a non-interface residue pair).

2.3.2 SVM ensemble method

Support Vector Machine (SVM) has been widely used in the study of protein-protein interactions, and has achieved good results. In this paper, we also use SVM to predict the tetramer protein complex interface residue pairs. Compared with non-interface residue pairs, the number of interface residue pairs is very small in tetramer protein complex. Therefore, the positive and negative classes are very imbalanced in the date set (positive class: interface residue pair, negative class: non-interface residue pair). We take under-sampling method to deal with the class imbalance problem and use ensemble method to reduce the information loss caused by under-sampling.

We propose SVM ensemble method to predict the tetramer protein complex interface residue pairs. Our method can be divided into two parts: Feature Extraction and Generate SVM Ensemble Classifier (see Figure 4). The feature extraction is shown in Section 2.3.1. The process of generate SVM ensemble classifier is as follows:

FIGURE 4

FIGURE 4. Flow chart of the SVM Ensemble Method.

The total number of positive samples in training set is 29,268. We randomly sample 10 times from all negative samples to generate 10 subsets. And we set the number of negative samples per random sampling to 29268. Then we combine each subset of negative samples with all positive samples to generate a balanced sample set. We obtain 10 balanced sample sets. By training the SVM model with each balanced sample set, 10 independent SVM models can be obtained. Finally, we use an integration strategy to fuse 10 independent SVM models to generate a SVM ensemble classifier ISVM, see formula 24. In the SVM model, SVM type is C-classification, SVM kernel function is radial basis function.

I S V M (x) = \sum_{ψ = 1}^{10} {S V M}^{ψ} (x) (24)

Where ${S V M}^{ψ}$ represents the SVM predictor trained with the $ψ$ -th balanced sample set. x represents a residue pair. ${S V M}^{ψ} (x)$ represents the probability that the $ψ$ -th individual SVM model predicts that the residue pair x is an interface residue pair.

3 Results

3.1 Predictions of the interaction between chains of the tetramer protein complex

3.1.1 Evaluation criteria

We use 7 common evaluation indicators (recall, specificity, precision, F1 score, Matthews Correlation Coefficient (MCC), accuracy and AUC) to evaluate the predictions. Their definitions as follows:

R e c a l l = \frac{T P}{T P + F N} (25)

S p e c i f i c i t y = \frac{T N}{T N + F P} (26)

P r e c i s i o n = \frac{T P}{T P + F P} (27)

F 1 = \frac{2 T P}{2 T P + F P + F N} (28)

M C C = \frac{T P \times T N - F P \times F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}} (29)

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N} (30)

Where, TP indicates the number of positive samples predicted by the model to be positive class. FN indicates the number of positive samples predicted by the model to be negative class. FP indicates the number of negative samples predicted by the model to be positive class. TN indicates the number of negative samples predicted by the model to be negative class.

We also define a new evaluation indicator ||PT||₁ to evaluate our predictions. ||PT||₁ represents the L1 norm of the vector PT, which means the sum of the number of interactions and non-interactions correctly predicted in a tetramer protein complex. PT=(a, b), where a represents the number of correctly predicted interactions in a tetramer protein complex, and b represents the number of correctly predicted non-interactions in a tetramer protein complex.

3.1.2 Results

We randomly divide 111 tetramer protein complexes into the training set, verification set and testing set, of which the number of tetramer protein complexes in training set is 63, the number of tetramer protein complexes in validation set is 20, and the number of tetramer protein complexes in testing set is 28 (see Supplementary Table S1). The number of positive and negative samples in each data set is shown in Table 1.

TABLE 1

TABLE 1. Information of positive and negative samples in each data set.

Input feature maps of training set and testing set into the CNN model to train the hyper parameters and verify the accuracy of the model. The hyper parameters finally selected are as follows: the learning rate is 0.000801, the number of convolution kernels is 2, and the number of epoch is 20. Under the above hyper parameters, the results of CNN model on validation set and testing set is shown in Table 2.

TABLE 2

TABLE 2. Predictions of CNN Model on validation set and testing set.

It can be seen from Table 2 that the recall of validation set and testing set is 0.9369 and 0.9329 respectively, which indicates that our CNN model is relatively accurate in predicting the interaction between two chains of the tetramer protein complex. The specificity of validation set and testing set is 0.3333 and 0.2105 respectively. The precision of the verification set and testing set is 0.9455 and 0.9026 respectively. The MCC of the verification set and testing set is 0.2576 and 0.1643 respectively. As the data of non-interactions between two chains in tetramer protein complexes is too sparse, the specificity and MCC values in validation set and testing set are relatively low. The F1 value of the verification set and testing set is 0.9412 and 0.9174 respectively. The AUC value of validation set and testing set is 0.7608 and 0.6263 respectively. Through the analysis of the above results, it shows that our CNN model can distinguish the positive and negative sample, that is, the method can be used to predict the interaction between chains of the tetramer protein complex.

The specific prediction of each tetramer protein complex in testing set is shown in Table 3. It can be seen that 15 tetramer protein complexes in testing set only have interaction between chains. The CNN model can correctly predict 9 of them, with an accuracy of 60%. The number of interactions between two chains in testing set is 149. The CNN model can correctly predict 139 of them, with an accuracy of 93.29%. For the 6 samples formed by each tetramer protein complex, at least 5 samples can be correctly predicted by CNN model, with an accuracy of 82.14%. The results also show that the CNN model can distinguish the positive and negative samples in 1DD3, 1P27, 1ZXJ and 3SQO tetramer protein complexes.

TABLE 3

TABLE 3. Prediction of each tetramer protein complex in testing set.

3.2 Predictions of the tetramer protein complex interface residue pairs

3.2.1 Evaluation criteria

The output value of SVM ensemble method is between 0 and 1, which indicates the possibility that the residue pair is an interface residue pair. The predicted values are arranged in descending order. We take the t predictions with the highest probability as the predicted t interface residue pairs.

In addition to recall, specificity, precision, F1 score, MCC and AUC, these six commonly indicators. In this part, we also define three new indicators to evaluate the performance of SVM ensemble method. Before introducing these three new indexes, we define a six dimensional vector ${N P I R P}^{4} (t) = {(n_{1}, n_{2}, n_{3}, n_{4}, n_{5}, n_{6})}_{t}$ , Where $n_{z}$ represents the number of positive interface residue pairs in the top t predictions of the Zth possible protein-protein interaction interface in the tetramer protein complex. Based on this six dimensional vector, we give definitions of three indicators, as follows

The first index is ${‖{N P I R P}^{4} (t)‖}_{0}$ , representing the L0 norm of the vector ${N P I R P}^{4} (t)$ , which is consistent with the meaning of the vector L0 norm in mathematics. The biological meaning of ${‖{N P I R P}^{4} (t)‖}_{0}$ is the number of correctly predicted protein-protein interaction interfaces in each tetramer protein complex. If there is at least one positive interface residue pair in the top t predictions, we consider that the protein-protein interaction interface is correctly predicted.

The second index is ${‖{N P I R P}^{4} (t)‖}_{1}$ (see formula 31), representing the L1 norm of the vector ${N P I R P}^{4} (t)$ , which is consistent with the meaning of the vector L1 norm in mathematics. The biological meaning of ${‖{N P I R P}^{4} (t)‖}_{1}$ is the number of correctly predicted interface residue pairs in the top t predictions at a tetramer protein complex.

{‖{N P I R P}^{4} (t)‖}_{1} = \sum_{z = 1}^{6} n_{z} (31)

The third index is ${A c c u r a c y}^{4} (t)$ , see formula 32.

{A c c u r a c y}^{4} (t) = \frac{N C T P (t)}{N T P} \times 100 % (32)

Where $N C T P (t)$ represents the Number of Correctly predicted Tetramer Protein complexes. In the top t predictions, we consider that the tetramer protein complex is correctly predicted, when there are z protein–protein interaction interfaces that each interface has at least one positive interface residue pair. NTP represents the Number of Tetramer Protein complexes containing at least z native protein-protein interaction interfaces in the data set.

3.2.2 Results

We randomly divide 111 tetramer protein complexes into training set and testing set according to the ratio of about 3:1. Training set contains 83 tetramer protein complexes and testing set contains 28 tetramer protein complexes. The tetramer protein complexes PDB ID of each set is shown in Supplementary Table S1. The specific number of positive and negative samples in each set is shown in Table 4.

TABLE 4

TABLE 4. Sample number information of training set and testing set.

Firstly, the feature vector of each residue pair is calculated and the specific process see Section 2.3.1. Secondly, we use the samples generated by training set to train model. Then, the samples generated by testing set are input into the training model. Finally, we obtain the score of each residue pair in testing set.

Table 5 shows two evaluation indexes ${‖{N P I R P}^{4} (t)‖}_{0}$ and ${‖{N P I R P}^{4} (t)‖}_{1}$ of testing set. From Table 5 we get the following conclusions: In the top 10 predictions, when at least one protein-protein interaction interface in a tetramer protein complex is correctly predicted, a total of 23 tetramer protein complexes are correctly predicted, when at least two protein-protein interaction interfaces in each tetramer protein complex is correctly predicted, a total of 20 tetramer protein complexes are correctly predicted. In the top 30 predictions, when at least three protein-protein interaction interfaces in each tetramer protein complex are correctly predicted, a total of 20 tetramer protein complexes are correctly predicted, when at least four protein-protein interaction interfaces in each tetramer protein complex are correctly predicted, a total of 16 tetramer protein complexes are correctly predicted.

TABLE 5

TABLE 5. Two evaluation indexes of testing set in the top t predictions.

In the top 10 predictions, the prediction of 2EPI tetramer protein complex is the best. Six protein-protein interaction interfaces are correctly predicted, and a total of 9 positive interface residue pairs are given. The prediction of 3VH5 tetramer protein complex follows closely. Five protein-protein interaction interfaces are correctly predicted, and a total of 13 positive interface residue pairs are given. On 1SWF tetramer protein complex, four protein-protein interaction interfaces are correctly predicted, and a total of 15 positive interface residue pairs are given. On 1NSW and 3STB tetramer protein complexes, four protein-protein interaction interfaces are correctly predicted, and a total of 13 positive interface residue pairs are given.

We calculate the index ${A c c u r a c y}^{4} (t)$ according to the ${‖{N P I R P}^{4} (t)‖}_{0}$ columns in Table 5(see Table 6). As can be seen from Table 6, in the top 10 predictions, when at least one protein-protein interaction interface is correctly predicted for each tetramer protein complex, the ${A c c u r a c y}^{4} (t)$ of SVM ensemble method is 82.14%, that is about 4/5 of tetramer protein complexes in testing set can be correctly predicted. In the top 20 predictions, when at least two protein–protein interaction interfaces are correctly predicted for each tetramer protein complex, the ${A c c u r a c y}^{4} (t)$ of SVM ensemble method is 89.29%. In the top 30 predictions, when at least four protein–protein interaction interfaces are correctly predicted for each tetramer protein complex, the ${A c c u r a c y}^{4} (t)$ of SVM ensemble method is 61.43%, that is about 3/5 of the tetramer protein complexes in testing set could be correctly predicted.

TABLE 6

TABLE 6. ${A c c u r a c y}^{4} (t)$ of testing set prediction.

When we give the top 50 predictions and all native protein-protein interaction interfaces on each tetramer protein complex are required to be correctly predicted, SVM ensemble method can correctly predict 9 tetramer protein complexes. The predictions of these 9 tetramer protein complexes are shown in Table 7. The prediction of 1NSE tetramer protein complex is the best. A total of 53 positive interface residue pairs are given on 5 native protein-protein interaction interfaces, with an average of 10.6 positive interface residue pairs per protein-protein interaction interface. On 1QYN, 1SWF and 3MH0 tetramer protein complexes, SVM ensemble method gives at least 30 positive interface residue pairs, with an average of 5 positive interface residue pairs per protein-protein interaction interface.

TABLE 7

TABLE 7. Predictions of 9 tetramer protein complexes in testing set.

In the top 200 predictions, the recall, precision, specificity, F1 and MCC of SVM ensemble method are 0.255, 0.052, 0.988, 0.081 and 0.102 respectively. In fact, if 200 residue pairs per protein-protein interaction interface are taken as interface residue pairs, a total of 29800 residue pairs are extracted as interface residue pairs in testing set. According to the proportion of interface residue pairs in the total residue pairs of testing set, there should be 45.98 interface residue pairs in the 29800 residue pairs, and the precision is 0.00154, the precision of SVM ensemble method is much higher than this value. Compared with non-interface residue pairs, interface residue pairs are too sparse, so the precision and F1 value of SVM ensemble method are not high.

In reference (Sun and Gong, 2020), Sun et al. predicted the tetramer protein complex interface residue pairs based on LSTM network with a graph. We compare the performance of our method with Sun et al. method (using optimal super parameters). In the top 10 predictions, when at least one protein-protein interaction interface is correctly predicted, the accuracy of our method is 82.14% and Sun et al. method is 83.33%, when at least two protein-protein interaction interfaces are correctly predicted, the accuracy of our method is the same as that of Sun et al., when at least three protein-protein interaction interfaces are correctly predicted, the accuracy of our method is 30.29% and Sun et al. method is 25.84%. In the top 20 predictions, when at least one protein-protein interaction interface is correctly predicted, the accuracy of our method and Sun et al. method is same, which is 92.86%, when at least two protein-protein interaction interfaces are correctly predicted, the accuracy of our method is 89.29% and Sun et al. method is 85.71%, when at least three protein-protein interaction interfaces are correctly predicted, the accuracy of our method is 64.29% and Sun et al. method is 61.54%. It can be seen that the predictions of our method are better than those of Sun et al. on the whole.

4 Discussion

In this paper, we have done two parts of work to predict the tetramer protein complex interaction. In the first part, we defined the position change sequence and geometric feature change sequences of the same type of amino acid. Based on these sequences, we proposed a 20 × 24 feature map to represent a sample generated by two chains in a tetramer protein complex and constructed a CNN model to predict the interaction between chains of the tetramer protein complex. In the second part, we considered the influence of surrounding amino acids (residues) on the central amino acid (the central residue) when extracting features. We defined Amino Acid k-Average Cumulation Factor, together with Amino Acid k-Interval Product Factor to extract features based on protein sequence. We also defined the Residue k-Interval Product Factor, Residue k-Average Cumulation Factor and weight factor to extract features based on protein three-dimensional structure. Finally, we proposed a SVM ensemble method based on under-sampling and ensemble method to predict the tetramer protein complex interface residue pairs. The prediction shows that our method is feasible for the prediction of tetramer protein complex interface residue pairs. Compared with previous studies, which only studied tetramer protein complex interface residue pairs, we also studied the interaction between chains of the tetramer protein complex, which provides a new perspective for the future study of multibody protein interactions. However, there are also the following points that need to be further improved. The first point is the study of the interaction between chains of the tetramer protein complex, whose accuracy still needs to be further improved. The second point, when all native protein-protein interaction interfaces of each tetramer protein complex can be correctly predicted, our accuracy also needs to be further improved. In the future, we also hope that our predictions can be used in docking processes to predict the multibody protein complex three-dimensional structure.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding authors.

Author contributions

YL, CW, and XG: research design. YL, RH, and JH data analyses. YL, CW, and XG: research. YL: wrote the manuscript. YL and XG: revision of results and manuscript content. All authors contributed to the article and approved the submitted version.

Acknowledgments

We are grateful to the Public Computing Cloud of Renmin University of China for providing computing resources.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2023.1076904/full#supplementary-material

References

Afreixo, V., Bastos, C. A., Pinho, A. J., Garcia, S. P., and Ferreira, P. J. (2009). Genome analysis with inter-nucleotide distances. Bioinformatics 25 (23), 3064–3070. doi:10.1093/bioinformatics/btp546

PubMed Abstract | CrossRef Full Text | Google Scholar

Baek, M., DiMaio, F., Anishchenko, I., Dauparas, J., Ovchinnikov, S., Lee, G. R., et al. (2021). Accurate prediction of protein structures and interactions using a three-track neural network. Science 373 (6557), 871–876. doi:10.1126/science.abj8754

PubMed Abstract | CrossRef Full Text | Google Scholar

Charton, M., and Charton, B. I. (1982). The structural dependence of amino acid hydrophobicity parameters. J. Theor. Biol. 99 (4), 629–644. doi:10.1016/0022-5193(82)90191-6

PubMed Abstract | CrossRef Full Text | Google Scholar

Drennan, C. L., Huang, S., Drummond, J. T., Matthews, R. G., and Lidwig, M. L. (1994). How a protein binds B12: A 3.0 A X-ray structure of B12-binding domains of methionine synthase. Science 266 (5191), 1669–1674. doi:10.1126/science.7992050

PubMed Abstract | CrossRef Full Text | Google Scholar

Du, T., Liao, L., Wu, C. H., and Sun, B. (2016). Prediction of residue-residue contact matrix for protein-protein interaction with Fisher score features and deep learning. Methods 110, 97–105. doi:10.1016/j.ymeth.2016.06.001

PubMed Abstract | CrossRef Full Text | Google Scholar

Fu, M., Geiss, B. J., and Ben-Hur, A. (2014). PAIRpred: Partner-specific prediction of interacting residues from sequence and structure. Proteins 82 (7), 1142–1155. doi:10.1002/prot.24479

PubMed Abstract | CrossRef Full Text | Google Scholar

Gao, M., and Skolnick, J. (2012). The distribution of ligand-binding pockets around protein-protein interfaces suggests a general mechanism for pocket formation. Proc. Natl. Acad. Sci. U. S. A. 109 (10), 3784–3789. doi:10.1073/pnas.1117768109

PubMed Abstract | CrossRef Full Text | Google Scholar

Grantham, R. (1974). Amino acid difference formula to help explain protein evolution. Science 185 (4154), 862–864. doi:10.1126/science.185.4154.862

PubMed Abstract | CrossRef Full Text | Google Scholar

He, B., Mortuza, S. M., Wang, Y., Shen, H. B., and Zhang, Y. (2017). NeBcon: Protein contact map prediction using neural network training coupled with naïve bayes classifiers. Bioinformatics 33 (15), 2296–2306. doi:10.1093/bioinformatics/btx164

PubMed Abstract | CrossRef Full Text | Google Scholar

Humphreys, I. R., Pei, J., Baek, M., Krishnakumar, A., Anishchenko, I., Ovchinnikov, S., et al. (2021). Computed structures of core eukaryotic protein complexes. Science 374 (6573), eabm4805. doi:10.1126/science.abm4805

PubMed Abstract | CrossRef Full Text | Google Scholar

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature 596 (7873), 583–589. doi:10.1038/s41586-021-03819-2

PubMed Abstract | CrossRef Full Text | Google Scholar

Kamisetty, H., Ovchinnikov, S., and Baker, D. (2013). Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era. Proc. Natl. Acad. Sci. U. S. A. 110 (39), 15674–15679. doi:10.1073/pnas.1314045110

PubMed Abstract | CrossRef Full Text | Google Scholar

Knutson, C., Bontha, M., Bilbrey, J. A., and Kumar, N. (2022). Decoding the protein-ligand interactions using parallel graph neural networks. Sci. Rep. 12 (1), 7624. doi:10.1038/s41598-022-10418-2

PubMed Abstract | CrossRef Full Text | Google Scholar

Kyte, J., and Doolittle, R. F. (1982). A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157 (1), 105–132. doi:10.1016/0022-2836(82)90515-0

PubMed Abstract | CrossRef Full Text | Google Scholar

Levy, E. D., and Pereira-Leal, J. B. (2008). Evolution and dynamics of protein interactions and networks. Curr. Opin. Struct. Biol. 18 (3), 349–357. doi:10.1016/j.sbi.2008.03.003

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, Y., Hu, J., Zhang, C., Yu, D. J., and Zhang, Y. (2019). ResPRE: High-accuracy protein contact prediction by coupling precision matrix with deep residual neural networks. Bioinformatics 35 (22), 4647–4655. doi:10.1093/bioinformatics/btz291

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, Y., Zhang, C., Bell, E. W., Zheng, W., Zhou, X., Yu, D. J., et al. (2021a). Deducing high-accuracy protein contact-maps from a triplet of coevolutionary matrices through deep residual convolutional networks. PLoS Comput. Biol. 17 (3), e1008865. doi:10.1371/journal.pcbi.1008865

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, Y., Zhang, C., Zheng, W., Zhou, X., Bell, E. W., Yu, D. J., et al. (2021b). Protein inter-residue contact and distance prediction by coupling complementary coevolution features with deep residual networks in CASP14. Proteins 89 (12), 1911–1921. doi:10.1002/prot.26211

PubMed Abstract | CrossRef Full Text | Google Scholar

Liu, J., and Gong, X. (2019). Attention mechanism enhanced LSTM with residual architecture and its application for protein-protein interaction residue pairs prediction. BMC Bioinforma. 20 (1), 609. doi:10.1186/s12859-019-3199-1

PubMed Abstract | CrossRef Full Text | Google Scholar

Lyu, Y., and Gong, X. (2020). A two-layer SVM ensemble-classifier to predict interface residue pairs of protein trimers. Molecules 25 (19), 4353. doi:10.3390/molecules25194353

PubMed Abstract | CrossRef Full Text | Google Scholar

Lyu, Y., Huang, H., and Gong, X. (2020). A novel index of contact frequency from noise protein-protein interaction data help for accurate interface residue pair prediction. Interdiscip. Sci. 12 (2), 204–216. doi:10.1007/s12539-020-00364-w

PubMed Abstract | CrossRef Full Text | Google Scholar

Malta, T. M., Sokolov, A., Gentles, A. J., Burzykowski, T., Poisson, L., Weinstein, J. N., et al. (2018). Machine learning identifies stemness features associated with oncogenic dedifferentiation. Cell 173 (2), 338–354.e15. doi:10.1016/j.cell.2018.03.034

PubMed Abstract | CrossRef Full Text | Google Scholar

McKinstry, W. J., Polekhina, G., Diefenbach-Jagger, H., Ho, P. W. M., Sato, K., Onuma, E., et al. (2009). Structural basis for antibody discrimination between two hormones that recognize the parathyroid hormone receptor. J. Biol. Chem. 284 (23), 15557–15563. doi:10.1074/jbc.M900044200

PubMed Abstract | CrossRef Full Text | Google Scholar

Michel, M., Hayat, S., Skwark, M. J., Sander, C., Marks, D. S., and Elofsson, A. (2014). PconsFold: Improved contact predictions improve protein models. Bioinformatics 30 (17), i482–i488. doi:10.1093/bioinformatics/btu458

PubMed Abstract | CrossRef Full Text | Google Scholar

Mylonas, S. K., Axenopoulos, A., and Daras, P. (2021). DeepSurf: A surface-based deep learning approach for the prediction of ligand binding sites on proteins. Bioinformatics 37, 1681–1690. btab009. doi:10.1093/bioinformatics/btab009

CrossRef Full Text | Google Scholar

Oganesyan, V., Pufan, R., DeGiovanni, A., Yokota, H., Kim, R., and Kim, S. H. (2004). Structure of the putative DNA-binding protein SP_1288 from Streptococcus pyogenes. Acta Crystallogr. D. Biol. Crystallogr. 60 (7), 1266–1271. doi:10.1107/S0907444904009394

PubMed Abstract | CrossRef Full Text | Google Scholar

Ovchinnikov, S., Kamisetty, H., and Baker, D. (2014). Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information. Elife 3, e02030. doi:10.7554/eLife.02030

PubMed Abstract | CrossRef Full Text | Google Scholar

Sun, D., and Gong, X. (2020). Tetramer protein complex interface residue pairs prediction with LSTM combined with graph representations. Biochim. Biophys. Acta Proteins Proteom 1868 (11), 140504. doi:10.1016/j.bbapap.2020.140504

PubMed Abstract | CrossRef Full Text | Google Scholar

Sun, D., Liu, S., and Gong, X. (2020). Review of multimer protein–protein interaction complex topology and structure prediction. Chin. Phys. B 29 (10), 108707. doi:10.1088/1674-1056/abb659

CrossRef Full Text | Google Scholar

Sun, Y., Watters, K., Hill, M. G., Fang, Q., Liu, Y., Kuhn, R. J., et al. (2020). Cryo-EM structure of rhinovirus C15a bound to its cadherin-related protein 3 receptor. Proc. Natl. Acad. Sci. U. S. A. 117 (12), 6784–6791. doi:10.1073/pnas.1921640117

PubMed Abstract | CrossRef Full Text | Google Scholar

Tanford, C. (1962). Contribution of hydrophobic interactions to the stability of the globular conformation of proteins. J. Am. Chem. Soc. 84, 4240–4247. doi:10.1021/ja00881a009

CrossRef Full Text | Google Scholar

Vidal, M., Cusick, M. E., and Barabási, A. L. (2011). Interactome networks and human disease. Cell 144 (6), 986–998. doi:10.1016/j.cell.2011.02.016

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, L., and Brown, S. J. (2006). BindN: A web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic Acids Res. 34, W243–W248. doi:10.1093/nar/gkl298

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, L., Huang, C., Yang, M. Q., and Yang, J. Y. (2010). BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features. BMC Syst. Biol. 4 (1), S3. doi:10.1186/1752-0509-4-S1-S3

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, W., Yang, Y., Yin, J., and Gong, X. (2017). Different protein-protein interface patterns predicted by different machine learning methods. Sci. Rep. 7 (1), 16023. doi:10.1038/s41598-017-16397-z

PubMed Abstract | CrossRef Full Text | Google Scholar

Wardah, W., Dehzangi, A., Taherzadeh, G., Rashid, M. A., Khan, M. G. M., Tsunoda, T., et al. (2020). Predicting protein-peptide binding sites with a deep convolutional neural network. J. Theor. Biol. 496, 110278. doi:10.1016/j.jtbi.2020.110278

PubMed Abstract | CrossRef Full Text | Google Scholar

Weigt, M., White, R. A., Szurmant, H., Hoch, J. A., and Hwa, T. (2009). Identification of direct residue contacts in protein-protein interaction by message passing. Proc. Natl. Acad. Sci. U. S. A. 106 (1), 67–72. doi:10.1073/pnas.0805923106

PubMed Abstract | CrossRef Full Text | Google Scholar

Yang, J., Yan, R., Roy, A., Xu, D., Poisson, J., and Zhang, Y. (2015). The I-tasser suite: Protein structure and function prediction. Nat. Methods 12 (1), 7–8. doi:10.1038/nmeth.3213

PubMed Abstract | CrossRef Full Text | Google Scholar

Yang, Y., and Gong, X. (2018). A new probability method to understand protein-protein interface formation mechanism at amino acid level. J. Theor. Biol. 436, 18–25. doi:10.1016/j.jtbi.2017.09.026

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, C., Freddolino, P. L., and Zhang, Y. (2017). Cofactor: Improved protein function prediction by combining structure, sequence and protein-protein interaction information. Nucleic Acids Res. 45 (W1), W291–W299. doi:10.1093/nar/gkx366

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, C., Zheng, W., Mortuza, S. M., Li, Y., and Zhang, Y. (2020). DeepMSA: Constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins. Bioinformatics 36 (7), 2105–2112. doi:10.1093/bioinformatics/btz863

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhao, N., Zhuo, M., Tian, K., and Gong, X. (2022). Protein-protein interaction and non-interaction predictions using gene sequence natural vector. Commun. Biol. 5 (1), 652. doi:10.1038/s42003-022-03617-0

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhao, Z., and Gong, X. (2019). Protein-protein interaction interface residue pair prediction based on deep learning architecture. IEEE/ACM Trans. Comput. Biol. Bioinform. 16 (5), 1753–1759. doi:10.1109/TCBB.2017.2706682

PubMed Abstract | CrossRef Full Text | Google Scholar

Keywords: tetramer protein complex interaction, feature map, CNN, SVM ensemble method, under-sampling

Citation: Lyu Y, He R, Hu J, Wang C and Gong X (2023) Prediction of the tetramer protein complex interaction based on CNN and SVM. Front. Genet. 14:1076904. doi: 10.3389/fgene.2023.1076904

Received: 25 October 2022; Accepted: 16 January 2023;
Published: 26 January 2023.

Edited by:

Peng Chen, Anhui University, China

Reviewed by:

Xiaolei Zhu, Anhui Agricultural University, China
Fuyi Li, The University of Melbourne, Australia
Cheng Wang, Anhui Jianzhu University, China

Copyright © 2023 Lyu, He, Hu, Wang and Gong. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Chunxia Wang, d2FuZ2NodW54aWFAaGViZXUuZWR1LmNu; Xinqi Gong, eGlucWlnb25nQHJ1Yy5lZHUuY24=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Prediction of the tetramer protein complex interaction based on CNN and SVM

1 Introduction

2 Materials and methods

2.1 Dataset

2.2 Construct feature map and CNN model to predict the interaction between chains of the tetramer protein complex

2.2.1 Construct feature map

2.2.2 Construct a convolutional neural network (CNN) model

2.3 Construct SVM ensemble method to predict the tetramer protein complex interface residue pairs

2.3.1 Feature extraction

2.3.1.1 Sequence feature extraction

2.3.1.2 Structure feature extraction

2.3.2 SVM ensemble method

3 Results

3.1 Predictions of the interaction between chains of the tetramer protein complex

3.1.1 Evaluation criteria

3.1.2 Results

3.2 Predictions of the tetramer protein complex interface residue pairs

3.2.1 Evaluation criteria

3.2.2 Results

4 Discussion

Data availability statement

Author contributions

Acknowledgments

Conflict of interest

Publisher’s note

Supplementary material

References

95% of researchers rate our articles as excellent or good

95% of researchers rate our articles as excellent or good