Study of PARP inhibitors for breast cancer based on enhanced multiple kernel function SVR with PSO

Xue, Haohan; Zhang, Ruixuan; Yan, Xudong; Wang, Ruihan; Zhang, Peijian

doi:10.3389/fphar.2024.1257253

ORIGINAL RESEARCH article

Front. Pharmacol. , 02 February 2024

Sec. Experimental Pharmacology and Drug Discovery

Volume 15 - 2024 | https://doi.org/10.3389/fphar.2024.1257253

This article is part of the Research Topic Artificial Intelligence and Drug Discovery View all 13 articles

Study of PARP inhibitors for breast cancer based on enhanced multiple kernel function SVR with PSO

Ruihan Wang

University, College of Computer Science and Technology, Qingdao, China

PARP1 is one of six enzymes required for the highly error-prone DNA repair pathway microhomology-mediated end joining (MMEJ) and needs to be inhibited when over-expressed. In order to study the PARP1 inhibitory effect of fused tetracyclic or pentacyclic dihydrodiazepinoindolone derivatives (FTPDDs) by quantitative structure-activity relationship technique, six models were established by four kinds of methods, heuristic method, gene expression programming, random forester, and support vector regression with single, double, and triple kernel function respectively. The single, double, and triple kernel functions were RBF kernel function, the integration of RBF and polynomial kernel functions, and the integration of RBF, polynomial, and linear kernel functions respectively. The problem of multi-parameter optimization introduced in the support vector regression model was solved by the particle swarm optimization algorithm. Among the models, the model established by support vector regression with triple kernel function, in which the optimal R² and RMSE of training set and test set were 0.9353, 0.9348 and 0.0157, 0.0288, and R²_cv of training set and test set were 0.9090 and 0.8971, shows the strongest prediction ability and robustness. The method of support vector regression with triple kernel function is a great promotion in the field of quantitative structure-activity relationship, which will contribute a lot to designing and screening new drug molecules. The information contained in the model can provide important factors that guide drug design. Based on these factors, six new FTPDDs have been designed. Using molecular docking experiments to determine the properties of new derivatives, the new drug was ultimately successfully designed.

1 Introduction

Breast cancer is a common malignant disease in breast tissue. In 2020, more than 2.26 million people were diagnosed with the disease and about 685 thousand people were killed by it (Sung et al., 2021). Because of the high incidence rate and seriousness of breast cancer, developing new drugs with good therapeutic effects has become a hot research spot. The size of the breast cancer drug development market has totaled 20.2 billion dollars in 2019 and is forecast to grow to 47.7 billion in 2029 (Wilcock and Webster, 2021), which makes breast cancer drug development become one of the biggest disease research fields.

Although new breast cancer drugs have been developed continuously, the efficiency of most drugs has been limited due to side effects and the lack of specificity. Since Bryant et al. proposed the concept of synthetic lethality effect in 2005, the anti-tumor effect of Poly (ADP-ribose) polymerase (PARP) inhibitors has been gradually revealed.

The PARP family, known as diphtheria-toxin-like ADP-ribosyltransferases (ARTDs), is a family of 17 enzymes that share a common catalytic ADP-ribosyl transferase (ART) motif (Zong et al., 2022). PARP can be activated by recognizing DNA fragments of structural damage and then perform base excision repair (Zong et al., 2022). It plays an important role in DNA single-strand break (SSB) repair, programmed cell death regulation, and DNA stability maintenance (Ohmoto and Yachida, 2017). Among the 17 kinds of PARP, PARP1 and PARP2 are the two most important subtypes of the PARP family (Huang and Yin, 2021). And at present, PARP inhibitors mainly inhibit these two subtypes.

PARP1 is one of six enzymes required for the highly error-prone DNA repair pathway microhomology-mediated end joining (MMEJ) (Sharma et al., 2015). When PARP1 is upregulated, MMEJ is increased, causing genome instability (Muvarak et al., 2015). Ordinarily, deficient expression of a DNA repair enzyme results in increased un-repaired DNA damage which leads to mutations and cancer through replication errors. However, the accuracy of MMEJ repair mediated by PARP1 is significantly low. Therefore, it seems that cancer is more likely to occur due to over-expression rather than under-expression (Sharma et al., 2015).

In breast cancer cells, the breast cancer susceptibility gene BRCA1 and BRCA2 genes are significant tumor suppressors for DNA double-strand breaks (DSBs) by homologous recombination (HR) and their mutation easily causes genetic instability and leads to the emergence of tumor cells (Luo et al., 2021).

PARP inhibitors can selectively kill tumor cells with HR function defects caused by RBCA1 and BRCA2 genes mutation by inhibiting PARP activity and leading to DNA repair failure, but have no effect on normal cells, which is synergistic lethal effect (Lord and Ashworth, 2017). PARP inhibitors show good curative effects in breast cancer therapies, which makes the research of PARP inhibitors become a significant field in breast cancer drug research.

In the experiments shown in literature (Wang H. et al., 2020), fused tetracyclic or pentacyclic dihydrodiazepinoindolone derivatives (FTPDDs) were studied as PARP inhibitors, and Pamiparib was found with great effects in inhibiting the activity of PARP1. Therefore, some FTPDDs that have a similar structure to Pamiparib should be deeply studied to find PARP1 inhibitors with better therapeutic effects and fewer side effects. The inhibitory activity on PARP can be evaluated by PARP IC₅₀. Therefore, obtaining PARP1 IC₅₀ values of FTPDDs is of great importance for screening out PARP1 inhibitors with high biological activity and low toxicity, which will contribute to screening out FTPDDs with good therapeutic effects subsequently. Since the traditional methods of measuring IC₅₀ consume a lot of manpower and material resources, quantitative structure-activity relationship (QSAR) technique is used to predict the IC₅₀ values of compounds quickly and accurately.

QSAR is a method that uses mathematical models to describe the relationship between molecular structure and certain biological activity (An et al., 2006; Chen and Si, 2021). The basic assumption of QSAR is that the molecular structure of the compound decides its physical, chemical, and biological properties which subsequently decide its biological activity (Roy and Das, 2014). And the biological and chemical properties of compounds are portrayed by molecular descriptors in QSAR model (Jin et al., 2022; LI et al., 2023). Therefore, QSAR can be used to predict the biological activity of new compounds by mathematical models based on precisely selected molecular descriptors (Vilar et al., 2008; Si et al., 2022). QSAR shows a strong ability in new drug research, which optimizes pharmacodynamic characteristics and reduces expensive experiments in the meanwhile. Therefore, QSAR can be used to predict the IC₅₀ of PARP1 inhibitors to screen out new potential drug molecules quickly and accurately.

In this study, six QSAR models based on three kinds of methods, heuristic method (HM), gene expression programming (GEP), random forest and support vector regression (SVR) with single, double, and triple kernel function, were established and compared after calculating molecular descriptors to predict the PARP1 IC₅₀ of FTPDDs. The single, double, and triple kernel functions were RBF kernel function, the integration of RBF and polynomial kernel functions, and the integration of RBF, polynomial, and linear kernel functions respectively. Among the results of the models, the predictions of the model constructed by SVR with triple kernel function were consistent with the measured values best. The method of SVR with triple kernel function is a big breakthrough and will contribute a lot to designing and screening new drug molecules. Activity prediction and molecular docking experiments were applied to newly designed multiple compounds. The experimental results indicated that the newly designed compounds showed better performance.

2 Computational details and theories

2.1 Data set

The data set of 57 FTPDDs was collected from literature (Wang H. et al., 2020) and listed in Tables 1–4 according to the main structure of compounds. It included two parts, the structures of 57 FTPDDs and their PARP1 IC₅₀ values which were measured by the same experimental method under the same experimental conditions. Using simple random sampling, the original data is randomly divided into training set and test set according to the ratio of 4:1. The data set was randomly separated into a training set of 45 compounds and a test set of 12 compounds. The training set was used to establish the models and the test set was used to evaluate the prediction ability of the constructed models. For multi model regression evaluation, small datasets are susceptible to noise and randomness, and each model may only have a small number of samples for training, resulting in unstable model results. In addition, overfitting is easy to occur if each model has more degrees of freedom to adjust. Therefore, cross validation is a good method to ensure the stability of machine learning models.

TABLE 1

TABLE 1. Measured and predicted lg (IC₅₀) of FTPDDs 1–16. *The compounds of the test set.

TABLE 2

TABLE 2. Measured and predicted lg (IC₅₀) of FTPDDs 17–34. *The compounds of the test set.

TABLE 3

TABLE 3. Measured and predicted lg (IC₅₀) of FTPDDs 35–54. *The compounds of the test set.

TABLE 4

TABLE 4. Measured and predicted lg (IC₅₀) of FTPDDs 55–57. *The compounds of the test set.

In order to test the robustness of the models, K-fold cross-validation was adopted. The procedure involves splitting the original data evenly into K parts, and sequentially using one of these parts as the validation set while using the remaining K-1 parts as training sets. This process is repeated K times, and the average of the K validation results is used to evaluate the model’s performance.

2.2 Descriptor calculation

The steps of calculating molecular descriptors of compounds are as follow.

The first three steps include drawing molecular structures of all compounds in ChemDraw software and performing preliminary optimization of the structures of the compounds by MM + molecular mechanics force field and more precise optimization by semi-empirical PM3 (Klein et al., 2006) method successively in HyperChem software (HyperChem, 1994). After the three steps, the lowest energy structures are obtained.

The remaining two steps consist of putting files obtained from HyperChem into MOPAC software (Stewart, 1989) to generate MNO files and then using MNO files as the input of the CODESSA software to calculate five classes of molecular descriptors: constitutional, topological, geometrical, electrostatic, quantumchemical (Wright et al., 1997).

2.3 Statistical parameters

The coefficient of determination (denoted R² or r²) and root mean square error (RMSE) were used to evaluate the models. R² is a measure of the goodness of fit of a model. In regression, the R² is a statistical measure of how well the regression predictions approximate the real data points. The closer R² is to 1, the more it indicates that the regression prediction fits the data (Glantz and Slinker, 1990). Furthermore, to verify the robustness of the models, the coefficient of determination of K-fold cross-validation (R²_cv) is used in the evaluation of the results, which is the average of all R² values in K cross validation experiments (Allen, 1974).

RMSE is a frequently used measure of the differences between values predicted by a model or an estimator and the values observed. The RMSE represents the square root of the second sample moment of the differences between predicted values and observed values or the quadratic mean of these differences (Hyndman and Koehler, 2006).

Mean Absolute Error (MAE) is a measurement used to measure the average absolute difference between predicted and true values in regression problems. MAE can measure the average error between predicted and true values. The smaller the value of MAE, the smaller the average difference between predicted and true values, indicating a higher accuracy of prediction (Tropsha et al., 2003).

The Concordance Correlation Coefficient (CCC) was developed as a measure for the correlation between two sets of data, for instance a gold standard and a second reading (Sheikhpour et al., 2017).

The external predictivity of QSAR models is commonly described by employing validation metrics (Roy and Roy, 2008), such as R² based metrics, namely, Q²_F1 and Q²_F2 (Zhou and Sun, 1999; Golbraikh and Tropsha, 2002).

2.4 Linear model by HM

After generating molecular descriptors, HM in CODESSA software was used to accomplish the pre-selection of the descriptors and build the linear model (Si et al., 2021).

HM in CODESSA software has the advantages of high efficiency and no limitation to the size of the data set. After calculating all molecular descriptors, preprocessing program eliminates 3 types of descriptors that cannot be used: 1) descriptors that not all compounds have; 2) descriptors that have a small variation in magnitude for all structures; 3) two collinear descriptors with the correlation coefficient greater than 0.8 (Wang Y. et al., 2020). The number of remaining available molecular descriptors is marked N. The steps of HM can refer to literature (Katritzky et al., 1995).

The HM performs descriptors pre-selection by the following criteria (Katritzky et al., 2023): 1) Fisher F-criteria must be greater than 1.0; 2) R² value should be higher than a specified threshold; 3) Student’s t criterion must exceed a defined value; 4) duplicate descriptors should have a squared intercorrelation coefficient below a predetermined level, and the descriptor with a higher R² value relative to the property is retained. The remaining descriptors are then arranged in descending order based on their correlation coefficients. Any significant 2 parameter correlation identified by the F-criteria is further expanded recursively into an n parameter correlation until the normalized F-criteria is no longer higher than the initial value. The top N correlations which are determined by both R² and the F-criterion are then saved.

HM attempts to build a series of linear regression models with 1 to N molecular descriptors and find the optimal one. Molecular descriptors determined by HM model serve as independent variables of the non-linear models for the next step (Gao et al., 2022)^.

2.5 Non-linear model by GEP

Considering that the relationship among the factors affecting IC₅₀ of PARP1 inhibitors is complex and usually non-linear, one non-linear model was established by GEP to predict IC₅₀ values of the compounds respectively.

As a new algorithm based on genetic algorithm (GA) and genetic planning (GP), GEP was proposed by Ferreira in 2001 (Ferreira, 2001). More details about GEP algorithm can refer to literature (Zhou and Sun, 1999). And the steps of GEP can refer to literature (Ferreira, 2001; Liu et al., 2004).

The following schematic Figure 1 depicts the essential process of gene expression programming. Initially, a certain number of individuals (known as the initial population) have their chromosomes generated randomly. Subsequently, these chromosomes are expressed into syntactically correct programs, and each individual’s fitness is evaluated against a set of fitness cases, also referred to as the selection environment (Ferreira, 2001). Based on their fitness, individuals are selected and modified for reproducing offspring with new traits. These newly individuals undergo the same developmental process, including genome expression, evaluation within the selection environment, selection, and modified reproduction (Ferreira, 2006). The process is repeated for a specific number of generations or until a satisfactory solution is obtained.

FIGURE 1

FIGURE 1. The process of gene expression programming.

GEP is widely used in science and Ferreira developed Automatic Problem Solver (APS) (Gepsoft, 2023), a commercial software integrated with GEP algorithm. The software can encode the appropriate molecular descriptors which are selected by HM and most related to inhibitor activity, and establish a non-linear model to predict IC₅₀ values of FTPDDs.

2.6 Non-linear model by RF

Random forest regression is an ensemble learning algorithm that performs regression tasks by constructing multiple decision trees and integrating their prediction results. In random forest, each decision tree is trained on randomly selected subsets of the data, reducing the risk of overfitting. The final regression result is obtained by averaging or weighted averaging the predictions of the individual trees.

The algorithm works by randomly selecting samples from the training set to form subsets, increasing the diversity of the models. At each node of each decision tree, only a portion of randomly selected features are considered, improving the robustness of the models. Decision trees are built using recursive partitioning based on the least impurity. In random forest, the optimal splitting points of each tree are determined by calculating their Gini coefficients which are the measurements of the purity of a classification model.

The value range for the Gini coefficient is from 0 to 1. A larger value indicates a higher purity of the model. The classifier for random forest is the CART tree. The Gini coefficient of the CART tree can be represented by Eq. 1.

G i n i (p) = 2 p (1 - p) (1)

When traversing each segmentation point of each feature, it is assumed that using the feature A = a. The set D is divided into two parts, namely, D1 (sample set that satisfies A = a) and D2 (sample set that does not satisfy A = a). The Gini coefficient of D under this feature is Eq. 2.

G i n i (D, A) = \frac{|D_{1}|}{|D|} G i n i (D_{1}) + \frac{|D_{2}|}{|D|} G i n i (D_{2}) (2)

Random forest regression has several advantages such as handling high-dimensional and large-scale datasets, having good generalization performance to avoid overfitting, and handling missing and abnormal values. Additionally, it has strong fitting ability for data with nonlinear relationships. The non-linear model established through random forest can effectively predict the IC₅₀ value of FTPDDs.

2.7 Non-linear models by SVR

In order to build the model with stronger prediction ability, three non-linear models were established based on SVR with single, double, and triple kernel function respectively.

SVR is an important application branch of support vector machine (SVM). SVM, proposed by Vladimir N. Vapnik and Alexey Ya. Chervonenkis in 1964, is a generalized linear classifier (GLC) that classifies the data into binary categories under supervised learning. When SVM is applied in the field of regression analysis, it is often called SVR (Wang Y. et al., 2020). Unlike SVM which maximizes the distance from the hyperplane to the points closest to it, SVR minimizes the total deviation from all sample points to the hyperplane (Du and Wu, 2003).

A significant advantage of SVR is its simplicity in mathematical calculations, as it transforms nonlinear problems in the input space into linear problems in a high-dimensional feature space. Another advantage is that SVR can utilize probability rules to train multiple classifiers on different types of data, thus enhancing prediction accuracy by measuring the classification confidence. Compared to other regression techniques, SVR demonstrates lower computational complexity.

Some parameters of support vector machines play an important role in model training. The penalty constant (C) is a regularization parameter that controls the degree of punishment for misclassified samples. A larger C value can lead to stricter classification, which may lead to the model overfitting the training data. A smaller C value allows for more classification errors, which may lead to better generalization ability.

Slack variable, which introduces tolerance that allows some samples to be at the boundary of classification errors or intervals. These parameters are usually used in conjunction with C to adjust the degree of cosmetic error.

For certain kernel functions, such as polynomial kernels and RBF kernels, there can also be additional parameters, such as the order of the polynomial, the bandwidth of the Gaussian kernel, etc. These parameters need to be adjusted according to the characteristics of the data. For imbalanced datasets, the importance of different categories can be balanced by adjusting their weights. This is very useful when dealing with imbalanced data.

The basic idea of SVR is to use a predetermined non-linear constructor to map the input vector to the high-dimensional feature space and regress in the mapped space. To avoid complex mapping operations in high-dimensional feature space, kernel function is used to realize inner product operation in the original space (Fang and Zhao, 2013).

Referring to literature (Deng and Tian, 2004), the primal problem of standard ε-SVR is as follows.

\begin{array}{l} \min_{ω \in R; ξ_{i}, ξ_{i}^{*}, b \in R} \frac{1}{2} {‖ω‖}^{2} + \frac{C}{l} \sum_{i = 1}^{l} (ξ_{i} + ξ_{i}^{*}), \\ s \cdot t \cdot ((ω \cdot x_{i}) + b) - y_{i} \leq ε + ξ_{i}, \\ y_{i} - ((ω \cdot x_{i}) + b) \leq ε + ξ_{i}^{*}, \\ ξ_{i}^{*}, ξ_{i} \geq 0, i = 1, 2, \dots, l . \end{array} (3)

In Eq. 3, ${ξ_{i}}^{(*)}$ means ${(ξ_{1}, {ξ_{1}}^{*}, \dots, ξ_{k}, {ξ_{k}}^{*})}^{T}$ , which is the slack variable. $b$ is the offset. $ε$ is the maximum error allowed in regression. $C$ is the penalty constant. The problem in Eq. 3 can be transformed into its dual problem:

\begin{array}{l} \min_{α^{(*)} \in R^{2 l}} \frac{1}{2} \sum_{i, j = 1}^{l} ((α_{i}^{*} - α_{i}) (α_{j}^{*} - α_{j}) K (x_{i,} x_{j})) + \\ ε \sum_{i = 1}^{l} (α_{i}^{*} + α_{i}) - \sum_{i = 1}^{l} y_{i} (α_{i}^{*} + α_{i}), \\ s \cdot t \cdot \sum_{i = 1}^{l} (α_{i} - α_{i}^{*}) = 0, \\ 0 \leq α_{i}, α_{i}^{*} \leq \frac{C}{l}, i = 1, 2 \dots, l . \end{array} (4)

In this Eq. 4, $α^{(*)}$ means ${(α_{1}, {α_{1}}^{*}, \dots, α_{l}, {α_{l}}^{*})}^{T}$ which is Lagrange multiplier and $K (x_{i}, y_{i})$ is kernel function. By solving this dual problem, the answer to the primal problem and the final regression decision function are obtained.

According to the quadratic programming method in the optimization theory, the parameters $α_{1}$ and $α_{1}^{*}$ can be obtained in solving the dual problem. The parameter $b$ can be gotten by using the Karush-Kuhn-Tucker (KTT) condition. In this way, the expression of the fitting function on the sample set $f (x)$ can be constructed. Its form is given in Eq. 5. The coefficient $(α_{i} - α_{i}^{*}) \neq 0$ is the support vector.

f (x) = \sum_{i = 1}^{l} (α_{i}, - α_{i}^{*}) * K (x_{i}, x_{j}) + b (5)

2.8 Kernel function of SVR

Kernel function has a great influence on the fitting effect of the SVR model. As mentioned in the part of establishing non-linear models by SVR, the job of the kernel function that realizes the inner product operation in the original space is to simplify calculation in high-dimensional space. Kernel function maps each sample point to an infinite-dimensional feature space to make the linearly inseparable data linearly separable (Wang and Chen, 2014). The popularly used kernel functions include linear kernel function, RBF kernel function, polynomial kernel function, and sigmoid kernel function (Mo et al., 2020). The first three kernel functions have been widely applied when establishing models by SVR and their introduction is as follows (Sheikhpour et al., 2018).

Linear kernel function is the simplest kernel function, which only calculates the inner product of two feature vectors. The accuracy of linear kernel function is not high (Mo et al., 2020) but it can be used to find which feature vectors are important with little computation. The form of linear kernel function is as follows.

K_{L} (x, x_{i}) = (x^{T} x_{i}) (6)

RBF kernel function, a certain kind of scalar function that’s radially symmetric, is the most widely used kernel function. RBF kernel function has a strong local learning ability, which means it can well characterize the local property of the sample, but its generalization ability is not good enough (Fang and Zhao, 2013; Ding et al., 2021). The form of RBF kernel function is given in Eq. 7, where the constant $σ$ is the kernel radius of RBF kernel function.

K_{R B F} (x, x_{i}) = \exp \{- \frac{{‖x - x_{i}‖}^{2}}{2 σ^{2}}\} (7)

The polynomial kernel function is a commonly used kernel function. The generalization ability of polynomial kernel function is strong, but its local learning ability is not good. The form of polynomial kernel function is given in Eq. 8, where the constant $q$ is the order of polynomial kernel function.

K_{P o l y} (x, x_{i}) = {[(x^{T} x_{i})]}^{q} (8)

The process of establishing three models of SVR with different number of kernel functions is as follows.

First, SVR with single kernel function was used to establish the regression model. RBF kernel function is usually selected as the single kernel function due to its good local learning ability. The form of single kernel function is as follows.

K \sin g l e = K_{R B F} (9)

In order to obtain the kernel function that enhances the learning and generalization ability of the model by SVR, Smits et al. proposed the multiple kernel function (Fang and Zhao, 2013). There are many combinations of kernel function, but the obtained multiple kernel function must satisfy Mercer’s Theorem (Smits and Jordaan, 2002).

According to the closure characteristic of kernel functions, the derivation principle of mixed kernel functions can be strictly proven. Take $\{X 1, X 2, \cdot \cdot \cdot, X m\} \subseteq X \subseteq R^{n}$ , the Gram matrices corresponding to $\{X 1, X 2, \cdot \cdot \cdot, X m\} K_{1}$ and $K_{2}$ are both positive definite matrices. If $c \subseteq R^{n}$ is chosen, then c′( $α$ K₁ $+$ $β$ K₂)c = c′ $α$ K₁c $+$ $c^{'} β$ K₂c ≥ 0 indicating that the Gram matrices corresponding to $α K_{1} + β K_{2}$ are all positive definite. From this, it can be concluded that $α K_{1} + β K_{2}, α, β \geq 0$ is kernel function. In addition, by limiting $α + β = 1$ , a stationary mixed kernel function can be constructed. Because the regression ability and generalization ability of SVR model are a pair of mutually balanced factors, the SVR model obtained by mixed kernel functions with different properties in a certain proportion can balance the performance of both aspects.

Considering that the generalization ability of polynomial kernel function is strong, RBF kernel function and polynomial kernel function were integrated as double kernel function to establish the SVR model with enhanced generalization ability (Yang et al., 2023). The form of double kernel function is given in Eq. 10, where the value of the variable $a$ ranges from 0 to 1.

K_{d u a l} = a * K_{R B F} + (1 - a) * K_{p o l y} (10)

Considering linear kernel function can capture key vectors at the cost of little computation and therefore help to enhance the regression effects of the model, linear kernel function was added to the previous double kernel function, which formed triple kernel function to establish the regression model with better learning and generalization ability again. The form of triple kernel function is given in Eq. 11, where variables $a$ and $b$ are positive coefficients and the sum of $a$ and $b$ is less than or equal to 1.

K_{t r i p l e} = a * K_{R B F} + b * K_{p o l y} + (1 - a - b) * K_{L} (11)

Therefore, three models based on SVR with different numbers of kernel functions were constructed to predict the IC₅₀ values of compounds.

2.9 SVR model optimized by particle swarm optimization

The values of parameters when establishing SVR model have a great influence on the learning and generalization ability of the model. In the process of constructing the model by SVR with triple kernel function, six parameters need to be optimized: the insensitive parameter $ε$ , the penalty factor $C$ , the kernel radius of RBF kernel function $σ$ , the coefficient of RBF kernel function $a$ , the order of polynomial kernel function $q$ , and the coefficient of polynomial kernel function $b$ . And in the process of establishing the models by SVR with single and double kernel function, the first three and five parameters need to be optimized respectively. Their searching ranges are as follows. $ε \in [0, 0.8]$ , $C \in [0.001, 200]$ , $σ \in [0.001, 5]$ , $a \in [0, 1]$ , $q \in \{1, 2, 3\}$ , $b \in [0, 1 - a]$ .

However, the optimization of parameters when building model by SVR with single kernel function has already become a research difficulty. As for the model by SVR with multiple kernel function, with the number of parameters increasing, it is more difficult to optimize the parameters. Considering that traditional parameter-searching methods including the grid search method and the random search method are inefficient, particle swarm optimization (PSO) algorithm was used to optimize parameters in the process of establishing three models by SVR.

As a kind of bionic optimization algorithm, PSO was proposed by social psychologist Kennedy and electrical engineer Eberhart in 1995. The idea of PSO is based on birds swarm finding the optimal destination by sharing collective information. PSO algorithm takes random values in high-dimensional space to initialize the position and velocity information of particles and update the information by self and group learning of particles. In PSO algorithm, particles only transmit the optimal information in the process of iterative process, so the PSO has the advantages of fast search speed and convergence rate (Wang, 2022).

PSO adopts the velocity-position model, in which the forms of position and velocity of particle $i$ in D-dimensional solution space are as follows.

X i = (x_{i 1}, x_{i 2}, x_{i 3}, \dots, x_{i D}) (12)

V i = (v_{i 1}, v_{i 2}, v_{i 3}, \dots, v_{i D}) (13)

To characterize which particle is in the best position among all particles, RMSE is used to evaluate the fitness of each particle. The smaller RMSE means better position and fitness. The best position of particle $i$ is recorded as $p_{i b e s t}$ and the global best position is recorded as $g_{b e s t}$ . In each iteration, the particle tracks $p_{i b e s t}$ , $g_{b e s t}$ and its previous state to adjust the position and velocity at the current time. The iterative equations are as follows.

v_{i} (k + 1) = w * v_{i} (k) + c_{1} * r a n d () * (p_{i b e s t} - x_{i} (k)) + c_{2} * r a n d () * (g_{b e s t} - x_{i} (k)) (14)

X_{i} (k + 1) = X_{i} (k) + V_{i} (k + 1) (15)

In the Eq. 15, $V_{i} (k)$ , $V_{i} (k + 1)$ , $X_{i} (k)$ , $X_{i} (k + 1)$ are the velocity and position of the particle $i$ at the current time and the next time respectively; $r a n d ()$ is a random number in range $[0, 1]$ . $c_{1}$ and $c_{2}$ are learning factors which are usually set to 2. $ω$ is the weight factor and the value of $ω$ automatically decreases with the iteration of the algorithm to accelerate the convergence rate. It is generally defined as:

ω = ω_{\min} + (i t e r_{\max} - i t e r) * (ω_{\max} - ω_{\min}) / i t e r_{\max} (16)

$ω_{\max}$ , $ω_{\min}$ are the maximal and minimal weight factors respectively, $i t e r$ is the current number of iterations, and $i t e r_{\max}$ is the maximum number of iterations.

Referring to literature (Xiong and Xu, 2006), The steps of parameter optimization by PSO are as follows.

Step 1. Initialize parameters of PSO including population size and the maximal and minimal weight factors $ω_{\max}$ and $ω_{\min}$ . Set the maximum number of iterators $i t e r_{\max}$ .

Step 2. Each particle is randomly assigned a set of position information $X_{(i, 0)}$ and velocity information $V_{(i, 0)}$ . Set the individual best position $p_{i b e s t}$ of each particle to its current position $X_{(i, 0)}$ .

Step 3. RMSE values of each particle are calculated to evaluate fitness. The initial global best position $g_{b e s t}$ is set to the position of the particle with the best fitness.

Step 4. Update the position and velocity of each particle by iterative calculation according to Eqs 14–16 and calculate the RMSE of each particle to evaluate fitness.

Step 5. Compare the fitness of each particle with the fitness of its $p_{i b e s t}$ . If the fitness is better than that of $p_{i b e s t}$ , $p_{i b e s t}$ is updated to the current position, otherwise the original value of $p_{i b e s t}$ remains unchanged.

Step 6. Compare the updated $p_{i b e s t}$ of each particle with $g_{b e s t}$ . If $p_{i b e s t}$ is better than $g_{b e s t}$ , $g_{b e s t}$ is updated to $p_{i b e s t}$ , otherwise the original value of $g_{b e s t}$ remains unchanged.

Step 7. Judge whether the termination conditions are met. If the maximum number of iterations is reached or $g_{b e s t}$ is not changed, terminate the iteration, otherwise return to step 4.

2.10 Property prediction and molecular docking

It is essential for newly designed molecules not only to exhibit efficacy against the target but also to possess favorable physical and chemical properties. Property explorer is a complimentary tool that predicts physical, chemical, and toxicological properties of molecules to aid in designing active drug compounds (Song et al., 2017). This tool enables researchers to construct new molecular structures and analyze their attribute values automatically. It provides valuable information on properties such as molecular weight, Partition coefficient (LogP), water solubility, topological polar surface area (TPSA), drug similarity, toxicity assessment, overall drug score, etc.

Macromolecular docking has become a key step in the drug development process (Chen et al., 2022a), facilitating the identification of potential therapeutic molecules and predicting ligand-target interactions at the molecular level (Chen et al., 2022b). To explore these interactions between the new PARP inhibitor and PARP1 at the binding site, the Sybyl-X 2.1 software package was employed (Li et al., 2012) This software allowed users to generate potential interactions between the molecules and proteins, as well as exploring the expected binding sites for molecular fitting.

The macromolecular docking technique was utilized to investigate the potential interaction between the new PARP inhibitor and the binding site of PARP1. Firstly, the chemical structure was imported into Sybyl-X software for calculation and optimization with parameters “max interactions” and “max display” as 1000 and 0.01, respectively. The molecules were then assigned Gasteiger-Hückel charges and minimized using the Tripos force field until convergence reached 0.05 kcal/mol/Å (Song et al., 2023).

Subsequently, the protein was imported into Sybyl-X software for hydrogenation, charging, and optimization. Unnecessary ligands and water molecules were removed to enable proper binding with the protein targets.

Finally, the docking results were imported into PyMol software for image optimization, and the amino acid residues and hydrogen bonds were labeled by same software.

3 Results and discussion

3.1 Lg (IC₅₀) of FTPDDs

3.2 Results of HM

540 molecular descriptors were calculated in CODESSA software. To select the molecular descriptors most related to PARP1 inhibitors, a series of linear models with the increasing number of molecular descriptors were established. Figure 2 shows the influences of the number of descriptors on R², R²_cv, and the standard deviation (S) of all compounds.

FIGURE 2

FIGURE 2. Influences of the number of descriptors on R2, R2cv, and S2 of all compounds.

As shown in Figure 2, the values of R² and R²_cv of all compounds were rising with the increasing number of molecular descriptors. However, when the number of descriptors reached eight, adding another descriptor did not significantly improve the statistics of the model, so the eight-parameter model can be regarded as the optimal one. The eight selected molecular descriptors and their physical-chemical meanings are shown in Table 5, and their R² matrix is shown in Table 6. The 8-parameter model is discussed in details as follows.

TABLE 5

TABLE 5. The selected descriptors and their physical-chemical meanings and coefficient.

TABLE 6

TABLE 6. R² matrix of the eight descriptors.

R² of training set and test set in HM model were 0.7550, 0.9014 and their RMSE were 0.2327, 0.2378. The optimal prediction results by HM model are shown in Figure 3. In addition, R²_cv of training set and test set in HM model are 0.7773 and 0.6798.

FIGURE 3

FIGURE 3. Plot of measured and predicted lg (IC50) by HM.

3.3 Results of GEP

The same eight descriptors were imported into APS software to search the ideal model by GEP. Then the optimal model was obtained in the 199th generation. R² of training set and test set in GEP model were 0.7395, 0.7818 and their RMSE were 0.2520 and 0.2768. R²_cv of training set and test set in GEP model were 0.7302 and 0.6998. The optimal prediction results by GEP model are shown in Figure 4. The mathematical equation of the non-linear model built by GEP is as follows.

\begin{array}{l} \lg (IC 50) = f l o o r (d_{5}) + \sec (d_{3}) + d 5 * d 2 + \tan (\log s t i c ((d 0 * d_{1}) * (- d_{1}))) \\ * arccot (d_{2}) + f l o o r (\sin ((d_{3} - \arctan (d_{4})) - (f l o o r (d_{7}) + d_{1}))) \end{array} (17)

FIGURE 4

FIGURE 4. Plot of measured and predicted lg (IC50) by GEP.

In Eq. 17, d0, d1, d2, d3, d4, d5, d6, and d7 represent MENANNB, MEEROA, MRECCB, MBCM, ANRINA, KA (O3), HDH/T (QCP), and MRECHB.

3.4 Results of RF

In order to verify the effectiveness of selecting descriptors through the HM method, RF was applied in the experiment for feature selection. Firstly, 540 unprocessed molecular descriptors belonging to 57 FTPDDs were exported from Codessa. Secondly, all descriptors containing default values were removed. Thirdly, according to the preprocessing steps of the heuristic algorithm, all descriptors with high collinearity and irrationality were also removed. Finally, RF was used to extract features from the remaining descriptors to select the most important 8 descriptors. Figure 5 shows the 8 descriptors selected by RF. The comparison between the descriptors selected by RF and HM was carried out to verify the effectiveness of the descriptors.

FIGURE 5

FIGURE 5. Eight descriptors selected by RF.

The 8 descriptors filtered out are Max resonance energy for a H-N bond (MREAH), HASA-2/TMSA [Zefirov’s PC] (HTZP), Wiener index (WI), Min e-e repulsion for a O atom (MEERAO), Max atomic state energy for a C atom (MASEAC), Final heat of formation (FHF), Min e-n attraction for a N-N bond (MENAAN), Min e-e repulsion for a C-H bond (MEERAC). Min e-e repetition for a O atom and Min e-n attraction for a N-N bond are the two most important descriptors selected through HM. Both descriptors were successfully screened by RF, demonstrating the effectiveness of the HM algorithm.

Furthermore,5-fold cross-validation was used to obtain an R²_cv of 0.87 for the training set and 0.86 for the test set. The MAE values for the training and testing sets are 0.1518 and 0.0821, respectively. In addition, the RMSE of four ring compounds and five ring compounds are 0.3078 and 0.3420, respectively, and their R² values are 0.7203 and 0.9002, respectively.

3.5 Results of SVR

The learning and generalization ability of SVR model depends on kernel function and the values of parameters. In the process of establishing models by SVR, there are always two parameters to be optimized: the insensitive parameter $ε$ and the penalty factor. The analysis of the influences of the parameters to the SVR model (Fang and Zhao, 2013) is as follows. $ε$ reflects the sensitivity of the model to the noise contained in the input vectors. The larger the $ε$ is, the lower the fitting accuracy of the model is. $C$ represents the tolerance of the model to the error. The greater the $C$ is, the higher the fitting accuracy is, which causes more computation. In addition, if the model is established by SVR with multiple kernel function, the coefficients of kernel functions also need to be optimized. Considering the low efficiency of traditional parameter searching algorithms including the grid search method and the random research method when the number of parameters is increasing, PSO was employed to find the best combination of parameters. In PSO algorithm, only the optimal information is transmitted between particles, which avoids searching all combinations of parameters and accelerates convergence rate.

SVR with single kernel function is commonly used, in which RBF kernel function is usually selected. In Eq. 7, the kernel radius $σ$ represents the mean square deviation of RBF function, that is, the width of RBF kernel function in the direction of independent variable. The smaller the $σ$ is, the better the fitting performance of RBF kernel function is, which leads to poor generalization ability (Fang and Zhao, 2013). The optimal combination of parameters by PSO is as follows.

(ε, C, σ) = (0.024, 4.132, 0.758)

The optimal prediction results of the model established by SVR with single kernel function are given in Figure 6. R² of training set and test set in the model were 0.9114 and 0.8599 and their RMSE were 0.0215 and 0.0491 respectively. In addition, R²_cv of training set and test set in the model were 0.8223 and 0.7815.

FIGURE 6

FIGURE 6. Plot of measured and predicted lg (IC50) by SVR with single kernel function.

Results showed that R² of training set was about 0.06 more than that of test set in the model established by SVR with single kernel function, which meant the model had an unsatisfactory generalization ability. Therefore, polynomial kernel function was integrated with RBF kernel function to establish the model by SVR with double kernel function. Therefore, the coefficient of RBF kernel function $a$ and the order of polynomial kernel function $q$ also need to be optimized. The optimal combination of parameters by PSO is as follows.

(ε, C, σ, a, q) = (0.013, 0.658, 0.328, 0.51, 2)

The ratio of RBF kernel function to polynomial kernel function was 0.51:0.49, which meant RBF kernel function and polynomial kernel function play almost equal roles. The optimal prediction results of the model established by SVR with double kernel function are given in Figure 7. R² of training set and test set in the model were 0.9259 and 0.9175 and their RMSE were 0.0180 and 0.0289 respectively. In addition, R²_cv of training set and test set in the model were 0.8989 and 0.8712.

FIGURE 7

FIGURE 7. Plot of measured and predicted lg (IC50) by SVR with double kernel function.

However, results showed that both R²_cv of training set and test set were less than 0.9 in the model established by SVR with double kernel function, which meant the learning and generalization ability of the model still needed to be improved. As mentioned before, linear kernel function can improve the regression ability of the model, so linear kernel function, polynomial kernel function, and RBF kernel function composed triple kernel function, which formed a novel SVR model with better learning and generalization ability. The optimal combination of parameters by PSO is as follows.

(ε, C, σ, a, q, b) = (0.024, 1.123, 0.332, 0.31, 2, 0.58)

The ratio of RBF kernel function, polynomial kernel function, and linear kernel function was 0.31:0.58:0.11, which indicated that polynomial kernel function played the most important role in realizing inner product operation of kernel function. The optimal prediction results of SVR with triple kernel function are given in Figure 8. R² of training set and test set in the model were 0.9353 and 0.9348 and their RMSE were 0.0157 and 0.0228 respectively. In addition, R²_cv of training set and test set in the model were 0.9090 and 0.8971.

FIGURE 8

FIGURE 8. Plot of measured and predicted lg (IC50) by SVR with triple kernel function.

3.6 Design of new FTPDDs

Through the analysis of the molecular descriptors adopted in models, the structural factors that influence the IC₅₀ values of the FTPDDs were identified. The non-standardized coefficients in table 2 represent the slope of the regression equation for each independent variable and indicate the magnitude of change in the dependent variable (IC₅₀) with respect to each independent variable. These coefficients are not dependent on the unit of the independent variables and can reveal the influence of various factors on IC₅₀ activity.

The descriptors included in the HM model provide valuable insights into the factors related to IC₅₀ activity:

(1) “MENANNB” refers to the interaction force between the electrons and nucleus in the bond between two nitrogen atoms. Reducing this value leads to a significant reduction in IC₅₀. (Clementi, 1980).

(2) “MEEROA” represents the minimum recurrence between electrons in an oxygen atom. As its coefficient in the HM model is negative, increasing the MEEROA value results in gradually decrease in the IC₅₀ value. (Clementi, 1980).

(3) “MRECCB” reflects the maximum response energy of a C-C bond. In the HM model, it has a positive regression coefficient, meaning that smaller MRECCB value indicates stronger inhibitory ability of PARP. (Clementi, 1980).

(4) “MBCM” quantifies the contribution of each atomic orbital in a molecular orbital to the formation of a chemical bond. In the HM model, a positive regression coefficient suggests that smaller MBCM value leads to a lower IC₅₀ value and higher activity. (Clementi, 1980).

(5) “ANRINA” describes the relative reactivity of nitrogen atoms as nuclear agents in chemical reactions. The negative coefficient in the HM model implies that increasing the ANRINA value leads to gradual decrease in the IC₅₀ value. (Franke, 1984).

(6) “KA (O3)” reflects the connectivity and arrangement of atoms in a molecule. The negative regression coefficient indicates that larger KA (O3) value corresponds to a smaller IC₅₀ value and higher activity. (Kier, 1985).

(7) “HDH/T (QCP)” describes hydrogen bond interactions on the molecular surface. A negative regression coefficient suggests that higher HDH/T (QCP) value leads to a smaller IC₅₀ value and higher activity. (Clementi, 1980).

(8) “MRECGB” refers to the context of a molecular orbital (MO) particle in bonding in a chemical molecule or specifications. In the HM model, the negative regression coefficient indicates that larger MRECGB value results in a smaller IC₅₀ value and higher activity. (Clementi, 1980).

In summary, the interpretation of the HM model and molecular descriptors has revealed several factors affecting inhibitory activity. To design new compounds, it is beneficial to reduce polar interactions between molecular atoms and alter the characteristics of different charge distributions of N atoms.

To obtain an ideal inhibitor structure, the structural composition of compound 44 which is the most effective compound in the literature can be modified based on these factors. Molecular structure adjustments can focus on the R region shown in Figure 9. Specifically, changes in the benzene ring with its 6 C atoms may be favorable for achieving the desired distribution of different charges.

FIGURE 9

FIGURE 9. The design strategy mainly focused on the R region of compound 44.

Some chemical functional groups were incorporated into positions R₁ to R₄, utilizing a random combination approach to minimize polar interactions between atoms. These functional groups include halogen, hydroxyl, carboxyl, aldehyde, hydrocarbon, as well as various forms of carbon and nitrogen atoms.

Through continuous and purposeful adjustments and combinations, a set of 126 molecules was designed based on analysis of descriptors in the HM model. Subsequently, the physical and chemical parameters of the newly designed molecules were calculated using CODESSA software. By inputting these parameters into the HM model, the IC₅₀ value of each molecule was predicted. If the predicted IC₅₀ value was lower than that of compound 44, the corresponding molecule would be selected for further analysis and macromolecular docking study in the Property explorer applet. Table 7 shows the predicted IC50 by HM and Docking total score of new FTPDDs.

TABLE 7

TABLE 7. Predicted IC50 by HM and Docking total score of new FTPDDs.

As a result of the analysis, the predicted IC₅₀ values of six new compounds indicate that these compounds have preferable properties are worth further research. The predicted IC₅₀ values of all these compounds were lower than the predicted values obtained in HM training for compound 44.

3.7 Property prediction of new FTPDDs

In order to predict properties of the new compound, the Property explorer applet (PEA) was applied in the experiment. This applet provides real-time predictions of physico-chemical properties and identification of potential toxicity risks for any chemical structure drawn. It can evaluate many properties of compounds, including partition coefficient, water solubility, topological polar surface area (TPSA), molecular weight, etc. Table 8 shows the predicted IC₅₀ by HM and properties by PEA of newly designed compounds.

TABLE 8

TABLE 8. Predicted IC₅₀ by HM and properties by PEA of newly designed compounds.

The partition coefficient, abbreviated P, is defined as a particular ratio of the concentrations of a solute between the two solvents and the logP is the logarithm of the ratio. The LogP value represents the logarithm of the partition coefficient between n-octanol and water, which is a standard measure of a compound’s hydrophilicity. It has been established that the LogP value of compounds with good potential absorption should not exceed 5.0 (Kwon, 2001).

Water solubility significantly influences the intestinal absorption and cellular distribution characteristics of compounds. Higher solubility means better absorption. The goal of drug design is to obtain compounds with higher water solubility.

TPSA is the sum of all topological polar regions on the molecular surface and is closely related to various bioavailability-related characteristics, such as permeability through the Blood–brain barrier (Ertl et al., 2000).

Molecular weight plays a role in the biological activity of compounds. Lower molecular weight compounds are more easily absorbed and distributed.

Drug similarity is utilized in new drug design to evaluate the “similarity” of the compound with factors such as bioavailability (Smith, 2011).

3.8 Molecular docking of new FTPDDs

During the lunar docking experiments, the newly designed compounds were employed as ligands to dock with PARP (pdb code 7CMW). Remarkably, compound 44a exhibited the most favorable performance in the macromolecular docking, achieving remarkable total score of 7.7065 which significantly surpassing that of compound 44. The detailed binding mode of compound 44a is presented in Figure 10, illustrating the formation of two crucial hydrogen bonds with specific residues.

FIGURE 10

FIGURE 10. Docking assay of compound 44a with PARP related target (PDB ID: 7CMW).

Based on the docking conformation of compound 44a, the N atom establishes a hydrogen bond with the residue GLY-863, which aligns with the binding pattern observed in compound 44. Additionally, the nitrogen atoms from the newly incorporated structural component also form hydrogen bonds with SER-904. The strong bond reaction observed between compound 44a and PARP suggests that it could potentially serve as a promising candidate inhibitor for this protease.

Molecular docking requires the preparation of ligands and protein structures. Maestro is required for Protein Preparation. Firstly, energy minimization is performed, and then the receptor is isolated from the protein. Minimize the energy of the generated compound in Chem3D using MM2, align the processed compound as a ligand in PyMol, and finally calculate the RMSD between the generated compound and the existing ligand (Wang and Zhang, 2023). Table 9 shows RMSD values between the selected docking pose of 7 cmw and the experimental X-ray structure.

TABLE 9

TABLE 9. RMSD values between the selected docking pose of 7 cmw and the experimental X-ray structure.

4 Comparison of different models

To verify the robustness of the models, k-fold cross-validation was used for model evaluation and the value of k was set to 5. All optimal predicted results of models based on HM, GEP, RF and SVR with single, double, and triple kernel function and their R²_cv are given in Table 10 respectively.

TABLE 10

TABLE 10. Comparison of Statistical parameters of different methods.

It is obvious that the non-linear model established by SVR with triple kernel function shows the strongest prediction ability and model robustness than that of other models. Compared with the linear model built by HM, nonlinear models can better describe complicated problems. Compared with GEP which is easily trapped in the local optimal solution of the problem, SVR is a convex quadratic optimization method, which makes its local optimal solution the global optimal one. Compared to RF, the kernel function of SVR can be selected and adjusted according to the nature of the problem to adapt to different data distributions and patterns. Compared with SVR with single kernel function, the difference between R² of training set and test set in the model by SVR with double kernel function decrease from 0.0515 to 0.0084, which demonstrated the addition of polynomial kernel function did improve the generalization ability of SVR model. Compared with SVR with double kernel function, R² of training set and test set in the model by SVR with triple kernel function increased by 0.0094 and 0.0173, which demonstrated the addition of linear kernel function was helpful to improve the learning and generalization ability of SVR model.

Among the three models established by SVR with different numbers of kernel function, the model established by SVR with triple kernel function shows the best learning and generalization ability because kernel functions are complementary. In addition, the differences between the optimal R² and R²_cv of training set and test set in the model by SVR with triple kernel function are 0.0192 and 0.0309 which are much less than that of other models. It indicates that the model established by SVR with triple kernel function has strong robustness.

Furthermore, to verify the effect of compounds with different ring numbers on the predicted results, the original dataset was divided into tetracyclic compounds and pentacyclic compounds, and the R² and RMSE of six models related to these two compounds were calculated respectively. Table 10 shows that the prediction results by six models related to pentacyclic compounds better fit the measured values. R² of three SVR models related to tetracyclic compounds improve significantly. Furthermore, the improvement of SVR kernel function can better improve the performance of models related to tetracyclic compounds.

In addition, to demonstrate the external predictive ability of the model, Table 10 presents the calculation results of the three statistics Q²_F1, Q²_F2, and CCC under six different models.

5 Conclusion

In this study, 6 eight-parameter QSAR models were established by HM, GEP, RF and SVR with single, double, and triple kernel function to predict the biological activity of 57 FTPDDs as PARP1 inhibitors respectively. Compared with other models, the model established by SVR with triple kernel function shows the strongest prediction ability and robustness, which indicates that the method of SVR with triple kernel function has good potential for constructing models to predict the biological activity of compounds and guiding drug design. In addition, the PSO algorithm shows a strong parameter-optimized ability in the process of establishing SVR model due to its characteristic of high searching speed and fast convergence rate, which means PSO has good potential for optimizing parameters when building SVR model. Furthermore, the model established by SVR with triple kernel function shows 8 important factors that have a great influence on the biological activity of PARP1 inhibitors, which will guide new drug design and screening for breast cancer. Six FTPDDs were designed using these 8 important factors and molecular docking experiments were conducted on them. The properties of new derivatives were ultimately verified, and the effectiveness of the SVR model was also demonstrated.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

Author contributions

HX: Data curation, Methodology, Writing–review and editing, Project administration, Writing–original draft. RZ: Data curation, Methodology, Validation, Writing–review and editing. XY: Writing–review and editing, Data curation, Methodology. RW: Data curation, Supervision, Writing–original draft. PZ: Supervision, Project administration, Writing–review and editing.

Funding

The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.

Acknowledgments

Throughout the process of writing this article, our team received an immense amount of support and assistance. First and foremost, the authors would like to express our sincerest gratitude to Changyou Zhou, Hexiang Wang etc., from BeiGene (Beijing) Co, for providing our team with the experimental data. We are incredibly grateful for their contribution. Additionally, we would like to extend our thanks to our senior colleagues who generously offered us invaluable advice. Their guidance and support enabled us to navigate obstacles and explore new frontiers with confidence. Lastly, but not least, we would like to express our deepest appreciation to our mentor, PZ, who is the corresponding author of this article. His profound expertise has been instrumental in shaping our research questions and methodologies. His insightful feedback continuously challenged us to enhance our thinking and elevate the quality of our team’s work.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fphar.2024.1257253/full#supplementary-material

References

Allen, D. M. (1974). The relationship between variable selection and data agumentation and a method for prediction. technometrics 16 (1), 125–127. doi:10.1080/00401706.1974.10489157

CrossRef Full Text | Google Scholar

An, L. Y., Xiang, Y. H., Zhang, Z. Y., and Hu, W. X. (2006). The new advance and applications of quantitative structure-activity relationship. J. Cap. Normal Univ. 27 (3), 52–57. doi:10.19789/j.1004-9398.2006.03.014

CrossRef Full Text | Google Scholar

Chen, C., and Si, H. (2021). QSAR models of Celecoxib analogues and derivatives as COX-2 inhibitor to predict their anti-inflammatory effect. Cancer Cell 33 (2021), 827–835. doi:10.54762/ccr2022.827-835

CrossRef Full Text | Google Scholar

Chen, Y., Ma, K., Si, H., Duan, Y., and Zhai, H. (2022a). Network Pharmacology integrated molecular docking to reveal the autism and mechanism of baohewan heshiwei wen dan tang. Curr. Pharm. Des. 28 (39), 3231–3241. doi:10.2174/1381612828666220926095922

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, Y., Ma, K., Xu, P., Si, H., Duan, Y., and Zhai, H. (2022b). Design and screening of new lead compounds for autism based on QSAR model and molecular docking studies. Molecules 27 (21), 7285. doi:10.3390/molecules27217285

PubMed Abstract | CrossRef Full Text | Google Scholar

Clementi, E. (1980). Computational aspects of large chemical systems. New York: Springer Verlag.

Google Scholar

Deng, N. Y., and Tian, Y. (2004). A new method of data mining-support vector machine. Beijing City: Science Publication, 51.

Google Scholar

Ding, Y., Liu, S., and Wang, X. (2021). The SOH estimate method research on lead-acid battery based on mixture kernels function RVM. Electr. Drive 51 (22). doi:10.19457/j.1001-2095.dqcd21888

CrossRef Full Text | Google Scholar

Du, S., and Wu, T. (2003). Support vector machines for regression. J. Syst. Simul. 15 (11), 1580–1633. doi:10.3969/j.issn.1004-731X.2003.11.023

CrossRef Full Text | Google Scholar

Ertl, P., Rohde, B., and Selzer, P. (2000). Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties. J. Med. Chem. 43 (20), 3714–3717. doi:10.1021/jm000942e

PubMed Abstract | CrossRef Full Text | Google Scholar

Fang, F., and Zhao, L. (2013). Face recognition based on multi-kernel support vector machine. Electron. Des. Eng. 21 (11), 4–6. doi:10.14022/j.cnki.dzsjgc.2013.11.042

CrossRef Full Text | Google Scholar

Ferreira, C. (2001). Gene expression programming: a new adaptive algorithm for solving problems. arXiv preprint cs/0102027. Available at: https://doi.org/10.48550/arXiv.cs/0102027.

Google Scholar

Ferreira, C. (2006). Gene expression programming: mathematical modeling by an artificial intelligence (Vol. 21). Berlin, Heidelberg: Springer.

Google Scholar

Franke, R. (1984). Theoretical drug design methods. Amsterdam: Elsevier.

Google Scholar

Gao, Z., Xia, R., and Zhang, P. (2022). Prediction of anti-proliferation effect of [1, 2, 3] triazolo [4, 5-d] pyrimidine derivatives by random forest and mix-kernel function SVM with PSO. Chem. Pharm. Bull. 70 (10), 684–693. doi:10.1248/cpb.c22-00376

CrossRef Full Text | Google Scholar

Gepsoft (2023). GeneXproTools Modeling made easy 5.0. Available at: http://www.gepsoft.com/gepsoft.

Google Scholar

Glantz, S. A., and Slinker, B. K. (1990). Primer of applied regression and analysis of variance. New York, United States: McGraw-Hill. ISBN 978-0-07-023407-9.

Google Scholar

Golbraikh, A., and Tropsha, A. (2002). Beware of q2! J. Mol. Graph. Model. 20 (4), 269–276. doi:10.1016/S1093-3263(01)00123-1

PubMed Abstract | CrossRef Full Text | Google Scholar

Huang, J., and Yin, X. (2021). Current status and prospect of PARP inhibitors in the treatment of tumors. Anti-tumor Pharm. 11 (3). doi:10.3969/j.issn.2095-1264.2021.03.09

CrossRef Full Text | Google Scholar

Hyndman, R. J., and Koehler, A. B. (2006). Another look at measures of forecast accuracy. Int. J. Forecast. 22 (4), 679–688. doi:10.1016/j.ijforecast.2006.03.001

CrossRef Full Text | Google Scholar

HyperChem (1994). Version 4.0, hypercube.

Google Scholar

Jin, Y., Liu, X., Zhou, J., Wang, L., Pan, S., and Lu, H. (2022). Inhibitory effect and mechanism of ‘Taizi Yangrong Decoction’on oral mucositis after radiotherapy for nasopharyngeal carcinoma in vivo and in vitro. Available at: https://doi.org//10.54762/ccr2022.877-885.

Google Scholar

Katritzky, A. R., Lobanov, V. S., and Karelson, M. (1995). QSPR: the correlation and quantitative prediction of chemical and physical properties from structure. Chem. Soc. Rev. 24 (4), 279–287. doi:10.1039/CS9952400279

CrossRef Full Text | Google Scholar

Katritzky, A. R., Lobanov, V. S., and Karelson, M. (2023). CODESSA, reference manual. Florida: University of Florida.

Google Scholar

Kier, L. B. (1985). A shape index from molecular graphs. Quant. Struct. -Act. Relat. 4, 109–116. doi:10.1002/qsar.19850040303

CrossRef Full Text | Google Scholar

Klein, E., Matis, M., Lukeš, V., and Cibulková, Z. (2006). The applicability of AM1 and PM3 semi-empirical methods for the study of N–H bond dissociation enthalpies and ionisation potentials of amine type antioxidants. Polym. Degrad. Stab. 91 (2), 262–270. doi:10.1016/j.polymdegradstab.2005.05.010

CrossRef Full Text | Google Scholar

Kwon, Y. (2001). “4.2.4: partition and distribution coefficients,” in Handbook of essential pharmacokinetics, pharmacodynamics and drug metabolism for industrial scientists (New York: Kluwer Academic/Plenum Publishers), 44. ISBN 978-1-4757-8693-4.

Google Scholar

Li, X., Ye, L., Wang, X., Wang, X., Liu, H., Qian, X., et al. (2012). Molecular docking, molecular dynamics simulation, and structure-based 3D-QSAR studies on estrogenic activity of hydroxylated polychlorinated biphenyls. Sci. Total. Environ. 441, 230–238. doi:10.1016/j.scitotenv.2012.08.072

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, G., Wang, X., Li, A., and Zhang, P. (2023). QSAR study on the IC50 of thiosemicarbazone derivatives as PC-3 inhibitors based on mixed kernel function support vector machine. Lat. Am. J. Pharm. 42 (3), 543–553.

Google Scholar

Liu, W., Bao, Y., Sun, L., and Wang, G. (2004). Study on regression analysis techniques based on genetic algorithm. Comput. Eng. Appl. 2004 (21), 94–97. doi:10.3321/j.issn:1002-8331.2004.21.030

CrossRef Full Text | Google Scholar

Lord, C. J., and Ashworth, A. (2017). PARP inhibitors: synthetic lethality in the clinic. Science 355 (6330), 1152–1158. doi:10.1126/science.aam7344

PubMed Abstract | CrossRef Full Text | Google Scholar

Luo, Q., Zhang, J., and Li, Z. (2021). Research advances in mechanism of action and drug resistance of PARP inhibitor in breast cancer. Med. Recapitulate 27 (24). doi:10.3969/j.issn.1006-2084.2021.24.012

CrossRef Full Text | Google Scholar

Mo, Y., Gu, M., and Zhang, H. (2020). Classification of college students with financial difficulties based on combination kernel function. J. Anhui Univ. Technol. 37 (1). doi:10.3969/j.issn.1671-7872.2020.01.010

CrossRef Full Text | Google Scholar

Muvarak, N., Kelley, S., Robert, C., Baer, M. R., Perrotti, D., Gambacorti-Passerini, C., et al. (2015). c-MYC generates repair errors via increased transcription of alternative-NHEJ factors, LIG3 and PARP1, in tyrosine kinase–activated leukemias. Mol. Cancer Res. 13 (4), 699–712. doi:10.1158/1541-7786.MCR-14-0422

PubMed Abstract | CrossRef Full Text | Google Scholar

Ohmoto, A., and Yachida, S. (2017). Current status of poly (ADP-ribose) polymerase inhibitors and future directions. OncoTargets Ther. 10, 5195–5208. doi:10.2147/OTT.S139336

PubMed Abstract | CrossRef Full Text | Google Scholar

Roy, K., and Das, R. N. (2014). A review on principles, theory and practices of 2D-QSAR. Curr. Drug Metab. 15, 346–379. doi:10.2174/1389200215666140908102230

PubMed Abstract | CrossRef Full Text | Google Scholar

Roy, P. P., and Roy, K. (2008). On some aspects of variable selection for partial least squares regression models. QSAR Comb. Sci. 27 (3), 302–313. doi:10.1002/qsar.200710043

CrossRef Full Text | Google Scholar

Sharma, S., Javadekar, S. M., Pandey, M., Srivastava, M., Kumari, R., and Raghavan, S. C. (2015). Homology and enzymatic requirements of microhomology-dependent alternative end joining. Cell death Dis. 6 (3), e1697. doi:10.1038/cddis.2015.58

PubMed Abstract | CrossRef Full Text | Google Scholar

Sheikhpour, R., Sarram, M. A., Gharaghani, S., and Chahooki, M. A. Z. (2017). Feature selection based on graph Laplacian by using compounds with known and unknown activities. J. Chemom. 31 (8), e2899. doi:10.1002/cem.2899

CrossRef Full Text | Google Scholar

Sheikhpour, R., Sarram, M. A., Rezaeian, M., and Sheikhpour, E. (2018). QSAR modelling using combined simple competitive learning networks and RBF neural networks. SAR QSAR Environ. Res. 29 (4), 257–276. doi:10.1080/1062936X.2018.1424030

PubMed Abstract | CrossRef Full Text | Google Scholar

Si, Y., Ma, K., Hu, Y., Si, H., and Zhai, H. (2022). QSAR model study of 2, 3, 4, 5-tetrahydro-1H-pyrido [4, 3-b] indole of cystic-brosis-transmembrane conductance-regulator gene potentiators. Lett. Drug Des. Discov. 19 (4), 269–278. doi:10.2174/1570180818666211022142920

CrossRef Full Text | Google Scholar

Si, Y., Xu, X., Hu, Y., Si, H., and Zhai, H. (2021). Novel quantitative structure–activity relationship model to predict activities of natural products against COVID-19. Chem. Biol. Drug Des. 97 (4), 978–983. doi:10.1111/cbdd.13822

PubMed Abstract | CrossRef Full Text | Google Scholar

Smith, G. F. (2011). Designing drugs to avoid toxicity. Prog. Med. Chem. 50, 1–47. doi:10.1016/B978-0-12-381290-2.00001-X

PubMed Abstract | CrossRef Full Text | Google Scholar

Smits, G. F., and Jordaan, E. M. (2002). “Improved SVM regression using mixtures of kernels,” in Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290), Honolulu, HI, USA, 12-17 May 2002. doi:10.1109/IJCNN.2002.1007589

CrossRef Full Text | Google Scholar

Song, F., Cui, L., Piao, J., Liang, H., Si, H., Duan, Y., et al. (2017). Quantitative structure-activity relationship and molecular docking studies on designing inhibitors of the perforin. Chem. Biol. Drug Des. 90 (4), 535–544. doi:10.1111/cbdd.12975

PubMed Abstract | CrossRef Full Text | Google Scholar

Song, F., Sun, H., Ma, X., Wang, W., Luan, M., Zhai, H., et al. (2023). QSAR and molecular docking studies on designing potent inhibitors of SARS-CoVs main protease. Front. Pharmacol. 14, 1185004. doi:10.3389/fphar.2023.1185004

PubMed Abstract | CrossRef Full Text | Google Scholar

Stewart, J. P. P. (1989). MOPAC 6.0, Quantum chemistry program exchange. QCPE, No. 455. Bloomington: Indiana University.

Google Scholar

Sung, H., Ferlay, J., Siegel, R. L., Laversanne, M., Soerjomataram, I., Jemal, A., et al. (2021). Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 71 (3), 209–249. doi:10.3322/CAAC.21660

PubMed Abstract | CrossRef Full Text | Google Scholar

Tropsha, A., Gramatica, P., and Gombar, V. K. (2003). The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSPR models. QSAR Comb. Sci. 22 (1), 69–77. doi:10.1002/qsar.200390007

CrossRef Full Text | Google Scholar

Vilar, S., Cozza, G., and Moro, S. (2008). Medicinal chemistry and the molecular operating environment (MOE): application of QSAR and molecular docking to drug discovery. Curr. Top. Med. Chem. 8 (18), 1555–1572. doi:10.2174/156802608786786624

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, H., Ren, B., Liu, Y., Jiang, B., Guo, Y., Wei, M., et al. (2020a). Discovery of pamiparib (BGB-290), a potent and selective poly (ADP-ribose) polymerase (PARP) inhibitor in clinical development. J. Med. Chem. 63 (24), 15541–15563. doi:10.1021/acs.jmedchem.0c01346

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, X., and Chen, J. (2014). Parameter selection of SVM with Gaussian kernel. China: Computer Systems & Applications, 242–245. doi:10.3969/j.issn.1003-3254.2014.07.048

CrossRef Full Text | Google Scholar

Wang, Y., Liu, Z., Qu, A., Zhang, P., Si, H., and Zhai, H. (2020b). Study of tacrine derivatives for acetylcholinesterase inhibitors based on artificial intelligence. Lat. Am. J. Pharm. 39 (6), 1159–1170.

Google Scholar

Wang, Y., and Zhang, P. (2023). Prediction of histone deacetylase inhibition by triazole compounds based on artificial intelligence. Front. Pharmacol. 14, 1260349. doi:10.3389/fphar.2023.1260349

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, Z. (2022). Servo parameters optimization of linear motor based on hybrid-PSO algorithm. Agric. Equip. Veh. Eng. 60 (3). doi:10.3969/j.issn.1673-3142.2022.03.013

CrossRef Full Text | Google Scholar

Wilcock, P., and Webster, R. M. (2021). The breast cancer drug market. Nat. Rev. Drug Discov. 20 (5), 339–340. doi:10.1038/D41573-021-00018-6

PubMed Abstract | CrossRef Full Text | Google Scholar

Wright, J. S., Carpenter, D. J., Mckay, D. J., and Ingold, K. U. (1997). Theoretical calculation of substituent effects on the O−H bond strength of phenolic antioxidants related to vitamin E. Phys. Rev. Lett. 119 (18), 4245–4252. doi:10.1021/ja963378z

CrossRef Full Text | Google Scholar

Xiong, W. L., and Xu, B. G. (2006). Study on optimization of SVR parameters selection based on PSO. J. Syst. Simul. 18 (9), 2442–2445. doi:10.3969/j.issn.1004-731X.2006.09.017

CrossRef Full Text | Google Scholar

Yang, X., Qiu, H., Zhang, Y., and Zhang, P. (2023). Quantitative structure–activity relationship study of amide derivatives as xanthine oxidase inhibitors using machine learning. Front. Pharmacol. 14, 1227536. doi:10.3389/fphar.2023.1227536

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhou, M., and Sun, S. (1999). Principle and application of genetic algorithm. Beijing: National defense industry press.

Google Scholar

Zong, C., Zhu, T., He, J., Huang, R., Jia, R., and Shen, J. (2022). PARP mediated DNA damage response, genomic stability and immune responses. Int. J. Cancer 150 (11), 1745–1759. doi:10.1002/ijc.33918

PubMed Abstract | CrossRef Full Text | Google Scholar

Keywords: breast cancer, PARP inhibitor, quantitative structure-activity relationship (QSAR), support vector regression (SVR), multiple kernel function, particle swarm optimization (PSO)

Citation: Xue H, Zhang R, Yan X, Wang R and Zhang P (2024) Study of PARP inhibitors for breast cancer based on enhanced multiple kernel function SVR with PSO. Front. Pharmacol. 15:1257253. doi: 10.3389/fphar.2024.1257253

Received: 12 July 2023; Accepted: 02 January 2024;
Published: 02 February 2024.

Edited by:

Ming Hao, Leidos Biomedical Research, Inc., United States

Reviewed by:

Md Tofiz Uddin, National Institute of Cancer Research and Hospital, Bangladesh
Elmira Nazarshodeh, University of Tehran, Iran

Copyright © 2024 Xue, Zhang, Yan, Wang and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Peijian Zhang, cGVpamlhbnpoQDEyNi5jb20=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Study of PARP inhibitors for breast cancer based on enhanced multiple kernel function SVR with PSO

1 Introduction

2 Computational details and theories

2.1 Data set

2.2 Descriptor calculation

2.3 Statistical parameters

2.4 Linear model by HM

2.5 Non-linear model by GEP

2.6 Non-linear model by RF

2.7 Non-linear models by SVR

2.8 Kernel function of SVR

2.9 SVR model optimized by particle swarm optimization

2.10 Property prediction and molecular docking

3 Results and discussion

3.1 Lg (IC₅₀) of FTPDDs

3.2 Results of HM

3.3 Results of GEP

3.4 Results of RF

3.5 Results of SVR

3.6 Design of new FTPDDs

3.7 Property prediction of new FTPDDs

3.8 Molecular docking of new FTPDDs

4 Comparison of different models

5 Conclusion

Data availability statement

Author contributions

Funding

Acknowledgments

Conflict of interest

Publisher’s note

Supplementary material

References

95% of researchers rate our articles as excellent or good

95% of researchers rate our articles as excellent or good

Study of PARP inhibitors for breast cancer based on enhanced multiple kernel function SVR with PSO

1 Introduction

2 Computational details and theories

2.1 Data set

2.2 Descriptor calculation

2.3 Statistical parameters

2.4 Linear model by HM

2.5 Non-linear model by GEP

2.6 Non-linear model by RF

2.7 Non-linear models by SVR

2.8 Kernel function of SVR

2.9 SVR model optimized by particle swarm optimization

2.10 Property prediction and molecular docking

3 Results and discussion

3.1 Lg (IC50) of FTPDDs

3.2 Results of HM

3.3 Results of GEP

3.4 Results of RF

3.5 Results of SVR

3.6 Design of new FTPDDs

3.7 Property prediction of new FTPDDs

3.8 Molecular docking of new FTPDDs

4 Comparison of different models

5 Conclusion

Data availability statement

Author contributions

Funding

Acknowledgments

Conflict of interest

Publisher’s note

Supplementary material

References

95% of researchers rate our articles as excellent or good

3.1 Lg (IC₅₀) of FTPDDs