Skip to main content

ORIGINAL RESEARCH article

Front. Genet. , 29 January 2025

Sec. Computational Genomics

Volume 15 - 2024 | https://doi.org/10.3389/fgene.2024.1483490

This article is part of the Research Topic The 22nd International Conference on Bioinformatics (InCoB 2023) Translational Bioinformatics Transforming Life View all 7 articles

Quadratic descriptors and reduction methods in a two-layered model for compound inference

Jianshen ZhuJianshen Zhu1Naveed Ahmed Azam
Naveed Ahmed Azam2*Shengjuan CaoShengjuan Cao1Ryota IdoRyota Ido1Kazuya HaraguchiKazuya Haraguchi1Liang ZhaoLiang Zhao3Hiroshi NagamochiHiroshi Nagamochi1Tatsuya AkutsuTatsuya Akutsu4
  • 1Department of Applied Mathematics and Physics, Graduate School of Informatics, Kyoto University, Kyoto, Japan
  • 2Department of Mathematics, Quaid-i-Azam University, Islamabad, Pakistan
  • 3Graduate School of Advanced Integrated Studies in Human Survivability (Shishu-Kan), Kyoto University, Kyoto, Japan
  • 4Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Japan

Compound inference models are crucial for discovering novel drugs in bioinformatics and chemo-informatics. These models rely heavily on useful descriptors of chemical compounds that effectively capture important information about the underlying compounds for constructing accurate prediction functions. In this article, we introduce quadratic descriptors, the products of two graph-theoretic descriptors, to enhance the learning performance of a novel two-layered compound inference model. A mixed-integer linear programming formulation is designed to approximate these quadratic descriptors for inferring desired compounds with the two-layered model. Furthermore, we introduce different methods to reduce descriptors, aiming to avoid computational complexity and overfitting issues during the learning process caused by the large number of quadratic descriptors. Experimental results show that for 32 chemical properties of monomers and 10 chemical properties of polymers, the prediction functions constructed by the proposed method achieved high test coefficients of determination. Furthermore, our method inferred chemical compounds in a time ranging from a few seconds to approximately 60 s. These results indicate a strong correlation between the properties of chemical graphs and their quadratic graph-theoretic descriptors.

1 Introduction

In recent years, extensive studies have been done on the design of novel molecules using various machine learning techniques (Lo et al., 2018; Tetko and Engkvist, 2020; Gartner III et al., 2024). Computational molecular design also has a long history in the field of chemo-informatics and has been studied under the names of quantitative structure-activity relationship (QSAR) (Cherkasov et al., 2014) and inverse quantitative structure-activity relationship (inverse QSAR) (Miyao et al., 2016; Ikebata et al., 2017; Rupakheti et al., 2015). This design problem has also become an important topic in both bioinformatics and machine learning.

The purpose of QSAR is to predict chemical activities from given chemical structures (Cherkasov et al., 2014). In most QSAR studies, a chemical structure is represented as a vector of real numbers called features or descriptors, and then a prediction function is applied to the vector, where a chemical structure is given as a chemical graph, which is defined by an undirected graph with an assignment of chemical elements to vertices and an assignment of bond multiplicities to edges. A prediction function is usually obtained from existing structure–activity relationship data. To this end, various regression-based methods have been utilized in traditional QSAR studies, whereas machine learning-based methods, including artificial neural network (ANN)-based methods, have recently been utilized (Ghasemi et al., 2018; Kim et al., 2021; Catacutan et al., 2024).

Conversely, the purpose of inverse QSAR is to predict chemical structures from given chemical activities (Miyao et al., 2016; Ikebata et al., 2017; Rupakheti et al., 2015; Miyao and Funatsu, 2024), where additional constraints may often be imposed to effectively restrict the possible structures. In traditional inverse QSAR, a feature vector is computed by applying some optimization or sampling method on the prediction function obtained by standard QSAR, and then chemical structures are reconstructed from the feature vector. However, the reconstruction itself is quite difficult due to the vast number of possible chemical graphs (Bohacek et al., 1996). In fact, inferring a chemical graph from a given feature vector, except for some simple cases, is an NP-hard problem (Akutsu et al., 2012). Due to this inherent difficulty, most existing methods employ heuristic methods for the reconstruction of chemical structures and thus do not guarantee optimal or exact solutions. On the other hand, one of the advantages of ANNs is that generative models, such as autoencoders and generative adversarial networks, are available. Furthermore, graph-structured data can be directly handled by using graph convolutional networks (Kipf and Welling, 2016). Therefore, it is reasonable to try to apply ANNs to inverse QSAR (Xiong et al., 2021). Various ANN models, including recurrent networks, autoencoders, generative networks, and invertible flow models, have been applied in this context (Segler et al., 2018; Yang et al., 2017; Gómez-Bombarelli et al., 2018; Kusner et al., 2017; De Cao and Kipf, 2018; Madhawa et al., 2019; Shi et al., 2020). However, the optimality or exactness of the solutions provided by these methods is not yet guaranteed.

A novel two-phase framework based on mixed-integer linear programming (MILP) and machine learning methods has been developed to infer chemical graphs (Shi et al., 2021; Zhu et al., 2022b; Ido et al., 2021; Azam et al., 2021a). The first phase constructs a prediction function η for a chemical property, and the second phase infers a chemical graph with a target value of the property based on the function η. For a chemical property π, we define Cπ as a data set of chemical graphs such that the observed value a(C) of property π for every chemical graph CCπ is available.

In the first phase, we introduce a feature function f:GRK for a positive integer K, where the descriptors of a chemical graph are defined based on local graph structures by using a two-layered model (Shi et al., 2021) so that the inverse of f can be modeled by MILP in the second phase. The prediction function aims to produce an output y=η(x)R based on the feature vector x=f(C)RK for each CCπ. This output serves as a predicted value for the real value a(C).

The task of the second phase is to infer desired chemical graphs. This phase consists of three steps. For a given set of rules called topological specification σ and a range [y̲*,ȳ*] of the target property value, the aim of the first step is to infer chemical graphs C* that satisfy the rules σ and y̲*η(f(C*))ȳ* (see Figure 1 for an illustration). For this, we formulate an MILP Mf,η,σ that represents (i) the computation process of xf(C) from a chemical graph C in the feature function f; (ii) that of yη(x) from a vector xRK in the prediction function η; and (iii) the constraint CGσ, where Gσ denotes the set of all chemical graphs that satisfy the rules in σ. Given an interval with y̲*,ȳ*R, we solve the MILP Mf,η,σ to find a feature vector x*RK and a chemical graph CGσ such that f(C)=x* and y̲*η(x*)ȳ* (if the MILP instance is infeasible, this suggests that Gσ does not contain such a desired chemical graph). See Zhu et al. (2022a) for a full description of the framework.

Figure 1
www.frontiersin.org

Figure 1. An illustration of inferring desired chemical graphs CGσ with y*̲η(f(C))y*̄.

In the second and third steps of the process, we employ different techniques to generate additional desired chemical graphs based on the initial solution C. These techniques include a dynamic programming algorithm (Azam et al., 2021b) and a neighbor search method (Azam et al., 2021a). The dynamic programming algorithm operates by decomposing C into trees and generating their isomers. These isomers are then combined to produce a set of isomers C* that belong to the desired chemical graph space Gσ. These isomers are referred to as the recombination solutions of C. The idea of the neighbor search method is to generate new desired chemical graphs CiGσ that have a slightly different feature vector than that of C. For this purpose, we solve an augmented MILP obtained by Mf,η,σ with an additional set of linear constraints. The resulting graphs obtained through this process are referred to as the neighbor solutions of C.

Contribution: The feature vector x=f(C) of a chemical graph C in this framework consists of the descriptors x(i),1iK that extract the local information of the interior and exterior parts of the graph obtained from the two-layered model. These simple descriptors play a crucial role in deriving compact MILP formulations for the inference of a desired chemical graph. However, for certain chemical properties, the prediction functions constructed based on the feature function f could not achieve evaluation scores in the acceptable range. To improve the evaluation score, we introduce new quadratic descriptors x(i)x(j) (or x(i)(1x(j)). This drastically increases the number of descriptors, which would take extra running time for learning or lead to overfitting of the data set. Moreover, computed quadratic descriptors cannot be directly formulated as a set of linear constraints in the original MILP. For this, we introduce a method of reducing a set of descriptors into a smaller set that delivers a prediction function with a higher performance. We also design an MILP formulation to represent a quadratic term x(i)x(j). Based on the same MILP Mf,η,σ formulation proposed by Zhu et al. (2022b), we implemented the framework to treat the feature function with quadratic descriptors. From the results of our computational experiments on more than 40 chemical properties, we observe that our new method of utilizing quadratic descriptors has improved the performance of a prediction function for many chemical properties.

2 Quadratic descriptors and their approximation in an MILP

In the framework with the two-layered model (Shi et al., 2021), the feature vector consists of graph-theoretic descriptors that are mainly the frequencies of atoms, bonds, and edge configurations in the interior and exterior of the chemical compounds. See Zhu et al. (2022a) for the details of the interior and exterior of chemical compounds and all the descriptors x(1),x(2),,x(K1), which are called linear descriptors, respectively. There are some chemical properties for which the performance of a prediction function constructed with this feature vector remains rather low. To improve the learning performance in the two-layered model, we introduce quadratic terms x(i)x(j) (or x(i)(1x(j))) and 1ijK1 as a new descriptor, where we assume that each x(i) is normalized between 0 and 1 by using the min-max normalization. Note that computing quadratic descriptors cannot be directly formulated as a set of linear constraints in the original MILP used in the two-layered model. Therefore, this section introduces an MILP formulation that approximates the product of two descriptors in a MILP.

Given two real values x and y with 0x1 and 0y1, the process of computing the product z=xy can be approximately formulated as the following MILP. First, regard (2p+11)x as an integer with a binary expression of p+1 bits, where x(j)[0,1] denotes the value of the j-th bit. Then, compute yx(j), which becomes the j-th bit z(j) of (2p+11)z.

Constants:

- x,y: reals with 0x,y1;

- p: a positive integer;

variables:

- z, z(j),j[0,p]: reals with 0z,z(j)1;

- x(j)[0,1],j[0,p]: binary variables;

constraints:

j0,p2jxj12p+11xj0,p2jxj,zjxj,j0,p,y1xjzjy+1xj,j0,p,z=12p+11j0,p2jzj.

Note that the necessary number of integer variables for computing xy for one pair of x and y is p. In this article, we set p6 in our computational experiment. The relative error by p=6 in the above method is at most 12p+11=1/127, which is approximately 0.8%.

3 Methods for reducing descriptors

The number of descriptors in the two-layered model increases drastically due to the quadratic descriptors. More precisely, if Dπ(1){x(k)k[1,K1]} denotes the set of linear descriptors, and Dπ(2){x(i)x(j)1ijK1}{x(i)(1x(j))1i,jK1} denotes the set of quadratic descriptors constructed over a data set for a property π, then the total number of descriptors are K1+(3(K1)2+K1)/2. This large number of descriptors would increase the running time to construct prediction functions or lead to overfitting of the data set. To address these issues, we propose methods for reducing descriptors in this section.

Let C be a given set of chemical compounds, D be a descriptor set obtained using the two-layered model, and K*[1,|D|] be the size of a set of selected descriptors.

For given C, D and a real number λ>0, we denote by Des-set-LLR(C,D,λ) the subset S of D such that w(d)=0 for dS and the hyperplane (w,b) constructed by the LLR procedure LLR(C,D,λ).

3.1 A method based on lasso linear regression

Because the lasso linear regression finds some number of descriptors dD with w(d)=0 in the output hyperplane (w,b), we can reduce a given set of descriptors by applying the lasso linear regression repeatedly. Choose parameters cmax and dmax so that LLR(C,D,λ) can be executed in a reasonable running time when |C|cmax and |D|dmax. Let K̃[1,|D|] be an integer for the number of descriptors that we choose from a given set D of descriptors.

LLR-Reduce(C,D):

Input: A data set C and a set D of descriptors;

Output: A subset D̃D with |D̃|=K̃.

Initialize DD;

while |D|>K̃ do

 Partition D randomly into disjoint subsets D1,D2,,Dp such that

  |Di|dmax for each i;

for each i=1,2,,p do

  Choose a subset Ci with |Ci|=min{cmax,|C|} of C randomly;

  Di Des-set-LLR(Ci,Di,λ) for some λ>0

end for;

DD1D2Dp

end while;

Output D̃D after adding to D extra K̃|D| descriptors from the previous

set D when |D|<K̃ by using the K-best method.

In this article, we set cmax200, dmax200, K̃5000 and w(d)0 if |w(d)|106 in our computational experiment.

3.2 A method based on the backward stepwise procedure

A backward stepwise procedure (Draper and Smith, 1998) reduces the number of descriptors one by one, choosing the one removal that maximizes the learning performance and outputs a subset with the maximum learning performance among all subsets during the reduction iteration.

For a subset SD and a positive integer p, let RCV,MLR2(C,S,p) denote the median of the coefficient of determination R2 test score of a prediction function ηw,b by MLR(C,S) constructed during p times 5-fold cross-validations. We define a performance evaluation function gp:2DR for an integer p1 such that gp(S)=RCV,MLR2(C,S,p). The backward stepwise procedure with this function gp is described as follows.

BS-Reduce(C,D,p):

Input: A data set C, a set D of descriptors, an integer p1, and a performance evaluation function gp:2DR defined above;

Output: A subset D*D.

Compute bestgp(D); Initialize DbestDD;

while D do

 Compute (d)gp(D\{d}) for each descriptor dD;

 Set d*D to be a descriptor that maximizes (d) over all dD;

 Update: DD\{d*};

if (d*)>best then update DbestD and best(d*) end if

end while;

Output D*Dbest.

Based on the lasso linear regression and the backward stepwise procedure, we design the following method for choosing a subset D* of a given set D of descriptors. We are given a set A of 17 real numbers and a set B(a) of 16 real numbers close to each number aA. The method first chooses a best parameter λbestA to construct a prediction function by LLR and then chooses a subset DiD for each λiB(λbest) by the backward stepwise procedure. The procedure takes O(|D|2) iterations, which may take a large amount of running time. We introduce an upper bound smax on the size of an input descriptor set D for the backward stepwise procedure. Let p1, p2, and p3 be integer parameters that control the number of executions of cross-validations to evaluate the learning performance in the method.

Select-Des-set(C,D):

Input: A data set C, a set D of descriptors, a set A of different values of λ, and a set B(λ) of real numbers close to each λA;

Output: A subset D* of D.

for each λA do

 Compute Dλ Des-set-LLR(C,D,λ) and λRCV,MLR2(C,Dλ,p1)

end for;

Set λbest to be a λA that maximizes λ; Denote B(λbest) by {λ1,λ2,,λ16};

for each i[1,16] do

 Compute Di Des-set-LLR(C,D,λi) and let (w,b),wR|D|,bR be the hyperplane obtained by this LLR;

if |Di|smax then

   DiDi

else

  Let Di consist of smax descriptors dDi that have the smax largest absolutevalues |w(d)| in the weight sets {w(d)dDi} of the hyperplane (w,b)

end if;

  DiBS-Reduce(C,Di,p2); iRCV,MLR2(C,Di,p3)

end for;

Output D* to be a set Di that maximizes i,i[1,16].

In the computational experiment in this article, we set A={0,106,105,104,103,0.01,0.05, 0.1,0.5,0.75,1,2,5,10,25,50,100}, |B(λ)|=16 for each λA, p1p2p35, and smax150+104/(|C|+200).

4 Results and discussion

With our new method of choosing descriptors and formulating an MILP to treat quadratic descriptors in the two-layered model, we implemented the framework for inferring chemical graphs and conducted experiments to evaluate the computational efficiency.

4.1 Chemical properties

In the first phase, we used the following 32 chemical properties of monomers and ten chemical properties of polymers.

For monomers, we used the following data sets: biological half-life (BHL), boiling point (Bp), critical temperature (Ct), critical pressure (Cp), dissociation constants (Dc), flash point in closed cup (Fp), heat of combustion (Hc), heat of vaporization (Hv), octanol/water partition coefficient (Kow), melting point (Mp), optical rotation (OptR), refractive index of trees (RfIdT), vapor density (Vd) and vapor pressure (Vp) provided by HSDB from PubChem (2022); electron density on the most positive atom (EDPA) and Kovats retention index (Kov) provided by Jalali-Heravi and Fatemi (2001); entropy (ET) provided by Duchowicz et al. (2002); heat of atomization (Ha) and heat of formation (Hf) provided by Roy and Saha (2003); surface tension (SfT) by Goussard et al. (2017); viscosity (Vis) provided by Goussard et al. (2020); isobaric heat capacities liquid (LhcL) and isobaric heat capacities solid (LhcS) provided by Naef (2019); lipophilicity (Lp) provided by Xiao (2017); flammable limits lower of organics (FlmLO) provided by Yuan et al. (2019); molar refraction at 20° (Mr) provided by Ponce (2003); and solubility (Sl) provided by ESOL (MoleculeNet, 2022), and energy of highest occupied molecular orbital (Homo), energy of lowest unoccupied molecular orbital (Lumo), the energy difference between Homo and Lumo (Gap), isotropic polarizability (Alpha), heat capacity at 298.15 K (Cv), internal energy at 0 K (U0), and electric dipole moment (mu) provided by ESOL (MoleculeNet, 2022), where the properties from Homo to mu are based on a common data set QM9.

The data set QM9 contains more than 130,000 compounds. In our experiment, we used a set of 1,000 compounds randomly selected from the data set. For the Hv property, we removed the chemical compound with the compound identifier (CID) =7947 as an extremal outlier from the original data set.

For polymers, we used the following data provided by Bicerano (2002): experimental amorphous density (AmD), characteristic ratio (ChaR), dielectric constant (DieC), dissipation factor (DisF), heat capacity in liquid (HcL), heat capacity in solid (HcS), mol volume (MlV), permittivity (Prm), refractive index of polymers (RfIdP), and glass transition (Tg), where we excluded from our test data set every polymer whose chemical formula could not be found by its name in the book (Bicerano, 2002). A summary of these properties is given in Tables 1, 2. We remark that the previous learning experiments for π{ ChaR, RfIdP} based on the two-layered model proposed by Azam et al. (2021a) and Zhu et al. (2022c) excluded some number of polymers as outliers. In our experiments, we do not exclude any polymer from the original data set as outliers for these properties.

Table 1
www.frontiersin.org

Table 1. Results of setting data sets for monomers.

Table 2
www.frontiersin.org

Table 2. Results of setting data sets for polymers.

4.2 Experimental setup and setting data set

We executed the experiments on a PC with a Core i7-9700 (3.0 GHz; 4.7 GHz at the maximum processor and 16 GB RAM DDR4 memory. Prediction functions were constructed using scikit-learn version 1.0.2 with Python 3.8.12, and MLPRegressor and rectified linear unit (ReLU) activation function were used for learning based on ANN. The prediction functions constructed by different machine learning models were evaluated using the coefficient of determination R2. We performed 10 rounds of 5-fold cross-validation, calculated the training and testing R2 scores for each cross-validation, and recorded the median of the testing R2 scores across all 50 cross-validations.

For each property π, we first selected a set Λ of chemical elements and then collected a data set Cπ on chemical graphs over the set Λ of chemical elements. To construct the data set Cπ, we eliminated chemical compounds that did not satisfy one of the following: the graph is connected, the number of carbon atoms is at least four, and the number of non-hydrogen neighbors of each atom is at most four.

We set a branch-parameter ρ to be 2, introduce linear descriptors defined by the two-layered graph in the chemical model without suppressing hydrogen, and use the set Dπ(1)Dπ(2) of linear and quadratic descriptors (see Zhu et al. (2022a) for the details). We normalize the range of each linear descriptor and the range {tRa̲tā} of property values a(C),CCπ by using the min–max normalization.

Among the above properties, we found that the median of the test coefficient of determination R2 of the prediction function constructed by LLR (Zhu et al., 2022b) or ALR (Zhu et al., 2022c) exceeds 0.98 for the following nine properties of monomers (resp., three properties of polymers): EDPA, Hc, Ha, Hf, LhcL, LhcS, Mr, Vd, and U0 (resp., HcL, HcS, and MlV). We excluded the above properties from further analysis because they already achieved excellent predictive performance, and further improvement would not provide additional insights. The remaining 23 chemical properties of monomers and seven chemical properties of polymers were used in the following experiments.

Tables 1, 2 show the size and range of data sets that we prepared for each chemical property to construct a prediction function, where we denote the following:

- π: the name of a chemical property used in the experiment.

- Λ: a set of selected elements used in the data set Cπ; Λ is one of the following 19 sets:

       Λ1={H,C,O}; Λ2={H,C,O, N}; Λ3={H,C,O, Si(4) }; Λ4={H,C,O, N,S(2),F}; Λ5={H,C,O, N, Cl, Pb}; Λ6={H,C,O, N,Si(4),Cl,Br}; Λ7={H,C,O, N,S(2),S(6),Cl}; Λ8={H,C,O, N,S(2),S(4),S(6),Cl}; Λ9={H, C(2),C(3),C(4),C(5),O, N(1), N(2), N(3), F}; Λ10={H,C,O, N,P(2),P(5),Cl}; Λ11={H, C, O(1), O(2), N}; Λ12={H, C, O, N, Cl}; Λ13={H, C, O, N, Cl, S(2) }; Λ14={H, C, O(1), O(2), N, Cl, Si(4), F}; Λ15={H,C,O(1), O(2), N, Si(4),Cl,F

      S(2), S(6),Br}; Λ16={H, C, O(2), N, Cl, P(3), P(5), S(2), S(4), S(6), Si(4), Br, I}, where a(i) for a chemical element a, and an integer i1 means a chemical element a with valence i.

- |Cπ|: the size of data set Cπ over Λ for the property π.

- n̲,n̄: the minimum and maximum values of the number n(C) of non-hydrogen atoms in the compounds C in Cπ.

- a̲,ā: the minimum and maximum values of a(C) for π over the compounds C in Cπ.

- |Γ|: the number of different edge-configurations of interior edges over the compounds in Cπ.

- |F|: the number of non-isomorphic chemical rooted trees in the set of all 2-fringe-trees in the compounds in Cπ.

- K1: the size |Dπ(1)| of a set Dπ(1) of linear descriptors, where |Dπ(2)|=(3(K1)2+K1)/2 holds.

4.3 Results on the first phase of the framework

For each chemical property π, we construct a prediction function by one of the following four methods (i)–(iv) and compare their results.

(i) LLR: use lasso linear regression on the set Dπ(1) of linear descriptors (see Zhu et al. (2022b) for the details of the implementation);

(ii) ANN: use ANN on the set Dπ(1) of linear descriptors (see Zhu et al. (2022b) for the details of the implementation);

(iii) ALR: use adjustive linear regression on the set Dπ(1) of linear descriptors (see Zhu et al. (2022c) for the details of the implementation); and

(iv) R-MLR: apply our method (see Zhu et al. (2022a)) of reducing descriptors to the set Dπ(1)Dπ(2) of linear and quadratic descriptors, and use multi-linear regression for the resulting set of descriptors.

For methods (i)–(iii), we use the same implementation described in Zhu et al. (2022b) and Zhu et al. (2022c) and omit the details.

In method (iv), we apply LLR-Reduce(Cπ,Dπ(1)Dπ(2)) to compute D̃Dπ(1)Dπ(2) of size 5000 if |Dπ(1)Dπ(2)|>5000 for a monomer property π. In the other case, we set D̃Dπ(1)Dπ(2). Then, apply Select-Des-set(Cπ,D̃) to get D*D̃, which is used to construct prediction functions based on MLR.

Results of the learning experiments are listed in Tables 3, 4, where:

- π: the property name;

- Λ: the chemical element set of Cπ;

- LLR, ANN, ALR, R-MLR: the median of test R2 score in ten times 5-fold cross-validations for functions obtained by methods (i), (ii), (iii), and (iv), respectively;

- the best R2 score for each property among LLR, ANN, ALR, and R-MLR is indicated by “*”;

- K1*,K2*: the number K1* of reduced linear descriptors and the number K2* of reduced quadratic descriptors in D* used in MLR.

Table 3
www.frontiersin.org

Table 3. Results of constructing prediction functions for monomers.

Table 4
www.frontiersin.org

Table 4. Results of constructing prediction functions for polymers. The negative value shows that the respective model is arbitrarily worse.

The computation times of finding D* and constructing functions by our method (iv) were in the ranges of [80,4×104] seconds and [0.03,0.46] seconds, respectively.

Tables 3, 4 show that method (iv) significantly increased the learning performance of several properties and achieved the best scores among methods (i)–(iii) for 43 of 47 data sets. We also noticed that most of the selected descriptors in D* are quadratics, which confirms the effectiveness of the proposed quadratic descriptors.

4.4 Results on the second phase of the framework

To execute the second phase, we used a set of seven instances Ia, Ibi,i[1,4], Ic, and Id based on the seed graphs prepared by Zhu et al. (2022b). We here present their seed graphs GC (see Zhu et al. (2022a)) for the details of Ia, Ibi,i[1,4], Ic, and Id).

The seed graph GC1 of Ib1 (resp., GCi,i=2,3,4 of Ibi,i=2,3,4) is illustrated in Figure 2.

Figure 2
www.frontiersin.org

Figure 2. (i) Seed graph GC1 for Ib1 and Id; (ii) seed graph GC2 for Ib2; (iii) seed graph GC3 for Ib3; (iv) seed graph GC4 for Ib4.

Instance Ic is introduced to infer an intermediate graph C, which preserves

- the core part of CA: CID 24822711 given in Figure 3A; and

Figure 3
www.frontiersin.org

Figure 3. An illustration of chemical compounds for instances Ic and Id: (A) CA: CID 24822711; (B) CB: CID 59170444; (C) CA: CID 10076784; (D) CB: CID 44340250, where hydrogens are omitted.

- the frequencies of all edge-configurations of CB: CID 59170444 given in Figure 3B.The seed graph GC of this instance is depicted in gray in Figure 3A).

Instance Id has been introduced in order to infer a chemical graph C such that

- C is monocyclic (where the seed graph of Id is given by GC1 in Figure 2(i)); and

- the frequency vector of edge configurations in C is a vector obtained by merging those of chemical graphs CA: CID 10076784 and CB: CID 44340250 in Figures 3C, D, respectively.

4.4.1 Solving an MILP for the inverse problem

We executed the stage of solving an MILP to infer a chemical graph for two properties π{Bp, Dc}.

For the MILP formulation Mf,η,σ, we use the prediction function η for each π{Bp, Dc} by method (iv), R-MLR that attained the median test R2 in Table 3. To solve an MILP with the formulation, we used CPLEX version 12.10. Tables 5, 6 show the computational results of the experiment in this stage for the two properties, where we denote the following:

- nLB: a lower bound on the number of non-hydrogen atoms in a chemical graph C to be inferred;

- y̲*,ȳ*: lower and upper bounds y̲*,ȳ*R on the value a(C) of a chemical graph C to be inferred;

- #v (resp., #c): the number of variables (resp., constraints) in the MILP;

- I-time: the time (sec.) to solve the MILP;

- n: the number n(C) of non-hydrogen atoms in the chemical graph C inferred by solving the MILP;

- nint: the number nint(C) of interior vertices in the chemical graph C; and

- η: the predicted property value η(f(C)) of the chemical graph C.

Table 5
www.frontiersin.org

Table 5. Results of inferring a chemical graph C and generating recombination solutions for Bp with Λ7.

Table 6
www.frontiersin.org

Table 6. Results of inferring a chemical graph C and generating recombination solutions for Dc with Λ7.

Figure 4A illustrates the chemical graph C inferred from Ic with (y̲*,ȳ*)=(340,350) of Bp in Table 5.

Figure 4
www.frontiersin.org

Figure 4. (A) C with η(f(C))=344.98 inferred from Ic with (y̲*,ȳ*)=(340,350) of Bp; (B) C with η(f(C))=0.558 inferred from Ia with (y̲*,ȳ*)=(0.55,0.60) of Dc; and (C) C with η(f(C))=3.199 inferred from Id with (y̲*,ȳ*)=(3.15,3.20) of Dc.

Figure 4B (resp., Figure 4C) illustrates the chemical graph C inferred from Ia with (y̲*,ȳ*)=(0.55,0.60) (resp., Id with (y̲*,ȳ*)=(3.15,3.20)) of Dc in Table 6.

In this experiment, we prepared several different types of instances: instances Ia and Ic have restricted seed graphs, the other instances have abstract seed graphs, and instances Ic and Id have restricted sets of fringe trees. From Tables 5, 6, we observe that an instance with many variables and constraints takes more running time than those with a smaller size in general. All instances in this experiment are solved in a few seconds to approximately 60 s with our MILP formulation.

4.4.2 Generating recombination solutions

Let C be a chemical graph obtained by solving the MILP Mf,η,σ for the inverse problem. We here execute a stage of generating recombination solutions C*Gσ of C such that f(C*)=x*=f(C).

We execute an algorithm for generating chemical isomers of C up to 100 when the number of all chemical isomers exceeds 100. For this, we use a dynamic programming algorithm (Zhu et al., 2022b). The algorithm first decomposes C into a set of acyclic chemical graphs. Next, it replaces each acyclic chemical graph T with another acyclic chemical graph T that admits the same feature vector as that of T, and finally, it assembles the resulting acyclic chemical graphs into a chemical isomer C* of C. The algorithm can compute a lower bound on the total number of all chemical isomers C without generating all of them.

Tables 5, 6 show the computational results of the experiment in this stage for the two properties π{Bp, Dc}, where we denote the following:

- D-time: the running time (s) to execute the dynamic programming algorithm to compute a lower bound on the number of all chemical isomers C* of C and generate all (or up to 100) chemical isomers C*;

- C-LB: a lower bound on the number of all chemical isomers C* of C; and

- #C: the number of all (or up to 100) chemical isomers C* of C generated in this stage.

From Tables 5, 6, we observe the running time and the number of generated recombination solutions in this stage.

The chemical graph C in Ib2, Ib3, and Id admits a large number of chemical isomers C* in some cases, where a lower bound C-LB on the number of chemical isomers is derived without generating all of them. For the other instances, the running time for generating up to 100 target chemical graphs in this stage is less than 0.03 s. For some chemical graphs C, the number of chemical isomers found by our algorithm was small. This is because some of the acyclic chemical graphs in the decomposition of C have no alternative acyclic chemical graph other than the original one.

4.4.3 Generating neighbor solutions

Let C be a chemical graph obtained by solving the MILP Mf,η,σ for the inverse problem. We executed a stage of generating neighbor solutions of C.

We select an MILP for the inverse problem with a prediction function η such that a solution C of the MILP admits only two isomers C* in the stage of generating recombination solutions; that is, instance Ic for property Bp with Λ7 and instances Ia, Ib4 and Ic for property Dc with Λ7.

In this experiment, we add to the MILP Mf,η,σ an additional set Θ of two linear constraints on linear and quadratic descriptors as follows. For the two constraints, we use the prediction functions ηπ constructed by R-MLR for properties π{ Lp, Sl} with Λ8 in Table 3.

Let Dπ* denote the set of descriptors selected in the construction of prediction function for properties π{ Bp, Dc} with Λ7 and π{ Lp, Sl} with Λ8 in Table 3, and let Dπunion,π{ Bp, Dc} denote the union Dπ*DLP*DSL*.

We regard each of ηLp and ηSl as a function from R|Dπunion| to R for π{ Bp, Dc}. We set pdim2 and let Θ consist of two linear constraints θ1ηLp and θ2ηSl. We set δ0.1 or 0.05, which defines a two-dimensional grid space where C is mapped to the origin (see Azam et al. (2021a) for the details on the neighbors). In these experiments, we check the feasibility of 48 neighbors of the origin C in a grid in an increasing order w.r.t. the distance. The time limit of the solver is set to be 300 s. We do not check the feasibility of a neighbor z and ignore it if there exists a neighbor z for which the MILP formulation is infeasible and z is closer to C than z.The results of these experiments are listed in Table 7 where

- (inst., π): specification I and property π;

- n: the number of atoms in the hydrogen-suppressed test instance;

- δ: the size of a sub-region in the grid;

- #sol: the number of new graphs inferred from 48 neighbors;

- #infs: the number of infeasible neighbors;

- #ign: the number of ignored neighbors;

- #TO: the number of neighbors for which the running time of the solver exceeds the time limit.

Table 7
www.frontiersin.org

Table 7. Results of generating neighbor solutions of C.

The branch-and-bound method for solving an MILP sometimes takes an extremely large execution time for the same size of instances. We introduce a time limit to bound an entire running time to skip such instances when testing the feasibility of neighbors in N0. From Table 7, we observe that some number of neighbor solutions of the solution C to the MILP Mf,η,σ could be generated for each of the four instances.

5 Conclusion

In the framework of inferring chemical graphs, the descriptors of a prediction function were mainly defined to be the frequencies of local graph structures in the two-layered model, and such a definition was important to derive a compact MILP formulation for inferring a desired chemical graph. To improve the performance of prediction functions in the framework, this article introduced a multiplication of two of these descriptors as a new descriptor and examined the effectiveness of the new set of descriptors. For this, we designed a method for reducing the size of a descriptor set to not lose the learning performance in constructing prediction functions and gave a compact formulation to compute a product of two values in an MILP. From the results of our computational experiments, we observe that a prediction function constructed by our new method performs considerably better than the prediction functions constructed by the previous methods for several chemical properties. Furthermore, the modified MILP, with the computation of quadratic descriptors, was able to infer desired chemical graphs with up to 50 non-hydrogen atoms.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material; further inquiries can be directed to the corresponding author.

Author contributions

JZ: data curation, funding acquisition, software, validation, and writing–review and editing. NA: data curation, formal analysis, software, validation, and writing–review and editing. SC: software, validation, and writing–review and editing. RI: software, validation, and writing–review and editing. KH: data curation, software, and writing–review and editing. LZ: data curation, software, and writing–review and editing. HN: conceptualization, data curation, formal analysis, methodology, project administration, and writing–original draft. TA: conceptualization, data curation, funding acquisition, methodology, and writing–review and editing.

Funding

The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. The publication cost of this article is funded by the Japan Society for the Promotion of Science, Japan, under grant numbers 22H00532, 22K19830, and 22KJ1979. The funding agency has no role in the conceptualization, design, data collection, analysis, decision to publish, or preparation of the manuscript.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The author(s) declared that they were an editorial board member of Frontiers, at the time of submission. This had no impact on the peer review process and the final decision.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Akutsu, T., Fukagawa, D., Jansson, J., and Sadakane, K. (2012). Inferring a graph from path frequency. Discrete Appl. Math. 160, 1416–1428. doi:10.1016/j.dam.2012.02.002

CrossRef Full Text | Google Scholar

Azam, N. A., Zhu, J., Haraguchi, K., Zhao, L., Nagamochi, H., and Akutsu, T. (2021a). “Molecular design based on artificial neural networks, integer programming and grid neighbor search,” in 2021 IEEE international Conference on Bioinformatics and biomedicine (BIBM) (IEEE), 360–363.

Google Scholar

Azam, N. A., Zhu, J., Sun, Y., Shi, Y., Shurbevski, A., Zhao, L., et al. (2021b). A novel method for inference of acyclic chemical compounds with bounded branch-height based on artificial neural networks and integer programming. Algorithms Mol. Biol. 16, 18. doi:10.1186/s13015-021-00197-2

PubMed Abstract | CrossRef Full Text | Google Scholar

Bicerano, J. (2002). Prediction of polymer properties. Boca Raton: cRc Press.

Google Scholar

Bohacek, R. S., McMartin, C., and Guida, W. C. (1996). The art and practice of structure-based drug design: a molecular modeling perspective. Med. Res. Rev. 16, 3–50. doi:10.1002/(SICI)1098-1128(199601)16:1<3::AID-MED1>3.0.CO;2-6

PubMed Abstract | CrossRef Full Text | Google Scholar

Catacutan, D. B., Alexander, J., Arnold, A., and Stokes, J. M. (2024). Machine learning in preclinical drug discovery. Nat. Chem. Biol. 20, 960–973.

PubMed Abstract | CrossRef Full Text | Google Scholar

Cherkasov, A., Muratov, E. N., Fourches, D., Varnek, A., Baskin, I. I., Cronin, M., et al. (2014). Qsar modeling: where have you been? where are you going to? J. Med. Chem. 57, 4977–5010. doi:10.1021/jm4004285

PubMed Abstract | CrossRef Full Text | Google Scholar

De Cao, N., and Kipf, T. (2018). Molgan: an implicit generative model for small molecular graphs. arXiv Prepr. arXiv:1805.

Google Scholar

Draper, N. R., and Smith, H. (1998). Applied regression analysis, 326. John Wiley and Sons.

Google Scholar

Duchowicz, P., Castro, E. A., and Toropov, A. A. (2002). Improved qspr analysis of standard entropy of acyclic and aromatic compounds using optimized correlation weights of linear graph invariants. Comput. and Chem. 26, 327–332. doi:10.1016/s0097-8485(01)00121-8

PubMed Abstract | CrossRef Full Text | Google Scholar

Gartner III, T. E., Ferguson, A. L., and Debenedetti, P. G. (2024). Data-driven molecular design and simulation in modern chemical engineering. Nat. Chem. Eng. 1, 6–9.

CrossRef Full Text | Google Scholar

Ghasemi, F., Mehridehnavi, A., Pérez-Garrido, A., and Pérez-Sánchez, H. (2018). Neural network and deep-learning algorithms used in qsar studies: merits and drawbacks. Drug Discov. today 23, 1784–1790. doi:10.1016/j.drudis.2018.06.016

PubMed Abstract | CrossRef Full Text | Google Scholar

Gómez-Bombarelli, R., Wei, J. N., Duvenaud, D., Hernández-Lobato, J. M., Sánchez-Lengeling, B., Sheberla, D., et al. (2018). Automatic chemical design using a data-driven continuous representation of molecules. ACS central Sci. 4, 268–276. doi:10.1021/acscentsci.7b00572

PubMed Abstract | CrossRef Full Text | Google Scholar

Goussard, V., Duprat, F., Gerbaud, V., Ploix, J.-L., Dreyfus, G., Nardello-Rataj, V., et al. (2017). Predicting the surface tension of liquids: comparison of four modeling approaches and application to cosmetic oils. J. Chem. Inf. Model. 57, 2986–2995. doi:10.1021/acs.jcim.7b00512

PubMed Abstract | CrossRef Full Text | Google Scholar

Goussard, V., Duprat, F., Ploix, J.-L., Dreyfus, G., Nardello-Rataj, V., and Aubry, J.-M. (2020). A new machine-learning tool for fast estimation of liquid viscosity. application to cosmetic oils. J. Chem. Inf. Model. 60, 2012–2023. doi:10.1021/acs.jcim.0c00083

PubMed Abstract | CrossRef Full Text | Google Scholar

Ido, R., Cao, S., Zhu, J., Azam, N. A., Haraguchi, K., Zhao, L., et al. (2021). A method for inferring polymers based on linear regression and integer programming. arXiv preprint arXiv:2109.02628.

Google Scholar

Ikebata, H., Hongo, K., Isomura, T., Maezono, R., and Yoshida, R. (2017). Bayesian molecular design with a chemical language model. J. computer-aided Mol. Des. 31, 379–391. doi:10.1007/s10822-016-0008-z

PubMed Abstract | CrossRef Full Text | Google Scholar

Jalali-Heravi, M., and Fatemi, M. H. (2001). Artificial neural network modeling of kovats retention indices for noncyclic and monocyclic terpenes. J. Chromatogr. A 915, 177–183. doi:10.1016/s0021-9673(00)01274-7

PubMed Abstract | CrossRef Full Text | Google Scholar

Kim, J., Park, S., Min, D., and Kim, W. (2021). Comprehensive survey of recent drug discovery using deep learning. Int. J. Mol. Sci. 22, 9983. doi:10.3390/ijms22189983

PubMed Abstract | CrossRef Full Text | Google Scholar

Kipf, T. N., and Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv Prepr. arXiv:1609.02907.

Google Scholar

Kusner, M. J., Paige, B., and Hernández-Lobato, J. M. (2017). “Grammar variational autoencoder,” in International conference on machine learning (Sydney, Australia: PMLR), 1945–1954.

Google Scholar

Lo, Y.-C., Rensi, S. E., Torng, W., and Altman, R. B. (2018). Machine learning in chemoinformatics and drug discovery. Drug Discov. today 23, 1538–1546. doi:10.1016/j.drudis.2018.05.010

PubMed Abstract | CrossRef Full Text | Google Scholar

Madhawa, K., Ishiguro, K., Nakago, K., and Abe, M. (2019). Graphnvp: an invertible flow model for generating molecular graphs. arXiv preprint arXiv:1905.11600.

Google Scholar

Miyao, T., and Funatsu, K. (2024). Data-driven molecular structure generation for inverse qspr/qsar problem. In Drug Development Supported by Informatics (Springer). 47–59.

Google Scholar

Miyao, T., Kaneko, H., and Funatsu, K. (2016). Inverse qspr/qsar analysis for chemical structure generation (from y to x). J. Chem. Inf. Model. 56, 286–299. doi:10.1021/acs.jcim.5b00628

PubMed Abstract | CrossRef Full Text | Google Scholar

MoleculeNet (2022). Available at: https://moleculenet.org/datasets-1.

Google Scholar

Naef, R. (2019). Calculation of the isobaric heat capacities of the liquid and solid phase of organic compounds at and around 298.15 k based on their “true” molecular volume. Molecules 24, 1626. doi:10.3390/molecules24081626

PubMed Abstract | CrossRef Full Text | Google Scholar

Ponce, Y. M. (2003). Total and local quadratic indices of the molecular pseudograph’s atom adjacency matrix: applications to the prediction of physical properties of organic compounds. Molecules 8, 687–726. doi:10.3390/80900687

CrossRef Full Text | Google Scholar

PubChem (2022). Annotations from hsdb. Available at: https://pubchem.ncbi.nlm.nih.gov/.

Google Scholar

Roy, K., and Saha, A. (2003). Comparative qspr studies with molecular connectivity, molecular negentropy and tau indices: Part i: molecular thermochemical properties of diverse functional acyclic compounds. J. Mol. Model. 9, 259–270. doi:10.1007/s00894-003-0135-z

PubMed Abstract | CrossRef Full Text | Google Scholar

Rupakheti, C., Virshup, A., Yang, W., and Beratan, D. N. (2015). Strategy to discover diverse optimal molecules in the small molecule universe. J. Chem. Inf. Model. 55, 529–537. doi:10.1021/ci500749q

PubMed Abstract | CrossRef Full Text | Google Scholar

Segler, M. H., Kogej, T., Tyrchan, C., and Waller, M. P. (2018). Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS central Sci. 4, 120–131. doi:10.1021/acscentsci.7b00512

PubMed Abstract | CrossRef Full Text | Google Scholar

Shi, C., Xu, M., Zhu, Z., Zhang, W., Zhang, M., and Tang, J. (2020). Graphaf: a flow-based autoregressive model for molecular graph generation. arXiv Prepr. arXiv:2001, 09382.

Google Scholar

Shi, Y., Zhu, J., Azam, N. A., Haraguchi, K., Zhao, L., Nagamochi, H., et al. (2021). An inverse qsar method based on a two-layered model and integer programming. Int. J. Mol. Sci. 22, 2847. doi:10.3390/ijms22062847

PubMed Abstract | CrossRef Full Text | Google Scholar

Tetko, I. V., and Engkvist, O. (2020). From big data to artificial intelligence: chemoinformatics meets new challenges. J. Cheminformatics 12, 1–3. doi:10.1186/s13321-020-00475-y

PubMed Abstract | CrossRef Full Text | Google Scholar

Xiao, N. (2017). Lipophilicity dataset - logd7, 4 of 1. 130 compounds.

Google Scholar

Xiong, J., Xiong, Z., Chen, K., Jiang, H., and Zheng, M. (2021). Graph neural networks for automated de novo drug design. Drug Discov. today 26, 1382–1393. doi:10.1016/j.drudis.2021.02.011

PubMed Abstract | CrossRef Full Text | Google Scholar

Yang, X., Zhang, J., Yoshizoe, K., Terayama, K., and Tsuda, K. (2017). Chemts: an efficient python library for de novo molecular generation. Sci. Technol. Adv. Mater. 18, 972–976. doi:10.1080/14686996.2017.1401424

PubMed Abstract | CrossRef Full Text | Google Scholar

Yuan, S., Jiao, Z., Quddus, N., Kwon, J. S.-I., and Mashuga, C. V. (2019). Developing quantitative structure–property relationship models to predict the upper flammability limit using machine learning. Industrial and Eng. Chem. Res. 58, 3531–3537. doi:10.1021/acs.iecr.8b05938

CrossRef Full Text | Google Scholar

Zhu, J., Azam, N. A., Cao, S., Ido, R., Haraguchi, K., Zhao, L., et al. (2022a). Molecular design based on integer programming and quadratic descriptors in a two-layered model. arXiv Prepr. arXiv:2209.13527.

Google Scholar

Zhu, J., Azam, N. A., Haraguchi, K., Zhao, L., Nagamochi, H., and Akutsu, T. (2022b). “A method for molecular design based on linear regression and integer programming,” in Proceedings of the 2022 12th international conference on bioscience, biochemistry and bioinformatics, 21–28.

Google Scholar

Zhu, J., Haraguchi, K., Nagamochi, H., and Akutsu, T. (2022c). Adjustive linear regression and its application to the inverse qsar. BIOINFORMATICS, 144–151. doi:10.5220/0010853700003123

PubMed Abstract | CrossRef Full Text | Google Scholar

Glossary

Alpha isotropic polarizability

ALR adjustive linear regression

AmD experimental amorphous density

ANN artificial neural network

BHL biological half life

Bp boiling point

ChaR characteristic ratio

Cp critical pressure

Ct critical temperature

Cv heat capacity at 298.15 K

Dc dissociation constants

DieC dielectric constant

DisF dissipation factor

EDPA electron density on the most positive atom

ET entropy

FlmLO flammable limits lower of organics

Fp flash point in closed cup

Gap the energy difference between HOMO and LUMO

Ha heat of atomization

Hc heat of combustion

HcL heat capacity in liquid

HcS heat capacity in solid

Hf heat of formation

HOMO energy of highest-occupied molecular orbital

Hv heat of vaporization

Kov Kovats retention index

Kow octanol/water partition coefficient

LLR lasso linear regression

Lp lipophilicity

LUMO energy of lowest-unoccupied molecular orbital

MILP mixed-integer linear programming

MLR multidimensional linear regression

MlV mol volume

Mp melting point

Mr molar refraction at 20° degree

mu electric dipole moment

OptR optical rotation

Prm permittivity

QSAR quantitative structure–activity relationship

RfIdP refractive index of polymers

RfIdT refractive index of trees

SfT surface tension

Sl solubility

Tg glass transition

U0 internal energy at 0 K

Vd vapor density

Vis viscosity

Vp vapor pressure

Keywords: machine learning, integer programming, chemo-informatics, materials informatics, QSAR/QSPR, molecular design

Citation: Zhu J, Azam NA, Cao S, Ido R, Haraguchi K, Zhao L, Nagamochi H and Akutsu T (2025) Quadratic descriptors and reduction methods in a two-layered model for compound inference. Front. Genet. 15:1483490. doi: 10.3389/fgene.2024.1483490

Received: 20 August 2024; Accepted: 30 December 2024;
Published: 29 January 2025.

Edited by:

Shoba Ranganathan, Macquarie University, Australia

Reviewed by:

Yuansheng Liu, Hunan University, China
Tsung Fei Khang, University of Malaya, Malaysia

Copyright © 2025 Zhu, Azam, Cao, Ido, Haraguchi, Zhao, Nagamochi and Akutsu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Naveed Ahmed Azam, bmF2ZWVkYXphbUBxYXUuZWR1LnBr

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Research integrity at Frontiers

Man ultramarathon runner in the mountains he trains at sunset

94% of researchers rate our articles as excellent or good

Learn more about the work of our research integrity team to safeguard the quality of each article we publish.


Find out more