Skip to main content

ORIGINAL RESEARCH article

Front. Cell Dev. Biol., 16 December 2021
Sec. Molecular and Cellular Pathology
This article is part of the Research Topic Omics Data Integration towards Mining of Phenotype Specific Biomarkers in Cancer, Volume II View all 65 articles

Gene-Based Testing of Interactions Using XGBoost in Genome-Wide Association Studies

  • 1Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
  • 2School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China
  • 3Department of Mathematics, University of Wisconsin-Madison, Madison, WI, United States
  • 4Research Institute of Big Data Science and Industry, Shanxi University, Taiyuan, China
  • 5School of Life Science, Shanxi University, Taiyuan, China
  • 6Beidahuang Industry Group General Hospital, Harbin, China

Among the myriad of statistical methods that identify gene–gene interactions in the realm of qualitative genome-wide association studies, gene-based interactions are not only powerful statistically, but also they are interpretable biologically. However, they have limited statistical detection by making assumptions on the association between traits and single nucleotide polymorphisms. Thus, a gene-based method (GGInt-XGBoost) originated from XGBoost is proposed in this article. Assuming that log odds ratio of disease traits satisfies the additive relationship if the pair of genes had no interactions, the difference in error between the XGBoost model with and without additive constraint could indicate gene–gene interaction; we then used a permutation-based statistical test to assess this difference and to provide a statistical p-value to represent the significance of the interaction. Experimental results on both simulation and real data showed that our approach had superior performance than previous experiments to detect gene–gene interactions.

1 Introduction

Genome-wide association study (GWAS) is a collection of successful methods for identifying genetic loci associated with complex traits. More than 71,000 specific single nucleotide polymorphisms (SNPs) associated with diseases or traits have been identified (Hindorff et al., 2009; Yang et al., 2015; Liu et al., 2018a; Guo et al., 2018; Buniello et al., 2019; Loos, 2020; Lyu et al., 2020; Hu et al., 2021). Previous GWAS schemes relied mainly on a single locus model that verified the independent association of individual markers to particular phenotypes. Despite the successful recognition of many regions of disease susceptibility, most SNPs captured by this kind of method may have a small effect size that does not explain the heritability of complex traits fully. It is believed that genetic interactions that are engaged significantly in the genetic basis of complex traits and diseases (Cordell, 2009; Moore et al., 2010; Liu et al., 2018b; He et al., 2020; Luo et al., 2020; Shao and Liu, 2021) may be a potential solution to the problem of “missing heritability” (Manolio et al., 2009; Fang et al., 2019; Young, 2019). The solution may be partial, but it could enlighten the construction of new topologies for gene pathways.

Genetic interaction was first studied at the SNP level, and SNP–SNP interactions (i.e., epistasis) were detected by applying several methods (Li et al., 2015a; Ritchie and Van Steen, 2018), such as statistics based on entropy (Dong et al., 2008), logistic regression (Lin et al., 2016), and odds ratio (Emily, 2012); other techniques include multifactor dimensionality reduction (MDR) (Ritchie et al., 2003), BOOST (Wan et al., 2010), RRIntCC (Zhang et al., 2019), GenEpi (Chang et al., 2020), and some accelerate method (Nobre et al., 2021). One of the general challenges encountered by these SNP-based approaches is the statistical weakness of the higher-order or pairwise tests that result from massive multiple testing corrections over all the groups or pairs of SNPs. Instead, we investigated every possible SNP from two genes in single, gene-based interaction detection.

The success of gene-based approaches in marginal association studies of GWAS could extend to the analysis of gene–gene interactions (GGIs) (Emily, 2018; Emily et al., 2020). This approach has several potential advantages. First, it typically has far fewer genes than SNPs, reducing the number of pairwise tests drastically. For example, 2×108 tests are required to detect genetic interactions in pairs of 20,000 genes. However, over 5×1012 tests are required for 3 million SNPs in a marker-based interaction. Second, because a gene contains more information than a single SNP and genes interact diversely, gene-based methods are more powerful statistically, which applies to gene-based studies on the main effects as well (Liu et al., 2010; Li et al., 2011; Jiang et al., 2017; Wang et al., 2020; Wang et al., 2021). Additionally, biological prior knowledge (e.g., information about the known association of genes within protein–protein interactions (PPIs) or pathways) can be introduced easily. Finally, gene-based results are characterized by having better interpretability and important biological consequences.

Peng et al. (Peng et al., 2010) discovered a canonical correlation of a pair of genes in a case group and in a control group by applying a canonical correlation analysis–based U statistic (CCU), which measured the difference in the correlation of the gene pair. The difference then indicated the incidence of a GGI. In the analysis, however, only linear relationships were taken into consideration. Afterward, CCU was extended to kernelized CCU (KCCU) (Yuan et al., 2012; Larson et al., 2013), where a non-linear relationship was detected under the kernel. Recently, Emily (Emily, 2016) presented a method called AGGrGATOr that combined p-values interaction tests at the marker level to gauge how a pair of genes interacted, which was a strategy used by Ma et al. (Ma et al., 2013) earlier to detect interactions under quantitative phenotypes. Li et al. (Li et al., 2015b) proposed an entropy-based and nonparametric method called GBIGM.

At present, the new approach GGInt-XGBoost is proposed for identifying gene–gene interactions of complex phenotypes at the gene level in case-control studies by leveraging the eXtreme Gradient Boosting (XGBoost) (Chen and Guestrin, 2016), which is applied in co-expressed gene detection and to explore genetic associations in the field of bioinformatics (Jiang et al., 2013; Babajide Mustapha and Saeed, 2016; Liu et al., 2016; Liu and Jiang, 2016; Mrozek et al., 2016; Wei et al., 2017a; Wei et al., 2017b; Liu et al., 2017; Chen et al., 2018; Wei et al., 2018; Jiang et al., 2019; Liu et al., 2019; Yu et al., 2020a; Yu et al., 2020b; Lv et al., 2020; Li et al., 2021; Liu et al., 2021). A built-in mechanism of XGBoost is that one can impose constraints on the trained model to make it additive, which we assume characterizes the lack of interaction between two genes. Our method exhibited an outstanding performance for detecting the underlying gene–gene interactions at the gene level under various settings based on the experiments using a semi-empirical dataset. Its application using real datasets showed accurate identification of gene–gene interactions.

2 Materials and Methods

In this section, we detailed the statistical workflow for GGInt-XGBoost. To evaluate the power to detect GGIs and type-I error, we present the different parameter settings for simulation studies based on empirical data. Then, we adopted a real-world rheumatoid arthritis dataset from the WTCCC (Wellcome Trust Case–Control Consortium) database to assess the performance of our method under a real situation.

2.1 GGInt-XGBoost

2.1.1 Preliminaries and Notation

Here, we take genes, a couple of SNPs, as the basic unit. Suppose that we have n random samples:

(G1,i,G2,i)p+q,i=1,2,,n,(1)

where

G1,i=(g1,i,1,g1,i,2,,g1,i,p),G2,i=(g2,i,1,g2,i,2,,g2,i,q), i=1,2,,n

and G1 and G2 represent two genes each with p and q SNPs, independently. In the case–control studies, yi{0,1} is a categorical label, where 0 is a control subject and 1 is a case subject. gk,i,j{0,1,2} represents the copy number of the minor alleles of SNP j in the gene k for the sample i.

In this work, we created a statistic based on the XGBoost to quantify GGI intensity in order to see if there is a statistical interaction between two genes in a qualitative phenotype. To estimate the distribution of the statistic, we used a permutation resampling strategy. Our method was based on the assumption that if there was no interaction between two genes, adding a constraint to limit interactions between SNPs to only occurring in the same gene would not have a significant negative impact on XGBoost performance. The XGBoost’s build-in mechanism for adding interaction constraints enables us to generate an additive model and use prior knowledge about the gene structure during model construction.

2.1.2 Definition of Total Additivity for Gene–Gene Interaction

We defined GGIs using the concept of additive models. Additive models were proposed by Friedman and Stuetzle (Friedman and Stuetzle, 1981) and further developed and popularized by Stone (Stone, 1985), Hastie, and Tibshirani (Hastie and Tibshirani, 1990). Consider the regression problem where the feature lies in d and the objective function has a real value. Let s1,,sl be a disjoint partition of the index set  {1,,d} and denotes the elements of si to be ji1,,jidi and πi(x)=(xji1,..,xjidi)di. Now, real-valued function F on d is said to be additive for partition {s1,,sl} if there exists Fi: di such that

F(x)= i=1lFi(πi(x)).(2)

In our setting, the samples are elements in {0,1,2}p+qp+q, and we let s1={1,2,,p} and s2={1,2,,q}. We defined the absence of interaction between the two genes as the log odds ratio being additive with respect to the partition {s1,s2}. In other words, our null hypothesis is

H0:  F1,F2 such that  P(y=1|G1,G2)= exp(F1(G1)+F2(G2))1+ exp(F1(G1)+F2(G2)).(3)

2.1.3 eXtreme Gradient Boosting (XGBoost)

eXtreme Gradient Boosting (XGBoost) (Chen and Guestrin, 2016) is a scalable machine-learning system for tree boosting, which researchers apply to bioactive molecular prediction (Babajide Mustapha and Saeed, 2016), protein submitochondrial localization prediction (Yu et al., 2020a), miRNA-disease association prediction (Chen et al., 2018), and in the bioinformatics field (Shao et al., 2021).

For a given dataset with n samples, D={(xi,yi)}, xim,yi, and the XGBoost objective function is defined as:

obj(θ)=inl(yi,yi^)+t=1TΩ(ft),(4)

where l is the loss function and Ω is the regularizer on the regression tree ft, θ=(f1,,fT), and yi^=tft(xi) . The tth tree ft was obtained iteratively by gradient boosting, that is,

ftargmini(y^(t1)l(yi,y^(t1))ft(xi)+12y^(t1)2l(yi,y^(t1))ft 2(xi))+Ω(ft).(5)

In our setting,

l(yi,yi^)={log(exp(yi^)1+exp(yi^)) y=1log(11+exp(yi^))y=0     (6)

When running XGBoost, an essential step is to optimize its general parameters, booster parameters, and learning parameters.

2.1.4 XGBoost With the Additive Constraint

The base learner ft used in XGBoost is a regression tree, and we considered features that appear in a path on the tree that starts at the root and ends at one of the leaves as features that interact with one another. XGBoost allows specification of feature interaction constraints in the form of lists of features where only the features in the same list are allowed to interact with one another. It is evident that when the lists are disjointed, and if every feature is included in one of the lists, the feature interaction constraint is equivalent to forcing each ft to include only features in a single list, which implies that the regression model ifi must be additive concerning the partition specified by the lists. With the constraint [[0,1] (Hindorff et al., 2009; Liu et al., 2018a; Loos, 2020),], for example, the tree in Figure 1A violates the first constraint [0,1], thus so would not be in the boosting tree system, but the tree in Figure 1B complies with both the first and second constraints.

FIGURE 1
www.frontiersin.org

FIGURE 1. Illustration of trees with the additive constraint. With the constraint [[0,1] (Hindorff et al., 2009; Liu et al., 2018a; Loos, 2020),], (A) a tree violates the first constraint (0,1) that would not be in the boosting tree system; (B) a tree complies with both the first and second constraints.

2.1.5 Illustration of the GGInt-XGBoost Workflow

Assume there are n samples in a case–control study for a pair of genes such that G1 has  p SNPs and G2 has q SNPs. We can then apply XGBoost using the logistic regression loss function with and without constraints on additivity, to the dataset to estimate the performance in error using 10-fold cross-validation. We denote the error in the unconstrained model as errorig0 and in the constrained model as errcons0. The improvement of the performance of the unconstrained model over the constrained model is Δerr0=errcons0errorig0errorig0, which according to our assumption should be a statistic that characterizes the strength of interaction between these two genes. A positive Δerr0 indicated that the unconstrained model performed better, and a larger positive Δerr0 means there was a stronger interaction between the two genes.

To get a p-value, we needed to estimate the distribution of Δerr0 under the null hypothesis. Here, we used a non-parametric strategy based on permutation: we shuffled the label y randomly m times, calculated Δerr using the exact same aforementioned procedure, and used the resulting empirical distribution as an estimate for the distribution of Δerr0 under the null hypothesis. Let the result of these m permutations be Δerr1,,Δerrm, then an estimated p-value for the null hypothesis is

p=|{i:ΔerriΔerr0}|m.(7)

For XGBoost, if we have n samples, K trees, a maximum depth of d per tree, and ||s||  as the number of non-missing entries in the training data, the training time complexity is O(Kd||s||logn). Prediction for a new sample takes O(kd). We employed parallel programming to minimize the execution time of permutation resampling.

We summarized the process of GGInt-XGBoost in the algorithm below (Algorithm 1) and presented the overall workflow (Figure 2).

FIGURE 2
www.frontiersin.org

FIGURE 2. Illustration of the GGInt-XGBoost workflow for gene-based gene–gene interaction detection.

www.frontiersin.org

2.2 Simulation Study

To assess the performance of GGInt-XGBoost to control type I error and to detect GGIs, we compared GGInt-XGBoost with KCCA (Larson et al., 2013), GBIGM (Li et al., 2015b), and AGGrEGATOr (Emily, 2016).

2.2.1 Simulation With Haplotype Data

gs2.0 (Li and Chen, 2008) is a semi-empirical simulation data generator that employs haplotype data as input and produces high-density SNP genotype data for qualitative samples. The generated dataset shares the same local linkage disequilibrium (LD) structure as that of human populations. We selected HapMap3 (a resident of Utah, the United States with Northern and Western European ancestry from https://www.sanger.ac.uk/resources/downloads/human/hapmap3.html) to mimic the actual LD structure of the human population. The Central European (CEU) dataset with 90 haplotypes was used as the template haplotype data. In this research, we randomly picked one pair of gene loci (i.e., GNPDA2 from chromosome 4 and FAIM2 from chromosome 12). GNPDA2 had a much stronger LD pattern than FAIM2 did, and they were not correlated (Figure 3). By employing the genipe module (Lemieux Perreault et al., 2016), an imputation pipeline on the genome-scale with PLINK, IMPUTE 2, and SHAPEIT, chromosomes 4 and 12 were imputed. After imputing, six SNPs were obtained from GNPDA2, and seven SNPs were obtained from FAIM2 (Supplementary Table S1).

FIGURE 3
www.frontiersin.org

FIGURE 3. Illustration of LD structures within genes GNPDA2 and FAIM2. The plots are generated by Haploview. r2 measures the LD strength of each pair of SNPs in each square, 0r21, where 0 indicates no LD and 1 indicates complete LD. The GNPDA2 has a much stronger LD pattern than that within FAIM2, and they are not correlated.

2.2.2 Disease Model

Here, we generated a disease model with two loci. A disease model represents the relationship between two loci that correspond to the disease. With various combinations of odds ratios (OR), sample sizes, and population prevalence, we generated different disease models. Using the jointly recessive–dominant model (RD model) as an example, for each locus let the genotype OR be (1+θ) and the population prevalence of the disease be p (Supplementary Table S2).

Pr(D|gi) indicates the probability of a sample being a case given the genotype combination of gi and named the penetrance of gi, and Pr(D¯|gi) denotes the probability of a sample being a control given the genotype combination of gi. Then, the odds of disease are:

ODDgi=Pr(D|gi)Pr(D¯|gi)=Pr(D|gi)1Pr(D|gi).(8)

The penetrance of genotype gi can be calculated using:

Pr(gi)=ODDgi1+ODDgi.(9)

The corresponding penetrance table is shown in Supplementary Table S3.

With a specific genotype OR (1+θ) and a population prevalence p, the baseline value γ represents the disease odds with the two loci that do not have the disease alleles, and it can be calculated by applying Eq. 10 with the terms from Supplementary Table S3.

p=Pr(D)=Pr(D|gi)×Pr(gi).(10)

We used six integrated disease models in gs2.0, which included a recessive–dominant model, a dominant–dominant model, an XOR model, a threshold model, a multiplicative model, and a recessive–recessive model. We generated different datasets by various parameter settings, and we compared the performances of KCCU, GBIGM, and AGGrEGATOr with our method.

Evaluation of Type-I Error: The type-I error indicates the ability of a method to reject the null hypothesis when it is true. In this study, the significance level α  was set to 0.05. The simulation used in the model is shown in Supplementary Table S4 and run 100 times with each sample size n{1k,2k,3k,4k,5k} with the odds ratio set at 1.

Evaluation of Power of the Test: The power of a test indicates the probability that the method can reject the null hypothesis correctly when the alternative hypothesis is true. In this study, we generated 100 datasets for each parameter set under six disease models (Supplementary Table S5). The power under each parameter setting was expressed by the frequency with which the null hypothesis of the dataset was rejected correctly at the significance level of α=0.05. To evaluate the influence of sample size, we chose n{1k,2k,3k,4k,5k} with a specific population prevalence p=0.01 and OR=2. To assess the impact of the OR, we considered varying OR{1.5,2,2.5,3,3.5,4} given a sample size of n=4000 (cases and controls were both 2000, balanced dataset) and p=0.01.

For GGInt-XGBoost, KCCU, AGGrEGATOr, and GBIGM, if the number of datasets with a significance level less than α  is m1, then the power can be calculated by the following formula:

power= m1100.(11)

GBIGM and AGGrEGATOr methods are nonparametric methods so no parameters need to be specific. We only set the ratio of the trimmed jackknife to 0.05 (ω=0.05) for KCCU. For GGInt-XGBoost, we set the number of trees to 1,000 (num_round = 1,000), the maximum depth of trees to 3 (max_depth = 3), the type of objective to “binary:logistic” (objective = “binary:logistic”), the learning rate to 0.01 (eta = 0.01), and the evaluate metric to error (eval_metric = “error”). We recommend that when dealing with real-data analysis, the depth of the trees is not to be set too deep in order to avoid overfitting. For a dataset with thousands of samples, a maximum depth of 2–4 is usually sufficient.

2.3 Experiments Using Rheumatoid Arthritis Data

To evaluate GGInt-XGBoost’s ability to process real GGIs in a qualitative dataset, we analyzed the susceptibility of a series of pairs of genes in rheumatoid arthritis (RA), a chronic systemic disease with inflammatory synovitis with unknown etiology. It causes progressive bone destruction and affects bone remodeling. In this article, we chose the WTCCC (2007) dataset, which contained British population genotype data generated by the Affymetrix GeneChip 500 k. We preprocessed our dataset in the following ways:

i. To verify the GGIs in the RA, we selected the pathway hsa05323 from the KEGG pathway database. The genotyping coordinates of the WTCCC dataset can be found in UCSC hg18/NCBI Build36. There were 90 genes in this pathway. Among them, many genes belonged to MHCII and V-ATPase, which are two protein combinations. Because many GGIs occurred by themselves, we only selected representative genes from each protein combination, and then we excluded other genes. Finally, 48 genes remained, which resulted in C482=1128 pairs of genes to be evaluated.

ii. The detailed gene information was obtained from the annotation file of NCBI Build36. For each gene, we added a 10 kb buffer region both downstream and upstream of the originally defined gene position. All SNPs within the region were selected for each gene.

iii. According to the quality control of GWAS, samples that included gender that did not match the chromosome X heterozygote rates were removed. SNPs were also excluded when they met any of the following conditions: the missing rate in the sample was 10%, the minor allele frequency (MAF) was 0.05, or the frequency of the control violated the Hardy–Weinberg equilibrium (p<0.0001). Finally, 385 SNPs remained with 4,966 samples that consisted of 2,993 control subjects and 1973 case subjects.

3 Results and Discussion

The experimental environment of the following results was a workstation with an Intel Xeon CPU E5-2,620 v2 at 2.10GHz, 96 GB of DDR3, Python3.6, and RStudio programming implementation.

3.1 Simulation Study

3.1.1 Evaluation of Type-I Error

For type-I error, by setting the significance level at α=0.05, we varied the sample size from 1,000 to 5,000. All the methods tested, except for GBIGM when n=1,000, had a type-I error comparable to the significance level (Table 1), which implied that these methods well controlled type-I error for various sample sizes.

TABLE 1
www.frontiersin.org

TABLE 1. Type-I error for KCCU, GBIGM, AGGrEGATOr, and GGInt-XGBoost when varying the sample size.

3.1.2 Evaluation of the Power of the Test
Impact of Odds Ratio

We investigated the performance of the various methods in detecting GGIs under the six disease models, with a population prevalence of 0.01, the sample size of 4,000, and odds ratios that varied from 1.5 to 4 (Figure 4). For this experiment, a single pair of SNPs that belonged to different genes was chosen randomly for the disease models in the generation of the simulated dataset, and the genes that contained these two SNPs were considered to be interacting. A larger OR resulted in better performance for all methods, and, when OR=4, some methods had a power that approached 1 (Figure 4). Our method was the best among all methods tested except for when OR=1.5, which might be because the base learner of XGBoost was a regression tree that might be prone to overfitting. It would be difficult to distinguish the signal from noise when the interaction strength was too low.

FIGURE 4
www.frontiersin.org

FIGURE 4. Statistical power of simulation studies for KCCU (blue), GBIGM (yellow), AGGrEGATOr (green), and GGInt-XGBoost (red) under six disease models with OR{1.5,2,2.5,3,3.5,4}

It is worth noting that in the recessive–recessive model (RR model) (Supplementary Table S5A), the detection power was consistently 20% as the OR value changed gradually from 1.5 to 4. AGGrEGATOr and GGInt-XGBoost reached approximately 45%. According to the RR model penetrance table, the baseline γ was very small when the population prevalence was p=0.01. Therefore, of the nine genotype combinations, eight of them tended to be zero. The only causal genotype (aabb) contained two minor alleles. Typically, the MAF of a SNP ranged from 0.2 to 0.4, and few genotypes (aabb) appeared in the simulation dataset. Consequently, it was difficult to detect the GGI under the disease phenotype. This was the main reason for the poor performance of these methods under the RR model.

Impact of Sample Size

We also investigated the influence of the sample size. Let the sample size be n{1k,2k,3k,4k,5k}, p=0.01, and OR=2 (Supplementary Figure S1). As the sample size increased, the detection power of all methods increased monotonically under all disease models, except for the RR model. In all methods, a larger sample size corresponded to better performance.

In conclusion, GGInt-XGBoost performed better in simulation studies than the other methods tested in almost every setting. The reason was probably that our method, by making use of constrained and unconstrained XGBoost models, made weak assumptions on the kind of interaction because any deviation from the additivity in the prediction of log odds ratios indicated an underlying GGI, which resulted in better statistical power. Furthermore, our method was more robust with respect to the LD pattern among the SNPs within each gene because the additivity constraint did not destroy the LD structure within each gene.

3.2 Experiments Using Rheumatoid Arthritis Data

Rheumatoid arthritis (RA) is an autoimmune disease with symptoms that typically include pannus formation in the synovial joints and destruction of cartilage and bones. The genes IL-17, IL-6, TNFα, RANK, and MMP3 are related to the development of RA (Majithia and Geraci, 2007). There were 48 genes in our dataset chosen from the RA pathway hsa05323, which resulted in 1,128 pairs of genes. We set significance level to α=0.01, and for our method, the number of permutations were set to m=1000. GBIGM and KCCU resulted in 134 and 159 pairs of detected interacting genes, respectively. A total number of 65 of those pairs that were detected by GBIGM and 30 of the pairs that KCCU detected had a p-value of 0. AGGrEGATOr detected 17 pairs of interacting genes, and GGInt-XGBoost detected 58 pairs of interacting genes.

Because there were too many detected interacting gene pairs in KCCU and GBIGM with a p-value = 0, we could not analyze all of them in detail, so we focused on the 10 most significant gene pairs detected by AGGrEGATOr and, by our method, GGInt-XGBoost (Table 2). After a literature search, we found 7 of the 10 most significant gene pairs under GGInt-XGBoost and 3 of the 10 most significant gene pairs under AGGrEGATOr were supported by prior research. There was also a greater correlation between the results of GGInt-XGBoost and KCCU or GBIGM than the correlation between AGGrEGATOr and GBIGM or KCCU.

TABLE 2
www.frontiersin.org

TABLE 2. Calculated p-value for the 20 gene pairs using all four different methods. p-values in bold font indicate that they are significant. The ``Ref'' column indicates that the pair can be found as direct interaction in our literature search.

Furthermore, when using GGInt-XGBoost, after the detection of interacting gene pairs, one can also use the ensemble tree mechanism of XGBoost to investigate marker-based interactions further; this is because the regression tree, which is the base learner used in XGBoost, is a powerful tool for the discovery of interactions among features. For a regression tree model, one considers features that appear in the same traversal path from the root to leaf to be interacting. As an example, the gene pairs IL-8 and Ang-1 were found to interact using our method. Pawel et al.(Kabala et al., 2020) reported that Ang-1 induced the production of IL-8 in synovial tissue explants of RA patients. In the first tree in the unconstrained XGBoost model, it was clear that one SNP from the gene IL-8 on chromosome 4 interacted with rs121937926 in the gene Ang-1 (Supplementary Figure S2). The interaction form was flexible because our method imposed no functional form.

We explored the structure of the unconstrained XGBoost model further with R package EIX (Karbowiak and Biecek, 2021), which produced an interaction plot (Figure 5). For the convenience of display, all the SNPs in IL-8 were named “G1_X”, and all SNPs in ANG-1 were named “G2_X”, where “X” was the index. We chose the sumGain as a measure for the interaction strength. The sumGain was the sum of the gain value in all nodes in which the given SNP occurred. The intensity of the sumGain was divided into four equal parts and represented by different colored squares in the legend. The interacting SNP pairs in Supplementary Figure S2 from IL-8 and Ang-1 exhibited median strength in Figure 5 (with red star), which demonstrated that it was possible to use the results of GGInt-XGBoost for a more fine-grained analysis of GGIs at the marker level. Also, our method was robust with respect to LD because the LD structure within each gene was still expressed in the tree model and did not directly impact the performance of our method (Figure 5). Table 3 gives the information of the top 10 interacted SNP pairs by sumGain and occurrence frequency in the ensemble boosting trees.

FIGURE 5
www.frontiersin.org

FIGURE 5. Plot that shows pairs of SNPs that lie in two nodes of a regression tree connected by an edge. The color indicates the sumGain measure for the SNP pairs. The pair with the red star indicates the interacting SNP pair from IL-8 and Ang-I.

TABLE 3
www.frontiersin.org

TABLE 3. sumGain measure of the 10 most significant interacting SNP pairs from IL-8 and Ang-I. “Frequency” is the number of occurrences of the SNP pair in the trained model.

4 Conclusion

Gene–gene interactions (GGIs) are important in the study of complex diseases and traits. In this article, we developed a gene-based GGI detection algorithm called GGInt-XGBoost. We treated the GGI detection problem as a measure of how much the log odds ratio of qualitative traits deviated from the additive structure. GGInt-XGBoost benefits from the attractive built-in mechanism of XGBoost that allows for an elegant expression of the additive structure by adding feature interaction constraints. Because of the weak assumptions of the interaction form and powerful and practical ability of XGBoost to fit nonlinear relationships, our method detected more types of interpretable GGIs accurately and effectively than other methods.

Combined with logistic regression, GGInt-XGBoost can be used for the GWAS of complex qualitative traits. To test its performance, we conducted a semi-empirical simulation study and a retrospective analysis of rheumatoid arthritis. For most of the settings tested, GGInt-XGBoost outperformed prior methods in statistical power for detecting GGIs. Also, because the base learner we used in XGBoost was the regression tree, GGIng-XGBoost can detect GGIs under quantitative traits. Furthermore, the base learner of XGBoost is a tree model that has a natural way of expressing marker-based interactions, which allows further investigations of interactions at the marker level after two genes are known to interact. For example, we looked for interactions between the genes IL-8 and Ang-1 and found that it was largely accounted for by the interaction between a single pair of SNPs from these two genes. Also, through the analysis of IL-8 and Ang-1, we found that GGInt-XGBoost was robust with respect to the LD structure within genes. The workflow designed for detection of GGIs did not damage the LD structure, and the assumption of the additive structure allowed marker-based interaction within genes. Last, GGInt-XGBoost might be improved further or generalized by incorporating ideas from causal inferences that would be applied more effectively to multi-gene settings and the study of gene pathways. In conclusion, GGInt-XGBoost is a helpful addition to the existing toolbox of statistical methods for studying gene–gene interaction in genome-wide association studies.

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found here: https://www.wtccc.org.uk/info/access_to_data_samples.html

Author Contributions

YZ and LX contributed to conceptualization and project administration. YG contributed to conceptualization, methodology, investigation, funding acquisition, and writing-original draft. CW contributed to methodology, formal analysis, and writing-original draft. ZY contributed to software and formal analysis. YW (4th author) contributed to data curation and visualization. ZL contributed to resources and data curation. YW (6th author) contributed to formal analysis, writing-review, and editing. All authors contributed to the article and approved the submitted version.

Funding

The work was supported by the National Natural Science Foundation of China (No. 62002243, No. 31900306), and the Research Foundation of ShenZhen Polytechnic (6021310019K).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fcell.2021.801113/full#supplementary-material

References

Babajide Mustapha, I., and Saeed, F. (2016). Bioactive Molecule Prediction Using Extreme Gradient Boosting. Molecules 21 (8). doi:10.3390/molecules21080983

CrossRef Full Text | Google Scholar

Buniello, A., MacArthur, J. A. L., Cerezo, M., Harris, L. W., Hayhurst, J., Malangone, C., et al. (2019). The NHGRI-EBI GWAS Catalog of Published Genome-wide Association Studies, Targeted Arrays and Summary Statistics 2019. Nucleic Acids Res. 47 (D1), D1005–D1012. doi:10.1093/nar/gky1120

PubMed Abstract | CrossRef Full Text | Google Scholar

Chang, Y.-C., Wu, J. T., Wu, J.-T., Hong, M.-Y., Tung, Y.-A., Hsieh, P.-H., et al. (2020). GenEpi: Gene-Based Epistasis Discovery Using Machine Learning. BMC Bioinformatics 21 (1), 68. doi:10.1186/s12859-020-3368-2

PubMed Abstract | CrossRef Full Text | Google Scholar

Chen, T., and Guestrin, C. (2016). “XGBoost : A Scalable Tree Boosting System,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, CA; New York, NY: Association for Computing Machinery), 785–794.

Google Scholar

Chen, X., Huang, L., Xie, D., and Zhao, Q. (2018). EGBMMDA: Extreme Gradient Boosting Machine for MiRNA-Disease Association Prediction. Cell Death Dis 9 (1), 3. doi:10.1038/s41419-017-0003-x

PubMed Abstract | CrossRef Full Text | Google Scholar

Cordell, H. J. (2009). Detecting Gene-Gene Interactions that Underlie Human Diseases. Nat. Rev. Genet. 10 (6), 392–404. doi:10.1038/nrg2579

PubMed Abstract | CrossRef Full Text | Google Scholar

Dong, C., Chu, X., Wang, Y., Wang, Y., Jin, L., Shi, T., et al. (2008). Exploration of Gene-Gene Interaction Effects Using Entropy-Based Methods. Eur. J. Hum. Genet. 16 (2), 229–235. doi:10.1038/sj.ejhg.5201921

CrossRef Full Text | Google Scholar

Emily, M.: A Survey of Statistical Methods for Gene-Gene Interaction in Case-Control Genome-wide Association Studies. 2018.

Google Scholar

Emily, M. (2016). AGGrEGATOr: A Gene-Based GEne-Gene interActTiOn Test for Case-Control Association Studies. Stat. Appl. Genet. Mol. Biol. 15 (2), 151–171. doi:10.1515/sagmb-2015-0074

PubMed Abstract | CrossRef Full Text | Google Scholar

Emily, M., Sounac, N., Kroell, F., and Houée-Bigot, M. (2020). Gene-Based Methods to Detect Gene-Gene Interaction in R: The GeneGeneInteR Package. J. Stat. Softw. 95 (12). doi:10.18637/jss.v095.i12

CrossRef Full Text | Google Scholar

Emily, M. (2012). IndOR: a New Statistical Procedure to Test for SNP-SNP Epistasis in Genome-wide Association Studies. Statist. Med. 31 (21), 2359–2373. doi:10.1002/sim.5364

PubMed Abstract | CrossRef Full Text | Google Scholar

Fang, G., Wang, W., Paunic, V., Heydari, H., Costanzo, M., Liu, X., et al. (2019). Discovering Genetic Interactions Bridging Pathways in Genome-wide Association Studies. Nat. Commun. 10 (1), 4274. doi:10.1038/s41467-019-12131-7

PubMed Abstract | CrossRef Full Text | Google Scholar

Field, M. (1995). Colony-stimulating Factors. Clin. Immunother. 3 (4), 255–261. doi:10.1007/bf03259277

CrossRef Full Text | Google Scholar

Friedman, J. H., and Stuetzle, W. (1981). Projection Pursuit Regression. J. Am. Stat. Assoc. 76 (376), 817–823. doi:10.1080/01621459.1981.10477729

CrossRef Full Text | Google Scholar

Guo, F., Wang, D., and Wang, L. (2018). Progressive Approach for SNP Calling and Haplotype Assembly Using Single Molecular Sequencing Data. Bioinformatics 34 (12), 2012–2018. doi:10.1093/bioinformatics/bty059

PubMed Abstract | CrossRef Full Text | Google Scholar

Hastie, T., and Tibshirani, R. (1990). Generalized Additive Models. London, United Kingdom: Chapman & Hall.

Google Scholar

He, B., Lang, J., Wang, B., Liu, X., Lu, Q., He, J., et al. (2020). TOOme: A Novel Computational Framework to Infer Cancer Tissue-Of-Origin by Integrating Both Gene Mutation and Expression. Front. Bioeng. Biotechnol. 8, 394. doi:10.3389/fbioe.2020.00394

PubMed Abstract | CrossRef Full Text | Google Scholar

Hindorff, L. A., Sethupathy, P., Junkins, H. A., Ramos, E. M., Mehta, J. P., Collins, F. S., et al. (2009). Potential Etiologic and Functional Implications of Genome-wide Association Loci for Human Diseases and Traits. Proc. Natl. Acad. Sci. 106 (23), 9362–9367. doi:10.1073/pnas.0903103106

PubMed Abstract | CrossRef Full Text | Google Scholar

Hu, Y., Sun, J. Y., Zhang, Y., Zhang, H., Gao, S., Wang, T., et al. (2021). rs1990622 Variant Associates with Alzheimer's Disease and Regulates TMEM106B Expression in Human Brain Tissues. BMC Med. 19 (1), 11. doi:10.1186/s12916-020-01883-5

PubMed Abstract | CrossRef Full Text | Google Scholar

Huber, R., Kirsten, H., Näkki, A., Pohlers, D., Thude, H., Eidner, T., et al. (2019). Association of Human FOS Promoter Variants with the Occurrence of Knee-Osteoarthritis in a Case Control Association Study. Int. J. Mol. Sci. 20 (6). doi:10.3390/ijms20061382

PubMed Abstract | CrossRef Full Text | Google Scholar

Jiang, L., Wang, C., Tang, J., and Guo, F. (2019). LightCpG: a Multi-View CpG Sites Detection on Single-Cell Whole Genome Sequence Data. Bmc Genomics 20, 306. doi:10.1186/s12864-019-5654-9

PubMed Abstract | CrossRef Full Text | Google Scholar

Jiang, Q., Jin, S., Jiang, Y., Liao, M., Feng, R., Zhang, L., et al. (2017). Alzheimer's Disease Variants with the Genome-wide Significance Are Significantly Enriched in Immune Pathways and Active in Immune Cells. Mol. Neurobiol. 54 (1), 594–600. doi:10.1007/s12035-015-9670-8

PubMed Abstract | CrossRef Full Text | Google Scholar

Jiang, Q., Wang, G., Jin, S., Li, Y., and Wang, Y. (2013). Predicting Human microRNA-Disease Associations Based on Support Vector Machine. Ijdmb 8 (3), 282–293. doi:10.1504/ijdmb.2013.056078

PubMed Abstract | CrossRef Full Text | Google Scholar

Jordan, L. A., Erlandsson, M. C., Fenner, B. F., Davies, R., Harvey, A. K., Choy, E. H., et al. (2018). Inhibition of CCL3 Abrogated Precursor Cell Fusion and Bone Erosions in Human Osteoclast Cultures and Murine Collagen-Induced Arthritis. Rheumatology (Oxford) 57 (11), 2042–2052. doi:10.1093/rheumatology/key196

PubMed Abstract | CrossRef Full Text | Google Scholar

Kabala, P. A., Malvar-Fernández, B., Lopes, A. P., Carvalheiro, T., Hartgring, S. A. Y., Tang, M. W., et al. (2020). Promotion of Macrophage Activation by Tie2 in the Context of the Inflamed Synovia of Rheumatoid Arthritis and Psoriatic Arthritis Patients. Rheumatology (Oxford) 59 (2), 426–438. doi:10.1093/rheumatology/kez315

PubMed Abstract | CrossRef Full Text | Google Scholar

Karbowiak, E., and Biecek, P. (2021). “EIX: Explain Interactions in 'XGBoost',” in R Package Version 1.1. Available at: https://github.com/ModelOriented/EIX.

Google Scholar

Karlson, E. W., Chibnik, L. B., Cui, J., Plenge, R. M., Glass, R. J., Maher, N. E., et al. (2008). Associations between Human Leukocyte Antigen, PTPN22, CTLA4 Genotypes and Rheumatoid Arthritis Phenotypes of Autoantibody Status, Age at Diagnosis and Erosions in a Large Cohort Study. Ann. Rheum. Dis. 67 (3), 358–363. doi:10.1136/ard.2007.071662

PubMed Abstract | CrossRef Full Text | Google Scholar

Larson, N. B., Jenkins, G. D., Larson, M. C., Vierkant, R. A., Sellers, T. A., Phelan, C. M., et al. (2013). Kernel Canonical Correlation Analysis for Assessing Gene-Gene Interactions and Application to Ovarian Cancer. Eur. J. Hum. Genet. 22 (1), 126–131. doi:10.1038/ejhg.2013.69

CrossRef Full Text | Google Scholar

Lemieux Perreault, L. P., Legault, M. A., Asselin, G., and Dubé, M. P. (2016). Genipe: an Automated Genome-wide Imputation Pipeline with Automatic Reporting and Statistical Tools. Bioinformatics 32 (23), 3661–3663. doi:10.1093/bioinformatics/btw487

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, H.-L., Pang, Y.-H., and Liu, B. (2021). BioSeq-BLM: a Platform for Analyzing DNA, RNA and Protein Sequences Based on Biological Language Models. Nucleic Acids Res., gkab829. doi:10.1093/nar/gkab829

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, J., and Chen, Y. (2008). Generating Samples for Association Studies Based on HapMap Data. BMC bioinformatics 9 (1), 44–13. doi:10.1186/1471-2105-9-44

PubMed Abstract | CrossRef Full Text | Google Scholar

Li, J., Huang, D., Guo, M., Liu, X., Wang, C., Teng, Z., et al. (2015). A Gene-Based Information Gain Method for Detecting Gene-Gene Interactions in Case-Control Studies. Eur. J. Hum. Genet. 23 (11), 1566–1572. doi:10.1038/ejhg.2015.16

CrossRef Full Text | Google Scholar

Li, M.-X., Gui, H.-S., Kwan, J. S. H., and Sham, P. C. (2011). GATES: A Rapid and Powerful Gene-Based Association Test Using Extended Simes Procedure. Am. J. Hum. Genet. 88 (3), 283–293. doi:10.1016/j.ajhg.2011.01.019

CrossRef Full Text | Google Scholar

Li, P., Guo, M., Wang, C., Liu, X., and Zou, Q. (2015). An Overview of SNP Interactions in Genome-wide Association Studies. Brief. Funct. Genomics 14 (2), 143–155. doi:10.1093/bfgp/elu036

PubMed Abstract | CrossRef Full Text | Google Scholar

Lin, H., Mueller-Nurasyid, M., Smith, A. V., Arking, D. E., Barnard, J., Bartz, T. M., et al. (2016). Gene-gene Interaction Analyses for Atrial Fibrillation. Sci. Rep. 6 (1), 35371–35379. doi:10.1038/srep35371

PubMed Abstract | CrossRef Full Text | Google Scholar

Liu, G., Hu, Y., Han, Z., Jin, S., and Jiang, Q. (2019). Genetic Variant Rs17185536 Regulates SIM1 Gene Expression in Human Brain Hypothalamus. Proc. Natl. Acad. Sci. USA 116 (9), 3347–3348. doi:10.1073/pnas.1821550116

PubMed Abstract | CrossRef Full Text | Google Scholar

Liu, G., Hu, Y., Jin, S., and Jiang, Q. (2017). Genetic Variant Rs763361 Regulates Multiple Sclerosis CD226 Gene Expression. Proc. Natl. Acad. Sci. USA 114 (6), E906–E907. doi:10.1073/pnas.1618520114

PubMed Abstract | CrossRef Full Text | Google Scholar

Liu, G., Hu, Y., Jin, S., Zhang, F., Jiang, Q., and Hao, J. (2016). Cis-eQTLs Regulate Reduced LST1 Gene and NCR3 Gene Expression and Contribute to Increased Autoimmune Disease Risk. Proc. Natl. Acad. Sci. USA 113 (42), E6321–E6322. doi:10.1073/pnas.1614369113

PubMed Abstract | CrossRef Full Text | Google Scholar

Liu, G., and Jiang, Q. (2016). Alzheimer's Disease CD33 Rs3865444 Variant Does Not Contribute to Cognitive Performance. Proc. Natl. Acad. Sci. USA 113 (12), E1589–E1590. doi:10.1073/pnas.1600852113

PubMed Abstract | CrossRef Full Text | Google Scholar

Liu, G., Jin, S., Hu, Y., and Jiang, Q. (2018). Disease Status Affects the Association between Rs4813620 and the Expression of Alzheimer's Disease Susceptibility geneTRIB3. Proc. Natl. Acad. Sci. USA 115 (45), E10519–E10520. doi:10.1073/pnas.1812975115

PubMed Abstract | CrossRef Full Text | Google Scholar

Liu, G., Zhang, Y., Wang, L., Xu, J., Chen, X., Bao, Y., et al. (2018). Alzheimer's Disease Rs11767557 Variant Regulates EPHA1 Gene Expression Specifically in Human Whole Blood. Jad 61 (3), 1077–1088. doi:10.3233/jad-170468

PubMed Abstract | CrossRef Full Text | Google Scholar

Liu, H., Qiu, C., Wang, B., Bing, P., Tian, G., Zhang, X., et al. (2021). Evaluating DNA Methylation, Gene Expression, Somatic Mutation, and Their Combinations in Inferring Tumor Tissue-Of-Origin. Front. Cell Dev. Biol. 9, 619330. doi:10.3389/fcell.2021.619330

PubMed Abstract | CrossRef Full Text | Google Scholar

Liu, J. Z., McRae, A. F., Nyholt, D. R., Medland, S. E., Wray, N. R., Brown, K. M., et al. (2010). A Versatile Gene-Based Test for Genome-wide Association Studies. Am. J. Hum. Genet. 87 (1), 139–145. doi:10.1016/j.ajhg.2010.06.009

CrossRef Full Text | Google Scholar

Loos, R. J. F. (2020). 15 Years of Genome-wide Association Studies and No Signs of Slowing Down. Nat. Commun. 11 (1), 5900. doi:10.1038/s41467-020-19653-5

PubMed Abstract | CrossRef Full Text | Google Scholar

Luo, J., Meng, Y., Zhai, J., Zhu, Y., Li, Y., and Wu, Y. (2020). Screening of SLE-Susceptible SNPs in One Chinese Family with Systemic Lupus Erythematosus. Cbio 15 (7), 778–787. doi:10.2174/1574893615666200120105153

CrossRef Full Text | Google Scholar

Lv, Z., Wang, D., Ding, H., Zhong, B., and Xu, L. (2020). Escherichia Coli DNA N-4-Methycytosine Site Prediction Accuracy Improved by Light Gradient Boosting Machine Feature Selection Technology. IEEE Access 8, 14851–14859. doi:10.1109/access.2020.2966576

CrossRef Full Text | Google Scholar

Lyu, P., Hou, J., Yu, H., and Shi, H. (2020). High-density Genetic Linkage Map Construction in Sunflower (Helianthus Annuus L.) Using SNP and SSR Markers. Curr. Bioinformatics 15 (8), 889–897. doi:10.2174/1574893615666200324134725

CrossRef Full Text | Google Scholar

Ma, L., Clark, A. G., and Keinan, A. (2013). Gene-based Testing of Interactions in Association Studies of Quantitative Traits. Plos Genet. 9 (2), e1003321. doi:10.1371/journal.pgen.1003321

PubMed Abstract | CrossRef Full Text | Google Scholar

Majithia, V., and Geraci, S. A. (2007). Rheumatoid Arthritis: Diagnosis and Management. Am. J. Med. 120 (11), 936–939. doi:10.1016/j.amjmed.2007.04.005

CrossRef Full Text | Google Scholar

Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., Hunter, D. J., et al. (2009). Finding the Missing Heritability of Complex Diseases. Nature 461 (7265), 747–753. doi:10.1038/nature08494

PubMed Abstract | CrossRef Full Text | Google Scholar

Moore, J. H., Asselbergs, F. W., and Williams, S. M. (2010). Bioinformatics Challenges for Genome-wide Association Studies. Bioinformatics 26 (4), 445–455. doi:10.1093/bioinformatics/btp713

PubMed Abstract | CrossRef Full Text | Google Scholar

Mrozek, D., Daniłowicz, P., and Małysiak-Mrozek, B. (2016). HDInsight4PSi: Boosting Performance of 3D Protein Structure Similarity Searching with HDInsight Clusters in Microsoft Azure Cloud. Inf. Sci. 349-350, 77–101. doi:10.1016/j.ins.2016.02.029

CrossRef Full Text | Google Scholar

Narasimhan, R., Coras, R., Rosenthal, S. B., Sweeney, S. R., Lodi, A., Tiziani, S., et al. (2018). Serum Metabolomic Profiling Predicts Synovial Gene Expression in Rheumatoid Arthritis. Arthritis Res. Ther. 20 (1), 164. doi:10.1186/s13075-018-1655-3

PubMed Abstract | CrossRef Full Text | Google Scholar

Navarrete Santos, A., Kehlen, A., Schütte, W., Langner, J., and Riemann, D. (1998). Regulation by Transforming Growth Factor-Beta1 of Class II mRNA and Protein Expression in Fibroblast-like Synoviocytes from Patients with Rheumatoid Arthritis. Int. Immunol. 10 (5), 601–607. doi:10.1093/intimm/10.5.601

PubMed Abstract | CrossRef Full Text | Google Scholar

Nobre, R., Ilic, A., Santander-Jimenez, S., and Sousa, L. (2021). Retargeting Tensor Accelerators for Epistasis Detection. IEEE Trans. Parallel Distrib. Syst. 32 (9), 2160–2174. doi:10.1109/tpds.2021.3060322

CrossRef Full Text | Google Scholar

Peng, Q., Zhao, J., and Xue, F. (2010). A Gene-Based Method for Detecting Gene-Gene Co-association in a Case-Control Association Study. Eur. J. Hum. Genet. 18 (5), 582–587. doi:10.1038/ejhg.2009.223

CrossRef Full Text | Google Scholar

Ritchie, M. D., Hahn, L. W., and Moore, J. H. (2003). Power of Multifactor Dimensionality Reduction for Detecting Gene-Gene Interactions in the Presence of Genotyping Error, Missing Data, Phenocopy, and Genetic Heterogeneity. Genet. Epidemiol. 24 (2), 150–157. doi:10.1002/gepi.10218

PubMed Abstract | CrossRef Full Text | Google Scholar

Ritchie, M. D., and Van Steen, K. (2018). The Search for Gene-Gene Interactions in Genome-wide Association Studies: Challenges in Abundance of Methods, Practical Considerations, and Biological Interpretation. Ann. Transl. Med. 6 (8), 157. doi:10.21037/atm.2018.04.05

PubMed Abstract | CrossRef Full Text | Google Scholar

Schneider, H., and Rudd, C. E. (2014). Diverse Mechanisms Regulate the Surface Expression of Immunotherapeutic Target Ctla-4. Front. Immunol. 5, 619. doi:10.3389/fimmu.2014.00619

PubMed Abstract | CrossRef Full Text | Google Scholar

Shao, J., and Liu, B. (2021). ProtFold-DFG: Protein Fold Recognition by Combining Directed Fusion Graph and PageRank Algorithm. Brief Bioinform 22 (3), bbaa192. doi:10.1093/bib/bbaa192

PubMed Abstract | CrossRef Full Text | Google Scholar

Shao, J., Yan, K., and Liu, B. (2021). FoldRec-C2C: Protein Fold Recognition by Combining Cluster-To-Cluster Model and Protein Similarity Network. Brief Bioinform 22 (3), bbaa144. doi:10.1093/bib/bbaa144

PubMed Abstract | CrossRef Full Text | Google Scholar

Steere, A. C., and Glickstein, L. (2004). Elucidation of Lyme Arthritis. Nat. Rev. Immunol. 4 (2), 143–152. doi:10.1038/nri1267

PubMed Abstract | CrossRef Full Text | Google Scholar

Stone, C. J. (1985). Additive Regression and Other Nonparametric Models. Ann. Stat. 13, 689–705. doi:10.1214/aos/1176349548

CrossRef Full Text | Google Scholar

Wan, X., Yang, C., Yang, Q., Xue, H., Fan, X., Tang, N. L. S., et al. (2010). BOOST: A Fast Approach to Detecting Gene-Gene Interactions in Genome-wide Case-Control Studies. Am. J. Hum. Genet. 87 (3), 325–340. doi:10.1016/j.ajhg.2010.07.021

CrossRef Full Text | Google Scholar

Wang, H., Jijun, T., Ding, Y., and Guo, F. (2021). Exploring Associations of Non-coding RNAs in Human Diseases via Three-Matrix Factorization with Hypergraph-Regular Terms on center Kernel Alignment. Brief. Bioinform. 22, bbaa409. doi:10.1093/bib/bbaa409

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, Z., He, W., Tang, J., and Guo, F. (2020). Identification of Highest-Affinity Binding Sites of Yeast Transcription Factor Families. J. Chem. Inf. Model. 60 (3), 1876–1883. doi:10.1021/acs.jcim.9b01012

PubMed Abstract | CrossRef Full Text | Google Scholar

Wei, L., Chen, H., and Su, R. (2018). M6APred-EL: A Sequence-Based Predictor for Identifying N6-Methyladenosine Sites Using Ensemble Learning. Mol. Ther. - Nucleic Acids 12, 635–644. doi:10.1016/j.omtn.2018.07.004

PubMed Abstract | CrossRef Full Text | Google Scholar

Wei, L., Wan, S., Guo, J., and Wong, K. K. (2017). A Novel Hierarchical Selective Ensemble Classifier with Bioinformatics Application. Artif. Intelligence Med. 83, 82–90. doi:10.1016/j.artmed.2017.02.005

CrossRef Full Text | Google Scholar

Wei, L., Xing, P., Zeng, J., Chen, J., Su, R., and Guo, F. (2017). Improved Prediction of Protein-Protein Interactions Using Novel Negative Samples, Features, and an Ensemble Classifier. Artif. Intelligence Med. 83, 67–74. doi:10.1016/j.artmed.2017.03.001

CrossRef Full Text | Google Scholar

Yang, J., Huang, T., Huang, T., Petralia, F., Long, Q., Zhang, B., et al. (2015). Synchronized Age-Related Gene Expression Changes across Multiple Tissues in Human and the Link to Complex Diseases. Sci. Rep. 5, 15145. doi:10.1038/srep15145

PubMed Abstract | CrossRef Full Text | Google Scholar

Young, A. I. (2019). Solving the Missing Heritability Problem. Plos Genet. 15 (6), e1008222. doi:10.1371/journal.pgen.1008222

PubMed Abstract | CrossRef Full Text | Google Scholar

Yu, B., Qiu, W., Chen, C., Ma, A., Jiang, J., Zhou, H., et al. (2020). SubMito-XGBoost: Predicting Protein Submitochondrial Localization by Fusing Multiple Feature Information and eXtreme Gradient Boosting. Bioinformatics 36 (4), 1074–1081. doi:10.1093/bioinformatics/btz734

PubMed Abstract | CrossRef Full Text | Google Scholar

Yu, X., Zhou, J., Zhao, M., Yi, C., Duan, Q., Zhou, W., et al. (2020). Exploiting XG Boost for Predicting Enhancer-Promoter Interactions. Curr. Bioinformatics 15 (9), 1036–1045.

Google Scholar

Yuan, Z., Gao, Q., He, Y., Zhang, X., Li, F., Zhao, J., et al. (2012). Detection for Gene-Gene Co-association via Kernel Canonical Correlation Analysis. BMC Genet. 13, 83. doi:10.1186/1471-2156-13-83

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, S., Jiang, W., Ma, R. C., and Yu, W. (2019). Region-based Interaction Detection in Genome-wide Case-Control Studies. BMC Med. Genomics 12 (Suppl. 7), 133. doi:10.1186/s12920-019-0583-7

PubMed Abstract | CrossRef Full Text | Google Scholar

Keywords: genome-wide association studies, gene–gene interactions, XGBoost, additive model, gene-based testing

Citation: Guo Y, Wu C, Yuan Z, Wang Y, Liang Z, Wang Y, Zhang Y and Xu L (2021) Gene-Based Testing of Interactions Using XGBoost in Genome-Wide Association Studies. Front. Cell Dev. Biol. 9:801113. doi: 10.3389/fcell.2021.801113

Received: 24 October 2021; Accepted: 23 November 2021;
Published: 16 December 2021.

Edited by:

Liang Cheng, Harbin Medical University, China

Reviewed by:

Yi Xiong, Shanghai Jiao Tong University, China
Quan Zou, University of Electronic Science and Technology of China, China

Copyright © 2021 Guo, Wu, Yuan, Wang, Liang, Wang, Zhang and Xu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Yi Zhang, y_zhang1024@126.com; Lei Xu, csleixu@szpt.edu.cn

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.