Genome-Wide Association Study-Guided Exome Rare Variant Burden Analysis Identifies IL1R1 and CD3E as Potential Autoimmunity Risk Genes for Celiac Disease

Mansour, Haifa; Banaganapalli, Babajan; Nasser, Khalidah Khalid; Al-Aama, Jumana Yousuf; Shaik, Noor Ahmad; Saadah, Omar Ibrahim; Elango, Ramu

doi:10.3389/fped.2022.837957

ORIGINAL RESEARCH article

Front. Pediatr., 14 February 2022

Sec. Genetics of Common and Rare Diseases

Volume 10 - 2022 | https://doi.org/10.3389/fped.2022.837957

Genome-Wide Association Study-Guided Exome Rare Variant Burden Analysis Identifies IL1R1 and CD3E as Potential Autoimmunity Risk Genes for Celiac Disease

1. Department of Genetic Medicine, Faculty of Medicine, King Abdulaziz University, Jeddah, Saudi Arabia
2. Princess Al-Jawhara Al-Brahim Center of Excellence in Research of Hereditary Disorders, King Abdulaziz University, Jeddah, Saudi Arabia
3. Department of Medical Laboratory Technology, Faculty of Applied Medical Sciences, King Abdulaziz University, Jeddah, Saudi Arabia
4. Pediatric Gastroenterology Unit, Department of Pediatrics, Faculty of Medicine, King Abdulaziz University, Jeddah, Saudi Arabia
5. Centre of Artificial Intelligence in Precision Medicine, King Abdulaziz University, Jeddah, Saudi Arabia

Abstract

Celiac disease (CeD) is a multifactorial autoimmune enteropathy characterized by the overactivation of the immune system in response to dietary gluten. The molecular etiology of CeD is still not well-understood. Therefore, this study aims to identify potential candidate genes involved in CeD pathogenesis by applying multilayered system biology approaches. Initially, we identified rare coding variants shared between the affected siblings in two rare Arab CeD families by whole-exome sequencing (WES). Then we used the STRING database to construct a protein network of rare variants and genome-wide association study (GWAS) loci to explore their molecular interactions in CeD. Furthermore, the hub genes identified based on network topology parameters were subjected to a series of computational validation analyses like pathway enrichment, gene expression, knockout mouse model, and variant pathogenicity predictions. Our findings have shown the absence of rare variants showing classical Mendelian inheritance in both families. However, interactome analysis of rare WES variants and GWAS loci has identified a total of 11 hub genes. The multidimensional computational analysis of hub genes has prioritized IL1R1 for family A and CD3E for family B as potential genes. These genes were connected to CeD pathogenesis pathways of T-cell selection, cytokine signaling, and adaptive immune response. Future multi-omics studies may uncover the roles of IL1R1 and CD3E in gluten sensitivity. The present investigation lays forth a novel approach integrating next-generation sequencing (NGS) of familial cases, GWAS, and computational analysis for solving the complex genetic architecture of CeD.

Introduction

Celiac disease (CeD) is an autoimmune gastrointestinal disorder seen in genetically susceptible individuals. The global seroprevalence of CeD (positive for CeD autoantibodies) is estimated to be ~1.4%, while biopsy-proven prevalence is ~0.7% (1, 2). It usually manifests in childhood and early adulthood, but can manifest as early as infancy, necessitating early detection and intervention to prevent irreparable (irreversible) damage like villi atrophy of the small intestine. Diarrhea, abdominal pain, failure to thrive, and anemia caused by intestinal villi atrophy are the most common clinical signs of CeD (3–5). CeD is triggered by abnormal activation of the immune system in response to dietary gliadin, a water-insoluble gluten protein found in wheat, rye, and barley (6, 7). The commonly practiced clinical intervention is adopting a gluten-free diet (GFD); nevertheless, symptoms in some patients persist even after gluten elimination (8, 9). The reliable diagnosis approach for CeD is the histopathological evaluation of small bowel biopsy (SBB), accompanied by the grading of intestinal mucosal lesions based on the pattern of villous atrophy and level of intraepithelial lymphocyte infiltration. Serological testing is a reliable screening approach for detecting tissue transglutaminase (tTG) and endomysial antibodies, but ~5% of celiac patients are seronegative (10).

CeD is a classical multifactorial disease in which an individual's genetic background determines the susceptibility and severity of gluten sensitivity. The strong genetic component implicated in disease etiology has been highlighted in studies conducted among twins, first-degree relatives, animal models, and different ethnic populations (11). A history of biopsy-defined CeD positive family members is expected to account for a greater illness risk in 20% or more of first-degree relatives (2–10-fold) among all the factors indicated [11] Also, patients with autoimmune diseases, such as type 1 diabetes (DM1) (85% are seropositive) (12, 13), primary Sjögren's syndrome, systemic sclerosis, and Graves' disease (autoimmune hyperthyroidism), have an increased chance of developing CeD (14, 15). The environmental factors such as the time of gluten dietary introduction and birth season are also thought to be involved in disease development (16). HLA (HLA-DQA1 and HLA-DQB1) genetic variants encoding the HLA-DQ2 and HLA-DQ8 antigens are known to account (explain) for up to 48% of disease etiology (17). All CeD patients have one of the two risk alleles (90 and 10%), but 30–40% of the general population also carries them (18, 19). This means that HLA risk alleles are simply a prerequisite for the development of CeD.

High-throughput genotyping [genome-wide association study (GWAS)] (20–25), massive parallel sequencing (26–28), and transcriptomics assays (RNA sequencing or microarrays) (29–33) have uncovered numerous genetic variations and differentially expressed genes, providing good resolution into the pathophysiology of CeD in recent decades. However, these studies were largely undertaken in sporadic cases belonging to European/Mediterranean populations (34–37) and were unable to uncover any causative gene underpinning the complicated genetic architecture of CeD. Few whole-exome studies, on the other hand, were able to identify some family-specific rare variants (26, 27). This demonstrates that studying the molecular basis of CeD in families rather than sporadic cases is a promising technique for uncovering novel disease genes or novel variants in known disease genes. However, due to the complicated polygenic nature of CeD, determining a specific causal gene or genetic variant is extremely difficult (27, 28). In this context, exploring the interaction between identified CeD GWAS loci and whole-exome sequencing (WES) variants not only reveal the major heritability but also can aid in uncovering new disease causal genes for many complex diseases (38). This novel approach may also decipher the functional role of some potential loci in any disease.

In recent years, computational integrative annotation of data from GWAS, genome or exome sequencing [next-generation sequencing (NGS)], and genome-wide gene expression data (microarray or RNA seq) have proven to be a powerful approach for interpreting the development and/or progression of several complex autoimmune diseases (39). However, no such integrative genomic annotation studies have been conducted on CeD. Therefore, in the current study, we used WES data from two rare Arab celiac families to develop protein–protein interaction networks between family-specific rare coding variants and GWAS risk loci to unravel the genetic basis of CeD. Our findings demonstrate that even with few rare familial data, applying powerful integrated approaches can help in the identification of potential biomarkers for complex diseases.

Materials and Methods

The overall study design and experimental approaches are represented in Figure 1.

Figure 1

Recruitment of Celiac Disease Family

The study protocol was approved by the Research Ethics Committee, King Abdulaziz University Hospital, Jeddah (KAUH). We have recruited two non-consanguineous Arab families living in Saudi Arabia: family A with three affected siblings and family B with two affected siblings. Pediatric gastroenterologists diagnosed the patients by clinical, histopathological (intestinal SBB), and serological (anti-tTG antibodies) examinations. The patients were confirmed to meet the standard diagnostic guidelines of the European Society for Pediatric Gastroenterology Hepatology and Nutrition (ESPGHAN) for CeD (40). A three-generation pedigree of both families was constructed based on personal interviews. Clinical information about the celiac patients was collected from hospital electronic health records. After participant consent was obtained, peripheral blood samples (3–4 ml) were collected and stored at −80°C until genetic analysis was performed.

DNA Extraction

Genomic DNA was extracted by lysis, binding, elution, and concentration steps outlined in the QIAmp (QIAGEN™, Valencia, CA, USA) blood extraction protocol. DNA concentration and purity were measured at 260 and 280 nm using Nano-Drop 2000 spectrophotometer, respectively, and accepted measurements were 50–150 ng/μl and 1.8–2.0, respectively. The integrity of DNA samples was checked on 1% agarose gel electrophoresis and then stored at −20°C until used for genetic analysis.

Whole-Exome Sequencing Analysis

WES was performed on the HiSeq2000 Next Generation Sequencer (Illumina, San Diego, CA, USA). The genomic DNA (average of 60 ng/μl) was used for library preparation, including DNA tagmentation (fragmentation and adapter ligation at both ends), target capturing (GKAT), and amplification using the ligated adapters. Libraries then were loaded onto a flow cell and placed on the sequencer for cluster generation and sequencing; the read depth was ~120 ×, covering 97% of target regions (more than 10 ×). The sequencing reads were mapped to the human genome reference GRCH38.p12 assembly using the BWA algorithm, and then SAMTOOLS was used for BAM to SAM files conversion and single-nucleotide polymorphisms (SNPs) and Indel calling (41, 42). ANNOVAR tool was used for rsID identification, annotation, and pathogenicity prediction of variants (43, 44). Variants were filtered based on several quality control (QC) measures like depth (≥30), maximum quality read (≥60), and alternative to total depth ratio (>80% for homozygous variants and 40–70% for heterozygous variants), in addition to other criteria like their minor allele frequency (MAF) (<0.02), location (coding regions), and their pathogenic effects (Supplementary Table 1). All the short-listed variants were analyzed by Sanger sequencing to determine their segregation pattern in the corresponding family members. In this context, oligonucleotide primer sequences (Supplementary Table 2) spanning the variant location were initially designed by Primer NCBI Primer Blast online tool (45), and then standard PCR amplification, Sanger sequencing, sequence alignment, and variant calling steps were performed as described in our recent publications (46, 47).

Protein–Protein Interaction Networks Construction of Rare Variants Genes and Genome-Wide Association Study Locus Genes

All the WES variants were initially examined to see their mode of inheritance in their corresponding celiac families. Then, we constructed PPINs and examined the interactions between filtered WES genes and CeD GWAS loci [r² > 0.8] (20, 21, 48) for families where a classical segregation analysis has failed to identify a single disease causal variant. The WES–GWAS gene list was provided as an input to construct and expand the PPINs by STRING database (https://string-db.org). Cytoscape 3.8.2 software was utilized to view the constructed networks and to calculate the centrality measures (49).

Network Analysis and Identification of Hub Genes

The PPINs generated from WES–GWAS data of each family were analyzed using two Cytoscape plug-ins, ClueGO (50) and CluePedia (51), for the execution of functional enrichment analysis using Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and immune system processes as key query Gene Ontology (GO) terms. Furthermore, the degree centrality (DC) parameter of network topology was analyzed utilizing Network analyzer Cytoscape plug-in. DC represents the number of interactions with any nodes in the network (52), and genes with DC > 10 were selected as hub (high-centrality) genes.

Computational Functional Validation of Selected Potential Celiac Disease Genes

The high-centrality genes from each PPIN were further explored to investigate their potential contribution to disease development. In this context, several databases and computational tools were used to perform functional enrichment annotations, examine gene expression levels in different organs, and note down the altered phenotypes of knockout (KO) mouse models.

Gene Ontology Annotations and Pathways

We used the Ensembl database (https://www.ensembl.org/index.html) to analyze the functionally enriched key GO terms including biological processes, molecular function, cellular components, and pathways for all the hub genes.

Knockout Mouse Model

In order to gain additional insight into the biological function of each query hub gene, we have used the gene names as the input data in the Mouse Genome Information database (MGI) (http://www.informatics.jax.org) (53). This database provides lists of pathological phenotypes in KO models in reference to the studied mouse strain as well as an overview of the altered phenotypes in the mouse model.

Gene Expression Analysis

The gene expression data of the query hub genes were retrieved from the EBI gene expression atlas (EXA) interface available in Ensembl. This tool generates the normalized expression level of each gene in various organs and tissues in the form of a heatmap. Baseline expression level measurements were represented in either fragment per kilobase of exon model per million mapped reads (FPKM) or transcripts per million (TPM).

Pathogenic Prediction of Hub Gene Variants

The rare coding variants identified in hub genes were further analyzed by the variant effect predictor (VEP) tool provided by Ensembl (54). From the VEP outputs, prediction scores of SIFT, PolyPhen, CADD, and Mutation assessor were selected. The MAF of these variants was determined by searching in Saudi Human Genome Project (SHGP) (https://shgp.kacst.edu.sa/index.en.html) and Great Middle East (GME) Variome (http://igm.ucsd.edu/gme/) databases.

Rare Coding Variant Effect on the Protein Structure of Celiac Disease Candidates

The hub genes showing the highest interaction (gene count numbers) with GWAS genes and positive findings from computational annotations were shortlisted and further studied.

Protein Structural Feature Analysis

The amino acid sequences in FASTA format were provided as an input to the Protein Families database (http://pfam.xfam.org) for mapping the variants onto functional domains (55). Additionally, the PredictProtein database (https://predictprotein.org) was used to detect the change in solvent accessibility and flexibility of the candidate protein in both native and variant conditions.

3D Structure Stability Analysis

Homology protein modeling of the query proteins was performed using BLASTP (56) and Swiss PDB viewer (https://swissmodel.expasy.org) tool by searching for the experimentally solved structures (with >50% coverage) deposited in protein data bank (PDB) (https://www.rcsb.org) (57). After that, the Modeler 10.1 software was utilized to build the protein model using the multiple template alignment approach. A total of 100 models for each protein were initially built, and then the models with the lowest DOPE scores were further selected to perform energy minimization of the 3D structures built (58). The optimum 3D structure was validated using the Ramachandran plot in the PROCHECK program (59), which was eventually used as a reference to build the mutant protein version with the DUET webserver. Besides predicting the tertiary structure models, DUET also provides the consensual stability scores of SDM (assess the change in amino acids function and protein family) and the mCSM (assess missense mutation effect of the protein structure) methods (60). Finally, Pymol software was used for visualization and alignment of all the protein structures built (61).

Results

Clinical and Family History

In family A (Figure 2A), the age of CeD diagnosis for the proband and two siblings are 18 years for III.2 and 12 years for III.4 and III.5, and the latter two showed elevated levels of tTG antibodies on an average level of 234.7 chemiluminescence unit (CU) when the normal range is <20. All the 3 patients adopted a GFD after 1 month of histological diagnosis, in addition to several nutritional supplementations like calcium, magnesium, zinc, Vit D, iron, and folic acid to compensate for the metabolic defects of CeD. Proband (III.2) was prescribed thyroxine tablets to manage the high level of thyroid-stimulating hormone (TSH) and hypothyroidism.

Figure 2

In family B (Figure 2B), the age of diagnosis was 5 years for III.4 and years for III.5 with a 7 CU average of tTG antibodies. Similar to family A, both patients adopted a GFD after 1 month of histological diagnosis and several nutritional supplementations like calcium and Vit D, in addition to antihistamine and pain killer drugs. Patient III.4 is diagnosed with diabetes mellitus; therefore, he was prescribed insulin, as well as thyroxine tablets for hypothyroidism management.

Whole-Exome Sequence Analysis

An average of 75,815 and 104,377 variants (with a Phred quality score of Q30) were identified in families A and B, respectively. In family A, a total of 338 variants (27 homozygous and 311 heterozygous) spanning over 322 genes were shared among III.2 and III.4. In family B, III.4 and III.5 shared 313 variants (37 homozygous and 276 heterozygous) mapped to 271 genes. The majority (11/12; 91.6%) of the coding variants identified in both families belonged to the missense category. Supplementary Table 3 shows the WES variant filtration steps followed in this study.

Segregation Analysis

Sanger sequencing validation of potential variants was performed to determine their mode of inheritance, i.e., autosomal recessive (AR), compound heterozygous (CH), or de novo (DN), in the CeD families. Overall, WES data filtration under different combinations yielded 4 variants under the AR mode of inheritance. These variants include IGFN1, c.3056T>G and LAD1, and c.452G>A variants for family A; and SSPO, c.11582dupA, PKD1L2, and c.706_707delAA variants for family B. However, Sanger sequencing did not confirm the AR segregation of the potential variants (IGFN1, c.3056T>G and LAD1, and c.452G>A) in the individuals from family A (Supplementary Figures 1, 2). Of the candidate variants short-listed for family B, SSPO was reported to be a pseudogene, and PKD1L2 has no functional correlation with CeD. Moreover, the search for CH variants in family A was not possible due to the absence of maternal WES data. In the case of family B, two genes with multiple variants showing CH inheritance were found, CYP4X1, c.116C>T (maternal), c.377C>T (paternal), and FLG, c.4765C>T (maternal), and c.6001G>A (paternal). However, they were excluded due to the lack of functional relevance to autoimmunity and CeD. Therefore, it is concluded that both AR and CH segregation models cannot explain the genetic basis of CeD in these two families.

Protein–Protein Interaction Network Construction and Expansion

The segregation analysis of the rare coding variants did not provide any evidence of causal gene(s) for CeD. So we hypothesized that the enrichment of variants in many functionally related or interacting genes in a relevant pathway might provide a clue to the disease biology. In families A and B, we identified rare variants in 322 and 271 genes, respectively. We found 23 and 13 of these genes from families A and B, respectively, in the innate immunity database (Supplementary Table 4). We constructed PPINs with genes (322 and 271 genes) from WES data and CeD GWAS loci [50, r² > 0.8].

Table 1 shows the statistical parameters of WES–GWAS PPINs before and after their expansion using the STRING database. In the case of the WES results from family A, only BACH2 (a GWAS gene) had one copy of the missense variant (rs1321699864), and it was not found to interact with any other WES identified genes, while in family B, no GWAS genes were found to have any rare coding variants. The maximum PPIN enrichment p-value was 9.99 × 10⁻¹⁶, and the minimum average local clustering coefficient was >0.423. Fine mapping of CeD GWAS loci on the immunochip platform has concluded 57 loci mapped to 50 genes with linkage disequilibrium score (r²) of >0.8 (20, 21, 48). The PPIN mapping and expansion of 371 genes (50 GWAS and 322 WES genes) in the STRING database have shown the direct interactions between 42 GWAS (84%) and 65 WES (20.1%) genes in family A. For family B, 321 genes (50 GWAS and 271 WES genes) were mapped and expanded showing direct protein–protein interaction between 44 (88%) GWAS and 56 (20.6%) WES genes.

Table 1

Statistical measure	Family A		Family B
	Before	After	Before	After
	expansion	expansion	expansion	expansion
No. of mapped nodes	359	409	317	357
No. of edges	481	972	423	824
Average node degree	2.68	4.75	2.67	4.62
Avg. local clustering coefficient	0.363	0.392	0.388	0.423
Expected number of edges	355	745	280	583
PPI enrichment p-value	1.31 × 10⁻¹⁰	9.99 × 10⁻¹⁶	11.1 × 10⁻¹⁵	<1.0 × 10⁻¹⁶

Statistical parameters of original and expanded WES–GWAS protein networks generated by STRING database.

WES, whole-exome sequencing; GWAS, genome-wide association study; PPI, protein–protein interaction.

Functional Enrichment Analysis of Whole-Exome Sequencing–Genome-Wide Association Study PPINs

The functional enrichment analysis of WES–GWAS protein networks has confirmed the predominant role of immune system-related GO terms and pathways in CeD etiology. In family A, 4.9% of WES and 20% of GWAS genes belonged to immune-related pathways, when compared with the total direct interactions. These include regulation of innate and adaptive immune responses, cytokine–cytokine receptor interaction, and regulation of production of molecular mediators of immune response pathways. On the other hand, 16.6% of the WES genes identified in family B were interacting with 48% of GWAS genes in immune pathway interactions. These genes were associated with autoimmune diseases like DM1, inflammatory bowel disease, rheumatoid arthritis, systemic lupus erythematous, and autoimmune thyroid disease and mapped to the intestinal immune network for IgA production, regulation of innate and adaptive immune response, and B cell- and T cell-mediated immunity pathways.

Protein Interaction Centrality Measures and Hub Gene Identification

The topology parameters of both PPINs revealed a total of 11 non-HLA WES genes showing a high-centrality score (>10 nodes). HLA genes were excluded to prioritize non-HLA immune-related genes and to study their relevance to CeD. In family A, 4 hub genes—EXOSC6 (Pro272Ser), CCNE1 (Asn260lle), ORC1 (Met816Thr), and IL1R1 (Tyr202His and Gly398Arg)—were identified. In family B, seven hub genes, namely, PPP2R1B (Arg549Cys), FBXL7 (Thr292Ile), PSMA8 (Val11Leu), POLR2A (Lys1838fs), CD3E (Ala157Val), WRN (Thr324Ala), and RANBP2 (Ile664Val), were identified. Of the 11 hub genes, CD3E and IL1R1 have shown the highest number of interactions with GWAS genes, with 9 and 7, respectively. Table 2 and Figure 3 represent the hub genes based on their DC and interacting gene partners from GWAS and WES data.

Table 2

Family	Gene	Degree of centrality	GWAS genes	WES genes
Family A	EXOSC6	15	ZFP36L1	NOL6, WDR3, KIAA0020, EMG1
	CCNE1	14	CSK	ORC1, CCT4
	ORC1	12	–	RIF1, CCNE1
	IL1R1	10	CCR2, CD28, CTLA4, IL2, IL21, IRAK1, IRF4	NOD1, MAP3K1, BCKDHA
Family B	PPP2R1B	25	CTLA4, IRAK1	APOB, CENPF
	FBXL7	22	–	GEMIN5, PSMA8, LRFN3, ANKRD9, MIB2
	PSMA8	20	–	FBXL7, PPP2R1B
	POLR2A	19	IRF4, UBE2L3	WDR77, ADCY10, KMT2C, NELFA, RPAP1, TERT, SRRM1
	CD3E	15	CSK, IL2, CTLA4, UBASH3A, ICOS, ETS1, RGS1, CCR2, CD28	HLA-DQA1, HLA-DRB5, HLA-B, HLA-DQB1
	WRN	11	RMI2	ASCC3, BIVM-ERCC5, BOD1L1, RIF1, TERT
	RANBP2	10	ZMIZ1, UBE2L3	ZNF44, SEC31A, SRRM1, PPP2R1B, CENPF, GEMIN5

Degree centrality between hub genes with GWAS loci and WES mapped genes with rare variants.

WES, whole-exome sequencing; GWAS, genome-wide association study.

Figure 3