The final, formatted version of the article will be published soon.
OPINION article
Front. Bioinform.
Sec. Genomic Analysis
Volume 4 - 2024 |
doi: 10.3389/fbinf.2024.1510352
Unlocking the future of complex human disease prediction: Multiomics risk score breakthrough
Provisionally accepted- 1 Catholic University of Health and Allied Sciences (CUHAS), Bugando, Tanzania
- 2 Makerere University, Kampala, Central Region, Uganda
Precise prediction of the risk of acquiring complex human diseases using genomic data has gained a considerable traction among clinicians, medical geneticists and researchers, particularly in this era of next generation sequencing. Multi-omics methods utilize various high-throughput screening technologies such as genomics (GWAS), DNA methylomics, metagenomics, transcriptomics, proteomics, metabolomics, and many others which play a crucial role in advancing the understanding of human diseases (Figure 1). These diverse multiomics indicators create a comprehensive framework, yielding significant insights into future health status predictions. The polygenic risk scores (PRS) -a calculation of a person's genetic predisposition to a trait or disease based on their genotype from pertinent genome-wide association study (GWAS) findings (Choi et al., 2020)-as well as methylation risk scores (MRS) -a linear combination of CpG (5'-C-phosphate-G-3') methylation (covalent attachment of a methyl group onto the cytosine residue of DNA) states (Thompson et al., 2022)-have shown promise in predicting complex human diseases accurately (Liu et al., 2024). However, their translation into clinical care is yet to be realized. Several efforts have been made to improve their accuracy in predicting complex human diseases, such as increasing diversity in the genetic training databases, such as the All of Us Research Program, and including conventional risk factors in the PRS model (Liu et al., 2024). In the realm of predictive medicine, conventional risk factors span socio-demographic elements like age and sex, alongside anthropometric data such as body mass index (BMI) and crucial clinical measures, including blood pressure, lipid profiles, kidney and liver function tests, and other key biomarkers such as glycated hemoglobin (HbA1c). These conventional risk factors intertwine with lifestyle choices, behaviors, and environment.Figure 1 shows how Multi-omics risk score is constructed from various high-throughput screening technologies such as genomics (GWAS), DNA methylomics, metagenomics, transcriptomics, proteomics and metabolomics to precisely predict and advancing the understanding of human diseases.The advent of GWAS, methylome-wide association studies (MWAS), and transcriptome-wide association studies (TWAS) have propelled genetic research forward by leaps and bounds, enabling the genotyping, methylation typing, and transcriptome analysis of millions of human samples. Through this vast endeavor, researchers have extracted genetic variants (Wu et al., 2021) and methylation patterns intricately linked to disease susceptibility across the human genome. The genetic variants and methylation patterns serve as the building blocks for constructing PRS (Choi et al., 2020) and MRS tailored to predict complex diseases in individuals based on their unique architecture. The efficacy and clinical potential of these tools shine brightly, offering invaluable insights into risk prediction for a plethora of common complex human diseases including cardiovascular diseases, cancers, diabetes mellitus, Alzheimer's disease, and ankylosing spondylitis (Cappozzo et al., 2022). They represent transformative applications in the arsenal of personalized medicine, promising to revolutionize healthcare by unlocking more secrets hidden within our genomes.The immense potential of genome-wide genotyping arrays lies in their ability to serve as a cost-effective approach capable of generating hundreds of PRSs. This groundbreaking technology is now undergoing rigorous evaluation in clinical studies across global healthcare systems. The allure of PRSs as predictive tools resonates profoundly, offering a glimpse into a future where personalized healthcare is not just a dream, but a tangible reality poised to transform medical practice. The full clinical potential of PRS and MRS remains largely untapped (Martin et al., 2019). This reality is especially pronounced in populations with high genetic diversity, diminished linkage blocks, and historical under-representation in genome databases, such as those of sub-Saharan African descent. The journey towards widespread clinical implementation of PRS is still in its infancy, with considerable challenges to overcome. Yet, with determination and concerted effort, bridging these gaps holds the key to unlocking the transformative power of PRS and MRS in diverse populations worldwide. Of note, there are concerted efforts such as the All of Us Research Program (Bick et al., 2024), Human Heredity and Health Africa (H3Africa), and others, to increase the representation of historically under-represented populations in the global genome databases to leverage this disparity. Ultimately the quantity and quality of data to compute PRS and MRS are escalating. The advent of multi-omics technologies and accrued data thereof in recent era suggest the feasibility of measuring and combining various omics data and cellular factors. This enables the creation of multi-omics risk scores (MoRS) with enhanced predictability for complex diseases (Liu et al., 2024). PRS integrated with multi-omics data analyses, including metagenomics, epigenomics, and transcriptomics have revealed potential biomarkers and ultimately improved predictability for several prevalent age-related conditions like heart disease, diabetes, dementia, and various cancers (Liu et al., 2024).The human gut microbiota, which refers to the collection of microorganisms residing in someone's gastrointestinal tract, has been implicated in numerous common diseases (Chen et al., 2024;Huang et al., 2024). Specific microbial signatures in the gut have been linked to mortality and the development of diseases such as type 2 Diabetes (T2D), liver issues, and respiratory diseases among the general population (Liu et al., 2024). This suggests that the composition of the gut microbiome could potentially aid in predicting disease risk. It is worth noting that while GWAS has shed light on the genetic underpinnings of the gut microbiome, it is evident that the heritability of the gut microbiome is relatively low. Furthermore, similarities in the gut microbiome across generations are primarily associated with living in the same household rather than genetic factors. Recent studies have highlighted the association of various omics data such as gut metagenomics, DNA methylome data from epigenome-wide association studies and transcriptomics data with complex human diseases (Wu et al., 2021;Liu et al., 2024).Recent research indicates that PRS models alone demonstrate superior predictive power compared to traditional risk factors (Liu et al., 2024). Furthermore, integrating MRS and transcriptomics data into the PRS showed a substantial improvement in prediction of prostate cancer (Wu et al., 2021). However, when conventional risk factors are incorporated into the PRS model, performance improves (Liu et al., 2024). Moreover, integrating both conventional risk factors and additional omics data such as from gut metagenomics into the PRS model significantly enhances its predictive performance for complex human diseases (Liu et al., 2024). Therefore, these studies demonstrated that the inclusion of other omics data from gut metagenomics, DNA methylomics and transcriptomics have shown a promise to improve the prediction of the incidence of age-related complex human diseases such as coronary artery disease, type 2 diabetes, Alzheimer's disease, and prostate cancer (Wu et al., 2021;Liu et al., 2024).Recent research suggests that studying blood DNA methylation at various CpG sites can serve as a valuable surrogate biomarker for exposure to risk factors, aiding in the prediction of complex human diseases such as cardiovascular diseases and in identifying high-risk populations (Cappozzo et al., 2022). Methylation risk scores (MRS) are typically constructed to model the relationship between methylation at CpG sites and specific traits or diseases through epigenome-wide association studies (EWAS). DNA methylation scores have been effectively utilized to assess an individual's biological age (epigenetic clock) and have been strongly associated with several non-communicable diseases (NCDs) risk factors such as smoking, alcohol consumption, low physical activity, obesity, socioeconomic status, and occupational characteristics. This existing collinearity has made DNA methylation scores a powerful tool for predicting aging-related diseases as well as lifestyle-related diseases such as cancer and cardiovascular diseases (Cappozzo et al., 2022).These epigenetic clocks demonstrate strong predictive capabilities for aging-related diseases and overall mortality. Research indicates that the risk of developing complex human diseases depends on the interaction between host genetic factors, environmental influences, and human behaviors or lifestyles. Incorporating conventional risk factors such as age, sex, smoking, and alcohol consumption into models accounts for human behavior (Levine et al., 2018). Studies, including the one conducted by Liu et al., have demonstrated that including these conventional risk factors improves the predictive ability for complex human diseases (Liu et al., 2024). Moreover, environmental factors, gene-environment interaction, and host lifestyle can be surrogated by epigenetic methylation analysis. Therefore, it is of the essence for the DNA methylomics data to be integrated into PRS models to enhance predictability for complex human diseases. There is a hypothesis suggesting that epigenomics data from DNA methylation might offer better predictive ability than many of the current PRS utilized today (Thompson et al., 2022). Consequently, integrating these multi-omics data into the PRS model could potentially yield the most effective predictive model for complex human diseases. To the best of our knowledge, there are very limited studies reported to integrate the DNA methylomics data into the PRS. Therefore, further investigations are warranted to explore the impact of integrating DNA methylomics data into PRS models for enhancing and predicting the development of complex human diseases.Epigenetic modifications are widely recognized as influential factors in the biological pathways of both communicable diseases and non-communicable complex human diseases like hypertension and cancer with DNA methylation being the most extensively studied. Epigenetics involves the alteration of gene expression without changing the genetic code through processes such as DNA methylation and histone modification. This procedure entails attaching covalently a methyl group to the cytosine base found within sections containing repeated cytosine-guanine bonds, also referred to as CpG islands. When a gene undergoes heavy methylation, it typically remains transcriptionally silent (Irizarry et al., 2009). Environmental factors can trigger significant changes in methylation levels. The methylation patterns found in promoter CpG islands, which are clusters of CpG sites located in gene promoters, hold significant promise as potential biomarkers. They could play crucial roles in disease detection, disease classification, prognosis, and forecasting treatment responses (Ehrlich, 2019).Recent research has revealed compelling links between DNA methylation and fluctuations in blood pressure, cardiovascular ailments, and various other non-communicable diseases. Han et al highlighted the pivotal role of gene-specific DNA methylation in elevating blood pressure, notably concerning factors like angiotensin-converting enzyme, lipid and amino acid metabolism, and impaired glucose metabolism (Han et al., 2016). Richard et al. pinpointed 13 replicated CpG sites, explaining 1.4% and 2.0% of individual differences in systolic and diastolic blood pressure, respectively (Richard et al., 2017). Intriguingly, new findings propose a strong link between DNA methylation and lifestyle choices (like smoking, alcohol intake, and diet), aging, obesity, and gender-all vital risk factors for hypertension. Kim et al. identified an association between DNA methylation in peripheral blood leukocytes and hypertension prevalence hints at the potential of DNA methylation as a high blood pressure biomarker (Kim et al., 2010). This highlights the promise of integrating DNA methylomics data into models to significantly enhance PRS performance. Constructing multi-omics risk scores for complex human diseases typically involves integrating multiple layers of biological data (genomics, methylomics, metagenomics, transcriptomics, proteomics and metabolomics) into a single predictive score that quantifies disease risk. Bioinformatics and computational tools for construction multi-omics risk scores use various statistical, machine learning, and feature selection techniques to identify predictive markers across omics layers and combine them into a single aggregated risk score. Table 1 summarizes the most commonly used analytical tools for multi-omics data integration. By combining these bioinformatics and computational tools and frameworks, researchers can construct multi-omics risk scores that provide comprehensive, predictive insights into complex human disease susceptibility and progression. Ultimately enhancing our understanding of the molecular basis of these complex diseases by leveraging complementary information across multi-omics data. (Euesden et al., 2015). PRS-CS is a Bayesian polygenic risk scoring tool that improves the accuracy of PRS by accounting for linkage disequilibrium via Bayesian regression and continuous shrinkage (CS) priors. This tool allows integrating genomics data with other omics layers, such as transcriptomics data, to build multi-omics risk scores (Ge et al., 2019). A deep learning-based tool that integrates multi-omics data using neural networks to identify predictive biomarkers and generate risk scores. Its architecture can handle complex, non-linear relationships across omics layers, making it well-suited for multi-omics risk prediction (Kang et al., 2022). These machine learning tools use various algorithms to integrate omics data and predict disease risk scores. They support feature selection, classification, and regression to create multi-omics risk prediction models (Zoppi et al., 2021;Ballard et al., 2024). Analysis+)It performs factor analysis to reduce dimensionality across omics datasets, identifying factors that contribute to disease risk. These factors can then be combined to build risk scores for predicting disease phenotypes and clinical outcomes (Argelaguet et al., 2020). CustOmics is a versatile deep-learning-based framework for multi-omics integration, designed for survival and classification tasks. It leverages customizable architectures to integrate data across omics types, particularly in cancer research (Benkirane et al., 2023). PLIER is a tool for dimensionality reduction and feature selection that leverages known pathways to combine multiple omics layers. It can generate interpretable factors used in risk score modeling, making it ideal for multi-omics risk prediction (Mao et al., 2019). CIMLR integrates multi-omics data by learning a consensus clustering across omics layers, useful for stratifying patients and creating disease risk scores. This method is effective in scenarios where disease subtypes must be identified in multi-omics datasets (Ramazzotti et al., 2018). BOI is a Bayesian framework that uses priors based on biological knowledge (e.g., pathway information) to integrate data across omics layers and predict risk. It is particularly effective in combining genomics and epigenomics data to improve disease risk prediction (Fang et al., 2018;Almutiri et al., 2024;Novoloaca et al., 2024). NetDx combines multi-omics data to build predictive patient similarity networks, which can be used to classify patient risk. NetDx allows for the creation of risk scores based on network-derived patient profiles and has been applied in cancer and psychiatric disease risk prediction (Pai et al., 2019). Mergeomics uses network-based integration to link multi-omics markers with pathways and disease-related gene networks. By identifying critical network modules associated with disease risk, Mergeomics aids in building risk scores that combine multi-omics biomarkers (Shu et al., 2016). This end-to-end analysis pipeline provides workflows for analyzing multi-omics data, including RNA-seq, DNAseq, and epigenomics data. OmicsPipe integrates these layers to build disease prediction models and can be adapted to produce multi-omics risk scores (Fisch et al., 2015). A translational research platform that integrates and analyzes multi-omics data, clinical data, and biomarker information. TranSMART includes tools for multi-omics data integration, risk score modeling, and data visualization. While tranSMART itself does not directly compute risk scores, it can help identify biomarker candidates and generate hypotheses about disease risk factors by linking omics data to clinical outcomes (Athey et al., 2013;tranSMART-Foundation/transmart, 2023). Although primarily focused on metabolomics, MetaboAnalyst has multi-omics capabilities, including pathway analysis and functional annotation. It can identify metabolomics and gene expression biomarkers linked to disease risk. Researchers typically use MetaboAnalyst's results in conjunction with statistical or machine learning models to develop personalized risk scores based on identified pathways and biomarkers (Pang et al., 2024). A network and pathway analysis tool that integrates multi-omics data, including genomics, transcriptomics, and proteomics information, to predict disease-related genes. GeneMANIA's network-based approach can support risk score creation based on pathway associations (Mostafavi et al., 2008). An R package that provides multi-omics integration and visualization methods, including PLS-DA, DIABLO, and multivariate factor analysis. It supports building predictive models and risk scores by selecting key features from multiple omics layers (Rohart et al., 2017).ComplexHeatmap A visualization package in R that supports hierarchical clustering and multi-layered heatmaps for multi-omics data. It is commonly used to visualize relationships across omics layers, aiding in feature selection for risk scoring (Gu et al., 2016;Gu, 2022). The main challenges and limitations for multi-omics utility for improving prediction of complex human diseases are underrepresentation of diverse population in the genome databases, due to the fact that collecting multi-omics data is challenging due to the time consuming and costly process, particularly from large scale human genetic studies. As a result, multi-omics data may only be available for a subset of participants in a study limiting its statistical power, generalizability and hence transferability (Hasin et al., 2017;Chen and Han, 2023). Furthermore, lack of expertise among clinicians and biological scientists particularly from low income countries slows down the process to realize its clinical utility of multi-omics for all. Therefore, the concerted efforts are needed for establishment of multi-omics databases, as well as fostering training among biological scientists in health and life sciences in underserved areas.Furthermore, several key limitations affect the reliability and effectiveness of multi-omics technologies. Batch effects are a significant challenge in omics data analysis, particularly in large-scale studies where samples are processed in batches or over extended periods. These technical variations between experimental runs can greatly impact data quality. Batch effects in omics studies refer to systematic differences in data caused by variations in experimental conditions across different batches of samples. These differences are unrelated to the biological variables or phenomena being studied and, if not properly addressed, can lead to misleading conclusions. Batch effects can arise from issues related to study design, such as flawed or confounded experimental setups or treatment effects. Variations in sample preparation and storage, including differences in protocols, reagents, equipment, storage conditions, operators, or laboratories, can also contribute to batch effects. Technical variations, such as inconsistencies in laboratory equipment, reagents, operators, or protocols across batches, further exacerbate the problem. Temporal variations, which occur when samples are processed at different times, and environmental factors, such as changes in temperature, humidity, or other conditions during sample preparation or analysis, are additional sources of variability (Yu et al., 2024).High-throughput experiments are particularly prone to batch effects due to variability in data generated by different sequencing machines or mass spectrometry instruments. For example, DNA sequencing platforms may differ in kits, sequencing depth, quality, or laboratory practices. Bulk and single-cell RNA sequencing protocols may vary in terms of laboratory methods, RNA quality, or library size. Similarly, LC-MS-based proteomics and metabolomics can be influenced by differences in instruments, processing order, or laboratory-specific procedures (Yu et al., 2024). Finally, data analysis introduces its own challenges, with variability arising from differences in analysis platforms or pipelines, software tools, reference databases, and the treatment of low-detected or missing values. Addressing these sources of batch effects is critical to ensuring the reliability and reproducibility of multi-omics studies (Yu et al., 2024). Therefore, addressing batch effects is crucial to ensuring the reliability of results in multi-omics studies and minimizing the risk of drawing inaccurate conclusions.Batch effects are notoriously common technical variations in multi-omics data and can lead to misleading outcomes if not properly addressed or if over-corrected. These effects can obscure true biological signals, create spurious associations, reduce the reproducibility and reliability of studies, and compromise the accuracy of downstream analyses, such as clustering, classification, or biomarker discovery. To address this challenge, several batcheffect correction algorithms have been developed to facilitate multi-omics data integration. However, their respective advantages and limitations must be thoroughly evaluated based on the type of omics data, performance metrics, and specific application scenarios before selecting an appropriate method for use (Ugidos et al., 2022;Yu et al., 2023Yu et al., , 2024)). Assessing and mitigating batch effects is crucial to ensuring the reliability and reproducibility of omics data while minimizing the impact of technical variation on biological interpretation. As multi-omics data continue to expand, the importance of robust experimental design, optimized pipelines, and effective batcheffect correction algorithms is expected to grow, becoming central to large-scale research and clinical applications. The review by Yu et al. provides detailed insights into the sources, diagnostics, visualization techniques, and potential solutions for addressing batch effects in large-scale omics studies, including an overview of the currently available batch-effect correction algorithms (Yu et al., 2024). Reproducibility between individual studies is another significant concern, as experimental conditions can vary across studies, potentially leading to inconsistent results (Yu et al., 2023(Yu et al., , 2024)).Additionally, certain omics types, such as transcriptomics, are particularly sensitive to sample extraction and preservation methods, which can introduce variability. Moreover, differences in platform technologies for the same omics type can further contribute to variability. For instance, DNA methylation analysis can be performed using a variety of BeadChip microarrays for detecting human DNA methylation. These include the HumanMethylation27 BeadChip (27K, which covers approximately 27,000 CpGs), the HumanMethylation450 BeadChip (450K, measuring over 485,000 CpGs), the HumanMethylationEPIC BeadChip (EPICv1 or 850K, measuring over 850,000 CpGs), and the HumanMethylationEPIC v2.0 BeadChip (EPICv2 or 900K, measuring over 900,000 CpGs (Lussier et al., 2024) as well as the gold standard, wholegenome bisulfite sequencing (WGBS), each offering different levels of resolution and coverage (Graw et al., 2021). Similarly, metabolomics can vary significantly depending on whether it is targeted or untargeted, presenting challenges in comparability. Lastly, there is the issue of interpreting absolute versus relative measures of risk, which complicates the translation of findings into practical applications (Canzler et al., 2020).Furthermore, additional challenges include data complexity as integrating diverse data types with distinct scales, noise and missing values is challenging and requires advanced statistical and computational techniques, sophisticated normalization and imputation methods.Computational requirements are also a challenge as multi-omics data integration is computationally intensive, requiring high performance computing resources and efficient algorithms. Lastly, interpretability due to the complexity of multi-omics models that makes them challenging to interpret, and limiting their utility in clinical settings. The opportunity and future research direction of multi-omics risk prediction hinge at the fact that integrating multi-omics data within a systems biology framework fosters researchers to model pathways and networks that drive diseases, leading to insight into causative mechanisms and potential therapeutic target. Multi-omics risk prediction has the potential to tailor interventions based on an individual's molecular profile, personalized medicine, optimizing prevention strategies and improving outcomes. This can be used in population preventive screening to identify high-risk individual for early intervention, particularly for diseases which early detection is critical such as cancer, cardiovascular diseases, diabetes mellitus, etc. Furthermore, the multiomics has the potential to be used in understanding disease mechanisms by identifying the key molecular drivers of disease, it can highlight new therapeutic targets and inform drug development. Multi-omics risk prediction holds the promise for enhancing precision in complex human diseases risk assessment, with the potential to move to health care towards more individualized, proactive care. As methodologies and computational tools continue to advance, multi-omics integration will likely play an increasingly pivotal role in predicting, preventing and managing complex human diseases. In conclusion, human complex diseases arise from a complex interplay between the host genetics, host behaviors or lifestyle, and environmental factors. Integrating multi-omics data -such as metagenomics, epigenomics, particularly DNA methylomics, and transcriptomics data-and conventional risk factors into risk score models holds the potential for achieving the highest predictive performance for age-related complex human diseases compared to models based solely on PRS. Further research investigating the integration of multi-omics data is warranted to enhance PRS performance in predicting complex human diseases.
Keywords: Complex human disease prediction, Multi-omics risk score, Polygenic risk score (PRS), multi-omics, Trait prediction
Received: 12 Oct 2024; Accepted: 29 Nov 2024.
Copyright: © 2024 Kidenya and Mboowa. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence:
Benson R. Kidenya, Catholic University of Health and Allied Sciences (CUHAS), Bugando, Tanzania
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.