Gene regulation is the intricate and highly dynamic process of inducing or inhibiting the expression of individual genes in an organism’s genome. It is orchestrated by a vast array of molecules, including transcription factors and cofactors, chromatin regulators, as well as other epigenetic mechanisms, which allow an organism to control cell growth and differentiation during development, as well as to respond to a variety of environmental stimuli. In turn, disruptions in gene regulation, e.g., those caused by mutations in regulatory sequences, have been shown to represent a defining feature of a plethora of diseases, including developmental and neurological disorders as well as cancer.
Given its importance in proper cell functioning and adaptability, decoding the architecture of gene regulation has become one of the most pressing tasks in modern (computational) biology. To this end, it has been a long-held ambition to enable quantitative prediction of gene regulation, i.e., inference of gene expression levels, from genomic and epigenomic features alone. The rise in computing power, recent advances in learning algorithms alongside high-throughput, next-generation sequencing that provide large-scale quantification of gene expression at single-cell resolution, as well as the identification of novel genes and noncoding RNAs at unprecedented levels, may bring us one step closer to the realization of this dream.
Early attempts in the application of modern approaches in machine learning - in particular deep learning - to predict mRNA abundance levels directly from DNA sequence have already yielded promising results. Despite this, it remains an open question how individual factors and epigenomic features involved in the gene regulatory apparatus interact within the vast genomic landscape of an organism’s non-coding regions and, in turn, contribute to mRNA expression levels. In order to advance our understanding of gene expression inference, we need scalable algorithms that: i) allow for the integration of a variety of diverse genomic and epigenomic datasets as well as structural and/or biological priors; ii) are interpretable with respect to the representations they have learned; and, iii) allow for a transfer of the learned representations to novel and different environments and contexts.
This Research Topic welcomes both original studies and review articles assessing modern machine learning approaches that integrate genomic and/or epigenomic datasets to quantitatively predict gene regulation. The topics of interest include, but are not limited to, the following:
- Quantitative inference of gene expression levels from DNA sequence and/or epigenomic features;
- Analyses of transcription factor interaction within multiple binding elements and/or effects of single nucleotide polymorphisms, copy number variations, etc., in causing loss or creation of promoter binding elements and
enhancers;
- Strategies for heterogeneous data integration of genomic sequences and epigenetic datasets, including chromatin accessibility, methylation, or chromatin conformation-related data;
- Evaluation and/or benchmarking of different (un-) supervised learning paradigms, including Bayesian (deep) learning, deep convolutional models, graph neural networks, or attention-based approaches;
- Approaches to structural learning of efficient model architectures based on biological, structural priors, or data-driven methods, such as Neural Architectural Search or Bayesian sampling;
- Analyses of model scalability versus model complexity trade-offs with respect to structural priors, inductive biases, dimensionality reduction, randomized or sampling-based techniques;
- Investigations into model interpretability e.g., visualization of individual components of trained models, such as model filters representing sequence binding motifs;
- Studies into the possibilities/limitations of the transfer of trained models to novel/different contexts and environments.
Gene regulation is the intricate and highly dynamic process of inducing or inhibiting the expression of individual genes in an organism’s genome. It is orchestrated by a vast array of molecules, including transcription factors and cofactors, chromatin regulators, as well as other epigenetic mechanisms, which allow an organism to control cell growth and differentiation during development, as well as to respond to a variety of environmental stimuli. In turn, disruptions in gene regulation, e.g., those caused by mutations in regulatory sequences, have been shown to represent a defining feature of a plethora of diseases, including developmental and neurological disorders as well as cancer.
Given its importance in proper cell functioning and adaptability, decoding the architecture of gene regulation has become one of the most pressing tasks in modern (computational) biology. To this end, it has been a long-held ambition to enable quantitative prediction of gene regulation, i.e., inference of gene expression levels, from genomic and epigenomic features alone. The rise in computing power, recent advances in learning algorithms alongside high-throughput, next-generation sequencing that provide large-scale quantification of gene expression at single-cell resolution, as well as the identification of novel genes and noncoding RNAs at unprecedented levels, may bring us one step closer to the realization of this dream.
Early attempts in the application of modern approaches in machine learning - in particular deep learning - to predict mRNA abundance levels directly from DNA sequence have already yielded promising results. Despite this, it remains an open question how individual factors and epigenomic features involved in the gene regulatory apparatus interact within the vast genomic landscape of an organism’s non-coding regions and, in turn, contribute to mRNA expression levels. In order to advance our understanding of gene expression inference, we need scalable algorithms that: i) allow for the integration of a variety of diverse genomic and epigenomic datasets as well as structural and/or biological priors; ii) are interpretable with respect to the representations they have learned; and, iii) allow for a transfer of the learned representations to novel and different environments and contexts.
This Research Topic welcomes both original studies and review articles assessing modern machine learning approaches that integrate genomic and/or epigenomic datasets to quantitatively predict gene regulation. The topics of interest include, but are not limited to, the following:
- Quantitative inference of gene expression levels from DNA sequence and/or epigenomic features;
- Analyses of transcription factor interaction within multiple binding elements and/or effects of single nucleotide polymorphisms, copy number variations, etc., in causing loss or creation of promoter binding elements and
enhancers;
- Strategies for heterogeneous data integration of genomic sequences and epigenetic datasets, including chromatin accessibility, methylation, or chromatin conformation-related data;
- Evaluation and/or benchmarking of different (un-) supervised learning paradigms, including Bayesian (deep) learning, deep convolutional models, graph neural networks, or attention-based approaches;
- Approaches to structural learning of efficient model architectures based on biological, structural priors, or data-driven methods, such as Neural Architectural Search or Bayesian sampling;
- Analyses of model scalability versus model complexity trade-offs with respect to structural priors, inductive biases, dimensionality reduction, randomized or sampling-based techniques;
- Investigations into model interpretability e.g., visualization of individual components of trained models, such as model filters representing sequence binding motifs;
- Studies into the possibilities/limitations of the transfer of trained models to novel/different contexts and environments.