Skip to main content

REVIEW article

Front. Genet.
Sec. Statistical Genetics and Methodology
Volume 15 - 2024 | doi: 10.3389/fgene.2024.1494474
This article is part of the Research Topic Statistical Approaches, Applications, and Software for Longitudinal Microbiome Data Analysis and Microbiome Multi-Omics Data Integration View all 7 articles

Recent advances in deep learning and language models for studying the microbiome

Provisionally accepted
  • 1 University of Pennsylvania, Philadelphia, United States
  • 2 Vanderbilt University Medical Center, Nashville, Tennessee, United States
  • 3 University of South Florida, Tampa, Florida, United States
  • 4 University of Pittsburgh, Pittsburgh, Pennsylvania, United States

The final, formatted version of the article will be published soon.

    Recent advancements in deep learning, particularly large language models (LLMs), made a significant impact on how researchers study microbiome and metagenomics data. Microbial protein and genomic sequences, like natural languages, form a language of life, enabling the adoption of LLMs to extract useful insights from complex microbial ecologies. In this paper, we review applications of deep learning and language models in analyzing microbiome and metagenomics data. We focus on problem formulations, necessary datasets, and the integration of language modeling techniques. We provide an extensive overview of protein / genomic language modeling and their contributions to microbiome studies. We also discuss applications such as novel viromics language modeling, biosynthetic gene cluster prediction, and knowledge integration for metagenomics studies.

    Keywords: microbiome, virome, artificial intelligence, Large language models, transformer, Attention

    Received: 10 Sep 2024; Accepted: 13 Dec 2024.

    Copyright: © 2024 Yan, Nam, Li, Deek, Li and Ma. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence:
    Hongzhe Li, University of Pennsylvania, Philadelphia, United States
    Siyuan Ma, Vanderbilt University Medical Center, Nashville, 37232, Tennessee, United States

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.