Skip to main content

ORIGINAL RESEARCH article

Front. Bioinform.
Sec. Genomic Analysis
Volume 4 - 2024 | doi: 10.3389/fbinf.2024.1397036
This article is part of the Research Topic From one genome to many genomes: the evolution of computational approaches for pangenomics and metagenomics analysis View all 3 articles

Pangenome Comparison via ED Strings

Provisionally accepted
Esteban Gabory Esteban Gabory 1Moses N. Mwaniki Moses N. Mwaniki 2Nadia Pisanti Nadia Pisanti 2Solon Pissis Solon Pissis 1*Jakub Radoszewski Jakub Radoszewski 3Michelle Sweering Michelle Sweering 1Wiktor Zuba Wiktor Zuba 1
  • 1 Centrum Wiskunde & Informatica, Amsterdam, Noord-Holland, Netherlands
  • 2 University of Pisa, Pisa, Tuscany, Italy
  • 3 University of Warsaw, Warsaw, Masovian, Poland

The final, formatted version of the article will be published soon.

    An elastic-degenerate (ED) string is a sequence of sets of strings. It can also be seen as a directed acyclic graph whose edges are labeled by strings. The notion of ED strings was introduced as a simple alternative to variation and sequence graphs for representing a pangenome, that is, a collection of genomic sequences to be analyzed jointly or to be used as a reference. In this study, we define notions of matching statistics of two ED strings as similarity measures between pangenomes and, consequently infer a corresponding distance measure. We then show that both measures can be computed efficiently, in both theory and practice, by employing the intersection graph of two ED strings [Gabory et al., CPM 2023]. We also implemented our methods as a software tool for pangenome comparison and evaluated their efficiency and effectiveness using both synthetic and real datasets. As for efficiency, we compare the runtime of the intersection graph method against the classic product automaton construction showing that the intersection graph is faster by up to one order of magnitude. For showing effectiveness, we used real SARS-CoV-2 datasets and our matching statistics similarity measure to reproduce a well-established clade classification of SARS-CoV-2, thus demonstrating that the classification obtained by our method is in accordance with the existing one.

    Keywords: Elastic-degenerate string, Intersection graph, pangenome comparison, Matching statistics, SARS-CoV-2

    Received: 06 Mar 2024; Accepted: 23 Aug 2024.

    Copyright: © 2024 Gabory, Mwaniki, Pisanti, Pissis, Radoszewski, Sweering and Zuba. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence: Solon Pissis, Centrum Wiskunde & Informatica, Amsterdam, Noord-Holland, Netherlands

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.