Skip to main content

ORIGINAL RESEARCH article

Front. Genet.
Sec. Computational Genomics
Volume 15 - 2024 | doi: 10.3389/fgene.2024.1451730
This article is part of the Research Topic The 22nd International Conference on Bioinformatics (InCoB 2023) Translational Bioinformatics Transforming Life View all 4 articles

A high-precision genome size estimator based on the k-mer histogram correction

Provisionally accepted
  • 1 The First Clinical Medical College of China Three Gorges University, Yichang, China
  • 2 Xi'an Mingde Institute of Technology, Northwestern Polytechnical University, Xi'an, China

The final, formatted version of the article will be published soon.

    In the realm of next-generation sequencing datasets, various characteristics can be extracted through k-mer based analysis. Among these characteristics, genome size (GS) is one that can be estimated with relative ease, yet achieving satisfactory accuracy, especially in the context of heterozygosity, remains a challenge. In this study, we introduce a high-precision genome size estimator, GSET (Genome S ize E stimation T ool), which is based on k-mer histogram correction. The processing model of GSET diverges from the popular data fitting models used by similar tools. Instead, it is derived from empirical data and incorporates a correction term to mitigate the impact of sequencing errors on genome size estimation. We have evaluated GSET on both simulated and real datasets. The experimental results demonstrate that this tool can estimate genome size with greater precision, even surpassing the accuracy of state-of-the-art tools. Notably, GSET also performs satisfactorily on heterozygous datasets, where other tools struggle to produce usable results. GSET is freely available for use and can be accessed at the following URL: https://github.com/Xingyu-Liao/GSET.

    Keywords: next generation sequencing, k-mer frequency distribution, k-mer histogram correction, Genome size estimation, sequencing error, Sequencing bias

    Received: 19 Jun 2024; Accepted: 09 Aug 2024.

    Copyright: © 2024 Liao, Wufei and Liu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence: Chaoyun Liu, Xi'an Mingde Institute of Technology, Northwestern Polytechnical University, Xi'an, 710124, China

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.