Skip to main content

METHODS article

Front. Bioinform.
Sec. Integrative Bioinformatics
Volume 4 - 2024 | doi: 10.3389/fbinf.2024.1489704

A Novel Lossless Encoding Algorithm for Data Compression -Genomics Data as an Exemplar

Provisionally accepted
Anas Al-okaily Anas Al-okaily 1*Abdelghani Tbakhi Abdelghani Tbakhi 2
  • 1 King Hussein Cancer Center, Amman, Jordan
  • 2 McMaster University, Hamilton, Ontario, Canada

The final, formatted version of the article will be published soon.

    Data compression is a challenging and increasingly important problem. As the amount of data generated daily continues to increase, efficient transmission and storage have never been more critical. In this study, a novel encoding algorithm is proposed, motivated by the compression of DNA data and associated characteristics. The proposed algorithm follows a divide-and-conquer approach by scanning the whole genome, classifying subsequences based on similarities in their content, and binning similar subsequences together. The data is then compressed into each bin independently. This approach is different than the currently known approaches: entropy, dictionary, predictive, or transform-based methods. Proof-of-concept performance was evaluated using a benchmark dataset with seventeen genomes ranging in size from kilobytes to gigabytes.The results showed a considerable improvement in the compression of each genome, preserving several megabytes compared to state-of-the-art tools. Moreover, the algorithm can be applied to the compression of other data types include mainly text, numbers, images, audio, and video which are being generated daily and unprecedentedly in massive volumes.The importance of data compression, a fundamental problem in computer science, information theory, and coding theory, continues to increase as global data quantities expand rapidly. The primary goal of compression is to reduce the size of data for subsequent storage or transmission. There are two common types of compression algorithms: lossless and lossy. Lossless algorithms guarantee exact restoration of the original data, whereas lossy algorithms do not. Such losses are caused, for instance, by the exclusion of unnecessary information, such as metadata in video or audio that will not be observed by users.

    Keywords: compression, Huffman encoding, LZ, Genomics, BWT

    Received: 01 Sep 2024; Accepted: 26 Dec 2024.

    Copyright: © 2024 Al-okaily and Tbakhi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence: Anas Al-okaily, King Hussein Cancer Center, Amman, Jordan

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.