Skip to main content

METHODS article

Front. Genet.
Sec. Computational Genomics
Volume 15 - 2024 | doi: 10.3389/fgene.2024.1421565
This article is part of the Research Topic Towards a More Complete and Accurate Personal Genome Sequence: Methods and Use Cases View all articles

AsmMix: an efficient haplotype-resolved hybrid de novo genome assembling pipeline

Provisionally accepted
Chao Liu Chao Liu Pei Wu Pei Wu Xue Wu Xue Wu Xia Zhao Xia Zhao Fang Chen Fang Chen Xiaofang Cheng Xiaofang Cheng Hongmei Zhu Hongmei Zhu Ou Wang Ou Wang *Mengyang Xu Mengyang Xu
  • Beijing Genomics Institute (BGI), Shenzhen, China

The final, formatted version of the article will be published soon.

    Accurate haplotyping facilitates distinguishing allele-specific expression, identifying cis-regulatory elements, and characterizing genomic variations, which enables more precise investigations into the relationship between genotype and phenotype. Recent advances in third-generation single-molecule long read and synthetic co-barcoded read sequencing techniques have harnessed long-range information to simplify the assembly graph and improve assembly genomic sequence. However, it remains methodologically challenging to reconstruct the complete haplotypes due to high sequencing error rates of long reads and limited capturing efficiency of co-barcoded reads. We here present a pipeline, AsmMix, for generating both contiguous and accurate diploid genomes. It first assembles co-barcoded reads to generate accurate haplotype-resolved assemblies that may contain many gaps, while the long-read assembly is contiguous but susceptible to errors. Then two assembly sets are integrated into haplotype-resolved assemblies with reduced misassembles. Through extensive evaluation on multiple synthetic datasets, AsmMix consistently demonstrates high precision and recall rates for haplotyping across diverse sequencing platforms, coverage depths, read lengths, and read accuracies, significantly outperforming other existing tools in the field. Furthermore, we validate the effectiveness of our pipeline using a human whole genome dataset (HG002), and produce highly contiguous, accurate, and haplotype-resolved assemblies. These assemblies are evaluated using the GIAB benchmarks, confirming the accuracy of variant calling. Our results demonstrate that AsmMix offers a straightforward yet highly efficient approach that effectively leverages both long reads and co-barcoded reads for haplotype-resolved assembly.

    Keywords: long reads, bioinformatics, de novo, genome assembly, haplotype Moved (insertion) [1] Deleted: blocks A penalty function

    Received: 22 Apr 2024; Accepted: 05 Jul 2024.

    Copyright: © 2024 Liu, Wu, Wu, Zhao, Chen, Cheng, Zhu, Wang and Xu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence: Ou Wang, Beijing Genomics Institute (BGI), Shenzhen, 518083, China

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.