AUTHOR=Wang Luotong , Qu Li , Yang Longshu , Wang Yiying , Zhu Huaiqiu 

TITLE=NanoReviser: An Error-Correction Tool for Nanopore Sequencing Based on a Deep Learning Algorithm

JOURNAL=Frontiers in Genetics

VOLUME=Volume 11 - 2020

YEAR=2020

URL=https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2020.00900

DOI=10.3389/fgene.2020.00900

ISSN=1664-8021

ABSTRACT=Nanopore sequencing is one of the most promising technologies of the Third-Generation Sequencing (TGS). Since 2014, Oxford Nanopore technologies (ONT) has developed a series of devices based on nanopore sequencing to produce very long reads, which has an expectable impact on genomics. However, the nanopore sequencing reads expose to a fairly high error rate owing to the difficulty determining the DNA bases from the complex electrical signals. Although a number of basecalling tools have been developed for the nanopore sequencing over the past years, there is still a challenge to correct the sequences after the procedure of basecalling by now. In this study, we present an open-source DNA base reviser, NanoReviser, based on deep learning model which is capable to correct the basecalling errors introduced by various basecallers provided by default. In our module, we re-segmented the raw electrical signals based on the basecalled sequences provided by the default basecallers and this re-segmentation process was proved to be necessary to correct the leak detection errors. By employing Convolution Neural Networks (CNN) and bidirectional Long Short-Term Memory (Bi-LSTM) networks, we took advantage of the information from the raw electrical signals and the basecalled sequences from the basecallers. Our result shows that NanoReviser, as a post-basecalling reviser, significantly improves the basecalling quality. Trained and testes on the standard ONT sequencing reads from public E.coli and human NA12878 datasets, NanoReviser can reduce the sequencing error rate over 5% on the E.coli dataset and 7% on the human dataset. The performance of NanoReviser is better than all current basecalling tools. Furthermore, we analyzed the modified bases of the E.coli and add the methylation information to train our module. With the methylation annotation, NanoReviser could reduce the error rate 7% on the E.coli dataset and reduce the error rate over 10% on the methylated area. To the best of our knowledge, NanoReviser is the first post-processing tool after basecalling to accurately correct the nanopore sequences without the time-consuming procedure of the building of the consensus sequence building. NanoReviser package is available at https://github.com/pkubioinformatics/NanoReviser.