GCRTcall: a transformer based basecaller for nanopore RNA sequencing enhanced by gated convolution and relative position embedding via joint loss training

Li, Qingwen; Sun, Chen; Wang, Daqian; Lou, Jizhong

doi:10.3389/fgene.2024.1443532

ORIGINAL RESEARCH article

Front. Genet., 22 November 2024

Sec. Computational Genomics

Volume 15 - 2024 | https://doi.org/10.3389/fgene.2024.1443532

GCRTcall: a transformer based basecaller for nanopore RNA sequencing enhanced by gated convolution and relative position embedding via joint loss training

Qingwen Li^1,2

Chen Sun³

Daqian Wang³*

Jizhong Lou^1,2,3*

¹Key Laboratory of Epigenetic Regulation and Intervention, Center for Excellence in Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
²College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
³Beijing Polyseq Biotech Co., Ltd., Beijing, China

Nanopore sequencing, renowned for its ability to sequence DNA and RNA directly with read lengths extending to several hundred kilobases or even megabases, holds significant promise in fields like transcriptomics and other omics studies. Despite its potential, the technology’s limited accuracy in base identification has restricted its widespread application. Although many algorithms have been developed to improve DNA decoding, advancements in RNA sequencing remain limited. Addressing this challenge, we introduce GCRTcall, a novel approach integrating Transformer architecture with gated convolutional networks and relative positional encoding for RNA sequencing signal decoding. Our evaluation demonstrates that GCRTcall achieves state-of-the-art performance in RNA basecalling.

Introduction

Nanopore sequencing technology directly sequences single strands of DNA or RNA by detecting changes in electrical current as the molecules pass through nanopores, eliminating the need for PCR amplification. This technique enables rapid single-molecule sequencing with significantly increased read lengths, reaching hundreds of kilobases or even magabases. It holds immense potential in various omics sequencing studies such as genomics, transcriptomics, epigenomics, and proteomics (Amarasinghe et al., 2020; Garalde et al., 2018; Jain et al., 2022; Sun et al., 2020; Jain et al., 2018; Davenport et al., 2021; Quick et al., 2016; Wang et al., 2015; Faria et al., 2017; Yakovleva et al., 2022; Boykin et al., 2019; Zhao et al., 2022).

Despite its advantages, the accuracy of basecalling has emerged as a significant bottleneck, limiting further broader application of nanopore sequencing. Sequencing signals are influenced not only by individual nucleotides but also by neighboring bases, resulting in non-uniform translocation of the sequences and low signal-to-noise ratios measured in picoamperes (pA). These challenges make accurate basecalling in nanopore sequencing particularly difficult (Jain et al., 2022; Wick et al., 2019; Wang et al., 2021).

In recent years, several algorithms have been developed to improve the accuracy of nanopore sequencing signal decoding and methylation detection. Methods like Metrichor and Nanocall (David et al., 2017), which utilize Hidden Markov Models (HMM) (Niu et al., 2022), segment events in the current signal and calculate transition probabilities for basecalling. Other approaches, such as Chiron (Teng et al., 2018), Deepnano (Boža et al., 2017), and Guppy, leverage Recurrent Neural Network (RNN) architectures, while Causalcall (Zeng et al., 2019) and RODAN (Neumann et al., 2022) employ Convolutional Neural Network (CNN) architectures to achieve end-to-end basecalling. Additionally, SACall incorporates self-attention mechanisms into nanopore signal decoding (Huang et al., 2022). Liu Q. et al. (2019) proposed DeepMod based on bidirectional long short-term memory (LSTM) (Chen et al., 2022) architecture to detect DNA modifications. Ni et al. (2019) developed DeepSignal by combining LSTM and Inception structure for DNA methylation prediction. Yin et al. (2024) constructed NanoCon through Transformer and contrastive learning for DNA modification detection. However, with the exception of RODAN (Neumann et al., 2022), the focus of these methods is primarily on DNA basecalling and modification prediction, with limited research in RNA decoding.

Unlike several hundreds base pairs per second (bps) translocation speed for DNA, RNA translocates at only about or below 100 bps. Additionally, there are substantial differences in the physical and chemical properties between DNA and RNA, resulting in distinct signal patterns. Consequently, models designed for DNA basecalling are usually ineffective for RNA signal decoding. To address this gap, we propose GCRTcall, a Transformer based basecaller for nanopore RNA sequencing, enhanced by Gated Convolution and Relative position embedding through joint loss training. This method achieves state-of-the-art decoding accuracy on multi-species transcriptome sequencing data.

Materials and methods

Benchmark dataset

The benchmark dataset used in this study was proposed by Neumann et al. (Neumann et al., 2022), which is also utilized in the development of RODAN (Neumann et al., 2022).

The training set comprises five species: Arabidopsis thaliana from (Neumann et al., 2022), Epinano synthetic constructs from (Liu H. et al., 2019), Homo sapiens from (Workman et al., 2019), Caenorhabditis elegans from (Roach et al., 2020), and Escherichia coli from (Grünberger et al., 2019). Initially, all reads were basecalled using Guppy version 6.2.1 (Technologies and O.N., 2024). The decoded sequences were then mapped to the reference genome with minimap2 (Li, 2018) to obtain corrected sequences. Subsequently, Taiyaki (Technologies and O.N., 2019) was utilized to generate an HDF5 file containing the raw signal of each read, its corresponding corrected sequence, and their mapping relationship. The training dataset contained 116,072 reads: with 24,370 from Arabidopsis, 29,728 from Epinano synthetic constructs, 30,048 from H. sapiens, 24,192 from C. elegans, and 7,734 from E. coli.

To ensure rigorous performance evaluation and avoid potential biases from overlapping datasets, we used test samples derived from entirely independent studies, distinct from those used for training. The test set included also five species: H. sapiens from (Workman et al., 2019), A. thaliana from (Parker et al., 2020), Mus musculus from (Bilska et al., 2019), S. cerevisiae from (Jenjaroenpun et al., 2021), and Populus trichocarpa from (Gao et al., 2021), each consisting of 100,000 reads.

Model architecture

Our model architecture was inspired by Google’s Conformer (Gulati et al., 2020), a convolution-augmented Transformer known for effectively modeling both global and local dependencies, outperforming traditional Transformer (Zhang et al., 2020; Vaswani et al., 2017; Liu B. et al., 2019; Zhang et al., 2024) and CNN (Li et al., 2019; Kriman et al., 2019; Han et al., 2020; Abdel-Hamid et al., 2014; Li et al., 2021; Li and Liu, 2023; Li X. et al., 2024) models in speech recognition tasks. GCRTcall compreises three CNN layers for downsampling and feature extraction, with output channels of 4, 6, and 512, and convolutional kernels of size 5, 5, and 19, with strides of 1, 1, and 10, respectively. This is followed by 8 Conformer blocks and a connectionist temporal classification (CTC) decoder (Gulati et al., 2020; Wang et al., 2023; Zhu et al., 2023), amounting to a total of 50 million parameters. Our previous study indicated that training with a joint loss, combining CTC loss and Kullback-Leibler Divergence (KLDiv) loss, results in superior basecalling accuracy compared to using only CTC loss under the same inference structure, and whether using decoder do not influence decoding accuracy. However, retaining the decoder results in a significant decrease in inference speed (Li Q. et al., 2024). Therefore, during the training phase, GCRTcall incorporates additional forward and backward Transformer decoders at the top, utilizing the joint loss for improved convergence. The model architecture of GCRTcall is illustrated in Figure 1.

Figure 1

Figure 1. Schematics representation of the architecture of GCRTcall. GCRTcall compreises three CNN layers for downsampling and feature extraction, and followed by 8 Conformer blocks and a CTC decoder. During training, a pair of forward and reverse decoder was added on top of the base architecture for joint loss training.

Compared to traditional Transformers, the Conformer modules in GCRTcall feature two key improvements: First, they combine relative positional embedding with a multi-head self-attention mechanism to enhance the model’s robustness to inputs of varying lengths. Second, they integrate depthwise separable convolutions based on gate mechanisms to process the outputs of attention layers, thereby strengthening the model’s ability to capture local dependencies within sequences.

Relative position multi-head self-attention mechanism

Transformer-XL (Dai et al., 2019) integrates relative positional embedding with a self-attention mechanism, enhancing the model’s representational capacity for sequences of varying lengths. The relative position multi-head self-attention mechanism processes input sequences along with its sinusoidal position encoding. It performs three linear projections on the input to generate Q, K, and V, and also applying linear projection on positional embedding to obtain Kp. Two biases, bk and bp, are initialized. The computation principle of the relative position self-attention mechanism is as follows:

RelativeAttention = softmax (\frac{(Q + b k) {\times K}^{T} + r e l a t i v e_s h i f t [(Q + b p) {\times K p}^{T}]}{\sqrt{d_{k}}}) V

The multi-head attention then combines and projects the aforementioned attention computation results as follows:

MultiHead = Concat ({head}_{1}, . . ., {head}_{h}) W^{O}

Gated depthwise separable convolution

EfficientNet (Tan et al., 2019) utilizes depthwise separable convolution to reduce the number of parameters and enhance computational efficiency while maintaining state-of-the-art accuracy. Similarly, Dauphin et al. (Dauphin et al., 2016) proposed gated convolutional networks, which utilize CNNs to extract hidden states from sequences and employ gated linear units (GLU) to augment non-linear expression and mitigate the vanishing gradient problem. This approach enables the model to compute in parallel, outperforming LSTMs on multiple NLP datasets. GLU is computed as follows:

GLU (a, b) = a \otimes σ (b)

where $a$ is the first half of the input matrices and $b$ is the second half.

Inspired by these approaches, the structure of the gated depthwise separable convolution block in GCRTcall is illustrated in Figure 2.

Figure 2

Figure 2. Architecture of the gated depthwise separable convolution block of GCRTcall. The convolution block consists of a 1-D pointwise convolution followed by a GLU, a 1-D depthwise convolution, 1-D batch normalization, and a swish activation function.

Joint loss training

An additional forward and reverse transformer decoder were added on the top of the inference structure of the model during training. The forward decoder adopts a lower triangular matrix as a causal mask, while the reverse decoder is equipped with an anti-lower triangular causal mask.

The model is trained by optimizing a joint loss that includes CTC loss and KLDiv loss to ensure convergence.

L_{joint} (x, y) = λ L_{CTC} (x_{E}, y) + (1 - λ) L_{KLDiv} (x_{D}, y)

where x_E is the output probability matrix of the encoder, and x_D is the output of the decoders, y is the label, λ is a hyperparameter between 0 and 1. In this paper, λ was set to 0.5.

Model training

As previously demonstrated, using a joint loss that combines CTC loss and KLdiv loss can help accelerate model convergence (Li Q. et al., 2024). Therefore, during training, we added two layers of forward and backward Transformer decoders at the top of GCRTcall, which are not utilized during actual inference.

GCRTcall was trained on an Ubuntu server equipped with 2 × 2.10 GHz 36-core CPUS, providing 144 logical CPUs and 512 GB of RAM. The training utilized 2 NVIDIA RTX 6000 Ada Generation 48G GPUs for 12 epochs, totaling 12.95 h. The batch size of 140 was employed, managed by the Ranger optimizer at a learning rate of 0.002 and weight decay of 0.01. The training was conducted using the ReduceLROnPlateau learning rate scheduler based on validation set loss monitoring with patience of 1, factor of 0.5, and threshold 0f 0.1.

Performance evaluation

Identity, mismatch rate, insertion rate, and deletion rate were adopted as metrics to evaluate the decoding accuracy of the model. These overall median metrics are commonly used in multiple basecaller researches for performance evaluation and comparison:

I d e n t i t y = \frac{N u m b e r o f m a t c h e d b a s e s}{L e n g t h o f a l i g n m e n t} \times 100 %

M i s m a t c h r a t e = \frac{N u m b e r o f m i s m a t c h e d b a s e s}{L e n g t h o f a l i g n m e n t} \times 100 %

I n s e r t i o n r a t e = \frac{N u m b e r o f i n s e r t e d b a s e s}{L e n g t h o f a l i g n m e n t} \times 100 %

D e l e t i o n r a t e = \frac{N u m b e r o f d e l e t e d b a s e s}{L e n g t h o f a l i g n m e n t} \times 100 %

Results and discussion

Comparison of decoding performance with different basecallers

We compared the basecalling accuracy of GCRTcall, Guppy 6.2.1, Dorado 0.8.1, and RODAN on a test set consisting of five species. All basecalling results were aligned to the reference genomes using minimap2, retaining only the optimal alignment results. As shown in Table 1, GCRTcall achieved state-of-the-art accuracy levels across all five species. Notably, according to Neumann et al. (Neumann et al., 2022), while RODAN slightly outperforms Guppy in basecalling accuracy for mouse and yeast, GCRTcall significantly outperforms both in decoding accuracy for these two species. Additionally, all four basecallers exhibited the poorest performance on mouse, consistent with previous findings that suggest substantial differences in sequencing signal patterns between mice and other species. Further, the performance of various basecallers was compared at different decoding lengths (Figure 3). It can be seen that GCRTcall performs best across all lengths, especially in the case of extremely long read length, GCRTcall still maintains the leading decoding accuracy.

Table 1

Table 1. Performance comparison between GCRTcall, RODAN, Dorado, and Guppy.

Figure 3

Figure 3. Performance comparison of different basecallers at different decoding lengths. (A) Identity rate comparison of different basecallers under different sequence lengths. (B) Insertion rate comparison of different basecallers under different sequence lengths. (C) Deletion rate comparison of different basecallers under different sequence lengths. (D) Mismatch rate comparison of different basecallers under different sequence lengths.

The inference was conducted on an Ubuntu server equipped with an Intel i9-13900K CPU, 125G RAM, and one NVIDIA RTX 3090 24G GPU. The inference speed of different basecaller model was also evaluated and compared. Dorado achieved the fastest decoding speed at 4.86E+07 samples per second because of its highly industrial optimization. Guppy reached decoding speed at 1.02E+07 samples per second, owing to its smaller parameter count of 2.2M. RODAN followed at 4.68E+06 samples per second, while GCRTcall, with its 50M parameters, completed decoding with speed at 1.68E+06 samples per second. Several acceleration optimization algorithms for Transformer-based models, such as hardware-aware techniques, sparse attention, and model quantization, have been proposed to enhance inference speed. These algorithms will be tested in the future development of GCRTcall.

In addition, we compared the decoding performance of Guppy, Dorado, and GCRTcall on RNA004 sample of Hek293T from (Chen et al., 2021), as presented in Table 2. Since the GCRTcall model was trained on the RNA002 dataset, and RNA004 differs significantly from RNA002 in terms of signal characteristics and sequence composition, GCRTcall’s performance on RNA004 is understandably inferior to that of Guppy and Dorado. The distinct signal features and species-specific differences in RNA004 require stronger generalization capabilities from GCRTcall, which was not trained on these new data patterns, leading to a decline in performance.

Table 2

Table 2. Performance comparison between Dorado, Guppy, and GCRTcall on RNA004 sample.

In contrast, both Guppy and Dorado have been optimized with profiles specifically tailored for RNA004 sequencing data, enabling them to better adapt to RNA004 and its corresponding human cell line data. While GCRTcall performs well on RNA002, the performance discrepancy on RNA004 highlights limitations in the model’s generalization abilities across different RNA sequencing datasets.

To enhance GCRTcall’s performance on RNA004 and other emerging datasets, we plan to expand the model’s training set in future work to include more diverse RNA data sources, particularly sequencing data from RNA004 and other human cell lines. By incorporating broader datasets, we expect a significant improvement in the model’s generalization capabilities. Additionally, we aim to explore fine-tuning strategies specifically tailored for different RNA datasets to better address the challenges posed by varying signal patterns.

In future studies, we will focus on enhancing the model’s decoding accuracy, particularly in handling novel RNA sequencing data and more complex signal patterns. By integrating larger, more diverse datasets with continuous architectural optimizations, we expect that GCRTcall will achieve more stable and efficient performance across a wider range of transcriptomic applications.

Ablation study

To further explore the impact of model structures on the basecalling accuracy of GCRTcall, we conducted two sets of ablation experiments: first, removing relative shift operation for position scores (GCRTcall w/o RS); and second, replacing Conformer modules with Transformer modules (Transcall).

In Transformer-XL, absolute position representation is initially performed to reduce the computational complexity of relative positional encoding. A relative shift of position scores is then applied to obtain relative position embeddings for sequences. To investigate the impact of relative position embeddings on model performance, GCRTcall was trained without the relative shift operation for 12 epochs using the same training set. The test results (Table 3) show a decrease in decoding performance compared to GCRTcall, indicating that relative position embeddings enhance the robustness of attention mechanisms to sequence position representation.

Table 3

Table 3. Performance comparison between GCRTcall, GCRTcall w/o RS, and Transcall.

To investigate the impact of gated convolution neural networks on model performance, Transformer modules were used to replace Conformer modules, and the model was also trained for 12 epochs. The test results (Table 3) indicate that the model’s decoding performance deteriorated compared to GCRTcall and GCRTcall w/o RS. This suggests that gated convolutional networks, which enhance the representation of local dependencies, play an important role in accurate basecalling.

The training curves of GCRTcall, GCRTcall w/o RS, and Transcall are illustrated in Figure 4. It can be observed that the form of position encoding has little impact on convergence during training, mainly enhancing the model’s generalization ability for decoding sequences of varying lengths. However, Transcall, without convolutional enhancement, converges slower and to a higher loss compared to both GCRTcall and GCRTcall w/o RS.

Figure 4

Figure 4. Training curves of GCRTcall, GCRTcall w/o RS, and Transcall. GCRTcall and GCRTcall w/o RS exhibit similar training curve. While Transcall, without convolutional enhancement, converges slower and to a higher loss compared to both GCRTcall and GCRTcall w/o RS.

CNN is proficient at capturing local features due to their ability to apply convolutional filters across input sequences (Xiang et al., 2023). By integrating convolutional modules within each encoder layer, the model can effectively capture local patterns and features intrinsic to sequential data. This is crucial for recognizing local dependencies. Furthermore, convolutional operations can capture information at various scales by utilizing different kernel sizes. This allows model to integrate multi-scale contextual information, enhancing its representational capacity across different temporal granularities. The combination of self-attention and convolution allows the model to capitalize on the complementary strengths of self-attention and convolution. While self-attention mechanisms are adept at capturing global dependencies and long-range relationships, convolutional operations excel at extracting local features. This combination enables the model to handle both local and global contextual information efficiently.

Relative positional embedding captures the relative positional relationships between elements in a sequence, as opposed to absolute positional encoding which only indicates the position of each element. This approach is particularly beneficial for handling sequences of varying lengths, as it remains invariant to the length of the input sequence, thereby improving the model’s robustness. By using relative positional embedding, the model can flexibly represent positional information, which is crucial for tasks that rely heavily on the sequential nature of data, such as nanopore signal decoding. This encoding method allows the model to maintain a consistent representation of positional relationships, improving its ability to model sequences effectively.

Conclusion

This study introduces GCRTcall, a Transformer-based basecaller designed for nanopore RNA sequencing signal decoding. GCRTcall is trained using a joint loss approach and is enhanced with gated depthwise separable convolution and relative position embeddings. Our experiments demonstrate that GCRTcall achieves state-of-the-art performance in nanopore RNA sequencing signal basecalling, outperforming existing methods in terms of accuracy and robustness. These results highlight the effectiveness of integrating advanced transformer architectures with convolutional enhancements for improving RNA sequencing accuracy.

Overall, GCRTcall represents a step forward in nanopore RNA sequencing, offering a robust and precise solution that can facilitate a deeper understanding of transcriptomics and other related fields.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding authors.

Author contributions

QL: Writing–original draft, Writing–review and editing. CS: Writing–review and editing. DW: Writing–review and editing. JL: Writing–review and editing.

Funding

The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. This work is partly supported by grants from the Ministry of Science and Technology of China (2019YFA0707001 and 2021YFF0700201), the Major project of Guanzhou National Laboratory (GZNL2024A01006), and the Strategic Priority Research Program of the Chinese Academy of Sciences (XDB37020102).

Conflict of interest

Authors CS, DW, and JL were employed by Beijing Polyseq Biotech Co., Ltd.

The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Abdel-Hamid, O., Mohamed, A. R., Jiang, H., Deng, L., Penn, G., and Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process. 22 (10), 1533–1545. doi:10.1109/taslp.2014.2339736