AUTHOR=Salekin Sirajul , Mostavi Milad , Chiu Yu-Chiao , Chen Yidong , Zhang Jianqiu , Huang Yufei 

TITLE=Predicting Sites of Epitranscriptome Modifications Using Unsupervised Representation Learning Based on Generative Adversarial Networks

JOURNAL=Frontiers in Physics

VOLUME=Volume 8 - 2020

YEAR=2020

URL=https://www.frontiersin.org/journals/physics/articles/10.3389/fphy.2020.00196

DOI=10.3389/fphy.2020.00196

ISSN=2296-424X

ABSTRACT=Epitranscriptome is an exciting area that studies modifications in transcripts. Of significant interest is the prediction of different modification sites from transcript sequences. However, very limited positive sites for most modifications pose significant challenges for training robust algorithms. To circumvent this problem, we proposed MR-GAN, a novel Adversarial Learned Inference (ALI) algorithm, which is trained in an unsupervised fashion on the entire pre-mRNA sequences to learn a low dimensional embedding of transcriptomic sequences. MR-GAN was then applied to extract the embeddings of the sequences in a training dataset we created for eight epitranscriptome modifications including m6A, m1A, m1G, m2G, m5C, m5U, 2′-O-Me, Pseudouridine (Ψ) and Dihydrouridine (D), of which the training samples are small. Prediction models were trained based on the embeddings extracted by MR-GAN. We compared the prediction performance with the one-hot encoding of the training sequences and SRAMP, a state-of-the-art m6A site prediction algorithm and demonstrated that the learned embeddings outperform one-hot encoding by a significant margin for up to 15% improvement. Using MR-GAN, we also investigated the sequence motifs for each modification type and uncovered known motifs as well as new motifs not possible with sequences directly. The results demonstrated that transcriptome features extracted using unsupervised learning can lead to high precision for predicting multiple types of epitranscriptome modifications even when the data size is small and extremely imbalanced.