AUTHOR=Ten Berk de Boer Esmee , Bilgrav Saether Kristine , Eisfeldt Jesper 

TITLE=Discovery of non-reference processed pseudogenes in the Swedish population

JOURNAL=Frontiers in Genetics

VOLUME=Volume 14 - 2023

YEAR=2023

URL=https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2023.1176626

DOI=10.3389/fgene.2023.1176626

ISSN=1664-8021

ABSTRACT=The vast majority of the human genome is non-coding. There is a diversity of non-coding features, some of which have functional importance. Although the non-coding regions constitute the majority of the genome, they remain understudied, and for a long time, these regions have been referred to as junk DNA. Pseudogenes are one of these features. A pseudogene is a non-functional copy of a protein-coding gene. Pseudogenes may arise through a variety of genetic mechanisms. Processed pseudogenes are formed through reverse transcription of mRNA by LINE elements, after which the cDNA is integrated into the genome. Processed pseudogenes are known to be variable across populations; however, the variability and distribution remains unknown. Herein, we apply a custom-designed processed pseudogene pipeline on the whole genome sequencing data of 3500 individuals; 2500 individuals from the thousand genomes dataset, as well as 1000 Swedish individuals. Through these analyses, we discover over 3000 pseudogenes missing from the GRCh38 reference. Utilising our pipeline, we position 74% of the detected processed pseudogenes – allowing for analyses of formation. Notably, we find that common structural variant callers, such as Manta and Delly, classify the processed pseudogenes as deletion events, which are later predicted to be truncating variants. By compiling lists of non-reference processed pseudogenes and their frequencies, we find a great variability of pseudogenes; indicating that non-reference processed pseudogenes may be useful for DNA testing and as population-specific markers. In summary, our findings highlight a great diversity of processed pseudogenes, that processed pseudogenes are actively formed in the human genome; and that our pipeline may be used to reduce false positive structural variation caused by the misalignment and subsequent misclassification of non-reference processed pseudogenes.