AUTHOR=Nagpal Sargun , Pal Ridam , Ashima , Tyagi Ananya , Tripathi Sadhana , Nagori Aditya , Ahmad Saad , Mishra Hara Prasad , Malhotra Rishabh , Kutum Rintu , Sethi Tavpritesh
TITLE=Genomic Surveillance of COVID-19 Variants With Language Models and Machine Learning
JOURNAL=Frontiers in Genetics
VOLUME=13
YEAR=2022
URL=https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2022.858252
DOI=10.3389/fgene.2022.858252
ISSN=1664-8021
ABSTRACT=
The global efforts to control COVID-19 are threatened by the rapid emergence of novel SARS-CoV-2 variants that may display undesirable characteristics such as immune escape, increased transmissibility or pathogenicity. Early prediction for emergence of new strains with these features is critical for pandemic preparedness. We present Strainflow, a supervised and causally predictive model using unsupervised latent space features of SARS-CoV-2 genome sequences. Strainflow was trained and validated on 0.9 million sequences for the period December, 2019 to June, 2021 and the frozen model was prospectively validated from July, 2021 to December, 2021. Strainflow captured the rise in cases 2 months ahead of the Delta and Omicron surges in most countries including the prediction of a surge in India as early as beginning of November, 2021. Entropy analysis of Strainflow unsupervised embeddings clearly reveals the explore-exploit cycles in genomic feature-space, thus adding interpretability to the deep learning based model. We also conducted codon-level analysis of our model for interpretability and biological validity of our unsupervised features. Strainflow application is openly available as an interactive web-application for prospective genomic surveillance of COVID-19 across the globe.