AUTHOR=Lin Ping-I , Moni Mohammad Ali , Gau Susan Shur-Fen , Eapen Valsamma 

TITLE=Identifying Subgroups of Patients With Autism by Gene Expression Profiles Using Machine Learning Algorithms

JOURNAL=Frontiers in Psychiatry

VOLUME=12

YEAR=2021

URL=https://www.frontiersin.org/journals/psychiatry/articles/10.3389/fpsyt.2021.637022

DOI=10.3389/fpsyt.2021.637022

ISSN=1664-0640

ABSTRACT=<p><bold>Objectives:</bold> The identification of subgroups of autism spectrum disorder (ASD) may partially remedy the problems of clinical heterogeneity to facilitate the improvement of clinical management. The current study aims to use machine learning algorithms to analyze microarray data to identify clusters with relatively homogeneous clinical features.</p><p><bold>Methods:</bold> The whole-genome gene expression microarray data were used to predict communication quotient (SCQ) scores against all probes to select differential expression regions (DERs). Gene set enrichment analysis was performed for DERs with a fold-change &gt;2 to identify hub pathways that play a role in the severity of social communication deficits inherent to ASD. We then used two machine learning methods, random forest classification (RF) and support vector machine (SVM), to identify two clusters using DERs. Finally, we evaluated how accurately the clusters predicted language impairment.</p><p><bold>Results:</bold> A total of 191 DERs were initially identified, and 54 of them with a fold-change &gt;2 were selected for the pathway analysis. Cholesterol biosynthesis and metabolisms pathways appear to act as hubs that connect other trait-associated pathways to influence the severity of social communication deficits inherent to ASD. Both RF and SVM algorithms can yield a classification accuracy level &gt;90% when all 191 DERs were analyzed. The ASD subtypes defined by the presence of language impairment, a strong indicator for prognosis, can be predicted by transcriptomic profiles associated with social communication deficits and cholesterol biosynthesis and metabolism.</p><p><bold>Conclusion:</bold> The results suggest that both RF and SVM are acceptable options for machine learning algorithms to identify AD subgroups characterized by clinical homogeneity related to prognosis.</p>