AUTHOR=Xu Yi , Cao Liyu , Zhao Xinyi , Yao Yinghao , Liu Qiang , Zhang Bin , Wang Yan , Mao Ying , Ma Yunlong , Ma Jennie Z. , Payne Thomas J. , Li Ming D. , Li Lanjuan TITLE=Prediction of Smoking Behavior From Single Nucleotide Polymorphisms With Machine Learning Approaches JOURNAL=Frontiers in Psychiatry VOLUME=11 YEAR=2020 URL=https://www.frontiersin.org/journals/psychiatry/articles/10.3389/fpsyt.2020.00416 DOI=10.3389/fpsyt.2020.00416 ISSN=1664-0640 ABSTRACT=
Smoking is a complex behavior with a heritability as high as 50%. Given such a large genetic contribution, it provides an opportunity to prevent those individuals who are susceptible to smoking dependence from ever starting to smoke by predicting their inherited predisposition with their genomic profiles. Although previous studies have identified many susceptibility variants for smoking, they have limited power to predict smoking behavior. We applied the support vector machine (SVM) and random forest (RF) methods to build prediction models for smoking behavior. We first used 1,431 smokers and 1,503 non-smokers of African origin for model building with a 10-fold cross-validation and then tested the prediction models on an independent dataset consisting of 213 smokers and 224 non-smokers. The SVM model with 500 top single nucleotide polymorphisms (SNPs) selected using logistic regression (p<0.01) as the feature selection method achieved an area under the curve (AUC) of 0.691, 0.721, and 0.720 for the training, test, and independent test samples, respectively. The RF model with 500 top SNPs selected using logistic regression (p<0.01) achieved AUCs of 0.671, 0.665, and 0.667 for the training, test, and independent test samples, respectively. Finally, we used the combined logistic (p<0.01) and LASSO (λ=10−3) regression to select features and the SVM algorithm for model building. The SVM model with 500 top SNPs achieved AUCs of 0.756, 0.776, and 0.897 for the training, test, and independent test samples, respectively. We conclude that machine learning methods are promising means to build predictive models for smoking.