AUTHOR=Zhu Jiadi , Yuan Ziyang , Shu Lianjie , Liao Wenhui , Zhao Mingtao , Zhou Yan 

TITLE=Selecting Classification Methods for Small Samples of Next-Generation Sequencing Data

JOURNAL=Frontiers in Genetics

VOLUME=Volume 12 - 2021

YEAR=2021

URL=https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2021.642227

DOI=10.3389/fgene.2021.642227

ISSN=1664-8021

ABSTRACT=Next generation sequencing (NGS) has emerged as an essential technology for the quantitative analysis of gene expression. In medical research, RNA sequencing data are commonly used to identify which type of disease a new patient belongs to. Because of the discrete of RNA-seq data, the existing statistical methods that have been developed for microarray data cannot be directly applied to RNA-seq data. Existing statistical methods usually model	RNA-seq data by a discrete distribution,  such as the Poisson, the negative binomial or  the mixture distribution with a point mass at zero and a Poisson distribution to further allow for data with an excess of zeros. Consequently, analytic tools corresponding to the above three discrete distribution are developed, including Poisson linear discriminant analysis (PLDA), negative binomial linear discriminant analysis (NBLDA) and zero-inflated Poisson logistic discriminant analysis (ZIPLDA). However, for a new real data, we do not know the real distribution in classification issue.

Considering that the count datasets are frequently characterized excess zeros and overdispersion, we extended the existing distribution to a mixture distribution with a point mass at zero and a negative binomial distribution, and proposed a Zero-inflated Negative Binomial Logistic Discriminant Analysis (ZINBLDA) for classification.
More importantly, we compared the above four classification methods from the perspective of model parameters, as an understanding of parameters is necessary for selecting optimal method for RNA-seq data. Furthermore, we found that the above four methods could transform to each other in some cases. In simulation studies, we compared and evaluated the performance of those classification methods in a wide range of settings, and we also used decision tree model to help us select the optimal classifier for a new RNA-seq data. The results of two real datasets coincide with the theory and simulation analysis results.