AUTHOR=Cao Quy , Sun Xinxin , Rajesh Karun , Chalasani Naga , Gelow Kayla , Katz Barry , Shah Vijay H. , Sanyal Arun J. , Smirnova Ekaterina 

TITLE=Effects of Rare Microbiome Taxa Filtering on Statistical Analysis

JOURNAL=Frontiers in Microbiology

VOLUME=Volume 11 - 2020

YEAR=2021

URL=https://www.frontiersin.org/journals/microbiology/articles/10.3389/fmicb.2020.607325

DOI=10.3389/fmicb.2020.607325

ISSN=1664-302X

ABSTRACT=Accuracy of microbial community detection in 16S rRNA marker-gene and metagenomic studies suffers from contamination and sequencing errors that lead to either falsely identifying microbial taxa that were not in the sample or misclassifying the taxa of DNA fragment reads. Filtering is  defined as removing taxa that are present in a small number of samples and have small counts in the samples where they are observed. This approach reduces extreme sparsity of microbiome data and has been shown to correctly remove contaminant taxa in cultured “mock” datasets, where  the true taxa compositions are known.  Although filtering is frequently used, careful evaluation of its effect on the data analysis and scientific conclusions remains unreported. Here, we assess the effect of filtering on the alpha and beta diversity estimation, as well as its impact on identifying taxa that discriminate between disease states. 

The effect of filtering on microbiome data analysis is illustrated on four  datasets: two mock quality control datasets where same cultured samples with known microbial composition are processed at different labs  and two disease study datasets. Results show that in microbiome quality control datasets,  filtering reduces the magnitude of differences in alpha diversity and alleviates technical variability between labs, while preserving between samples similarity (beta diversity).  In the disease study datasets, DESeq2 and linear discriminant analysis Effect Size (LEfSe) methods were used to identify taxa that are  differentially expressed across groups of samples, and random forest models to rank features with largest contribution towards disease classification.  Results reveal that filtering retains significant taxa and preserves the model classification ability measured by the area under the receiver operating characteristic curve (AUC).  The comparison between filtering and contaminant removal method shows that they have complementary effects and are advised to be used in conjunction. 

Filtering reduces the complexity of microbiome data, while preserving their integrity in downstream analysis. This leads to mitigation of the classification methods’ sensitivity and reduction of technical variability, allowing researchers to generate more reproducible and comparable results in microbiome data analysis.