AUTHOR=Sanders Lauren M. , Chok Hamed , Samson Finsam , Acuna Ana Uriarte , Polo San-Huei Lai , Boyko Valery , Chen Yi-Chun , Dinh Marie , Gebre Samrawit , Galazka Jonathan M. , Costes Sylvain V. , Saravia-Butler Amanda M. 

TITLE=Batch effect correction methods for NASA GeneLab transcriptomic datasets

JOURNAL=Frontiers in Astronomy and Space Sciences

VOLUME=Volume 10 - 2023

YEAR=2023

URL=https://www.frontiersin.org/journals/astronomy-and-space-sciences/articles/10.3389/fspas.2023.1200132

DOI=10.3389/fspas.2023.1200132

ISSN=2296-987X

ABSTRACT=RNA sequencing (RNA-seq) data from space biology experiments promise to yield invaluable insights into the effects of spaceflight on terrestrial biology. However, sample numbers from each study are low due to limited crew availability, hardware, and space. To increase statistical power, spaceflight RNA-seq datasets from different missions are often aggregated together. However, this can introduce technical variation or "batch effects", often due to differences in sample handling, sample processing, and sequencing platforms. Several computational methods have been developed to correct for technical batch effects, thereby reducing their impact on true biological signals. 

In this study, we combined 7 mouse liver RNA-seq datasets from NASA GeneLab (part of the NASA Open Science Data Repository) to evaluate several common batch effect correction methods (ComBat and ComBat-seq from the sva R package, and Median Polish, Empirical Bayes, and ANOVA from the MBatch R package). We quantitatively evaluated the ability of these methods to correct for technical batch variables in space biology RNA-seq data using the following criteria: BatchQC, principal component analysis, dispersion separability criterion, log fold change correlation, and differential gene expression analysis. Each batch variable / correction method combination was then assessed using a custom scoring approach to identify the optimal correction method for the combined dataset, by geometrically probing the space of all allowable scoring functions to yield an aggregate volume-based scoring measure. 

Finally, we describe the GeneLab multi-study analysis and visualization portal which will allow users to access the publicly available space biology ‘omics data, select multiple studies to combine for analysis, and examine the presence or absence of batch effects using multiple metrics. If the user chooses to perform batch effect correction, the scoring approach described here can be implemented to identify the optimal correction method to use for their specific combined dataset prior to analysis.