AUTHOR=Sanders Lauren M. , Chok Hamed , Samson Finsam , Acuna Ana Uriarte , Polo San-Huei Lai , Boyko Valery , Chen Yi-Chun , Dinh Marie , Gebre Samrawit , Galazka Jonathan M. , Costes Sylvain V. , Saravia-Butler Amanda M. TITLE=Batch effect correction methods for NASA GeneLab transcriptomic datasets JOURNAL=Frontiers in Astronomy and Space Sciences VOLUME=10 YEAR=2023 URL=https://www.frontiersin.org/journals/astronomy-and-space-sciences/articles/10.3389/fspas.2023.1200132 DOI=10.3389/fspas.2023.1200132 ISSN=2296-987X ABSTRACT=

Introduction: RNA sequencing (RNA-seq) data from space biology experiments promise to yield invaluable insights into the effects of spaceflight on terrestrial biology. However, sample numbers from each study are low due to limited crew availability, hardware, and space. To increase statistical power, spaceflight RNA-seq datasets from different missions are often aggregated together. However, this can introduce technical variation or “batch effects”, often due to differences in sample handling, sample processing, and sequencing platforms. Several computational methods have been developed to correct for technical batch effects, thereby reducing their impact on true biological signals.

Methods: In this study, we combined 7 mouse liver RNA-seq datasets from NASA GeneLab (part of the NASA Open Science Data Repository) to evaluate several common batch effect correction methods (ComBat and ComBat-seq from the sva R package, and Median Polish, Empirical Bayes, and ANOVA from the MBatch R package). Principal component analysis (PCA) was used to identify library preparation method and mission as the primary sources of batch effect among the technical variables in the combined dataset. We next quantitatively evaluated the ability of each of the indicated methods to correct for each identified technical batch variable using the following criteria: BatchQC, PCA, dispersion separability criterion, log fold change correlation, and differential gene expression analysis. Each batch variable/correction method combination was then assessed using a custom scoring approach to identify the optimal correction method for the combined dataset, by geometrically probing the space of all allowable scoring functions to yield an aggregate volume-based scoring measure.

Results and Discussion: Using the method described for the combined dataset in this study, the library preparation variable/ComBat correction method pair out ranked the other candidate pairs, suggesting that this combined dataset should be corrected for library preparation using the ComBat correction method prior to downstream analysis. We describe the GeneLab multi-study analysis and visualization portal which will allow users to access the publicly available space biology ‘omics data, select multiple studies to combine for analysis, and examine the presence or absence of batch effects using multiple metrics. If the user chooses to perform batch effect correction, the scoring approach described here can be implemented to identify the optimal correction method to use for their specific combined dataset prior to analysis.