TCGADownloadHelper: Simplifying TCGA Data Extraction and Preprocessing

Baumann, Alexandra  Anke; Wolfien, Markus; Wolkenhauer, Olaf

doi:10.3389/fgene.2025.1569290

TECHNOLOGY AND CODE article

Front. Genet.

Sec. Computational Genomics

Volume 16 - 2025 | doi: 10.3389/fgene.2025.1569290

This article is part of the Research TopicAdvancements in Sequencing Technologies for Epigenomic and Transcriptomic Analysis: From Bulk to Single-Cell ResolutionView all articles

TCGADownloadHelper: Simplifying TCGA Data Extraction and Preprocessing

Provisionally accepted

Alexandra Anke Baumann^1,2*

Markus Wolfien^2,3

Olaf Wolkenhauer^1,4,5

¹Department of Systems Biology and Bioinformatics, University of Rostock, Rostock, Germany
²Institute for Medical Informatics and Biometry, Faculty of Medicine Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany
³Center for Scalable Data Analytics and Artificial Intelligence, Dresden, Germany
⁴Leibniz Institute for Food Systems Biology, Technical University of Munich, Freising, Bavaria, Germany
⁵Stellenbosch Institute of Advanced Study, Wallenberg Research Centre, Stellenbosch University, Stellenbosch, South Africa

The final, formatted version of the article will be published soon.

The Cancer Genome Atlas (TCGA) provides comprehensive genomic data across various cancer types. However, complex file naming conventions and the necessity of linking disparate data types to individual case IDs can be challenging for first-time users. While other tools have been introduced to facilitate TCGA data handling, they lack a straightforward combination of all required steps. To address this, we developed a streamlined pipeline using the Genomic Data Commons (GDC) portal's cart system for file selection and the GDC Data Transfer Tool for data downloads. We use the Sample Sheet provided by the GDC portal to replace the default 36character opaque file IDs and filenames with human-readable case IDs. We developed a pipeline integrating customizable Python scripts in a Jupyter Notebook and a Snakemake pipeline for ID mapping along with automating data preprocessing tasks (https://github.com/alex-baumannur/TCGADownloadHelper). Our pipeline simplifies the data download process by modifying manifest files to focus on specific subsets, facilitating the handling of multimodal data sets related to single patients. The pipeline essentially reduced the effort required to preprocess data. Overall, this pipeline enables researchers to efficiently navigate the complexities of TCGA data extraction and preprocessing. By establishing a clear step-by-step approach, we provide a streamlined methodology that minimizes errors, enhances data usability, and supports the broader utilization of TCGA data in cancer research. It is particularly beneficial for researchers new to genomic data analysis, offering them a practical framework prior to conducting their TCGA studies.

Keywords: the Cancer Genome Atlas (TCGA), Sample preprocessing, jupyter notebook, lung cancer, Genomic Data Commons (GDC) portal

Received: 31 Jan 2025; Accepted: 23 Apr 2025.

Copyright: © 2025 Baumann, Wolfien and Wolkenhauer. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Alexandra Anke Baumann, Department of Systems Biology and Bioinformatics, University of Rostock, Rostock, Germany

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.