Skip to main content

ORIGINAL RESEARCH article

Front. High Perform. Comput.
Sec. HPC Applications
Volume 2 - 2024 | doi: 10.3389/fhpcp.2024.1458674

Addressing GPU Memory Limitations for Graph Neural Networks in High-Energy Physics Applications

Provisionally accepted
Claire Lee Claire Lee 1*V Hewes V Hewes 2,3Giuseppe Cerati Giuseppe Cerati 2Kewei Wang Kewei Wang 1Adam Aurisano Adam Aurisano 3Ankit Agrawal Ankit Agrawal 1Alok Choudhary Alok Choudhary 1Wei-Keng Liao Wei-Keng Liao 1
  • 1 Northwestern University, Evanston, United States
  • 2 Fermilab Accelerator Complex, Fermi National Accelerator Laboratory (DOE), Batavia, Illinois, United States
  • 3 University of Cincinnati, Cincinnati, Ohio, United States

The final, formatted version of the article will be published soon.

    Reconstructing low-level particle tracks in neutrino physics can address some of the most fundamental questions about the universe. However, processing petabytes of raw data using deep learning techniques poses a challenging problem in the field of High Energy Physics (HEP). In the Exa.TrkX Project, an illustrative HEP application, preprocessed simulation data is fed into a state-of-art Graph Neural Network (GNN) model, accelerated by GPUs. However, limited GPU memory often leads to Out-of-Memory (OOM) exceptions during training, due to the large size of models and datasets. This problem is exacerbated when deploying models on High-Performance Computing (HPC) systems designed for large-scale applications. We observe a high workload imbalance issue during GNN model training caused by the irregular sizes of input graph samples in HEP datasets, contributing to OOM exceptions. We aim to scale GNNs on HPC systems, by prioritizing workload balance in graph inputs while maintaining model accuracy.Our paper introduces diverse balancing strategies aimed at decreasing the maximum GPU memory footprint and avoiding the OOM exception, across various datasets. Our experiments showcase memory reduction of up to 32.14% compared to the baseline. We also demonstrate the proposed strategies can avoid OOM in application. Additionally, we create a distributed multi-GPU implementation using these samplers to demonstrate the scalability of these techniques on the HEP dataset.

    Keywords: high-performance computing, scientific workflows, Graph neural networks, Supercomputing, Graphic processing units, deep learning

    Received: 02 Jul 2024; Accepted: 12 Aug 2024.

    Copyright: © 2024 Lee, Hewes, Cerati, Wang, Aurisano, Agrawal, Choudhary and Liao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence: Claire Lee, Northwestern University, Evanston, United States

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.