Skip to main content

ORIGINAL RESEARCH article

Front. High Perform. Comput.
Sec. Parallel and Distributed Software
Volume 2 - 2024 | doi: 10.3389/fhpcp.2024.1417040
This article is part of the Research Topic Horizons in High Performance Computing, 2024 View all articles

Runtime Support for CPU-GPU High-Performance Computing on Distributed Memory Platforms

Provisionally accepted
Polykarpos Thomadakis Polykarpos Thomadakis *Nikos Chrisochoides Nikos Chrisochoides
  • Old Dominion University, Norfolk, United States

The final, formatted version of the article will be published soon.

    Hardware heterogeneity is here to stay for high-performance computing. Large-scale systems are currently equipped with multiple GPU accelerators per compute node and are expected to incorporate more specialized hardware. This shift in the computing ecosystem offers many opportunities for performance improvement; however, it also increases the complexity of programming for such architectures. This work introduces a runtime framework that enables effortless programming for heterogeneous systems while efficiently utilizing hardware resources.The framework is integrated within a distributed and scalable runtime system to facilitate performance portability across heterogeneous nodes. Along with the design, this paper describes the implementation and optimizations performed, achieving up to 300% improvement on a single device and linear scalability on a node equipped with four GPUs. The framework in a distributed memory environment offers portable abstractions that enable efficient inter-node communication among devices with varying capabilities. It delivers superior performance compared to MPI+CUDA by up to 20% for large messages while keeping the overheads for small messages within 10%. Furthermore, the results of our performance evaluation in a distributed Jacobi proxy application demonstrate that our software imposes minimal overhead and achieves a performance improvement of up to 40%. This is accomplished by the optimizations at the library level and by creating opportunities to leverage application-specific optimizations like over-decomposition.

    Keywords: Parallel Computing, Distributed Computing, GPGPU programming, Runtime systems, Heterogeneous systems, high-performance computing

    Received: 13 Apr 2024; Accepted: 02 Jul 2024.

    Copyright: © 2024 Thomadakis and Chrisochoides. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence: Polykarpos Thomadakis, Old Dominion University, Norfolk, United States

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.