Skip to main content

ORIGINAL RESEARCH article

Front. High Perform. Comput.
Sec. High Performance Big Data Systems
Volume 2 - 2024 | doi: 10.3389/fhpcp.2024.1384619
This article is part of the Research Topic Horizons in High Performance Computing, 2024 View all articles

Supercharging Distributed Computing Environments For High-Performance Data Engineering

Provisionally accepted
  • 1 Indiana University, Bloomington, Indiana, United States
  • 2 University of Virginia, Charlottesville, United States
  • 3 Biocomplexity Institute and Initiative, University of Virginia, Charlottesville, Virginia, United States

The final, formatted version of the article will be published soon.

    The data engineering and data science community has embraced the idea of using Python and R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these frameworks are now ever more important in order to process terabytes of data. They can easily exceed the capabilities of a single machine but also demand significant developer time and effort due to their convenience and ability to manipulate data with high-level abstractions that can be optimized. Therefore it is essential to design scalable dataframe solutions. There have been multiple efforts to be integrated into the most efficient fashion to tackle this problem, the most notable being the dataframe systems developed using distributed computing environments such as Dask and Ray. Even though Dask and Ray's distributed computing features look very promising, we perceive that the Dask Dataframes and Ray Datasets still have room for optimization In this paper, we present CylonFlow, an alternative distributed dataframe execution methodology that enables state-of-the-art performance and scalability on the same Dask and Ray infrastructure (supercharging them!)). To achieve this, we integrate a high-performance dataframe system Cylon, which was originally based on an entirely different execution paradigm, into Dask and Ray. Our experiments show that on a pipeline of dataframe operators, CylonFlow achieves 30× more distributed performance than Dask Dataframes. Interestingly, it also enables superior sequential performance due to leveraging the native C++ execution of Cylon. We believe the performance of Cylon in conjunction with CylonFlow extends beyond the data engineering domain and can be used to consolidate highperformance computing and distributed computing ecosystems.

    Keywords: Data engineering, data science, High performance computing, Distributed Computing, DataFrames

    Received: 10 Feb 2024; Accepted: 20 Jun 2024.

    Copyright: © 2024 Perera, Sarker, Shan, Kamburugamuve, Kanewala, Widanage, Zhong, Fetea, Abeykoon, Von Laszewski and Fox. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence: Arup K. Sarker, University of Virginia, Charlottesville, United States

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.