Skip to main content

TECHNOLOGY AND CODE article

Front. Big Data
Sec. Data Mining and Management
Volume 7 - 2024 | doi: 10.3389/fdata.2024.1446071

SparkDWM: A scalable design of a Data Washing Machine using Apache Spark

Provisionally accepted
  • University of Arkansas at Little Rock, Little Rock, United States

The final, formatted version of the article will be published soon.

    Data volume has been one of the fast-growing assets of most real-world applications. This increases the rate of human errors such as duplication of records, misspellings, and erroneous transpositions, among other data quality issues. Entity Resolution is an ETL process that aims to resolve data inconsistencies by ensuring entities are referring to the same real-world objects. One of the main challenges of most traditional Entity Resolution systems is ensuring their scalability to meet the rising data needs. This research aims to refactor a working proof-of-concept entity resolution system called the Data Washing Machine to be highly scalable using Apache Spark distributed data processing framework. We solve the single-threaded design problem of the legacy data washing machine by using PySpark's Resilient Distributed Dataset and improve the Data Washing Machine design to use intrinsic metadata information from references. We prove that our systems achieve the same results as the legacy Data Washing Machine using 18 synthetically generated datasets. We also test the scalability of our system using a variety of real-world benchmark ER datasets from a few thousand to millions. Our experimental results show that our proposed system performs better than a MapReduce-based Data Washing Machine. We also compared our system with Famer and concluded that our system can find more clusters when given optimal starting parameters for clustering.Deleted: hadoop distributed file system5 36 Deleted: that may arise.

    Keywords: data washing machine1, entity resolution2, Data Curation3, pyspark4, distributed dwm, spark dwm

    Received: 08 Jun 2024; Accepted: 23 Aug 2024.

    Copyright: © 2024 Hagan and Talburt. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence:
    Nicholas K. Hagan, University of Arkansas at Little Rock, Little Rock, United States
    John R. Talburt, University of Arkansas at Little Rock, Little Rock, United States

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.