Skip to main content

METHODS article

Front. Big Data
Sec. Machine Learning and Artificial Intelligence
Volume 7 - 2024 | doi: 10.3389/fdata.2024.1476506

Constructing a Metadata Knowledge Graph as an atlas for demystifying AI Pipeline optimization

Provisionally accepted
Revathy Venkataramanan Revathy Venkataramanan 1,2*Aalap Tripathy Aalap Tripathy 2Tarun Kumar Tarun Kumar 2Sergey Serebryakov Sergey Serebryakov 2Annmary Justine Annmary Justine 2Arpit Shah Arpit Shah 2Suparna Bhattacharya Suparna Bhattacharya 2Martin Foltin Martin Foltin 2Paolo Faraboschi Paolo Faraboschi 2Kaushik Roy Kaushik Roy 1Amit Sheth Amit Sheth 1
  • 1 University of South Carolina, Columbia, United States
  • 2 Hewlett Packard Labs, Houston, United States

The final, formatted version of the article will be published soon.

    The emergence of advanced Artificial Intelligence (AI) models has driven the development of frameworks and approaches that focus on automating model training and hyperparameter tuning of end-to-end AI pipelines. However, other crucial stages of these pipelines such as dataset selection, feature engineering, and model optimization for deployment have received less attention. Improving efficiency of end-to-end AI pipelines requires metadata of past executions of AI pipelines and all their stages. Regenerating metadata history by re-executing existing AI pipelines is computationally challenging and impractical. To address this issue, we propose to source AI pipeline metadata from open-source platforms like Papers-with-Code, OpenML, and Hugging Face. However, integrating and unifying the varying terminologies and data formats from these diverse sources is a challenge. In this paper, we present a solution by introducing Common Metadata Ontology (CMO) which is used to construct an extensive AI Pipeline Metadata Knowledge Graph (AIMKG) consisting of 1.6 million pipelines. Through semantic enhancements, the pipeline metadata in AIMKG is also enriched for downstream tasks such as search and recommendation of AI pipelines. We perform quantitative and qualitative evaluations on AIMKG to search and recommend relevant pipelines to user query. For quantitative evaluation we propose a custom aggregation model that outperforms other baselines by achieving a retrieval accuracy (R@1) of 76.3%. Our qualitative analysis shows that AIMKG-based recommender retrieved relevant pipelines in 78% of test cases compared to the state-of-the-art MLSchema based recommender which retrieved relevant responses in 51% of the cases. AIMKG serves as an atlas for navigating the evolving AI landscape, providing practitioners with a comprehensive factsheet for their applications. It guides AI pipeline optimization, offers insights and recommendations for improving AI pipelines, and serves as a foundation for data mining and analysis of evolving AI workflows.

    Keywords: AI pipeline Metadata, Graph learning, Graph Recommendation, AIMKG, Metadata Knowledge Graphs, AI Pipeline Optimization

    Received: 05 Aug 2024; Accepted: 27 Nov 2024.

    Copyright: © 2024 Venkataramanan, Tripathy, Kumar, Serebryakov, Justine, Shah, Bhattacharya, Foltin, Faraboschi, Roy and Sheth. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence: Revathy Venkataramanan, University of South Carolina, Columbia, United States

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.