Skip to main content

TECHNOLOGY AND CODE article

Front. Mar. Sci.
Sec. Ocean Observation
Volume 11 - 2024 | doi: 10.3389/fmars.2024.1403175

DC_OCEAN: An open-source algorithm for identification of duplicates in ocean databases

Provisionally accepted
  • 1 International Center for Climate and Environmental Sciences, Institute of Atmospheric Physics, Chinese Academy of Sciences (CAS), Beijing, Beijing Municipality, China
  • 2 University of Chinese Academy of Sciences, Beijing, Beijing, China
  • 3 National Centers for Environmental Information, National Oceanic and Atmospheric Administration (NOAA), Asheville, United States
  • 4 Istituto Nazionale di Geofisica e Vulcanologia (INGV), Bologna, Italy
  • 5 Climate Science Centre, Commonwealth Scientific and Industrial Research Organization, Hobart, Australia
  • 6 Centre for Southern Hemisphere Oceans Research (CSIRO), Hobart, Australia
  • 7 Physical Oceanography Laboratory, Department of Geophysics, Tohoku University, Sendai, Japan
  • 8 Italian National Agency for New Technologies, Energy and Sustainable Economic Development (ENEA), Santa Teresa Research Centre, Pozzuolo di Lerici, Italy
  • 9 Scripps Institution of Oceanography, University of California, San Diego, La Jolla, United States

The final, formatted version of the article will be published soon.

    A high-quality hydrographic observational database is essential for ocean and climate studies and operational applications. Because there are numerous global and regional ocean databases, duplicate data continues to be an issue in data management, data processing and database merging, posing a challenge on effectively and accurately using oceanographic data to derive robust statistics and reliable data products. This study aims to provide algorithms to identify the duplicates and assign labels to them.We propose first a set of criteria to define the duplicate data; and second, an open-source and semi-automatic system to detect duplicate data and erroneous metadata. This system includes several algorithms for automatic checks using statistical methods (such as Principal Component Analysis and entropy weighting) and an additional expert (manual) check. The robustness of the system is then evaluated with a subset of the World Ocean Database (WOD18) with over 600,000 in-situ temperature and salinity profiles. This system is an open-source Python package (named DC_OCEAN) allowing users to effectively use the software. Users can customize their settings. The application result from the WOD18 subset also forms a benchmark dataset, which is available to support future studies on duplicate checks, metadata error identification, and machine learning applications. This duplicate checking system will be incorporated into the International Quality-controlled Ocean Database (IQuOD) data quality control system to guarantee the uniqueness of ocean observation data in this product.

    Keywords: duplicate checking, ocean data infrastructure, ocean in-situ observations, ocean data quality improvement, Temperature and salinity

    Received: 19 Mar 2024; Accepted: 10 Sep 2024.

    Copyright: © 2024 Song, Tan, Locarnini, Simoncelli, Cowley, Kizu, Boyer, Reseghetti, Castelao, GOURETSKI and Cheng. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence: Lijing Cheng, International Center for Climate and Environmental Sciences, Institute of Atmospheric Physics, Chinese Academy of Sciences (CAS), Beijing, 100029, Beijing Municipality, China

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.