AUTHOR=Song Xinyi , Tan Zhetao , Locarnini Ricardo , Simoncelli Simona , Cowley Rebecca , Kizu Shoichi , Boyer Tim , Reseghetti Franco , Castelao Guilherme , Gouretski Viktor , Cheng Lijing TITLE=DC_OCEAN: an open-source algorithm for identification of duplicates in ocean databases JOURNAL=Frontiers in Marine Science VOLUME=11 YEAR=2024 URL=https://www.frontiersin.org/journals/marine-science/articles/10.3389/fmars.2024.1403175 DOI=10.3389/fmars.2024.1403175 ISSN=2296-7745 ABSTRACT=
A high-quality hydrographic observational database is essential for ocean and climate studies and operational applications. Because there are numerous global and regional ocean databases, duplicate data continues to be an issue in data management, data processing and database merging, posing a challenge on effectively and accurately using oceanographic data to derive robust statistics and reliable data products. This study aims to provide algorithms to identify the duplicates and assign labels to them. We propose first a set of criteria to define the duplicate data; and second, an open-source and semi-automatic system to detect duplicate data and erroneous metadata. This system includes several algorithms for automatic checks using statistical methods (such as Principal Component Analysis and entropy weighting) and an additional expert (manual) check. The robustness of the system is then evaluated with a subset of the World Ocean Database (WOD18) with over 600,000