Automating the Curation Process of Historical Literature on Marine Biodiversity Using Text Mining: The DECO Workflow

Paragkamian, Savvas; Sarafidou, Georgia; Mavraki, Dimitra; Pavloudi, Christina; Beja, Joana; Eliezer, Menashè; Lipizer, Marina; Boicenco, Laura; Vandepitte, Leen; Perez-Perez, Ruben; Zafeiropoulos, Haris; Arvanitidis, Christos; Pafilis, Evangelos; Gerovasileiou, Vasilis

doi:10.3389/fmars.2022.940844

TECHNOLOGY AND CODE article

Front. Mar. Sci., 22 July 2022

Sec. Marine Ecosystem Ecology

Volume 9 - 2022 | https://doi.org/10.3389/fmars.2022.940844

Automating the Curation Process of Historical Literature on Marine Biodiversity Using Text Mining: The DECO Workflow

Savvas Paragkamian^1,2*†

Georgia Sarafidou^2†

Dimitra Mavraki²

Christina Pavloudi^2,3

Joana Beja⁴

Menashè Eliezer⁵

Marina Lipizer⁵

Laura Boicenco⁶

Leen Vandepitte⁴

Ruben Perez-Perez⁴

Haris Zafeiropoulos^1,2

Christos Arvanitidis^2,7

Evangelos Pafilis²

Vasilis Gerovasileiou^2,8

¹Department of Biology, University of Crete, Heraklion, Greece
²Hellenic Centre for Marine Research (HCMR), Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Heraklion, Greece
³Department of Biological Sciences, The George Washington University, Washington, DC, United States
⁴Flanders Marine Institute (VLIZ), Oostende, Belgium
⁵National Institute of Oceanography and Applied Geophysics (OGS), Trieste, Italy
⁶National Institute for Marine Research and Development “Grigore Antipa” (NIMRD), Constanta, Romania
⁷LifeWatch ERIC, Seville, Spain
⁸Department of Environment, Faculty of Environment, Ionian University, Zakynthos, Greece

Historical biodiversity documents comprise an important link to the long-term data life cycle and provide useful insights on several aspects of biodiversity research and management. However, because of their historical context, they present specific challenges, primarily time- and effort-consuming in data curation. The data rescue process requires a multidisciplinary effort involving four tasks: (a) Document digitisation (b) Transcription, which involves text recognition and correction, and (c) Information Extraction, which is performed using text mining tools and involves the entity identification, their normalisation and their co-mentions in text. Finally, the extracted data go through (d) Publication to a data repository in a standardised format. Each of these tasks requires a dedicated multistep methodology with standards and procedures. During the past 8 years, Information Extraction (IE) tools have undergone remarkable advances, which created a landscape of various tools with distinct capabilities specific to biodiversity data. These tools recognise entities in text such as taxon names, localities, phenotypic traits and thus automate, accelerate and facilitate the curation process. Furthermore, they assist the normalisation and mapping of entities to specific identifiers. This work focuses on the IE step (c) from the marine historical biodiversity data perspective. It orchestrates IE tools and provides the curators with a unified view of the methodology; as a result the documentation of the strengths, limitations and dependencies of several tools was drafted. Additionally, the classification of tools into Graphical User Interface (web and standalone) applications and Command Line Interface ones enables the data curators to select the most suitable tool for their needs, according to their specific features. In addition, the high volume of already digitised marine documents that await curation is amassed and a demonstration of the methodology, with a new scalable, extendable and containerised tool, “DECO” (bioDivErsity data Curation programming wOrkflow) is presented. DECO’s usage will provide a solid basis for future curation initiatives and an augmented degree of reliability towards high value data products that allow for the connection between the past and the present, in marine biodiversity research.

Introduction

Species’ occurrence patterns across spatial and temporal scales are the cornerstone of ecological research (Levin, 1992). The compilation of both past and present marine data to a unified census is crucial to predict the future of ocean life (Ausubel, 1999; Anderson, 2006; Lo Brutto, 2021). This compilation has been attempted by big collaborative projects, like Census of Marine Life¹ (Vermeulen et al., 2013), that follow metadata standards and guidelines (Michener et al., 1997; Wilkinson et al., 2016) and modern web technologies (Michener, 2015). The project has resulted in the incorporation of census data from the past, i.e. historical data, to modern data platforms, such as the Ocean Biodiversity Information System (OBIS) (Klein et al., 2019), which feeds the Global Biodiversity Information Facility (GBIF) (GBIF, 2022). The transformation of historical data to modern standards is necessary for their rescue (data archaeology) from decay and inevitable loss (Bowker, 2000).

Historical data are usually found in the form of (a) historical literature and (b) specimens stored in biodiversity museum collections (Rainbow, 2009) (the digital transformation process and progress of specimens is reviewed by Nelson and Ellis, 2019). Historical biodiversity documents (also known as legacy, ancient or simply old documents) comprise literature from 1000 AD until 1960 and therefore are stored in an analogue and/or obsolete format (Lotze and Worm, 2009; Beja et al., 2022). These old documents can be found in institutional libraries, publications, books, expedition logbooks, project reports, newspapers (Faulwetter et al., 2016; Mavraki et al., 2016; Kwok, 2017) or other types of legacy formats (e.g. stored in floppy disks, microfilms or CDs).

From the scientific point of view, historical biodiversity data are as relevant as modern data (Griffin, 2019; Beja et al., 2022). They are valuable for studies on biodiversity loss (Stuart-Smith et al., 2015; Goethem and Zanden, 2021), as forming baseline studies for the design of future samplings (Rivera-Quiroz et al., 2020) and for predictions of future trends (Mouquet et al., 2015). Furthermore, historical data offer the kind of evidence needed for conservation policy and marine resource management, allowing for past patterns and processes to be compared with current ones (Fortibuoni et al., 2010; McClenachan et al., 2012; Costello et al., 2013; Engelhard et al., 2016). Hundreds of historical marine data held in documents have already been uploaded to OBIS, yet a Herculean effort is required to curate the thousands of available documents of the Biodiversity Heritage Library (BHL) (Gwinn and Rinaldo, 2009) and other repositories.

Adequate and interoperable metadata are equally necessary and have to be curated alongside data (Heidorn, 2008; Mouquet et al., 2015). In this context, standards and guidelines have been recently formulated in policies as Findable, Accessible, Interoperable and Reusable (FAIR) (meta)data (Wilkinson et al., 2016; Reiser et al., 2018). Identifiers and semantics are used to accomplish the interoperability and reusability of biodiversity data as well as the monitoring of their use (Mouquet et al., 2015). Indispensable to the curation process of marine data have been the standards of the Biodiversity Information Standards², more specifically Darwin Core (Wieczorek et al., 2012) and vocabularies such as those included in the International Commission on Zoological Nomenclature³, the World Register of Marine Species⁴ (WoRMS) (WoRMS Editorial Board, 2022), the Environmental Ontology⁵ (ENVO) (Buttigieg et al., 2016) and Marine Regions⁶ (Claus et al., 2014). These standards and vocabularies and their adoption by biodiversity initiatives like GBIF and OBIS align with the goal of marine biodiversity Linked Open Data and support their interoperability and reusability (Page, 2016; Penev et al., 2019; Zárate and Buckle, 2021).

The rescue process of historical biodiversity documents can be summarised in four tasks (Figure 1). The first task is the digitisation of the document, which involves locating and cataloguing the original data sources, imaging/scanning with specific equipment and standards and uploading them to digital libraries (Lin, 2006; Thompson and Richard, 2013). In the second task, the images are analysed with text recognition software, mainly through Optical Character Recognition (OCR) (for standards see Groom et al., 2019 and for reviews see Lyal, 2016 and Owen et al., 2020). Text recognition errors are then corrected manually by professionals or citizen scientists (Herrmann, 2020). The third task is named Information Extraction (IE) as it involves the steps of named entity recognition, mapping and normalisation of biodiversity information (Thessen et al., 2012). Here, the curators may compile a species’ occurrence census enriched with metadata of the study, geolocation, environment, sampling methods and traits among others (Faulwetter et al., 2016). Lastly, the fourth task, is the data publishing to online biodiversity databases/repositories (Costello et al., 2013; Penev et al., 2017). Expert manual curation is a cross-cutting action through all the aforementioned tasks for quality control and stewardship (Vandepitte et al., 2015). This article focuses on the tools and curation procedures encompassed in the third and fourth tasks described above.

FIGURE 1

Figure 1 Summarised process of historical document rescue. Four tasks are required to complete the data rescue process of biodiversity documents. Each of these has several steps, methodology, tools and standards. Curation is needed in every task, for tool handling and error correction. The stars represent the 5-star ranking system of Linked Data as introduced by W3C⁷ (Heath and Bizer, 2011). Availability of information from historical data increases as the curation tasks are completed (as exemplified by the fan on the right). Icons used from the Noun Project released under CC BY: book by Oleksandr Panasovskyi, scanning by LAFS, Book info by Xinh Studio, Library by ibrandify, Scanner Text by Wolf Böse, Check form by allex, Whale by Alina Oleynik, Fish by Asmuh, tag code vigorn, pivot layout by paisan, Certificate by P Thanga Vignesh, web service by mynamepong.

Several factors may turn the curation of historical documents into a serious challenge (Faulwetter et al., 2016; Beja et al., 2022). Errors from the first and second tasks, as presented in Figure 1 (i.e. bad quality imaging, mis-recognised characters etc.) are propagated through the whole process. In terms of georeferencing constraints, location names or sampling points on an old map may be provided instead of the actual coordinates. Additionally, taxonomic constraints (e.g. old, currently unaccepted synonyms, lack of authority associated with the taxon names) combined with the absence of taxonomic literature or voucher specimens (e.g. identifier number for samples of natural history/expedition collections) require the taxonomists’ assistance. Numerical measurement units often need to be converted to the International System of Units (SI system) (e.g. fathoms to metres) (Calder, 1982; Wieczorek et al., 2012). Old toponyms and political boundaries that have now changed should also be taken into consideration, as well as coordinates that now fall on land instead of in the sea, due to the changes in the coastline. Lastly, the use of languages other than English is quite common in old scientific publications, so multilingual curators are required. Some of the aforementioned issues are presented in Figure 2. Because of these limitations, the manual curation of data and metadata is mandatory when it comes to historical data (Faulwetter et al., 2016).

FIGURE 2

Figure 2 Common problems encountered in historical data, such as old ligatures, absence of taxon names, ambiguous symbols, shortened words and descriptive information instead of numerical (page 185 from Forbes, 1844).

Manual curation, a tedious and multistep process, requires substantial effort for the correct interpretation of valuable historical information; however, text mining tools appear to be promising in assisting and accelerating this part of the curation process (Alex et al., 2008). Text mining is the automatic extraction of information from unstructured data (Hearst, 1999; Ananiadou and Mcnaught, 2005). These mining tools build upon standardised knowledge, vocabularies, dictionaries and perform multistep Natural Language Processes. Named Entity Recognition (NER) is a key step in this process for locating terms of interest in text (Perera et al., 2020). The entities of interest for biodiversity documents include: (1) taxon names, (2) people’s names (Page, 2019a; Groom et al., 2020), (3) environments/habitats (Pafilis et al., 2015; Pafilis et al., 2017), (4) geolocations/localities (Alex et al., 2015; Stahlman and Sheffield, 2019), (5) phenotypic traits/morphological characteristics (Thessen et al., 2018), (6) physico-chemical variables, and (7) quantities, measurement units and/or values. Subsequent steps include the relation extraction between entities. Multiple tools have emerged to retrieve a single or a collection of these entities in the past few years (Batista-Navarro et al., 2017; Muñoz et al., 2019; Dimitrova et al., 2020; Le Guillarme and Thuiller, 2022).

The work described in this document has a threefold structure: (a) the abundance of marine historical literature digitised/available for curation is attempted to be estimated; (b) bioinformatics tools, focusing on automating and assisting the curation process for these documents, are compiled/reviewed. Two categories of such curation software are described: (i) the first one relies on web and standalone applications with Graphical User Interface (GUI) and the second (ii) combines Command Line Interface (CLI) programming libraries and software packages; lastly, (c) a demonstrator biodiversity data curation workflow, named DECO (bioDivErsity data Curation programming wOrkflow⁷), developed using programming tools, is presented.

Method

Historical Literature Discovery

A search was conducted on BHL to amass the historical literature on BHL regarding marine biodiversity. Using the keywords “marine”, “ocean”, “fishery”, “fisheries” and “sea” on the items’ titles and their subjects (the scripts, results and documentation are available in this repository⁹) the documents available for information extraction were estimated. Subjects are categories provided for each title and multiple subjects can be assigned to each title. The items that were originally published before 1960 were selected, in order to include only historical documents, according to the definition included in the Introduction section. Furthermore, the taxon names on each page, which were identified by BHL using the Global Names parser tool (Mozzherin et al., 2022), were summarised for every document. Hence, summaries of the number of automatically identified taxon names were calculated along with the page number for each item. Additionally, OBIS’ historical datasets originally published before 1960 were downloaded and analysed. This analysis provides an approximation of the size of available marine historical literature compared to the already rescued documents. All analysis scripts were written in GNU AWK programming language and the visualisation scripts were written in R using the ggplot2 library (Wickham, 2016).

Historical Document Rescue Methodology

Data curators thoroughly read each page of a document and insert the data into spreadsheets, mapping them to Darwin Core terms, adding metadata and creating a standard Darwin Core Archive¹⁰. This whole process, which is mostly manual, means reading the information (e.g. the occurrence of a specific taxon and its locality) and inputting it through typing to the corresponding cell of the data file. It is, as expected, a time- and resource-consuming procedure. Taxon names, traits, environments and localities can be identified as well and the transformation of these results to database identifiers (IDs), like Life Science Identifier¹¹ (LSID) of Aphia IDs¹², Encyclopedia of Life¹³ (EOL) IDs (Parr et al., 2014), Marine regions gazetteer IDs, marine species traits¹⁴ among others, can be facilitated through web applications and programming software. The Natural Environment Research Council¹⁵ Vocabulary Server, developed and hosted by the British Oceanographic Data Centre¹⁶ was used for mapping facts and additional measurements included in documents.

Tools assist curators in this process for the NER, Entity Mapping, data structure manipulation and finally data upload steps. Curation tools can be categorised as GUI applications (computer programs and web applications) and CLI applications (interconnected programming tools, libraries and packages) (Figure 3). As an example, multiple page documents can be searched for taxon names in seconds, with technologies that find synonyms and fuzzy search for the OCR transformation misspelling. The interconnection and guidance of these steps still requires human interaction and correction.GUI applications are standalone applications or web applications, the latter support document upload and, once they are processed in a server, the results are delivered back to the user (Lamurias and Couto, 2019). CLI tools include programming packages and libraries of any programming language in UNIX (Linux and Mac operating systems - OS) and Windows OS. Even though programming packages and libraries are fast and scalable they require familiarity and expertise in CLI and programming which, on the other hand, takes effort and time because of its initial learning curve. The CLI tools, Application Programming Interfaces (APIs) and programming packages chosen during this study are open-source, are in active development, can process many documents and can be combined with other tools in some of the considered steps.

FIGURE 3

Figure 3 The curation process of marine historical biodiversity documents: on the left column are the required steps starting from the scanned document (usually a PDF file) and ending with the data publishing step. Two approaches are presented: in the middle column are the GUI tools whereas on the right column are the CLI and/or the executable programming tools. Note that the list of given examples is non-exhaustive. Icons used from the Noun Project released under CC BY: Whale by Alina Oleynik, Fish by Asmuh, tag code by vigorn, pivot layout by paisan, Certificate by P Thanga Vignesh, web service by mynamepong.

Case Study

The historical document “Report on the Mollusca and Radiata of the Aegean Sea: and on their Distribution, Considered as Bearing on Geology” by Forbes (1844) and its curated dataset were used as a case study for the tool usage description and evaluation (where applicable). More specifically, the six page long Appendix No. 1 (pages 180-185) document has been manually curated and published, thus serving as a golden standard (Figure 4). It was digitised and transcribed on 2009-04-22 by the Internet Archive¹⁷ and on 2021-09-30 it was manually curated (Mavraki et al., 2021) and published in MedOBIS¹⁸ (Arvanitidis et al., 2006). The rescue process resulted in a Darwin Core Archive file with 530 occurrence records, 17 unique sampling stations and 260 taxa, covering 217 species. The effort required from the information extraction task to data publishing was roughly 50 working days (8 hours per day) by a single data curator.

FIGURE 4

Figure 4 A screenshot of the dataset used where the structure of the data and metadata provided can be seen (page 180 from Forbes, 1844).

Tool Usability Evaluation

The web applications mentioned in this work were tested in November 2020 in two web browsers, Mozilla Firefox version 83 and Google Chrome version 87, both on Microsoft Windows 10 and MacOS 10.14.

Demonstrator

DECO was developed for the automation of biodiversity historical data curation. Its workflow combines image processing tools for scanned historical documents OCR with text mining technologies. It extracts biodiversity entities such as taxon names, environments as described in ENVO and tissue mentions. The extracted entities are further enriched with marine data identifiers from public APIs (e.g. WoRMS) and presented in a structured format as well as in report format with automated visualisation components. Furthermore, the workflow was implemented as a Docker container to ease its installation and its scalable application on large documents. DECO is under the GNU GPLv3 licence (for 3^rd party components separate licences apply) and is available via the GitHub repository (https://github.com/lab42open-team/deco).

Results

Historical Literature Discovery

Marine literature analysis on BHL holdings revealed that there are 1,627 different digital items that contain at least 100 distinct taxa to a maximum of 10,000 taxa, as identified automatically from the Global Names GNfinder tool. These items cover the period from 1558 to 1960, contain 648,927 pages, written in 10 different languages, 80% of which being English. An absolute estimation of historical marine data is difficult to be made as several more documents are stored locally in legacy formats.

The rescued historical marine data uploaded on OBIS are 223 datasets, published from 1753 to 1960. Hence, the manual curated literature is much lower than the available digitised documents. These rescued biogeographical datasets cover 46,000 species and 38 phyla that contain about 1.5 million occurrences at the species level.

Bioinformatics Tools Compilation and Review

This section describes the tools used in the curation workflow (Figure 3). In each step, the main up-to-date programming tools, web services and applications, used for the extraction of biodiversity data, are presented. These curation tools are listed, accompanied with features such as extracted information, input format and their interface in Table 1.

TABLE 1

Table 1 Functions, interface and curation step of the tools tested in this work.

Named Entity Recognition

The Global Names Recognition and Discovery¹⁹ (GNRD) tool, within Global Names Architecture²⁰ (GNA), is a web application used for the recognition of scientific names. It can use files such as PDF, images or Microsoft Office documents and one can still input URLs or even free-form text. It supports OCR transformation from PDF files using the tool Tesseract²¹ and uses the GNfinder²² discovery engine, in order to provide the list of names. It offers an API and can be installed locally. GNA is also used by the BHL platform to locate taxonomic names within the pages of its collections (Richard, 2020).

The test performed on the Forbes (1844) six-page PDF template provided 128 unique scientific names at species level, out of the 218 identified through the manual curation (Figure S1). WoRMS Aphia IDs (Vandepitte et al., 2015; Martín Míguez et al., 2019) are widely used and included in GNRD.

The Biodiversity Observation Miner²³ (BOM) is a web application based on R Shiny²⁴, also available on GitHub²⁵, that allows for the semi-automated discovery of biodiversity observations (e.g. biotic interactions, functional or behavioural traits and natural history descriptions) associated with the species scientific names (Muñoz et al., 2019). It uses the GNfinder discovery engine through the R package taxize²⁶ (Chamberlain and Szöcs, 2013). BOM is still under development (April 2022) and an OCR processed PDF file must be used as input. The novelty of this tool is the provision of text snippets (Figure S2) and the co-occurrence of words, accompanied with their count, to inform curators for terms that appear together in the document.

TextAnnotator²⁷, provided by the specialised information service BIOfid²⁸, is focused on information extraction about taxon names of vascular plants, birds, moths and butterflies, location and time mentioned in German texts (Driller et al., 2018; Driller et al., 2020). This could be extended to other environments, languages and taxonomic groups with the BIOfid Github page²⁹ serving as the starting point. The TextAnnotator - in beta version - accepts web pages or free text. Evidence of recent use of this tool was found in Driller et al. (2020).

The Pensoft Annotator³⁰ is another beta web application that works with ontologies (Dimitrova et al., 2020) (Figure S3). The Pensoft Annotator has Relation Ontology³¹ (RO) and ENVO built in but it is extendable to any ontology with curation modifications for stopwords. The character limitation, however, can be expanded upon communication with the tool’s administrators.

Taxonfinder³² is a web application for the extraction of scientific names mentioned in web pages. It features an API that was used in BHL for large scale annotations of taxonomic names until 2019, when it was replaced by GNfinder (Richard, 2020).

The most notable NER tool, with CLI, for taxon names is the Global Names Finder (GNfinder) (Pyle, 2016; Mozzherin et al., 2022) which provides fuzzy search and is the underlying engine of most biodiversity text mining tools. It is in active development, deeming it a reliable tool for this work. The main command line tool is gnfinder find which returns two arrays (metadata and names). The metadata are the language, date of the execution of the command and total number of words. The data have one entry per identified string which contains the matched string, the returned name and the positional boundaries in character sequence.

In order to simultaneously extract taxa, environment and tissue mentions, the tool EXTRACT³³ (Pafilis et al., 2017) implements the JensenLab tagger API (Jensen, 2016) with advanced dictionaries SPECIES-ORGANISMS³⁴ (Pafilis et al., 2013), ENVIRONMENTS³⁵ (Pafilis et al., 2015) and TISSUES³⁶ (Palasca et al., 2018). It returns NCBI Taxonomy IDs (Schoch et al., 2020), ENVO terms and BRENDA IDs³⁷, respectively to a file with 3 columns: tagged text, entity type and term ID. TaxoNERD (Le Guillarme and Thuiller, 2022), using Deep neural networks, scores higher than other NER tools on taxon name recognition based on golden standard corpora.

An important NER system is the Stanford NER³⁸ (Finkel et al., 2005) which recognises locations, persons and organisations in text. It has a generic scope but it can also assist in the curation of biodiversity data. The general tokenisation and normalisation procedures developed by the NLP Stanford team are the basis of many text mining tools. Additionally, the ClearEarth³⁹ project (Thessen et al., 2018) can tag biotic and abiotic entities, localities, units and values in text and is built using the ClearTK NLP toolkit⁴⁰ (Bethard et al., 2014). Upon installation it downloads multiple dictionaries and takes up to six gigabytes of space. It relies on Stanford NLP and other dependencies and provides a Python wrapper and a CLI.

A common constraint in historical documents is the lack of coordinates from the sampling areas, so the data curator should provide the coordinates using the toponyms given. There are tools that enable this procedure, such as Marine Gazetteer. BioStor-Lite map⁴¹, which contains automated geolocation annotation of BHL documents (Page, 2019b), displays the points on the global map providing the user the ability to search for additional documents with selected points on the map or by drawing rectangles. The Edinburgh geoparser (Alex et al., 2015), a command line tool, recognises places in text and is one of very few tools to have this functionality. The Stanford NER system has been used as well (Stahlman and Sheffield, 2019) upon receiving training, for geolocation recognition.

Entity Normalisation and Mapping

Mapping the information retrieved from the NER tools to different IDs is crucial for cross-platform interoperability, ensuring a good output requires the mapping services to be up to date.

Taxon names can have multiple IDs depending on the platform, taxonomy common IDs, apart from the Linnaean system, are the LSID, NCBI taxonomy identifiers, EOL identifiers etc. For marine species LSIDs based on Aphia IDs are the most widely adopted.

Ontobee⁴², a web server that links ontologies, is useful for the annotation of text to ontology IDs (Xiang et al., 2011). Curators can provide text snippets to Ontobee in order to retrieve ontology terms regarding environmental features (e.g. ENVO IDs), functional traits (e.g. PATO IDs⁴³ (Tan et al., 2022)) or other ontology terms of interest. Currently, the use of entire documents is not recommended.

The WoRMS Taxon match⁴⁴ tool matches the taxon list found against the World Register of Marine Species (WoRMS) taxon LSID. Geographic regions are confirmed with the use of the georeference tool developed for the Marine Gazetteer, users can enter the location name in the gazetteer search field of the web interface and the result’s output includes the region’s boundaries and the corresponding MRGID.

Most vocabulary servers provide APIs that map the different IDs. EMODnet Biology has adopted LSIDs for marine species based on Aphia IDs from the WoRMS vocabulary, which provides a dedicated API and an R package worrms (Chamberlain, 2020). Additionally, the R package taxize (Chamberlain and Szöcs, 2013) provides taxon mapping capabilities across many data sources (i.e. NCBI taxonomy, Integrated Taxonomic Information System, Encyclopedia of Life, WoRMS). Functions like get_eolid, get_nbnid, get_wormsid can perform mapping across rows of the taxon name of the case study. In addition, the GloBI⁴⁵ (Global Biotic Interactions) nomer tool⁴⁶ (Poelen and Salim, 2022) can also be used as it provides entity mapping functionality via CLI (Poelen et al., 2014).

Data Transformations

In this step, curators organise data according to the Darwin Core⁴⁷ standard and extensions, such as extended Measurement or Fact Extension⁴⁸, resulting in the creation of a Darwin Core Archive (see guidelines via the link⁴⁹) with detailed sampling descriptors and terms based on controlled vocabularies.

When considering data transformations, curators tend to use GUI spreadsheet applications like Microsoft Excel, Google Sheets and LibreOffice Calc. OpenRefine⁵⁰ is a free, open source software that handles messy data and provides their transformation in various ways (Ham, 2013). The software’s main goal is to provide data cleaning, fixing and analysing while also enhancing the interconnection between different datasets (Verborgh and De Wilde, 2013).

Automation can be used for this transformation through CLI tools like the R tidyverse⁵¹ package suite, Python pandas⁵² library and AWK programming language⁵³, among others. These tools support fast and scalable tabular and text data handling, manipulations, merging and filtering. The choice of tools depends on the users’ familiarity, expertise and operating system.

Quality Control

Prior to publishing the dataset it is important to perform sanity checks and quality checks to ensure that the data comply with the Darwin Core Standards (Vandepitte et al., 2015). LifeWatch-EMODnetBiology QC tool⁵⁴ allows the use of the IPT URL or the dataset’s DwC-A files and provides a list of the quality issues encountered, according to the EMODnet Biology criteria, as an output. It is available as a Web Application interface, based on RShiny, and as a R package⁵⁵ (De Pooter and Perez-Perez, 2019). LifeWatch Belgium Data Services’⁵⁶ has similar functionalities, providing a compilation of data services from plain text and spreadsheet files as input. The GBIF Data Validator⁵⁷ combines all the above mentioned options, in terms of input, and provides a detailed summary of issues encountered in data and metadata. Open Refine, is equipped with a few extensions that can also check for taxon names and reconcile them.

The Obistools⁵⁸ R package (Provoost et al., 2019), the basis of the LifeWatch-EMODnetBiology QC tool, can be used to check the coordinate boundaries and calculate centroids in cases where the exact location is unknown. It also checks for dates’ formats and events. It has comprehensive documentation and is in active development.

Upload to Database

The last step of the curation process is the publication of the standards’ compliant formatted data, which is facilitated by the Integrated Publishing Toolkit⁵⁹ (IPT) software platform (Robertson et al., 2014).

Curators create an IPT resource entry with the aforementioned data and associated metadata, which is then uploaded to an IPT instance, e.g. the MedOBIS⁶⁰ Repository (Arvanitidis et al., 2006). In the case of MedOBIS, the IPT is subsequently harvested and made available by the central OBIS⁶¹ system, thus being a strong example and supporter of the ‘collect once, use many times’ concept.

One-Stop-Shop Tools

The main all-in-one GUI computer program is Golden-GATE-imagine⁶², an updated version of GoldenGATE editor (Sautter et al., 2007). This tool supports OCR, NER and entity mapping, as described in the various steps of the curator’s workflow by providing annotations on PDF backed up by ontologies. It was developed by Plazi in 2015 and was last updated in 2016. Several recent biodiversity data related publications still report the use of it although it has not been updated since that time (Miller et al., 2019; Rivera-Quiroz and Miller, 2019; Agosti et al., 2020). Due to its open source nature, Golden-Gate-imagine can be further developed by any interested parties, as exemplified in GNfinder.

DECO: A Biodiversity Data Curation Programming Workflow

A CLI workflow named DECO developed to demonstrate the advantages of the CLI approach, is available via this GitHub repository⁶³. DECO has connected different tools of the programming curation steps (Figure 3). The execution is via a single command with a user-provided PDF file and the output are the taxon names and records from WoRMS API, taxonomy NCBI IDs and ENVO terms from the Environmental Ontology. Complementary tools (i.e. Ghostscript⁶⁴, jq⁶⁵ and ImageMagick⁶⁶) and UNIX commands are also called in a single Bash script which unifies the workflow. In order to simplify the setup procedure of the workflow a Docker container and a Singularity container were developed that include all the dependencies and the code. The code and both containers have been tested on Ubuntu, Mac and Debian server (Table 2). For a larger corpus of biodiversity historical data the recommendation is to use the Singularity container in a remote server or a High Performance Computing (HPC) cluster.

TABLE 2

Table 2 The platforms where the CLI workflow was tested.

Discussion

Data Rescue Landscape

The huge difference between rescued historical marine datasets uploaded on OBIS and the available digital items on BHL holdings reflects the challenges faced by curators and the minimal attention paid by the wider community, when compared to other data rescue activities (e.g. specimens, oceanographic data, etc.). Many publications lack basic metadata such as location, date, purpose or method of sampling. Tracing this information is limited as the data providers may (a) have forgotten these details, (b) be retired or (c) be deceased (Michener et al., 1997).

The project ‘Census of Marine Life’ included, among its initial objectives, the rescue of historical marine data. Since then, there have been ongoing efforts within the EMODnet Biology project and LifeWatch Research Infrastructure, among others. Similarly, initiatives like Global Oceanographic Data Archaeology and Rescue⁶⁷ (GODAR), Oceans Past Initiative⁶⁸ (OPI) and RECovery of Logbooks And International Marine data⁶⁹ (RECLAIM) (Wilkinson et al., 2011) rescue data from ship logs for oceanographic, climate and biodiversity data. More effort is however needed, as exemplified by museum specimen collections and herbaria digitisation (Mora et al., 2011; Wheeler et al., 2012). The museum specimen collections and herbaria digitisation has multiple projects and infrastructures like Distributed System of Scientific Collections⁷⁰ (DiSSCo), Innovation and consolidation for large scale digitisation of natural heritage⁷¹ (ICEDIG), Integrated Digitized Biocollections⁷² (iDigBio) and Biodiversity Community Integrated Knowledge Library (BiCIKL) (Penev et al., 2022). Similar attention is required to rescue marine biodiversity data from historical documents that can contribute to a more complete global biodiversity synthesis (Heberling et al., 2021).

In the last few years, an upsurge in web applications development regarding the enhancement of biodiversity data digitisation has been observed. This is an indication of the need for such initiatives. Advancements in the field of OCR, text mining and information technology promise semi-automation and acceleration of the curator’s work, which could transform the biodiversity curation field into an -omics like, interdisciplinary field that requires complementary skills of document handling, web technologies and text mining, to name but a few.

Interface Remarks

Web applications provide the advantage of visual aids (e.g. highlights of NER terms), which improve the evaluation easiness and intuitiveness when using their graphical interfaces. Emerging web development technologies like R Shiny, Flask⁷³ and Django⁷⁴ among others, have simplified the processes of web application development. These applications are powerful and effective in most cases but are siloed in functionality and extendability, they also have many software dependencies which increase instability, when not maintained in the long term.

CLI tools are a powerful way to implement scalable, reproducible and replicable workflows: scalable because the same code can be applied to multiple files (e.g. in this case, the various documents); reproducible and replicable because the code can be executed multiple times and with different types of documents, respectively. Furthermore, they usually have additional functionalities that have not been implemented in their web application counterparts. The difficulties regarding CLI tools’ dependency and portability are being resolved with the rise of containerised applications which include all system requirements and are distributed through web repositories like Docker Hub⁷⁵, the downside is that without interactiveness they are cumbersome when assisting the curation process.

Sustainability

Tool usability relies on active development and continuous support and debugging. Sustainability is considered the main issue regarding the tools’ functionality. An example is EnvMine (Tamames and de Lorenzo, 2010), a promising 2010 cutting edge tool which is no longer available. One-stop-shop purpose software applications for domain specific usage, like GoldenGate, are very helpful but require more effort to stay up to date with the integrated tools. Other tools are often out of date, as active development and contribution to reporting issues in open-source repositories, such as Github, is lacking, thus becoming obsolete and unsupported in only a few years from their first release.

Curation Step-Wise Remarks

The curators’ role is invaluable in the data rescue process, as their domain specific expertise is far from becoming entirely automated. There are plenty of available digitised historical documents that are not curated in web libraries, such as BHL, the Belgian Marine Bibliography⁷⁶, Web of Science⁷⁷, Wiley Online Library⁷⁸ and Taylor & Francis Online⁷⁹, among others (Kearney, 2019). BHL provides “OCRed” documents and there are plenty of other tools that can tackle this process which are reviewed elsewhere (Owen et al., 2020), however OCR is a crucial limiting step in the workflows, in terms of the information transformed from image to text, because there are many cases that lead to mispelled or lost text; especially the case with handwritten text and poor quality images (Lyal, 2016).

Information extraction can be performed both on a small and a large scale. Named Entities are what most text mining tools extract. Taxon names recognition is the main function of the majority of the current tools and has matured significantly over the past decade, especially through the integration of multiple platforms with the GNA (Pyle, 2016). Environments and geolocations have strong background data, Environment Ontology terms (retrieved with the EXTRACT tool) and GeoNames⁸⁰/Marineregions gazetteers, respectively. Geolocation mining, in particular, has not been adapted in biodiversity curation but there are generic tools (e.g. mordecai⁸¹ - Halterman, 2017) that are able to be trained with gazetteers to extract approximate localities from text. Also extraction of sample location from maps is possible by first geolocating the historic map in Geographic Information Systems (Jenny and Hurni, 2011) and then using computer vision to find the locations’ coordinates (Chiang et al., 2014). Characteristics of taxa, i.e. phenotypic traits, associated physico-chemical variables, units and the use of semantics to describe relations, are still under standardisation (Thessen et al., 2020) and NER prototypes have been made with ClearEarth and Pensoft Annotator, for example.Entity mapping has also seen an important development because there are many open public APIs for vocabularies like those used in WoRMS, and Marine Regions and aggregators such as GBIF and OBIS, among others, and in some cases software packages (mostly in the R programming language). The task for Publication has its dedicated applications and tools with the CLI tools being able to perform quality control and deliver a preferred on-the-fly format.

DECO

The CLI scientific workflow assembled in this paper, DECO, is a demonstration of EMODnet Biology’s vision for biodiversity data rescue using programming tools. To the best of our knowledge, this is the first task-driven CLI that brings together state-of-the-art image processing, OCR tools, text mining technologies and Web APIs, in order to assist curators. By using programming interface and Command Line Tools the workflow is scalable, customisable and modular, meaning that more tools can be incorporated to, e.g. include the entities mentioned in the previous section. It is fast, may be used on a personal computer, and is available as a Docker and a Singularity container. The containerised versions of the workflow simplify the installation procedure and increase its stability, scalability and portability because they include all the necessary dependencies. This CLI scientific workflow promises a faster and high throughput processing that could be applied to any type of data, not only historical, thus contributing to the overall digitisation of biodiversity knowledge.

Future Outlook

Progress has been made in the advancement of the historical data rescue process, from digitisation platforms to standards, services and publication (Beja et al., 2022). To bridge the gap between tools and curators requires effort on both ends; namely the data curators and the tool developers. It is recommended that curators are trained in basic programming skills from which they and the historical data rescue process in general would benefit in the long term (Holinski et al., 2020). Regarding software development, important features are highlighted, like the use of multiple ontologies in Pensoft Annotator. This is a direction which should be further expanded to all entities of interest. Multidisciplinary cooperation between scientific communities and partners of tools, ontologies and databases is needed to accomplish this task (Bowker, 2000). An important example was set by GNA which advanced scientific names recognition significantly. In addition, the co-occurrence feature, that was present in Biodiversity Observation Miner, once expanded to other entities and associated with a scoring scheme will be a state-of-the-art text mining application that goes beyond NER to actually infer relations. The rise of deep neural networks is promising as well in all different tasks of Information Extraction, as seen in TaxoNERD (Le Guillarme and Thuiller, 2022). Lastly, the community is pushing to Semantic Publishing, FAIR completeness of new data and new taxonomic publishing guidelines to eliminate the need of text mining and curation in current publications (Penev et al., 2019; Fawcett et al., 2022).

The implementation of crowdsourced curation within citizen science projects for the historical biodiversity data is encouraged (Clavero and Revilla, 2014; Arnaboldi et al., 2020; Holinski et al., 2020). Practices like this are already in place in the digitisation of natural history collections and have been proved fruitful (Ellwood et al., 2015). EMODnet Biology’s Phase IV will launch such a citizen science project for historical documents curation during the second half of 2022. Approaches from other fields of science that handle historical and old data, such as history, linguistics, archaeology would provide useful insights for the text mining of historical biodiversity data.

Concluding Remarks

Historical marine biodiversity data provide important and significant snapshots of the past that can help understand the current status of ocean ecosystems and predict future trends in face of the climate crisis. There is a wealth of historical documents that have been digitised yet, most of their data have not been rescued or published in online systems. To accelerate the tedious data rescue process it is essential that more curators become engaged, and tools for Information Extraction and Publication get improved to satisfy their needs. Tools like DECO and GoldenGATE demonstrate possible future directions for one-stop-shop applications for command line and graphical user interfaces, respectively. Research Infrastructures can play a pivotal role towards this goal. Last but not least, the community and funding bodies should prioritise the data rescue of these invaluable documents before their decay and inevitable loss.

Data Availability Statement

DECO is available here: https://github.com/lab42open-team/deco. Historical marine literature analysis is here: https://github.com/savvas-paragkamian/historical-marine-literature. BHL, EMODnet Biology and OBIS data are available for download here https://about.biodiversitylibrary.org/tools-andservices/deve loper-and-data-tools/ and https://www.emodnetbiology.eu/tool box/en/download/occurrence/explore and here https://obis.org/ manual/access/, respectively. The digitised document of the “Report on the Mollusca andRadiata of the Aegean Sea, and on their distribution, considered as bearing on Geology. 13th Meeting of the British Association for the Advancement of Science, London, 1844” is available here: https://www.biodiversitylibrary.org/page/12920789. The curated dataset of the case study is available here (version 1.9 and above): http://ipt.medobis.eu/resource?r=mollusca_forbes.

Author Contributions

Conceptualisation: CA, EP, and VG. Wrote first draft of the manuscript: SP, GS, DM, CP, CA, EP, and VG. Revised the manuscript: all. Web applications testing: SP, GS, ME, RP. Programming tools testing: SP, HZ, and EP. Code development and containerisation: SP and HZ. Work Package Leaders: VG and DM. Project coordinator: JB. All authors contributed to the article and approved the submitted version.

Funding

This work was supported by EMODnet Biology Phase III (EASME/EMFF/2016/-1.3.1.2/Lot 5/SI2.750022 and EASME/EMFF/2017/ 1.3.1.2/02/SI2.789013) and Phase IV (EMFF/2019/1.3.1.9/Lot 6/SI2. 837974). The European Marine Observation and Data Network (EMODnet) is financed by the European Union under Regulation (EU) No 508/2014 of the European Parliament and of the Council of 15 May 2014 on the European Maritime and Fisheries Fund. SP was supported also by EMODnet Biology Phase IV and for different parts of his work he was supported from the project “Centre for the study and sustainable exploitation of Marine Biological Resources (CMBR)” (MIS 5002670), which is implemented under the Action “Reinforcement of the Research and Innovation Infrastructure,” funded by the Operational Programme “Competitiveness, Entrepreneurship and Innovation” (NSRF 2014–2020) and co-financed by Greece and the EU (European Regional Development Fund). GS received support from EMODnet Biology Phase III and Phase IV. SP and HZ received support from the Hellenic Foundation for Research and Innovation(HFRI) and the General Secretariat for Research and Innovation (GSRI), under grant agreement No. 241 (PREGO project). DM and VG have received support from LifeWatchGreece Research Infrastructure (Arvanitidis et al., 2016) and “Centre for the study and sustainable exploitation of Marine Biological Resources (CMBR)” (MIS 5002670), which is implemented under the Action “Reinforcement of the Research and Innovation Infrastructure,” funded by the Operational Programme “Competitiveness, Entrepreneurship and Innovation” (NSRF 2014-2020) and co-financed by Greece and the European Union (European Regional Development Fund). LB received support by EMODNET Biology, Phase IV. The work of LV is funded by Research Foundation - Flanders (FWO) as part of the Belgian contribution to LifeWatch. For different aspects of his work HZ received support from ELIXIR-GR: Managing and Analysing Life Sciences Data (MIS: 5002780) which is co-financed by Greece and the European Union - European Regional Development Fund. CA received support from LifeWatch ERIC. LifeWatch ERIC funded the publication fees. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmars.2022.940844/full#supplementary-material

Footnotes

References

Abrami G., Henlein A., Lücking A., Kett A., Adeberg P., Mehler A. (2021). Unleashing Annotations With TextAnnotator: Multimedia, Multi-Perspective Document Views for Ubiquitous Annotation. in Proceedings of the 17th Joint ACL - ISO Workshop on Interoperable Semantic Annotation (Groningen, The Netherlands (online): Association for Computational Linguistics), 65–75. Available at: https://aclanthology.org/2021.isa-1.7

Google Scholar

Agosti D., Guidoti M., Catapano T., Ioannidis-Pantopikos A., Sautter G. (2020). The Standards Behind the Scenes: Explaining Data From the Plazi Workflow. Biodiversity. Inf. Sci. Standards. 4, e59178. doi: 10.3897/biss.4.59178