Annotated Video Footage for Automated Identification and Counting of Fish in Unconstrained Seagrass Habitats

Ditria, Ellen M.; Connolly, Rod M.; Jinks, Eric L.; Lopez-Marcano, Sebastian

doi:10.3389/fmars.2021.629485

DATA REPORT article

Front. Mar. Sci., 03 March 2021

Sec. Ocean Solutions

Volume 8 - 2021 | https://doi.org/10.3389/fmars.2021.629485

Annotated Video Footage for Automated Identification and Counting of Fish in Unconstrained Seagrass Habitats

$\nEllen M. Ditria$ Ellen M. Ditria¹^*

Rod M. Connolly¹

Eric L. Jinks¹

Sebastian Lopez-Marcano^1,2

¹Coastal and Marine Research Centre, School of Environment and Science, Australian Rivers Institute, Griffith University, Gold Coast, QLD, Australia
²Quantitative Imaging Research Team, Data61, CSIRO, Marsfield, NSW, Australia

Introduction

Technological advances are improving the collection, processing and analysis of ecological data. One of these technologies that has been adopted in recent studies by ecologists is computer vision (CV). CV is a rapidly developing area of machine learning that aims to infer image content at the same level humans can by extracting information from pixels (LeCun et al., 2015; Weinstein, 2018). CV in ecology has gained much attention as it can quickly and accurately process image from remote video imagery while allowing scientists to monitor both individuals and populations at unprecedented spatial and temporal scales. Automated analysis of imagery through CV has also become more accurate and streamlined with the implementation of deep learning (a subset of machine learning) models that have improved the capacity to processes raw images compared to traditional machine learning methods (LeCun et al., 2015; Villon et al., 2016). As the use of camera systems for monitoring fish abundances is common practice in conservation ecology (Gilby et al., 2017; Whitmarsh et al., 2017; Langlois et al., 2020), deep learning allows for the automated processing of big data from video or images, a step which usually creates a bottleneck when these data must be analyzed manually.

Fish monitoring in their natural habitat represents a key step in effective management of fish and fisheries. Deep learning enables researchers to process raw image-based data at high volumes at a much quicker rate than manual methods (González-Rivero et al., 2020). By enabling more data to be analyzed, sampling can be expanded to increase sample sizes across greater spatial and temporal scales. Automated monitoring using deep learning can reduce labor costs and increase efficiency and has been shown to be equally or more accurate than humans at processing data (Torney et al., 2019; Ditria et al., 2020a). To develop automated monitoring tools for fish in aquatic ecosystems, comprehensive training and test datasets are required. To obtain ecologically relevant information using these methods, the models must contain enough labeled training data (annotated appropriately to be implemented into the chosen algorithm) to perform at a high enough standard to be considered for replacing manual methods of data processing. Additionally, datasets should capture the complexity and variability of aquatic ecosystems to create a robust and flexible model for use in monitoring.

While applications of deep learning techniques into marine environments have shown promising results as a viable alternative for manual analysis, there are significant challenges in term of data acquisition and preparing training data to create useful models. Additionally, the reluctance or inability of some parties to share annotated datasets, may slow progress in the development and applications of computer vision techniques for monitoring. Initiatives such as the FishCLEF 15 challenge have expanded the body of literature on computer vision techniques by challenging participants to create a high performing model for fish identification, importantly in unconstraint underwater habitats which is critical for applying this technology to real-world scenarios (Joly et al., 2015). However, the expansion of deep learning techniques in the last few years in marine science call for higher volumes of data for training than traditional machine learning methods. Therefore, there is a need for accessible, quality annotated datasets for deep learning models to further the progress of applying these techniques in ecology.

Although there are currently few datasets available, several research groups have begun adding much needed annotated datasets to support further research into the application of CV for fish identification (Table 1). There are several ways these labeled datasets can contribute to the understanding of CV techniques. The datasets available currently are labeled to achieve either one or a number of the following: object recognition/classification to determine what is in a frame, localization to pin-point where it is in a frame, and object detection which is a combination object classification and object localization (Wang, 2016).

TABLE 1

Table 1. Summary of some currently available datasets for training and testing computer vision models for fish.

The contributions of this dataset include: (1) a comprehensive dataset of ecologically important fish species that captures the complexity of backgrounds observed in unconstrained seagrass ecosystems to form a robust and flexible model; (2) a variety of modalities for rapid and flexible testing or comparison of different frameworks, and (3) unaltered imagery for investigation of possible data augmentation and performance enhancement using pre- and post-processing techniques.

Dataset

To continue the development of automated tools for fish monitoring, we report a dataset “Annotated videos of luderick from estuaries in southeast Queensland, Australia” which was used to train a deep learning algorithm for automated species identification and abundance counts presented in Ditria et al. (2020a). This dataset contains footage from remote underwater video (RUV) recordings of two common fish species, luderick (Girella tricuspidata) and Australian bream (Acanthopagrus australis) from seagrass habitats in the estuaries of two river systems in southeast Queensland, Australia. The raw data was obtained using submerged action cameras (Haldex Sports Action Cam HD 1080p) to collect video footage in the Tweed River estuary in southeast Queensland (−28.169438, 153.547594), between February and July 2019. Additional footage was collected from seagrass meadows in a separate estuary system in Tallebudgera Creek (−28.109721, 153.448975). Each sampling day, six cameras were deployed for 1 h over a variety of seagrass patches; the angle and placement of cameras was varied among deployments. Videos were manually trimmed for training to contain only footage of luderick (the target species for the study) and split evenly into 5 frames per second for manual annotation to create a labeled dataset. The dataset consists of 4,281 images and 9,429 annotations (9,304 luderick, 125 bream) at the standard high resolution (1920 × 1080 p). The background varies in complexity due to 18 individual camera deployments at a range of different camera angles and depths showcasing a range of backgrounds and lighting as well as fish positions and angles (Figure 1).

FIGURE 1

Figure 1. (A) Example images of free swimming fish (luderick) in uncontrained seagrass habitats. This dataset provides footage from different angles, from cameras exposed to different environmental conditions such as tide, lighting, water clarity, color hue experienced under real-world conditions. (B) Segmentation masks around luderick allow for this dataset to be used for a number of computer vision techniques.

The presented data were used to train a deep learning algorithm via transfer learning (Torrey and Shavlik, 2010). The deep learning framework implemented in this study was Mask R-CNN (Massa and Girshick, 2018). To develop this model, we used a ResNet50 configuration, pre-trained on the ImageNet-1k dataset. We conducted the model training, testing and prediction tasks on a Microsoft Azure Data Science Virtual Machine powered by an NVIDIA V100 GPU. Data preparation (uploading raw videos, splitting videos into 5 frames per second frames) and annotation tasks were carried out using software developed at Griffith University (https://globalwetlandsproject.org/tools/fishid/).

The performance metrics used for this dataset were based on precision (P) and recall (R) which were calculated form the false positives, false negatives and true positives (Ditria et al., 2020a).

\begin{array}{l} P r e c i s i o n = \frac{T r u e P o s i t i v e}{T r u e P o s i t i v e + F a l s e P o s i t i v e} \\ R e c a l l = \frac{T r u e P o s i t i v e}{T r u e P o s i t i v e + F a l s e N e g a t i v e} \end{array}

An F1 score was used to calculate how well the model was able to determine the maximum number of fish in a video at any one time (MaxN) and was calculated as below:

\begin{array}{l} F 1 = 2 \times \frac{P \times R}{P + R} \end{array}

A mean average precision value (mAP) was used to determine the trained model's ability to detect and classify ground-truth segmentation mask annotations. Specifically, the mAP50 metric was used to evaluate model performance in Ditria et al. (2020a), which assigns a true positive when a predicted segmentation mask (Figure 1) overlaps the ground-truth annotated segmentation by at least 50% (https://github.com/cocodataset/cocoapi). mAP50 was calculated as follows:

\begin{array}{l} m A P = \int_{0}^{1} P (R) d R \end{array}

The results using these metrics and the proposed dataset can be found in Ditria et al. (2020a), where the mAP50 and F1 are both >92% for detecting the target species and counting abundance.

By providing the datasets in different modalities, we propose that it can be used to understand fish dynamics in seagrass ecosystems, develop novel fish counting methods (beyond only MaxN) and for understanding and exploring different methods to improve accuracy for implementation in an ecological context.

Data Presentation

This dataset is organized into 3 sub-datasets that have been allocated for suggested training, test and novel test purposes as used in Ditria et al. (2020a). Each annotation includes object instance annotations which consist of the following key fields: Labels are provided as a common name, either “luderick” for Girella tricuspidata or “bream” for Acanthopagrus australis. Bounding boxes that enclose the species in each frame are provided in “[x, y, width, height]” format, in pixel units; Segmentation masks which outline the species as a polygon are provided as a list of pixel coordinates in the format “[x, y, x, y, …]”; The corresponding image is provided as an image filename. All image coordinates (bounding box and segmentation masks) are measured from the top left image corner and are 0-indexed. Each dataset includes object instance annotations which consist of a corresponding image, label, bounding box, and segmentation mask.

Annotations are provided in both CSV format and COCO JSON format (https://cocodataset.org/#format-data), which is a commonly used data format for integration with object detection frameworks including PyTorch and TensorFlow. Images are included in a ZIP archive. Additional metadata for each image can be found in dataset_images.csv, including deployment dates, geo-coordinates, and habitat type.

Recommendations for Additional Uses of the Dataset in Future Publications

Author recommendations for future work using this dataset include the potential to answer both ecological and technological questions. There are multiple ways of using this dataset for developing automation for application in fish ecology. Different deep learning architectures (Liu et al., 2017), types of augmentation (Shorten and Khoshgoftaar, 2019), and pre and post-processing techniques can be tested for different objectives. New types of analyses also continue to be developed as the body of literature for applied deep learning in ecology grows. There are also several observed phenomena that have been described in deep learning studies that are yet to be quantified in unconstrained underwater habitats. These include how models perform through time when the predicted data begins to differ from the training data set, for example due to changed environmental conditions. The dataset allows analyses of automation effectiveness from videos at different times. The 6-month spread of sampling dates incorporates summer conditions (austral summer: February/March) with warmer water temperatures and poorer water visibility due to increased rainfall in river catchments, as well as winter conditions (austral winter: June/July) typically with cooler, clearer water.

The results in Ditria et al. (2020b) suggest that this data may be suitable to train models for monitoring purposes across habitats, however further research is needed to account for error rates for manual annotation bias and model prediction. This study produced reliable automated predictions, but there is room for further improvement. Pre- and post-processing steps to further investigate the effects of performance can also be applied, as unlike most available datasets, this dataset contains unaltered images extracted directly from the video data, including the background habitat.

This dataset also may be used in conjunction with other datasets to train composite models on multiple species since the segmentation masks and bounding boxes are labeled at the species level unlike most available datasets. This data has been annotated to be used for object detection or classification as well as being suitable to test a range of movement tracking algorithms. Ongoing testing of this standardized dataset will be valuable, especially as new deep-learning architectures become available.

A full analysis and interpretation of the data are reported in Ditria et al. (2020a).

Data Availability Statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found at: https://doi.org/10.1594/PANGAEA.926930.

Author Contributions

ED and SL-M conducted the fieldwork to obtain raw data. ED (& volunteers) annotated the dataset. EJ developed the deep learning architecture and user interface on which data annotation was conducted. ED drafted the manuscript. All authors commented on the manuscript.

Funding

RC was supported by a Discovery Project from the Australian Research Council (DP180103124).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

We thank the volunteers who assisted in annotating the dataset. This work benefitted from the support of the Global Wetlands Project.

References

Anantharajah, K., Ge, Z., McCool, C., Denman, S., Fookes, C., Corke, P., et al. (2014). “Local inter-session variability modelling for object classification,” in 2014 IEEE Winter Conference on Applications of Computer Vision (Steamboat Springs, CO: IEEE), 309–316. doi: 10.1109/WACV.2014.6836084