Automatic detection, identification and counting of deep-water snappers on underwater baited video using deep learning

Baletaud, Florian; Villon, Sébastien; Gilbert, Antoine; Côme, Jean-Marie; Fiat, Sylvie; Iovan, Corina; Vigliola, Laurent

doi:10.3389/fmars.2025.1476616

ORIGINAL RESEARCH article

Front. Mar. Sci., 06 February 2025

Sec. Marine Fisheries, Aquaculture and Living Resources

Volume 12 - 2025 | https://doi.org/10.3389/fmars.2025.1476616

This article is part of the Research TopicChallenges in Fishery Assessment MethodologiesView all 12 articles

Automatic detection, identification and counting of deep-water snappers on underwater baited video using deep learning

Florian Baletaud^1,2,3*

Sébastien Villon¹

Antoine Gilbert²

Jean-Marie Côme³

Sylvie Fiat¹

Corina Iovan¹

Laurent Vigliola¹

¹ENTROPIE, Institut de Recherche pour le Développement (IRD), UR, UNC, IFREMER, CNRS, Centre IRD de Nouméa, Noumea, New Caledonia
²Soproner, Groupe GINGER, Noumea, New Caledonia
³Burgeap, Groupe GINGER, Lyon, France

Deep-sea demersal fisheries in the Pacific have strong commercial, cultural, and recreational value, especially snappers (Lutjanidae) which make the bulk of catches. Yet, managing these fisheries is challenging due to the scarcity of data. Stereo-Baited Remote Underwater Video Stations (BRUVS) can provide valuable quantitative information on fish stocks, but manually processing large amounts of videos is time-consuming and sometimes unrealistic. To address this issue, we used a Region-based Convolutional Neural Network (Faster R-CNN), a deep learning architecture to automatically detect, identify and count deep-water snappers in BRUVS. Videos were collected in New Caledonia (South Pacific) at depths ranging from 47 to 552 m. Using a dataset of 12,100 annotations from 11 deep-water snapper species observed in 6,364 images, we obtained good model performance for the 6 species with sufficient annotations (F-measures >0.7, up to 0.87). The correlation between automatic and manual estimates of fish MaxN abundance in videos was high (0.72 – 0.9), but the Faster R-CNN showed an underestimation bias at higher abundances. A semi-automatic protocol where our model supported manual observers in processing BRUVS footage improved performance with a correlation of 0.96 with manual counts and a perfect match (R=1) for some key species. This model can already assist manual observers to semi-automatically process BRUVS footage and will certainly improve when more training data will be available to decrease the rate of false negatives. This study further shows that the use of artificial intelligence in marine science is progressive but warranted for the future.

1 Introduction

In order to assess fisheries stock for a target species, it is necessary to estimate its abundance and biomass spatially and across time, but also along the species length structure (Gulland, 1983). Such information may be insufficient or biased when acquired from landings of data-poor fisheries, thus calling for independent methods to complement traditional fisheries stock assessments (Moore et al., 2013). The emergence of video-assisted methods like BRUVS (Baited Remote Underwater Video Stations) (Whitmarsh et al., 2017) using low-cost small action cameras may provide such valuable complementary information (Moore et al., 2013; Letessier et al., 2015). However, video-based assessments require a considerable processing time to manually count fish on images, limiting their broad-scale applications (Sheaves et al., 2020). Modern automated video analyses using deep learning algorithms are becoming more accurate (Villon et al., 2018; Marrable et al., 2022; Bhalla et al., 2024) and may reduce these costly video-processing constraints (Tseng and Kuo, 2020; Connolly et al., 2021; Lopez-Marcano et al., 2021). Yet the lack of labelled, species-rich, datasets for fish classification and identification keeps its automation binding. Furthermore, the performance of deep learning algorithms on deep video surveys, including darkness, artificial lightning or generally variable image conditions and backgrounds, is still poorly known (Saleh et al., 2024; Jian et al., 2024).

In the Pacific, deep-water demersal fisheries are of high significance not only for local consumption but also for their commercial, cultural, and recreational value (Dalzell and Preston, 1992). Their commercial development began in the 1970s to alleviate fishing pressure on coral reefs but has generally collapsed in the 1990s (Williams et al., 2012). Over time, these fisheries have transitioned primarily to subsistence but continued to hold commercial significance in more developed and isolated regions like New Caledonia and Hawaii (Newman et al., 2016). While deep demersal fisheries include around 200 species in the western Pacific Ocean, the landed species are mainly composed of snappers, a group in the Lutjanidae family associated to the genera Etelis, Pristipomoides, Aphareus, and Aprion. Deep-water snappers are characterized by relatively slow metabolic rates and long lifespans, making them highly vulnerable to overfishing (Newman et al., 2016; Wakefield et al., 2020). Usually found at depths starting at 100 m to 500 m and more, these fish aggregate in structured topographies like steep slopes, seamounts, or any topographic anomaly such as sand banks or pinnacles (Gomez et al., 2015). Yet, deep-water snapper fisheries lack core management measures based on stock assessments which remain challenging to perform in such hardly accessible marine habitats (Newman et al., 2016).

Baited Remote Underwater Video Stations are among the most used, standardized video technics to study underwater fish ecology (Whitmarsh et al., 2017; Langlois et al., 2020). BRUVS can assess spatial and temporal variation in fish assemblages through visual identification and quantifying species abundance (Letessier et al., 2015; Wellington et al., 2018). They are a low-cost method able to generate large amounts of data (Cappo et al., 2007; Osgood et al., 2019; MacNeil et al., 2020). BRUVS can be deployed in a variety of habitats, including coral reefs, but also soft sediments, freshwater, the deep sea, or the pelagic environment (Ellender et al., 2012; Gladstone et al., 2012; Zintzen et al., 2012; Henderson et al., 2017; Schmid et al., 2017; Letessier et al., 2019; Reis-Filho et al., 2019). Their use in environmental monitoring is increasing with more studies focusing on industrial settings like underwater pipelines (Bond et al., 2018; Schramm et al., 2020, 2021) or windfarms (Griffin et al., 2016). BRUVS are also emerging as independent and complementary methods for fisheries stock assessments (Cappo et al., 2004; Ault et al., 2018; Boldt et al., 2018). Clearly, BRUVS show great potential for monitoring deep-sea fisheries.

When manually processing BRUVS footage by identifying, counting, and measuring fish, the fastest and commonly used metric is the MaxN (Whitmarsh et al., 2017; Langlois et al., 2020). MaxN corresponds to the maximum number of individuals per species that can be counted in a single image per video. While conservative, this measure prevents from counting the same individuals twice. It has been shown that getting accurate fish abundance measures on each image from a video station, or within short video periods, and averaging these measures along the whole video may be more representative, but would multiply processing costs (Schobernd et al., 2014). This cost could effectively be reduced using deep learning algorithms.

Deep learning and specifically Convolutional Neural Networks (CNNs) are artificial intelligence algorithms that generate classification by autonomously identifying features in images (LeCun et al., 2015). The rapid progress in the automatic processing of underwater images has already long permeated in ecology with the accurate detection of several marine species (Christin et al., 2019; Mannocci et al., 2021; Saleh et al., 2022; Xu et al., 2023). The ability to detect and identify fish on images in their natural environment has also been explored, but have mainly targeted coral reef fish, which can be highly differentiated due to their diversity of shapes and colors (Mandal et al., 2018; Villon et al., 2018, 2022; Saleh et al., 2024). The available public images follow the same trend but are diversifying on shallow habitats, with images from fish at deeper strata still lacking (Saleh et al., 2024; Bhalla et al., 2024).To our knowledge, few studies have used deep-water images with their own singular constraints like variable light levels (Saleh et al., 2022; Jian et al., 2024 but see Liu et al., 2023), and none for the deep-water snappers. Given the diversity of habitats and conditions in which fish can be detected, incorporating more diverse species and backgrounds is crucial for improving general fish detection and identification techniques (Saleh et al., 2022; Bhalla et al., 2024; Jian et al., 2024).

The state-of-the-art of object detection and classification features three primary algorithms: Single Shot Detection (SSD), Faster Region-based Convolutional Neural Network (Faster R-CNN), and You Only Look Once (YOLO) (Bhalla et al., 2024). While YOLO and SSD have demonstrated notable speed advantages over Faster R-CNN, the latter has shown superior accuracy in object detection and classification (Kim et al., 2018; Bose and Kumar, 2020; Kaarmukilan et al., 2020; Lee and Kim, 2020; Lee et al., 2021; Mahendrakar et al., 2022; Sarma et al., 2024). This difference is due to the NAS (Neural Architecture Search) automatically searching and building the most efficient architecture (Elsken et al., 2018). Furthermore, while some recent versions of YOLO do outperform older Faster R-CNN implementations, one of YOLO’s weaknesses is its inability to address important variation of object sizes like Faster R-CNN can do (Ammar et al., 2019). Such variation is commonplace in underwater videos, where individuals can appear either close or very far from the camera. One of the main advantages of YOLO is its speed in real-time detection operations where faster R-CNN will take more processing time. BRUVS are usually deployed and retrieved over a short period of time, leading to an inevitable separated processing time from deployment. For this reason, Faster R-CNN seems to represent the best option for this context as being the most precise although a little bit slower (Sarma et al., 2024).

Here, we chose the Faster R-CNN architecture and assessed its ability to automatically detect deep-water snapper species in BRUVS images from deep slopes and seamounts of a South Pacific island: New Caledonia. We then discuss constraints and solutions about how this algorithm may help accelerate video processing for fisheries stock assessments considering a fully automatic and semi-automatic approach. To our knowledge, this study is the first to train an artificial intelligence algorithm for the detection, identification and counting of deep-water snapper in the wild on baited videos.

The main contributions of this article are as follows:

1. To address the problem of high processing costs associated with manual data extraction on images by experts on BRUVS footage of commercial species, we propose the use of artificial intelligence, specifically the Faster R-CNN deep learning algorithm, to automate the detection, identification and counting of deep-water snappers (Lutjanidae family) observed in New Caledonia.

2. To address the choice of deep learning algorithm for non-specialists, we propose the use of the Faster R-CNN architecture. It has proven to be effective in processing varying objects (species) with higher accuracy compared to other model architectures.

3. To address the problem of too small training dataset, we propose a semi-automatic method which combines manual and automatic processes to improve the accuracy of fish abundance estimates. This semi-automatic process achieved results much closer to manual count while reducing the number of images checked by the expert to the amount of detections by the algorithm.

2 Materials and methods

2.1 Video dataset origin

New Caledonia is a sanctuary and hotspot for marine biodiversity (Payri et al., 2019). Anthropic pressure is low, with around 271 400 inhabitants over 16,372 km² (isee.nc) disproportionally localized around its capital, Noumea. The 400 km long main island is surrounded by a 1,600 km long coral reef barrier and wilderness atolls, reefs, and small islands scattered across the 1,450,000 km² of the New-Caledonian Exclusive Economic Zone (EEZ). Mainly composed of deep sea, 40% of the EEZ surface is a potential habitat for deep-sea snappers (Gomez et al., 2015). A total of 15 sites were sampled with BRUVS, including 11 seamount summits and 4 deep island slopes, during four oceanographic campaigns conducted aboard the RV ALIS in 2019 and 2020. Sample depths varied between 47 and 552 m (Baletaud et al., 2023).

On each seamount or deep slope, five to ten video samples were collected for a total of 121 deep water BRUVS deployments using GoPro Hero 4. Cameras were set with a medium field of view in 1920x1080 at 30 frames per second and at 1200 lumens, 120-degree angle led light (Groupbinc). BRUVS were baited with 1kg of crushed sardines in a perforated PVC canister and provided 2 hours of usable seafloor footage. Then, videos were manually processed, and MaxN (maximum abundance per species in a single frame, Langlois et al., 2020) was estimated for each species using the EventMeasure (Seagis) software (version 5.42). Eleven species of deep-water snappers were observed throughout this 121 BRUVS dataset. Snappers were observed at variable abundances on 98 BRUVS and were absent in the remaining 23 video stations. We then extracted a total of 410 video clips of 15 seconds centered around each MaxN observation. Overlapping sequences between different species’ MaxN on each video clip were filtered to avoid duplicated annotations of identical images. These video clips were then sliced to two or five frames per second for manual annotation. The annotation procedure was identical to a previous study (Villon et al., 2018). Briefly, for each image, the coordinates of the box enclosing each observed snapper were registered using Computer Vision Annotation Tool (CVAT) (Sekachev et al., 2020). This procedure yielded 12,100 individual deep-water snapper annotations identified at the species level on 6,364 images extracted from the video sequences. The image dataset was then split into a training and a testing dataset. Splitting considered individual BRUVS to avoid images of the same species and BRUVS in the training and testing dataset, and thus minimize false negatives (Villon et al., 2020). The training dataset included 80% of annotations (5,031 images, 9,782 annotations), and the remaining 20% were used in the testing dataset (1,333 images, 2,318 annotations). Species-wise annotations were highly unbalanced as some species occurred more often than others (Table 1). Randallichthys filamentosus was represented by only three annotations, resulting in no image in the testing dataset. Therefore, the species was only kept in the model training to add diversity to its training data.

Table 1

Table 1. Annotation summary for the training and testing datasets per species used with the R-CNN algorithm.

2.2 Deep learning model and evaluation metrics

CNNs are specific algorithms designed for object detection and image classification. By initially extracting pixel sets that represent potential features, CNNs apply filters and weights to generate a localized sum of pixels throughout the image. Training these algorithms involves supplying raw images along with manually annotated features, enabling the recognition of specified objects. The output generated by the CNN is the list of identified objects and their respective probability scores.

We used the Faster Region-Based Convolutional Neural Network (Faster R-CNN) dedicated to object detection (Ren et al., 2017). Faster R-CNN has proven to be the best type of architecture to process objects within a large range of sizes, and to provide higher accuracies than other models (Ammar et al., 2019; Bose and Kumar, 2020; Kaarmukilan et al., 2020; Lee et al., 2021). For these reasons, the architecture is particularly suited to applications in the field of marine biodiversity and is indeed commonly used for fish detection and classification (Blowers et al., 2020; Chen et al., 2023). The model was used with a hybrid inception module coupled to a Nas ResNet configuration (Inception-ResNet V2) with images processed in 1024x1024 format. The architecture was pre-trained on the COCO (common objects in context) dataset (Lin et al., 2014), and is built as following: 1) a feature extractor relying on inception (Szegedy et al., 2015) and residual connections (He et al., 2016) to embed the image, 2) a region proposal network composed of convolutional layers predicting the likelihood of object presence (Zhong et al., 2020), 3) a region of interest pooling layer deleting redundant bounding boxes, 4) fully connected layers refining the features of each object and 5) a classification layer with a softmax function which outputs classification scores for each region proposal. Such two-stage architecture is particularly efficient to process images with objects of different sizes, fitting the context of fish detection and classification. All further details and model architecture can be found on the TensorFlow 2’s GitHub model directory. The training and testing data of our BRUVS images annotated with the deep-water snapper species were converted into the tensorflow file format and supplied to the architecture. Model training and testing were carried out through the open-source Tensorflow API in Python 3. The used hardware contained four parallelized NVIDIA Quadro RTX 8000 cards with 196 GB of CPU memory and 42 GB of GPU memory and operated on an Ubuntu operating system. The model was run on 200 000 iterations with a batch size of 8.

The test dataset provided the number of true positives (a detection of the correct species where it has been manually annotated), false negatives (no detection in images with manually annotated species), and false positives (detection in an image where no individual was present, or the incorrect species detected). From these parameters, the common assessment metrics used in deep learning were computed: recall (1), precision (2), and F-measure (3) (Zhang and Zhang, 2009). Each metric’s value ranges from 0 to 1, with values closer to 1 indicating better performance.

The recall reveals the algorithm’s essential ability to accurately detect and identify the desired features. It represents instances where detections should have taken place in the test dataset but were missed. It is calculated by dividing the number of true positives by the sum of true positives and false negatives:

\begin{array}{l} R e c a l l = \frac{T r u e p o s i t i v e s}{T r u e p o s i t i v e s + F a l s e n e g a t i v e s} & (1) \end{array}

Precision indicates the algorithm’s detection error rate, calculated by dividing the number of true positives by the combined sum of true positives and false positives:

\begin{array}{l} P r e c i s i o n = \frac{T r u e p o s i t i v e s}{T r u e p o s i t i v e s + F a l s e p o s i t i v e s} & (2) \end{array}

The F-measure is a general indicator of the model’s quality and is equal to the harmonic mean of recall and precision:

\begin{array}{l} F - m e a s u r e = 2 \times \frac{R e c a l l \times P r e c i s i o n}{R e c a l l + P r e c i s i o n} & (3) \end{array}

These evaluation metrics were calculated for each of the eleven species seen across all frames of the test dataset.

2.3 Automatic and semi-automatic fish counting on video

In order to evaluate the ability of the algorithm to estimate MaxN, the number of automatic detections per frame (MaxN_Auto) in the test dataset was compared to the number of manual annotations (MaxN_Man). First, the Pearson correlation coefficient was used for its simplicity in quantifying the strength and direction of the linear relationship between MaxN_Auto and MaxN_Man. A high correlation (close to 1) will indicate a strong positive linear relationship between both indices. Then, using a standard linear regression, the intercept of the linear relationship between MaxN_Auto and MaxN_Man was tested against zero. The slope was also tested against 1 to evaluate whether the algorithm underestimated or overestimated the number of detections, and hence fish abundance.

Next, we proposed a semi-automatic approach that combines the trained algorithm with manual intervention on images containing detections. This method aimed at evaluating the potential of deep learning-assisted video processing. All images where the faster R-CNN detected deep water snappers were reviewed and manually corrected by an expert biologist. This process eliminated false positives, leaving only errors due to false negatives. Using this protocol, we recalculated the model metrics based on the corrected misclassifications. With no more false positives, the precision metrics consistently reached 1. The semi-automatic MaxN (MaxN_Semi) was then compared to the MaxN_Auto using the Pearson correlation and linear regression against MaxN_Man.

3 Results

Faster R-CNN training lasted for four days in order to execute 200,000 iterations on the multi-GPU calculator. Out of the 1,333 testing images comprising 2,318 annotated fish, the trained Faster R-CNN automatically detected 2,351 fish, out of which 1786 were true positives (76%) so 565 were false positives.

The F-measure of automatic detections ranged between 0.15 to 0.87, indicating considerable variation in the evaluation measures per species. Largest values were obtained for Etelis coruscans (F-measure: 0.87, recall: 0.91, precision: 0.84), closely followed by Pristipomoides filamentosus (F-measure: 0.79, recall: 0.86, precision: 0.73, Table 2). Pristipomoides multidens, was not detected on any of the 13 testing observations, hence values of 0 for the recall and precision. Pristipomoides zonatus was hardly detected in the equally low testing observations (recall of 0.08 on 12 annotations). However, the model never classified another deep-water snapper as this species (precision of 1.0). These two latter species along with Etelis carbunculus and Parapristipomoides squamimaxillaris, were those with less than 89 annotations to train the model. Species with comparatively higher annotation numbers (> 186, Aprion virescens, up to 5729, P.filamentosus) showed F-measures of at least 0.71 (Pristipomoides flavipinnis). A sample of the testing dataset is illustrated in Figure 1.

Table 2

Table 2. Evaluation metrics (recall, precision and F-measure) generated from the testing dataset for 10 deep-water snapper species on the trained Faster R-CNN (automatic) and the corrected detections from the Faster R-CNN (semi-automatic).

Figure 1

Figure 1. Examples of correct (A–D) and incorrect (E–H) detections on the test dataset. (A) Six correct detections of Pristipomoides filamentosus, (B) correct detection of Aphareus rutilans and P.filamentosus while correctly leaving two emperors Lethrinus miniatus and a grouper Epinephelus maculatus. (C) Correct detection of Etelis coruscans. (D) Correct detection of three A. rutilans and a single Pristipomoides flavipinnis. (E) Correct detection of the single P.filamentosus with incorrect detection of an emperor (Gymnocranius euanus) as P. filamentosus and a surgeonfish (Naso hexacanthus) as Aprion virescens. (F) Incorrect classification of a. rutilans and an emperor (L. miniatus) as P.filamentosus. (G) Incorrect classification of A. rutilans and a grouper (Epinephelus chlorostigma) as P. filamentosus, (H) incorrect classification of two P. flavipinnis as P.filamentosus and a grouper (Variola louti) as Etelis carbunculus.

The semi-automatic approach, in which the expert corrected classification errors, showed an F-measure ranging from 0.15 to 1 (Table 2). A drastic increase in performance metrics was observed for species with higher number of annotations (>186), with semi-automatic F-measures ranging from 0.86 for A. virescens to 1 for E. coruscans, which showed no more false negatives in the testing dataset. P. filamentosus, with the highest number of images tested (1,303), returned an F-value of 0.97 compared to 0.86 without correction. The largest increase in F value was for A. rutilans and P. flavipinnis, from 0.66 and 0.65 to 0.86 and 0.89, respectively.

For the analysis of fish abundance on whole BRUVS (MaxN), we focused on species with F-measures superior or equal to 0.71 (aka with more than 100 annotations, Table 2) as models with lower F-measures provided poor abundance estimates. High correlation coefficients were observed between manually and automatically estimated fish abundance (Figure 2; Table 3). Pearson Correlation coefficient ranged between 0.72 and 0.90 among species, with the highest values observed for Etelis coruscans and an overall value of 0.85 when combining data from all species.

Figure 2

Figure 2. Comparison of manual (MaxN_Man), automatic (MaxN_Auto) and semi-automatic (MaxN_Semi) fish abundance on baited remote underwater video stations (BRUVS) using the R-CNN trained on the deep-water snapper species. Only species with more than 100 annotations were considered relevant for this analysis. Point size is proportional to the number of detections against annotated fish. Automatic and semi-automatic linear fits are also shown with a dotted reference line of slope 1 and intercept 0.

Table 3

Table 3. R squared (R²), Pearson correlation coefficient (correlation), test of intercept against zero (intercept), test of slope against zero (slope) and test of slope against one (slope = 1 p-value) for automatic (MaxN_Auto) and semi-automatic (MaxN_semi) counts.

The slope coefficient for each and all individual species were significantly different from zero (p< 0.001). However, while automatic fish abundances appeared comparable to manual abundances for up to three to four individuals in the same frame, the Faster R-CNN model tended to underestimate higher abundance with slope coefficients significantly smaller than 1 for all and each species. Slope coefficient ranged nonetheless between 0.65 and 0.88 with highest value found when considering all species together (Figure 2). Except for P. argyrogrammicus, all intercepts were significantly different from zero, but with marginal deviation (range: 0.01 to 0.05 except for P. filamentosus: 0.31).

The semi-automatic protocol yielded fish abundance estimates much closer to manual counts, with a Pearson correlation coefficient of 0.96 for all species combined (Figure 2; Table 3). Correlations ranged from 0.86 to 1 depending on the species. E. coruscans showed a perfect fit (slope = 1, intercept = 0) with semi-automatic MaxN identical to manual MaxN. The slope between MaxN_Man and MaxN_Semi was not significantly different from one for P. filamentosus, revealing extremely good semi-automatic model performance.

4 Discussion

The use of the Faster R-CNN algorithm to automatically detect, identify and count deep-water snappers proved successful and highly promising considering the challenge this group of fish presents and the variable background habitat. The algorithm effectively differentiated between species that were very similar and hard to distinguish, even for an experienced taxonomist. While the detection and identification will probably need post-verification until enough annotations are gathered to achieve automatic F-measures above 0.9 for all species, the abundance estimations were still consistent with manual counts. This procedure can already be employed for automatic deep-sea snapper monitoring, or semi-automatic monitoring, where observers would save substantial processing time by simply verifying and adjusting detections rather than processing entire BRUVS videos.

It is crucial for fisheries stock management to be able to work on the species level. This deep-water snappers’ dataset represents a fine addition to the collection with varying habitat constraints such as presence or absence of natural light and hard and soft substrates. Especially, this species group is challenging due to the similar appearance of its members. Deep-water snappers are mostly “greyish”, “fish-looking” species, posing a challenge in identification, particularly for P. filamentosus and P. flavipinnis which share almost identical characteristics (Figure 3). E. coruscans stands out with its reddish color and long elongated tail tips, allowing the algorithm to distinguish it from other species, and lead to the highest recall, precision, and F-measure metrics. Furthermore, the semi-automatic treatment of E. coruscans yielded individual detections and abundance values that matched the manual estimates perfectly. This is highly encouraging, considering E. coruscans is a highly targeted species of this fishery (Newman et al., 2016). P. filamentosus had the highest number of annotations and images, which likely explains its high identification success rate. The bigger the training database per feature, the better the identification for the Faster R-CNN algorithm, which typically requires at least 1,300 training images per feature to achieve over 95% certainty in fish identification (Villon et al., 2018). In our study, only two out of the 11 species studied (P. filamentosus and P. flavipinnis) met this training size requirement.

Figure 3

Figure 3. Correct automatic detection of closely related and similar-looking deep-water species Pristipomoides filamentosus (yellow) and P. flavipinnis (purple). The expert would be interested by the accentuated yellow eye color and recognize the slight vertical band pattern presented only by P.flavipinnis. Other frames before and after this image would have been required by the expert to confirm the identification.

While the human observer may browse through the video sequence to observe color, behavior, movements, and other clues to identify species and count individuals, the algorithm is restricted to each single image to decide. That the algorithm was able to effectively differentiate between snapper species with so little information at hand is therefore very encouraging. However, errors have still been observed with many false positives caused by rarer species (e.g., A. rutilans) being confused with the most common ones (P. filamentosus, Figure 1). This type of confusion was easily corrected by the intervention of an expert during the semi-automatic counting protocol as the fish still got detected. The expert fully corrected each false positive, and the precision became equal to one. Additionally, semi-automatic recall also increased compared to its value with the automatic protocol. This is because some fish were not detected in frames where other individuals were detected. Since the expert corrected the entire frames, undetected individuals were also annotated, thereby reducing the number of false negatives and increasing recall. For example, the recall of A. rutilans increased from 0.66 to 0.80, indicating that this species was present in many frames with other detected species. However, the recall of A. virescens remained the same, indicating that no further detections of this species occurred on frames where other snappers were detected by the algorithm. The confusion problem between species could be partly due to the disparity in available images between similar species, with those with fewer images being misclassified more often than those with more images. While a semi-automatic protocol can partly address the issue, an alternative solution might involve adding temporal information through motion analysis or a tracking algorithm that would isolate the background or follow the same individuals, thereby adding detection and identification information from previous frames to subsequent ones (Shin, 2016; Jalal et al., 2020). The other major constraint highlighted in this study is the underestimation bias at higher abundances. We observed that frames involving many fish can become easily saturated (notably P. filamentosus, cf. Figure 1A) with few individuals blocking the camera’s field of view. This bias in the algorithm seems rather inevitable due to its dependance to the technical video sampling system with a single sensor and angle of view. The MaxN abundance index based on the maximum number of fish present in the same frame is known to be sensitive to the phenomenon of image saturation (MacNeil et al., 2020). It is also reported in another study working on a different species of snapper in a different configuration (daylight reef) (Connolly et al., 2021). Our semi-automatic protocol could correct this bias for the two species that presented the highest MaxN, E. coruscans and P. filamentosus, yielding F-measure > 0.96 after correction by an expert taxonomist. The tracking of individuals across successive frames might also permit a better differentiation of individuals saturating images, hence reducing or removing the bias in MaxN at high abundance, as the expert usually also does.

We are confident that our trained Faster R-CNN algorithm is already operational for fisheries assessment using our semi-automatic detection procedure. The whole process using BRUVS to assess fish abundance is nondestructive, independent from fisheries data and may today become cost-effective with the support of artificial intelligence. Our model, as it is, can provide a matrix of detections per species for each frame of the video stations. The frames with the greatest number of detections per species can then be identified and used as references to define video intervals of a few seconds including the MaxN of the different species. These short video sequences could then be processed by biologists using programs like EventMeasure, reducing hours of video processing to minutes. Furthermore, the manual processing of the short video sequences would be limited to simply correcting algorithmic detections, which would further speed up the process. Additionally, new annotations should be used to retrain the algorithm and further improve its performance. If stereo cameras are used, then fish size could be measured in addition to abundance. Although size-measurements are performed manually so far using programs like EventMeasure (Letessier et al., 2015), algorithms exist to automatically measure object dimensions on videos like with instance segmentation (Othman et al., 2018; Garcia-d’Urso et al., 2022). Their ongoing development represents the next stage and their application on BRUVS and fisheries management is warranted.

Some caveats can still be discussed for further improvement. The uneven distribution of training images among species calls for an increased sampling to complete the dataset and improve identification accuracy (Villon et al., 2018). Our current algorithm may still drastically reduce annotation times for rarer species as they are detected but mostly confused with more occurring species. However, rarity is a key characteristic of biodiversity, and a large number of annotations can remain difficult to gather for the rarest species (Villon et al., 2022). In this case, methods like the few shots deep learning algorithm could be coupled with the Faster R-CNN to compensate for the lack of annotations (Villon et al., 2021). A coupling with other BRUVS datasets from other regions may also improve the algorithm performances but may then face issues related to changes in environmental conditions across regions (Kalogeiton et al., 2016). However, while our study relied on a dataset restricted to New Caledonia, the sampling occurred across the spatially immense EEZ and across depth ranging from shallow photic seamounts (50-60 meters deep) to deep aphotic seamounts and continental deep slopes (150-500 meters deep), exploring diverse environmental backgrounds and light intensities (Baletaud et al., 2023).

While this case study involved a particularly constraining group of species (looking-alike deep-water snappers), in variable background conditions of light and habitats, it further shows that the Faster R-CNN is a worthy algorithm architecture that may be used in many use-case scenarios involving fish species detection. The methodology is applicable to any visually identifiable fish species provided sufficient training images for the model, as is the main constraint for any deep learning development (Ahmad et al., 2023). New CNN architectures are released more and more frequently, improving classification speed and accuracy, and their review using this new dataset will prove interesting although not in the scope of this study. The potential for deep learning to improve the day-to-day work of marine scientists in monitoring fisheries seems certified for the future (Zhang et al., 2021). The transition is progressive, and a semi-automatic approach may be yet closer to being adopted by operational monitoring organizations or consultancy firms using this work.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to bGF1cmVudC52aWdsaW9sYUBpcmQuZnI=.

Ethics statement

Ethical approval was not required for this study involving animals captured by video in accordance with the local legislation and institutional requirements because the data analyzed was from a previous study.

Author contributions

FB: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing. SV: Conceptualization, Data curation, Formal analysis, Methodology, Software, Validation, Visualization, Writing – review & editing. AG: Conceptualization, Funding acquisition, Project administration, Resources, Supervision, Writing – review & editing. J-MC: Funding acquisition, Project administration, Resources, Supervision, Writing – review & editing. SF: Data curation, Investigation, Resources, Writing – review & editing. CI: Conceptualization, Funding acquisition, Methodology, Project administration, Resources, Writing – review & editing. LV: Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Visualization, Writing – review & editing.

Funding

The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. The study was funded by grant ANR “SEAMOUNTS” #ANR-18-CE02-0016, the French Oceanographic Fleet, and IRD core funding. FB was supported by grant ANRT CIFRE #2019/0105.

Acknowledgments

We would like to thank the many undergrad students and technical staffs for their help in the annotation process at the lab. Data was collected under permits 2019-733/GNC, 2020-503/GNC and 2020-1077/GNC delivered by the Government of New-Caledonia, 898-2019/ARR/DENV, 3066-2019/ARR/DENV, 844-2020/ARR/DDDT and 1955-2020/ARR/DDDT delivered by the Southern Province of New-Caledonia, and 609011/2019/DEPART/JJC, 609011-18/2019/DEPART/JJC and 609011-39/2020/DEPART/JJC delivered by the Northern Province of New-Caledonia.

Conflict of interest

FB, AG ang J-MC was employed by Groupe GINGER.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Ahmad U., Junaid Ali M., Ahmed Khan F., Ahmad Khan A., Ur Rehman A., Muhammad Ali Shahid M., et al. (2023). Large scale fish images classification and localization using transfer learning and localization aware CNN architecture. Comput. Syst. Sci. Eng. 45, 2125–2140. doi: 10.32604/csse.2023.031008