Fish Acoustic Detection Algorithm Research: a deep learning app for Caribbean grouper calls detection and call types classification

Ibrahim, Ali K.; Zhuang, Hanqi; Schärer-Umpierre, Michelle; Woodward, Caroline; Erdol, Nurgun; Chérubin, Laurent M.

doi:10.3389/fmars.2024.1378159

ORIGINAL RESEARCH article

Front. Mar. Sci., 22 July 2024

Sec. Ocean Observation

Volume 11 - 2024 | https://doi.org/10.3389/fmars.2024.1378159

Fish Acoustic Detection Algorithm Research: a deep learning app for Caribbean grouper calls detection and call types classification

Ali K. Ibrahim^1,2*

Hanqi Zhuang²

Michelle Schärer-Umpierre³

Caroline Woodward¹

Nurgun Erdol²

Laurent M. Chérubin¹

¹Harbor Branch Oceanographic Institute, Florida Atlantic University, Fort Pierce, FL, United States
²Department of Electrical Engineering and Computer Science (EECS), Florida Atlantic University, Boca Raton, FL, United States
³HJR Reefscaping, Boquerón, Puerto Rico

In this paper, we present the first machine learning package developed specifically for fish calls identification within a specific range (0–500Hz) that encompasses four Caribbean grouper species: red hind (E. guttatus), Nassau (E. striatus), yellowfin (M. venenosa), and black (M. bonaci). Because of their ubiquity in the soundscape of the grouper’s habitat, squirrelfish (Holocentrus spp.) sounds along with vessel noise are also detected. In addition the model is also able to separate grouper species call types. This package called FADAR, the Fish Acoustic Detection Algorithm Research is a standalone user-friendly application developed in Matlab™. The concept of FADAR is the product of the evaluation of various deep learning architectures that have been presented in a series of published articles. FADAR is composed of a main algorithm that can detect all species calls including their call types. The architecture of this model is based on an ensemble approach where a bank of five CNNs with randomly assigned hyperparameters are used to form an ensemble of classifiers. The outputs of all five CNNs are combined by a fusion process for decision making. At the species level, the output of the multimodel is thus used to classify the calls in terms of their types. This is done by species specific deep learning models that have been thoroughly evaluated in the literature on the species concerned here, including transfer learning for red hind and yellowfin groupers and custom designed CNN for Nassau grouper, which has a greater number of known call types than the other species. FADAR was manually trained on a diversity of data that span various regions of the Caribbean Sea and also two recorder brands, hydrophone sensitivities, calibrations and sampling rates, including a mobile platform. This strategy has conferred FADAR substantive robustness to a diversity of noise level and sources that can be found in the grouper calls frequency band such as vessels and marine mammals. Performance metrics based on sensitivity (recall) and specificity showed the same performance level for both balanced and unbalanced datasets and at locations not used in the training set.

1 Introduction

Many fish species undergo long distance migrations, where mature adults gather in high densities to form large spawning aggregations that are reoccurring in time and space (Domeier and Colin, 1997). Many known fish spawning aggregations (FSA) sites are also multi-species breeding hot spots (Heyman and Kjerfve, 2008), which increases the vulnerability of these spawning populations to harvest and environmental changes (Erisman and Rowell, 2017). Site fidelity, temporal predictability, bathymetric features (i.e. shelf-break, capes) and circulation anomalies (eddies, flow reversals) are some of the major characteristics of the spawning habitat (Claro and Lindeman, 2003; Kobara and Heyman, 2008; Chérubin et al., 2011; Kobara et al., 2013; Reglero et al., 2018). While these characteristics ensure reproductive success, their predictability is the cause of over-exploitation and depletion of aggregating populations (Sadovy, 1997; Sala et al., 2001). Of numerous historical Caribbean-wide FSAs (Smith, 1972; Eklund et al., 2000), only a few are protected and remain to date while many are in need of protection (Sadovy et al., 2008). They play a critical role in the persistence of marine populations and their disappearance through the extirpation of large predatory fishes contributes to top-down changes in coral reef ecosystems and biodiversity loss (Mumby et al., 2006).

While many of the FSAs are known to fishers, which they specifically target during spawning season, not all of them have been documented. There may be unreported FSAs, which, if discovered, would contribute to the assessment of the grouper population, for example, in the greater Caribbean region and elsewhere in the world. Characterization of the FSAs in terms of the timing, duration, sex ratio and size of the aggregation is crucial for stock assessment, the design and evaluation of management measures, and conservation. FSAs, being the sole reproductive events, are critical to the marine ecosystem. They are globally under threat because of their small numbers and size, which negatively impacts the fish population and the ecosystem, along with the livelihood and socioeconomic of the fishing communities.

More than eight hundred soniferous fish species have been identified (Looby et al., 2022; Rice et al., 2022). Among them, codfishes, drum fishes, grunts, groupers, snappers, jacks, and catfishes are part of the most abundant and commercially important species (Rountree et al., 2006). Invertebrates also produce sounds. Among those, important to fisheries, are white shrimp (Penaeus setiferus) Berk (1998), spiny lobsters (Palinuridae) (Moulton, 1957; Fish, 1964; Patek, 2002), American lobster (Homarus americanus) (Fish, 1966; Henninger et al., 2005), mussels (Mytilus edulis), sea urchins (Fish, 1964), and perhaps squid (Theuthida) (Iversen et al., 1963). Most soniferous fish species produce low frequency sounds, usually below 1000 Hz (Ladich, 2004) that are typically broadband short-duration signals. Some fish species can produce sound with frequencies that can reach 8 kHz (Zelick et al., 1999; Tavolga et al., 2012) or with more complex acoustic features (Vasconcelos et al., 2011). Sound producing mechanisms are species dependent and sound characteristics vary with circumstances, such as courtship, threats or territorial defense (Kasumyan, 2008). Therefore, fish sound can be used to monitor fish activity, and in particular courtship to identify the location and delineate FSAs (Chérubin et al., 2020), to determine temporal and seasonal patterns of the spawning activity (Locascio and Mann, 2008; Mann et al., 2009, 2010; Nelson et al., 2011; Schärer et al., 2012b), the behavior of fishes including population structure and its changes (Hawkins, 1986; Luczkovich et al., 1999; Rountree et al., 2006, 2008; Walters et al., 2009; Rowell et al., 2011). Some calls produced by fish aggregated to spawn are known as courtship associated sounds (CASs) in their behavioral context (Mann et al., 2010), whereas others are agonistic or territorial but also part of the FSA (Rowell et al., 2018).

Passive acoustic monitoring (PAM) has been used for more than sixty years in fish biology and fishery surveys [see Fish et al. (1952); Fish and Mowbray (1970) for review]. PAM is a fishery-independent, non-intrusive method that can provide in-situ information critical for understanding the efficacy of management measures and for the discovery of new or previously extirpated aggregations recovering from overfishing (Woodward et al., 2023). PAM data can also provide a window into the number of species using the FSA site, its biodiversity and fishing pressure through the monitoring of vessel noise (Mahale et al., 2023), establishing the significance of the site to multi-species spawning aggregations and its fishery management. Where the recovery of threatened and endangered species, such as the Nassau grouper (Epinephelus striatus), is difficult to monitor by more traditional means, PAM offers a solution to this type of population assessments. High signal to noise ratio is paramount to the detection of sound sources in PAM surveys of FSAs, which is best achieved when the recording station is fixed. However, assessing the spatial extent of the FSA is limited by the number of recorders and their locations. This constraint can be mitigated with the use of mobile autonomous platforms, which have provided new insights into the fish distribution in general (Wall et al., 2017) and at FSAs (Chérubin et al., 2020; Woodward et al., 2023). While substantially beneficial at advancing science and management, long-term PAM generates large volumes of high-resolution acoustic data that is extremely labor intense to analyze by listening and visualizing spectrograms. Challenges primarily stem from the identification of the sound sources and the enumeration of species specific sounds, from differences in human perception, and from the signal to noise ratio in the recordings.

In recent years, automatic fish sound signal detection methods have been developed. These traditional machine learning (ML) techniques, inspired by automatic speech recognition (Vieira et al., 2015), require a pre-processing step to convert raw audio data into features that are used as input to a machine learning (ML) model to identify a signal of interest (Pace, 2008; Bahoura and Simard, 2010; Kottege et al., 2015; Urazghildiiev and Van Parijs, 2016; Choi et al., 2019). For example, Noda et al. (2016) successfully classified one hundred and two different species of fish sounds. They used linear Frequency Cepstral Coefficients, Mel-Frequency Cepstral Coefficients (MFCC), Shannon Entropy and Syllable Length for feature extraction. For the classification, they evaluated the three conventional machine-learning algorithms: K-Nearest Neighbors, Random Forest (RF), and Support Vector Machines (SVMs). They applied their method to two public databases, FishBase and Discovery of Sound In The Sea (DOSITS) and obtained a classification accuracy of 95.24%, 93.56%, and 95.58%, for each classifier, respectively. Sattar et al. (2016a, b) also used similar techniques based on fish call feature analysis to identify grunts, growls and groans from the plainfin midshipman (Porichthys notatus) in large acoustic datasets. Handcrafted acoustic cepstral features were used for classification and detection of four Caribbean grouper species CAS by Ibrahim et al. (2018a) with 82.7% accuracy. The main disadvantage of this kind approach is that the chosen features must be uniquely designed for a specific application, and may involve nontrivial steps that require expertise in multiple disciplines Baumgartner and Mussoline (2011). Non trivial steps include feature dimension reduction using PCA Binder and Hines (2012), acoustic index calculation and complex entropy based detectors as used in Siddagangaiah et al. (2019) and, image correlation methods (Matthews and Beaujean, 2016; Ricci et al., 2017) as examples.

Deep Learning (DL) methods have emerged as an effective tool in the field of bioacoustics due to their huge success and widespread adoption in other pattern recognition fields such as image classification (He et al., 2016), object detection (Zhao et al., 2019), speech recognition (Meng et al., 2019) and music processing (Nam et al., 2018). DL improves the process by acting as a feature extractor that is an integral part of the architecture of a Deep Neural Network (DNN) (Bohnenstiehl, 2023), learning non-linear representations of the data through a multi-layer neural network approach. This relative simplicity, inherent to DNNs, makes them highly versatile to conduct for various classification tasks (O’Mahony et al., 2019), outperforming conventional ML techniques since they are able of more discriminatory representations than traditional feature extraction (Shorten and Khoshgoftaar, 2019). In practice, however, some DL-based detectors and classifiers for acoustic signals still employ a pre-processing step like computing spectrograms (Shiu et al., 2020; Vickers et al., 2021).

DL methods have been successfully applied in terrestrial environments for the sound classification of animals such as insects (Silva et al., 2013), frogs (Huang et al., 2009), birds (Bravo Sanchez et al., 2021; Mehyadin et al., 2021), bats (Parsons and Jones, 2000), and other mammals (Pandeya and Lee, 2018; Clink and Klinck, 2021), including the monitoring of farm livestock welfare through their sound Mcloughlin et al. (2019). Automated approaches to identify bird vocalizations are also based on classifiers trained on spectrograms and are becoming increasingly popular for conducting avian PAM within broad-scale monitoring programs. BirdNET, for example, is a user-friendly freely available, multispecies classifier, that uses a convolutional neural network (CNN) to efficiently process large quantities of audio data to quickly identify more than nine hundred bird species (Kahl et al., 2021).

More recently, DL techniques have been applied to automated detection and classification of marine mammal and fish sounds. Their success has been demonstrated by many studies for binary marine mammal species detection and multi-class species classification (Belghith et al., 2018; Liu et al., 2018; Bergler et al., 2019; Bermant et al., 2019; Shiu et al., 2020; Yang et al., 2020; Zhong et al., 2020; Allen et al., 2021; Ibrahim et al., 2021; White et al., 2022), advancing the capabilities of mining large PAM datasets for detecting species of interest. However, these methods generally require large amount of validated training data and progress has been limited by challenges related to the lack of labeled datasets adequate for training and testing. Large quantities of known and as yet unidentified broadband signal types mingle in marine recordings, with variability introduced by acoustic propagation, source depths and orientations, and interacting signals (Frasier, 2021; Laplante et al., 2021). Manual classification of these datasets is unmanageable without an in-depth knowledge of the acoustic context and biodiversity data of each recording location. A signal classification pipeline which combines unsupervised and supervised learning phases with opportunities for expert oversight to label signals of interest was presented in Frasier (2021). The workflow presented in the former was implemented with user-interfaces within the publicly available acoustic data processing software package Triton (Wiggins et al., 2010). White et al. (2022) trained a DL model for multi-class marine sound source detection to explore its utility for extracting sound sources for use in marine mammal conservation and ecosystem monitoring. A training set was developed comprising existing datasets amalgamated across geographic, temporal and spatial scales, collected across a range of acoustic platforms. Transfer learning was used to fine-tune an open-source state-of-the-art CNN to detect odontocete tonal and broadband call types and vessel noise (from 0 to 48 kHz). The input to the CNN algorithm consists of spectogram images to exploit the differences of this time-frequency representation between each sound source.

Here, we present a DL-based workflow called Fish Acoustic Detection Algorithm Research (FADAR) initially designed to identify and classify the CAS of four Caribbean grouper species, namely Nassau grouper [E. striatus - Schärer et al. (2012b)], red hind [E. guttatus - Mann et al. (2010)], black grouper [Mycteroperca bonaci - Schärer et al. (2014)], and yellow fin grouper [M. venenosa - Schärer et al. (2012a)]. FADAR also identifies squirrelfish [Holocentrus spp - Luczkovich and Keusenkothen (2007)] and vessel noise as a background noise class. This workflow is the outcome of several DL models development specifically applied to fish sounds detection and classification.

In Ibrahim et al. (2018b), CNN and Long Short Term Memory (LSTM) networks were used to classify the previous four groupers species. CNNs were designed to effectively identify spatial patterns from images (Yamashita et al., 2018). LSTMs are a special type of Recurrent Neural Networks (RNN) that were designed to solve the vanishing gradient problem stemming from long-term dependencies contained in a time-series (Bengio et al., 1994). However, RNNs are also known for their pattern discrimination capabilities in time signals. Denoised spectrograms of CAS were used as input to both DL models. The CNN classifier was better than LSTM at discriminating the fish calls with over 90% accuracy. It also outperformed the handcrafted MFCC classifier built for the same species (Ibrahim et al., 2018a). Not only groupers species could be successfully identified through their calls, but also the various call types within species (Ibrahim et al., 2019; Wilson et al., 2020).

Call types among CAS exhibit significant acoustics feature differences. They can be used to understand the evolution of the fish behavior during the spawning season. The change in their relative numbers can be observed during the days leading and following peak calls (Wilson et al., 2020; Zayas et al., 2020). Using calls recorded during three consecutive spawning seasons at a Nassau grouper FSA in the Cayman Islands, Wilson et al. (2020) described the spectral and temporal characteristics of nine call types known or presumed to be produced by the four epinephelid species of interest in this study. For example, red hind grouper produce at least four distinct types of sounds that are most commonly heard during FSA (Zayas et al., 2020). Unsupervised classification methods can be used to determine the underlying representation in the input data without labeled data. One such method, known as stacked auto encoder or SAE was successfully applied to the specific task of identifying red hind call types by learning the latent representation of the main call types (Ibrahim et al., 2019).

The concept of Transfer Learning (TL) is also built into the foundation of FADAR (Ibrahim et al., 2020). Transfer learning relies on pre-trained DNNs, that have been trained for specific image recognition tasks on a large number of images. Their ability to identify specific patterns can be used for other pattern recognition tasks. Additional layers of neurons are then added to the pre-trained model to train the new model on the specific dataset. In essence, the pre-trained DNN acts as feature extractors that are generic enough that they apply to multiple tasks (Laplante et al., 2022).

The remainder of the paper is organized as follows. Section 2 presents the acoustic characteristics of the call types of all the fish species currently identified by FADAR. The architecture of the proposed FADAR tool composed of multiple classifiers, the datasets used for training and testing, and the metrics used to evaluate FADAR skills are presented in Section 3. Section 4 presents the classification results of the grouper sounds and the evaluation of FADAR on datasets from various Caribbean regions, depths and instruments. In Section 5 the FADAR App is presented followed by a discussion in Section 6. Concluding remarks are given in Section 7.

2 Grouper sounds

The four epinephelid species of interest in this study are found in the greater Caribbean, including the Gulf of Mexico and the Bahamas. They all spawn during the winter and spring months (December to May) in the Northern Hemisphere (Nemeth, 2012) and their spawning aggregations are cued to the moon and the winter solstice (Nemeth et al., 2007). FSAs often occur at remote locations and in water depths between 30 and 80 m, near the shelf break, where spawning activities usually peak at dusk but are contingent upon water temperatures and local current conditions (Nemeth, 2009). Spawning grouper population are thus challenging to observe and monitor (Kobara et al., 2013) and their CAS production constitutes the only way to monitor their presence remotely and their spawning activity across the entire spawning season.

Red hind produce at least four distinct types of stereotyped sounds that are heard during FSAs and in captivity during the spawning week (Wilson et al., 2020; Zayas et al., 2020). The first red hind sound (RH1) is a combination of a short pulse followed by a short tone (Figure 1A) with frequency range 20–360 Hz. The short tone is a short pulse-period pulse train (Ibrahim et al., 2019), whose period can vary and that can be extended by various longer pulse-period, short duration pulse trains. The second red hind sound (RH2) is composed of a series of pulses followed by an extended tone (Figure 1B). It can be combined with a pulse train before or after the call and the tone can be modulated like RH1 and also be extended or shortened. The third type of red hind sounds consists of a grunt (RH3). It can be produced as a single grunt or in a train consisting of two or three successive grunts as shown in Figure 1C. The fourth type of red hind calls is the pulse which can be produced alone or as a train alone or as part of other call types, usually before or after RH1 or RH2 calls. Figure 1D shows consecutive short pulses, resulting in a pulse train without any other type of calls. Figure 1E shows a combination of RH4 and RH2, where several pulses of increasing frequency precede the long tone. The fifth type of sounds associated with the red hind FSA is called a chorus. It consist of the continuous overlap of call types RH1 and RH2. Both call types are not immediately distinguishable in the spectrogram but can be heard (Appeldoorn-Sanders et al., 2023).

Figure 1

Figure 1 Spectrograms of four red hind grouper call types and a combination of call types. Note that the times axis differs among images. (A) Red hind call RH1; (B) Red hind call RH2; (C) Red hind call RH3; (D) Red hind call RH4; (E) Combination of call types RH4 and RH2. The spectrograms were calculated with a FFT size of 4096 points and show the relative intensity in dB. The calls were recorded off the west coast of Puerto Rico, at Abrir la Sierra fish spawning aggregation site during the spawning season of 2015.

Nassau grouper calls have also been categorized as four distinct types as shown by Schärer et al. (2012b); Wilson et al. (2020), and Rowell et al. (2018). The Nassau grouper call N1 is an alarm call, composed of a variable number of low frequency pulses (Figure 2A). The CAS call N2 consists of a modulated tone that may be preceded by a variable number of pulses or a shorter tonal call (Figure 2B). N3 is an agonistic call made of pulses and double pulse segments (Figure 2C) that occurs along with competitive displays of males as described in Rowell et al. (2018). The fourth type of calls, labeled N4 is a Nassau grouper call that included a variable number of grunt pairs in sequence (Figure 2D). All Nassau grouper calls peak frequency ranged between 90 and 300Hz as shown in Wilson et al. (2020). Further descriptions of the four call types frequency and duration characteristics can be found in Wilson et al. (2020).

Figure 2

Figure 2 Spectrograms of four Nassau grouper call types as identified in the literature. (A) Nassau grouper call N1; (B) Nassau grouper call N2; (C) Nassau grouper call N3; (D) Nassau grouper call N4. The spectrograms were calculated with a FFT size of 4096 points and show the relative intensity in dB. The calls were recorded off the west coast of Puerto Rico, at Bajo de Sico fish spawning aggregation site during the spawning season of 2014.

Yellowfin grouper calls have been categorized as two types that may occur subsequently or separately Schärer et al. (2012a). They consist of variable series of fast pulses or tonal calls labeled YF1 (Figure 3A) and of a rather uniform pulse train labeled YF2 (Figure 3B). Tonal calls average duration is about 3s ranging between 1.29s and 5.69s with a peak frequency in the range 88.9 to 141.7Hz. Pulse calls duration is in the same range as the tonal call with peak frequency range of 101.4Hz to 132.4Hz. Both calls are known as CAS and were recorded at FSAs.

Figure 3

Figure 3 Spectrograms of yellowfin grouper call types. (A) Yellow fin call YF1. (B) Yellow fin call YF2. The spectrograms were calculated with a FFT size of 4096 points and show the relative intensity in dB. The calls were recorded off the west coast of Puerto Rico, at Mona Island fish spawning aggregation site during the spawning season of 2013.

Black grouper produces at least two variations of a lower frequency, modulated tonal call, which ranges between 60 Hz and 120 Hz, but generally has a longer duration than Nassau grouper call N2 (Figure 4A). This call is sometime preceded by a set of pulses and associated to courtship displays as shown by Schärer et al. (2014) and Wilson et al. (2020).

Figure 4

Figure 4 Spectrogram of (A) black grouper courtship call and (B) of holocentrids (squirrelfish). The spectrograms were calculated with a FFT size of 4096 points and show the relative intensity in dB. The calls were recorded off the west coast of Puerto Rico, at Abrir la Sierra fish spawning aggregation site during the spawning season of 2014.

Although our focus was mostly on epinephelids, we noticed that in a significant number of recordings the presence of another sound, in an overlapping frequency range with the grouper range. This sound was identified to be from squirrelfishes, mainly Holocentrus rufus and H. adscensionis, which are a primary component of the Caribbean coral reef soundscapes (Moulton, 1958). Their grunts form an acoustic signature known as the ‘staccato’. It consists of a pulse train in the frequency range 120–400 Hz with a duration of 1 to 2 s (Figure 4B), which is well described in Winn and Marshall (1963), Winn et al. (1964), and Parmentier et al. (2011). In Puerto Rico both species have been documented in reef habitats in similar abundances (National Centers for Coastal Ocean Science (NCCOS) and Southeast Fisheries Science Center (SEFSC), 2020). This call was considered as one of the classes to improve the FADAR’s accuracy as it would otherwise be mislabeled as a yellowfin grouper call, which looks very similar structurally, although of different frequency range as shown by Figures 3B, 4B.

3 Methods

3.1 FADAR workflow

FADAR was developed in Matlab™, with Deep Learning, Signal Processing and Audio toolboxes. The general workflow of the algorithm consists of a preprocessing stage where the sound signal is converted into an RGB image of the spectrogram. Then the images are analyzed by the DL models that constitute FADAR, which produces an output that is the classification of the sound sources according to the classes defined for the models (Figure 5). However, the classification process involves two stages. For the general identification of grouper species, not including their call types, FADAR consists of an ensemble of deep learning models with randomly selected hyper parameters. The classification of call types for the different species though, is conducted by species specific deep learning sub models as shown hereafter.

Figure 5

Figure 5 Fish Acoustic Detection Algorithm Research (FADAR) workflow.

In the pre-processing stage, the input audio files are first re-sampled to 10 kHz then divided into 2 s audio segments without predetermined filtering. A spectrogram is created for each segment in the 0–500 Hz frequency range and then converted into an intensity image using Matlab™ RGB conversion tools. Spectrograms were generated by applying a Hanning window, with a frame length of 0.1 s (1000 samples), 80% overlap and an NFFT size of 4096 points. These images were thus used to train and test FADAR for all the classes chosen for this model. No calibration was applied to the input. During the training stage, the diversity of data resulting from different gain setups was accounted for as shown below, which increases the classification robustness to the data source.

FADAR training took place with a total of 73466 spectrograms (or images) encompassing all four grouper species call types, squirrelfish sounds, vessel and background noise. The data split was 80% for training and 20% for validation. 180162 images of fish CAS and 632850 images of vessel and ambient noise, not part of the training/validation set were used for testing and were manually labeled. The numbers per class are presented in Sections 3.3 and 3.4.

3.2 Main FADAR model

Classifying data with an automated approach is an iterative process that hinges on the formulation of the problem, the data analysis, feature extraction and selection, classifier selection, and the validation of the model. Classification models fail for several reasons, including insufficient data preprocessing, overfitting during the training stage, unsuitability of the model for the tasks, and lack of independent data for model validation. Ibrahim (2019) proposed an ensemble-based approach for DL data classification and event forecasting, called multimodel deep learning (MMDL). In the proposed method, the architecture construction process is automated by diversifying the structure of CNN classifiers. The selection of hyperparameters for each classifier of the ensemble is randomized to allow the system to devise the most suitable network architecture for a given dataset. The MMDL algorithm then fuses the results from the different architecture classifiers, which collectively improves the performance of individual classifiers. Such approach was proposed and implemented for the detection of North Atlantic right whale up-calls by Ibrahim et al. (2021). The general network architecture of the CNNs consists of a randomized number of convolutional blocks that each contain convolutional layers, a ReLU activation layer, a batch normalization layer, and a max-pooling layer as shown in Figure 6. Therefore the proposed MMDL algorithm for grouper CAS detection and classification consists of a bank of five CNNs where, to reduce the design complexity, a randomized generation process is applied to assign values to hyperparameters by setting up a range for the number of convolutional layers [3 5], number of filters [8 32], neurons in fully connected layers [300 790], and batch size [16 128]. These randomly generated DL models form an ensemble of classifiers. The outputs of each model are combined by using a fusion strategy for decision making as shown in Figure 7. The fusion block analyzes the outputs of individual models to identify locally consistent, discriminative, and representative patterns. The types of metrics used in this process were selected according to the results of an early study by Moreno-Seco et al. (2006) that tested the efficacy of fusion methods like Majority Voting, Unweighted Average, and PatternNet. The latter consistently outperformed the other methods, and was used in the MMDL. The fusion process and implementation is further described in (Ibrahim et al., 2021).

Figure 6

Figure 6 General structure of an individual CNN network with α convolution blocks.

Figure 7

Figure 7 Multi-model deep learning (MMDL) CNN ensemble flow chart used in the Fish Acoustic Detection Algorithm Research (FADAR).

The MMDL, is thus the main FADAR model that is used to detect and classify the four groupers species’ CAS, regardless of the call type, as well as squirrelfish and boat sounds. In order to classify the call types, species specific classifiers or sub-models were designed for red hind, Nassau and yellowfin groupers, which are presented in the following sub-section. The sub-models are applied to the outputs of the MMDL as shown in Figure 5.

3.3 Species specific call types classifiers

3.3.1 Red hind call types classifier

In Ibrahim et al. (2019), a random ensemble of stacked auto-encoders was designed to classify RH1 and RH2 red hind grouper call types. An accuracy of 95% was obtained with 15 random SAEs, which requires extensive processing times with large numbers of parameters. In order to make the model more efficient, an approach based on Transfer Learning was preferred. Transfer Learning involves leveraging a DL model that has been trained for one task (source domain) as the starting point for a new task (target domain). Typically, the source model is trained on vast quantities of data where the layers of the network have learned to extract features that are useful to the given task. Ideally, some of these features would be generic enough that they may be shared across tasks. Transfer Learning occurs when the knowledge learned in the source domain is transferred during the start of the training process of a new model to make predictions on the target domain. Specifically, the knowledge learned is contained within the weights of the trained network, and can be easily loaded as the starting point of the new training process. Usually, the target domain has limited access to data and therefore, greatly benefits from the inherited knowledge. Transfer Learning can be applied in many different ways, depending on the problem. It can involve techniques such as freezing certain layers of the network therefore preserving the weights of those layers that were learned previously, replacing the classification layers, training for only a few epochs to fine-tune the network, or a combination of these techniques. Pre-trained models were selected for extracting the deep informative features from spectrogram images generated from red hind sounds. Different types of pre-trained model were used such as ResNet, MobileNetV2 (Howard et al., 2017), Efficient Net (Tan and Le, 2020), and ShuffleNet (Zhang et al., 2017) as shown in Ibrahim et al. (2019). White et al. (2022) used a similar transfer learning approach based on EfficientNet B0 to classify marine sound sources, which provides a computationally efficient architecture for rapid classification. EfficientNet introduces a compound scaling method that uniformly scales the depth, width, and resolution of the network. This scaling approach ensures that the model becomes more powerful as it gets larger while maintaining efficiency. By carefully balancing these scaling factors, EfficientNet achieves improved accuracy compared to other models while using fewer resources. The combination of accuracy and efficiency of the EfficientNet, along with its scalability and transfer learning capabilities, makes it a relevant choice as the red hind call types classifier but also because of its higher accuracy than the others. More than 17000 samples were used for training and validation following the same 80/20 ratio as in the other DL models (Table 1). More than 94000 were used for testing and were separate from the training/validation set (Table 2). The structure of the red hind call types submodel is shown in Figure 8.

Table 1

Table 1 Numbers of 2-second sound samples used for training and validation (80/20 split) per location.

Table 2

Table 2 Numbers of 2-second sound samples used for testing per location.

Figure 8

Figure 8 General structure of the red hind grouper call type model. BN indicates Batch Normalization. FC indicates fully connected layers with 1000 neurons and the last FC layer has only two neurons.

3.3.2 Nassau grouper call types classifier

While at least four Nassau call types have been identified, only two classes were used for the call type classification. Class 1 labeled as FN1 encompasses pulse-like calls such as N1, N3, and N4 and class 2, labeled as FN2 tonal calls such as N2. Due to this added complexity within the FN1 class a transfer learning approach did not provide satisfying results. Therefore, a CNN model was specifically designed for the Nassau grouper call type classification. The superior performance of the shallow CNN model over the transfer learning could be attributed to its reduced complexity. The simplified architecture of the shallow CNN, with fewer layers and parameters, allows it to more effectively focus on the specific features relevant to the Nassau grouper call types classification task. While pre-trained intricate design is tailored for a broader range of challenges, its complexity might lead to overfitting or less adaptability to the target dataset. In contrast, the shallow CNN’s simplicity enables it to efficiently extract and learn the key characteristics of Nassau grouper call type images, ultimately resulting in enhanced classification performance. The Nassau CNN model utilizes a 16-layer structure with seven layers of convolution, three fully connected layers and a softmax layer (Figure 9). The first two convolutional layers contain 3x3 filter of stride 1, the number of filters being 8, and a maxpooling layer of 2x2 filter of stride 2. The next two convolutional layers are made of 3x3 filter with stride 1, the number of filters being 16, which are followed by a maxpooling layer of 2x2 filter of stride 2. Each of the remaining three convolutional layers are followed by a maxpooling layer. The number of filters of the remaining convolutional layers are 32, 48, 64, respectively. The last maxpooling layer is followed by a fully connected layer with 1024 nodes, a dropout layer with a probability of 0.25, then another fully connected layer with 512 nodes, a dropout layer with a probability of 0.5, then yet another fully connected with 2 nodes with a SoftMax activation layer that ensures the output predictions across all classes. More than 10000 samples were used for training and validation following the same 80/20 ratio (Table 1). About 19678 were used for testing and were separate from the training/validation test (Table 2).

Figure 9

Figure 9 General structure of the Nassau grouper call types Convolutional Neural Network (CNN) model. FC stands for fully connected.

3.3.3 Yellowfin grouper call types classifier

The design of the yellowfin grouper call-type classification model also utilizes a Transfer Learning concept as done in Ibrahim et al. (2020) because of the relatively comparable dissimilarity between call types as in the red hind calls. EfficientNet was also selected as the pre-trained model to classify yellowfin grouper call types. Using Transfer Learning in this case improved the accuracy of the classification. This model is the same as the one used for the red hind grouper but trained with yellowfin calls. The number of call used in this model for training and validation was about 11000 (Table 1). About 11864 were used for testing and were separate from the training/validation test (Table 2).

3.3.4 Data annotation, classes and call counting

The same pre-processing conversion steps were applied to all the acoustic data used to create the training library. An important consideration is the assurance that one or more, call or call type, - including part of the call only if located at the beginning or end of the window - of the species concerned was comprise within the 2s window used for each sample as described in Section 3.1. The classes of the main FADAR model comprised one class for red hind that included the two main call types RH1 and RH2, one class for Nassau grouper that included call types FN1 and FN2 (see Section 3.3.2), one class for both call types of yellowfin grouper, one class for black grouper, one class for squirrelfish and one class boat and background noise, all in the same frequency range. For the call type classification, new classes were created, corresponding to each of the call types for the species concerned. Classes were labeled by a team trained by the authors. Each file was auditively analyzed with canceling headphones and/or visualized with acoustic analysis software. Grouper sounds were quantified per file by visual inspection of spectrograms using Ishmael Bioacoustics (version 3.0) or Audacity (version 3.3.3) or Matlab™ software. For each file an image of the sound was created and classified by an observer depending on the pattern, duration, and frequency of each signal. Classes were labeled from a wide range of data sources as shown in Table 1, which included different recorder types, mobile or fixed platforms, depth, geographical regions and sampling rates. No calibrations were applied to the input data in order to improve the code flexibility at handling a diverse set of data types without significant pre-processing.

While the 2s window length was selected to mostly capture one call at once, input files were in general longer than 20s. Therefore, the splitting of the file in 2s windows can lead to the same call being counted twice. To remedy this issue, when identifying the number of calls in each file, we implemented the following algorithm. The amplitude of the signal in the first and last 0.5s of each 2s window is calculated and compared to a carefully chosen threshold. If the successive window amplitude is above the threshold, then the signal is part of the same call, otherwise it signifies the end of the call. This operation ultimately provides the number of calls in each input file.

3.4 Datasets

To obtain a robust classifier, the training data should encompass the full diversity of each class. To increase the diversity of our training set we utilized data collected by a variety of institutions, under differing survey protocols and across a range of geographic locations and temporal scales. However, all the acoustic data used in this study was collected at FSA sites in the Caribbean Sea. The dataset spans the period 2014–2020 and encompasses three geographic regions from the western shelf of the island of Puerto Rico, in the Greater Antilles, and spans multiple years at the same locations. A second set of data was collected in the neighboring islands of St. Thomas and St Croix of the U.S. Virgin Islands. They were collected under the auspices of three different organizations, namely the Caribbean Fisheries Management Council (CFMC), the Caribbean Coral Reef Institute (CCRI) of the University of Puerto-Rico and the National Oceanic and Atmospheric Administration (NOAA) Southeast Area Monitoring and Assessment Program (SEAMAP-C). The other geographic region that contributed acoustic data to this dataset is in the Cayman islands, more specifically the western shelf edge of Little Cayman. And the third geographic region is the western tip of the Florida Keys in the Gulf of Mexico, namely Riley’s Hump and Western Dry Rocks.

Because all the deployments targeted grouper spawning aggregation sites, all four groupers species can be heard but may not be present at every site. On the Puerto-Rican shelf and in the U.S. Virgin Islands, red hind is the most abundant species. However, spawning sites for yellowfin and Nassau groupers were also monitored and at some locations black grouper CAS were also heard. In the Cayman islands, the recorders were deployed at Nassau grouper spawning sites with the incidental presence of red hind, yellow fin and black grouper. In the Florida Keys, recorders were deployed at spawning aggregation sites of multiple groupers species. All recordings also include the sound of squirrelfish and surface vessels, however, the other sound sources vary with the locations and were not specifically identified for this study. While most acoustic data were obtained from bottom mounted fixed recorders, we also added to this dataset recordings from a mobile surface platform that surveyed the insular shelf edge of St. Thomas and Puerto Rico [see Chérubin et al. (2020); Woodward et al. (2023) for more details]. The relative proportion of data per class between locations used for training and validation is given in Table 1. The relative proportion of data per class between locations used for testing is given in Table 2. The testing data wasn’t used in the training/validation stage although it came from the same overall dataset as shown in Tables 1, 2.

Along with the different locations, research groups, and methods, the type of recorders also varied but consisted mainly of Loggerhead Instruments recording units. Each unit was programmed to record ambient sounds either continuously for short-term week-long deployments or through a duty cycle for long-term months-long deployments. In Puerto-Rico and the U.S. Virgin Islands most units were Loggerhead Instruments digital sprectrogram recorders (DSG), using a sampling rate of 80 kHz. In the Cayman Islands, acoustic recordings were obtained from DSGs between 2013 and 2020 Wilson et al. (2020) and Ocean Instruments SoundTrap Model 300HF in 2020. DSG instruments recorded at a sample rate of 50 kHz and Soundtrap at 48 kHz. In the Florida Keys, Soundtrap recorders were used with sampling rates of 44.1 kHz. All these recorders were fixed on the ocean floor. Acoustic recording units from mobile platforms were also considered in our training data set. They were collected by an embedded system on a wave glider and recorded ocean sounds between 10 and 20 m below the surface. More details on the glider operations and the payload system can be found in Chérubin et al. (2020); Woodward et al. (2023). The sampling rate for the PAM system was 10 kHz. Therefore, the training dataset encompasses several types of recorders, hydrophone sensitivities, gain setups, and sampling rates that will contribute to the robustness of FADAR.

3.5 FADAR testing procedure and performance metrics

In this section, we present the model testing procedure, the evaluation metrics and then the explainability tools that were applied to identify the spectral features used in the classification process. The experiments were implemented in Matlab™ following a standard validation procedure, which is explained next. The dataset used for training is shown in Table 1, and the testing dataset is shown in Table 2. Among the training data, 80% of the data were randomly chosen for training, while the remaining 20% were reserved for validation. This process was repeated five times until all data points in the training set were validated once. Finally, the trained model was tested using the reserved test data.

Sensitivity, specificity, and accuracy defined as Equations 1–3:

\begin{array}{l} S e n s i t i v i t y = \frac{T P}{T P + F N} & (1) \end{array}

\begin{array}{l} S p e c i f i c i t y = \frac{T N}{T N + F P} & (2) \end{array}

\begin{array}{l} A c c u r a c y = \frac{(T P + T N)}{(T P + T N + F P + F N)} & (3) \end{array}

where TP stands for true positive, TN true negative, FP false positive and FN false negative were calculated for each class to measure the performance of the various classification models. Here, TN corresponds to the vessel/ambient noise class. The ability of a model to identify true positives is referred to as sensitivity also called recall, while the ability of a model to identify true negatives, which is one of our classes is referred to as specificity. More specifically, the sensitivity score shows the ability of the model to correctly identify the candidate sample’s call-type, while the specificity score shows the ability of the model to correctly state that the candidate sample does not belong to that particular call-type. While accuracy may be an inadequate measure of the performance of the classifier for imbalanced datasets, where TN ≫ TP, Hildebrand et al. (2022) recommend measuring the classifier performance using metrics that are expressed in terms of ratios, namely the true positive and true negative rates. Here we also calculated for each of the six classes the receiver-operating-characteristic curves (ROC).

3.6 Enhancing model transparency with interpretability techniques

In our pursuit of a more comprehensible and transparent classification model, we integrated three pivotal interpretability techniques known as Gradient-weighted Class Activation Mapping (Grad-CAM), Local Interpretable Model-agnostic Explanations (LIME), and Occlusion Sensitivity. These methodologies collectively shed light on the decision-making processes within complex CNN architectures. In scientific and engineering contexts, the notion of a “black box” underscores the challenge of understanding processes devoid of explanatory insights, meaning that humans, even those who design them, cannot understand how variables are being combined to make predictions (Rudin and Radin, 2019).

Grad-CAM operates as a post-hoc explanation approach, facilitating model interpretability without necessitating structural alterations to the examined CNN architecture (Selvaraju et al., 2019). By producing heat maps that vividly highlight regions of an input image contributing positively to the network’s classification decision for a specific class, Grad-CAM provides a visual representation of activation intensity, ranging from vibrant orange (high activity) to cooler blue (lower activity). This technique offers transparent insights into the decision rationale of diverse CNN-based models (Selvaraju et al., 2019).

LIME constitutes an approach that furnishes explanations for classifiers of all kinds, striving to construct a locally interpretable model closely approximating the actual model’s behavior for a given input and prediction (Ribeiro et al., 2016). By highlighting superpixels in orange to denote positive contributions and in blue for negative ones, LIME effectively segments an image, pinpointing regions with substantial influence on the classification outcome. This approach quantifies contributions, thus illuminating areas that significantly affect classification results.

Occlusion Sensitivity enhances model interpretability by systematically occluding portions of an input image and measuring the corresponding impact on the classification outcome. By iteratively masking different sections of an image, the technique discerns the image regions that exert the most influence on the model’s predictions. Occlusion Sensitivity thus further contributes to the transparency and insights into how the CNN arrives at its decisions, enhancing the model’s overall interpretability.

Collectively, these techniques offer a deeper understanding of the intricate decision-making processes within our model.

4 Results

4.1 Features of interest in the call type classification

In the study herein, because the classification is based on RGB spectrogram images, FADAR classification of the calls is based on the identification of spectral features as shown by Figure 10. The identification of tonal calls is mostly based on the slope and energy of the tonal bands of the calls. The highest energy bands are the greatest contributors to the call identification for red hind call RH1, Nassau grouper call FN1 and yellowfin grouper call YF1 according to Grad-CAM. However, this metric shows significant overlap between the three species frequency bands. Instead, the LIME and Occlusion sensitivity metrics indicate different parts of each of the calls that do not overlap suggesting that key time-frequency features exist and enable the distinction between calls. The time-frequency features of the black grouper tonal call are distinct from the other three species according to Grad-CAM, LIME and Occlusion Sensitivity metrics.

Figure 10

Figure 10 Spectral regions and features of interest identified by Gradient-weighted Class Activation Mapping (Grad-CAM - second column), Local Interpretable Model-agnostic Explanations (LIME- third column), and occlusion sensitivity (fourth column) interpretability measures in the call spectrograms (first column; the spectrograms were calculated with a FFT size of 4096 points) of red hind (first row, EGUT), Nassau (second row, ESTRI), yellowfin (third row, MVEN) and black grouper (fourth row, MBON).

The same interpretability measures were also applied to the distinction between call types for red hind grouper (Figure 11) and yellow fin grouper (Figure 12). The Grad-CAM metric for red hind grouper suggests that the spectral features selected by the sub-model are the maximum energy of the spectral band in the RH1 calls and the time-varying pulses in the time frequency representation of the RH2 calls. For the yellowfin grouper sub-model, the spectral features of interest are a specific frequency range in the maximum slope region of the tonal call for call type YF1 and the extent the peak energy band of the pulse train sound, YF2.

Figure 11

Figure 11 Spectral regions and features of interest identified by Gradient-weighted Class Activation Mapping (Grad-CAM - second column) interpretability measure in the red hind grouper (EGUT) call type RH1 (first row) and call type RH2 (second row) spectrograms. The spectrograms were calculated with a FFT size of 4096 points.

Figure 12

Figure 12 Spectral regions and features of interest identified by Gradient-weighted Class Activation Mapping (Grad-CAM - second column) interpretability measure in the yellow fin grouper (MVEN) call type YF1 (first row) and call type YF2 (second row) spectrograms. The spectrograms were calculated with a FFT size of 4096 points.

4.2 Evaluation of FADAR

We first present the evaluation of FADAR main model, a MMDL algorithm for grouper CAS detection and classification (see Section 3.2). The results are shown in Table 3 and consist of the average over all the test data for each class. FADAR sensitivity is greater than 0.91 for all classes, the lowest score being for yellowfin grouper. Specificity is however greater than 0.98 for all classes, the lowest score being for black grouper. And accuracy is greater than 0.94, the lowest score being for yellowfin grouper. Considering all classes together, the sensitivity, specificity and accuracy of FADAR main model are above 0.97 for the testing dataset used in this study that includes a diversity of recording platforms, gain, sampling frequency, location and depths. All three measures are the highest for boat sound and squirrelfish.

Table 3

Table 3 Evaluation metric values of the main FADAR model.

The ROC curves for each of the classes, corresponding to the results of FADAR classification expressed in terms of Accuracy in Table 3 are shown in Figure 13. The ROC curves suggest a high probability of TP and low probability of FP for all six classes, which also confirms that using accuracy as a performance measure of FADAR is appropriate.

Figure 13

Figure 13 Performance of FADAR’s multi-model deep learning (MMDL) is illustrated as ROC curves for each classes.

FADAR’s main model was also evaluated to account for the effect of region (Table 4) and depth (Table 5). For the effect of region we chose to compare three geographically distinct regions such as the western Caribbean FSA in the Cayman Islands, the Western shelf of Puerto Rico and the southern shelf of St. Thomas in the central Caribbean region, and the Florida Keys in the Gulf of Mexico. The regional comparison reveals few differences across region for all classes. Squirrelfish sound classification shows a slight decrease in accuracy (<0.9) due to a decrease in sensitivity at ALS. Different instruments were also used, namely Soundtrap recorders in the Cayman Islands and the Florida Keys, and Loggerhead DSG recorders at ALS and RHB. The results show no significant differences according to the instrument. Because most grouper spawning aggregation sites are located near the shelf edge, the depth variation between sites is minimal regardless of the species, although on the western shelf of Puerto Rico the FSA at BDS is located at 50 m depth. Therefore we compared the results of FADAR on recordings at BDS, ALS and ASL Deep, which is 4 m deeper than ALS and made by the wave glider, between 10 and 20 m below the surface. Table 5 shows no significant differences between depths.

Table 4

Table 4 Performance metrics for different acoustic datasets across three distinct regions: the Florida Keys (Gulf of Mexico), the Cayman Islands (Western Caribbean), Abrir la Sierra (ALS) and Red Hind Bank (RHB) in the northern central Caribbean.

Table 5

Table 5 Performance metrics for different acoustic datasets across depth: Bajo de Sico (BDS) FSA is located at 50 m depth, Abrir la Sierra (ALS) Deep at 28 m, ALS at 24 m, and the Wave Glider PAM is towed between 10 and 20 m below the surface.

The same analysis was conducted to evaluate the skills of the sub-models on call type classification for each species concerned. The results are shown in Table 6 for red hind grouper calls, Table 7 for Nassau grouper calls and Table 8 for yellow fin grouper calls. The red-hind grouper sub-model sensitivity and accuracy is slightly improved over FADAR main model and is above 0.93 for all three measures. Nassau grouper sub-model exhibits lower skills at classifying the call types than the main model does at identifying any Nassau calls. However all three measures remain above 0.9. Finally, the yellow fin grouper sub-model showed improved sensitivity over the main model. Classification skills were much improved in terms of all three metrics for the YF1 call especially. Specificity and accuracy of the classification of YF2 calls were lower than in the main FADAR model. Overall, the sub-models exhibit higher sensitivity for tonal-like calls such as RH1, N1, and YF1.

Table 6

Table 6 Evaluation metric values of the red hind grouper sub-model.

Table 7

Table 7 Evaluation metric values of the Nassau grouper sub-model.

Table 8

Table 8 Results of the Yellowfin grouper sub-model.

A second evaluation of FADAR was conducted on datasets and data not used in the previous testing steps. It consisted of an unbiased evaluation of the algorithm by testing FADAR on an equal number of samples (600) for each species, ensuring balanced representation across locations including Cayman Islands, St. Thomas (Grammanik Bank), St Croix (Lang Bank), and the Florida Keys. Notably, particular attention is directed towards mitigating potential sources of acoustic interference, particularly at locations such as the Cayman Islands, Puerto Rico and the Virgin Islands, where background noise originating from other fish sounds and vessel activity, as well as the coexistence of marine mammals with overlapping frequency ranges, present significant challenges to call type and species classification. These systematic evaluations, encapsulated in the sensitivity, specificity, and accuracy measures, allow for a rigorous examination of FADAR’s performance in species detection while considering the influence of environmental factors across diverse marine ecosystems. The results are shown in Table 9 and confirm the robustness of FADAR classification skills across locations for all classes. The sensitivity lowest score (0.897) is obtained for the Cayman Islands squirrelfish sound although the accuracy remains above 0.9. For all the other species and locations including the Cayman Islands the sensitivity, specificity, and accuracy scores are above 0.9. Again, this unbiased analysis suggest that location, which encompasses three Caribbean regions, the Florida Keys in the Gulf of Mexico, Grammanik and Lang Bank in the Northern central Caribbean and the Cayman Islands in the western Caribbean is not a factor in the accuracy of FADAR, nor is the instrument type. Loggerhead Instruments DSGs were used in the northern central Caribbean and Soundtrap recorders in the Cayman Island and the Florida Keys.

Table 9

Table 9 Performance metrics for different acoustic datasets across three distinct regions: the Florida Keys (Gulf of Mexico), the Cayman Islands (Western Caribbean), Grammanik Bank (St. Thomas, USVI) and Lang bank (St. Croix, USVI) in the northern central Caribbean.

5 FADAR application input and output structure

A Windows app, known as FADAR, was created to provide an efficient solution for classifying and categorizing different types of grouper fish sounds using the proposed algorithms. FADAR windows app is a Matlab™ Runtime app that is run without installing Matlab™. Two choices are available to install FADAR: either a standalone executable that does not require Matlab™ Runtime libraries or an app installer that includes Matlab™ Runtime libraries that will create the executable. With our user-friendly application, individuals can easily upload either a single sound file or an entire directory containing multiple sound files. The app offers four primary classification options: first, ‘Classify Grouper’ which offers a broad identification of the input sounds as grouper species or vessel or squirrelfish; second, ‘Classify Specific Grouper Types’ that allows users to select exact grouper sound categories such as ‘Red Hind Call Types,’ ‘Nassau Call Types,’ or ‘Yellowfin Call Types.’ To further improve user experience, FADAR provides versatile output formats. Users have the option to view classification results as in-app tables, allowing for quick and easy reference. For those who require external analysis or wish to share the data, the app can export classification outcomes as CSV (Comma-Separated Values) files, compatible with spreadsheet software like Microsoft Excel. Additionally, the application is equipped to generate video files displaying spectrograms annotated with the species name, offering a visual representation of the sound classifications for verification purposes. Finally, the app can also create folders of the 2s spectrograms in the 0–500 Hz range for each class, of all the calls detected for further verification. FADAR is also an efficient algorithm that can classify ten thousand 20-s files in 2.5 hours on a GPU powered laptop.

FADAR app can be downloaded at the following link: https://github.com/Aliklawat/-Fish-Acoustic-Detection-Algorithm-Research. Two files are present in the folder, namely FADAR executable and a zipped app installer. Requirements to operate FADAR are 64-bit processor and a minimum sampling rate of 10 kHz for the data. It is recommended to download the app installer, unzip and run it. It will provided instruction during the installation process of Matlab™ Runtime and FADAR. Once installed, FADAR is ready to operate by a simple click on the FADAR icon. All of the installation steps and operation guidance are provided on the Github page. In summary, our Windows app “FADAR”, streamlines the process of identifying and categorizing grouper fish sounds, catering to both general classification and specific call type recognition needs. Whether users are a marine biologist, researcher, or enthusiast seeking to understand and differentiate various grouper fish sounds, FADAR application serves as a valuable and user-friendly tool to meet their requirements. FADAR has been shared with multiple groups, including Florida Fish and Wildlife Commission, NOAA (Taylor et al., 2020), the NGO COBI in Mexico, and the Grouper Moon Project at Scripps Institution of Oceanography (van Horn et al., 2024). Training sessions were also setup for our potential users. COBI in Mexico has trained fishermen and fisheries managers to use the FADAR for Nassau grouper spawning aggregation monitoring.

6 Discussion

Machine learning methods have become effective tools for classification of often-extremely large passive acoustic datasets with a focus on marine mammals (Frasier, 2021; White et al., 2022). Existing work increasingly employs spectrogram representations of sound across a limited frequency range, which is selected according to the species (or signal) of interest. Using the full frequency band that include the signal of sources of interest, would indeed hinder the performance for the classification process because the ratio between pixels containing the signal to be classified would be quite low compared to an image with limited bandwidth as suggested by White et al. (2022). FADAR is the first machine learning package developed specifically for fish calls identification within a specific frequency range (0 -500 Hz) that encompasses all of the known species’ calls targeted in this study. We proposed an approach for classifying grouper sound calls in FSA sites using the concept of DL. We used the concept of ensemble DL for the main model to classify six different sounds, five fishes and one anthropogenic, heard in FSAs. Additionally, we proposed three submodels to classify each species call types. Furthermore, FADAR provides an integrated system consisting of DL models for both feature extraction and classification of fish species using their sounds.

For the detection and classification of marine mammals signals, CNNs have become the most common architectures Belghith et al. (2018); White et al. (2022). This work builds upon the evaluation of various architectures for each species that were presented in a series of studies such as CNN and LSTM (Ibrahim et al., 2018b), transfer learning with CNN (Ibrahim et al., 2020), stack auto encoders (Ibrahim et al., 2019), and multimodel DL (Ibrahim, 2019; Ibrahim et al., 2021), which have paved the way for the FADAR algorithm presented herein. These studies provided the baseline to identify the best methods for the different applications, especially the call type classification. They also showed that the performance of the different architectures was species dependent. So FADAR is the results of what gave the best results for all species calls overall and for species specific call types. Another selection of architectures would certainly work as well.

Our results demonstrate the ability of a CNN to extract higher level components of the soundscape for an assessment of the species present, beneficial to marine management, policy and stakeholders. However, as shown in White et al. (2022), a single CNN model appears to be unsuitable for all bioacoustic research needs. At the species level, understanding species acoustic behavior in the spawning aggregation context requires not only models which incorporate soundscape elements but also, in tandem, complex species call type-level classifiers to meet the desired research needs. In Woodward et al. (2023), the authors show that the acoustic propagation characteristics between RH1 and RH2 are different, because mostly RH2 call type were recorded the near the surface, despite both call types being present at depth at the same source level. Using both tonal and impulsive call types of grouper calls allows for a more refined determination of their spawning behavior (territory defense, mate attraction, call to migrate to the aggregation site) over single call type approaches. Male red hind form territories with harems of one or more females during spawning aggregations (Shapiro et al., 1993) that may be associated with variations in combinations or structure of call types associated with sex-specific interactions. The methodology described here could be applied to other soniferous species with similar complex repertoires as the grouper species analyzed in this study.

Many FSAs are multi-species hot spots where several grouper species can be found (Wilson et al., 2020; Woodward et al., 2023). They are noisy environments where geophony, anthropophony, and biophony overlap and hinder the transmission, detection, and discrimination of species specific sounds. Understanding how fish and other marine animals adapt their communication strategy while sharing the acoustic space (both in frequency and space) can offer insight into the differences among species sounds (Wilson et al., 2020). To assess the possibility of acoustic partitioning between the four grouper species that FADAR can detect and classify, Wilson et al. (2020) analyzed their spectral and temporal features and their individual sound segments in their study. They measured the following acoustic features for a subset of high signal to noise ratio calls and their segments: duration, peak frequency, 3 dB bandwidth, received level, and (if applicable) inter-pulse period (IPP). The spectral and temporal characteristics of calls themselves were partitioned. And they investigated the use of the acoustic features of calls and segments for discriminating between species and call types using a random forest of multiple classification and regression trees (Breiman et al., 1984; Breiman, 2001). Their analysis revealed that IPP and duration were the most important predictors for random forests, influencing both call and segment classification more strongly than spectral features. In the study herein, the grouper calls are detected and classified based on the spectrogram images. Hence, the detection and discrimination of the grouper calls by FADAR is done through the time-frequency features contained in the images. While the Grad-CAM measure showed overlapping features among the tonal calls of red hind, Nassau and yellowfin groupers, more discriminating features were revealed by the LIME and Occlusion Sensitivity measures. Features such as the slope of the energy bands of lesser energy, proximity of the bands to each other were identified. This type of spectrogram feature separation also extends in the classification of call types.

Work that incorporates multi-sound sources has become important to investigate variations in ambient sound characteristics and monitor biodiversity to infer ecological information. There are a variety of tools and studies on multi-sound classification in terrestrial systems such as BirdNET (Kahl et al., 2021) and others (Potamitis, 2014; Denton et al., 2021; Ghani et al., 2023). In the marine environment Belghith et al. (2018) demonstrate how custom CNNs can discriminate between baleen whale calls, odontocete echolocation clicks and anthropogenic noise sources, achieving overall accuracy scores of 66.4% with a site-specific training set. In White et al. (2022) they implement, on small training sets, a transfer learning of a high performing architecture combined with multi-channel spectrograms as input for the detection of multi-sound sources. They report a higher accuracy with an overall macro-average of 94% on the test set. Using ROC curves alongside confusion matrices to measure per-class performance they assess the effect of regional soundscape variation on performance metrics for specific sound types. A signal classification pipeline involving supervised and unsupervised learning was used by Frasier (2021) to identify and classify seven classes including five distinct cetacean sounds. This framework enables expert oversight to label signals of interest, some of which are known, and others which are not yet well characterized or matched to a known source. The intent of this framework is, however, to provide a viable solution for efficiently generating the large, representative training sets needed for applications of DL in passive acoustics. Accuracy scores were also high, above 98% on their balanced evaluation dataset of one thousand samples. Accuracy was much lower on their unbalanced, manually labeled independent dataset. Mahale et al. (2023) employed unsupervised classification through a hybrid technique comprising principal component analysis and K-means clustering for data features of four fish sound types. They were able to classify the chorus of four fish species with accuracies varying between 76.81% and 100.00%, and they classified vessel sound with 100% accuracy. However, comparison between studies is not straightforward due to differing test metrics and training sets which cannot be compared (Hildebrand et al., 2022).

In the study herein, sensitivity and specificity scores remain consistent between the unbalanced and balanced testing dataset for each class. An approach similar to Frasier (2021) was used to detect fish chorus in a large acoustic database of 5.3 years of raw acoustic data by Kim et al. (2023). First a clustering method was used to create distinct classes of chorus and noise. Then a deep neural network was trained to distinguish between noise and chorus classes as aggregated by the unsupervised clustering process. The neural network classified chorus and noise on testing data with an overall 94.6% accuracy, in which signal intensity impacted classification accuracy. This type of approach alleviates the tedious labeling tasks of the dataset and could be considered as new development for FADAR in order to increase the number of sound sources that can be detected. However it would not allow for the detection algorithm refinement that might be necessary as previously shown for call types of the same species that present separate spectral features as shown by the red hind grouper calls for example. Supervised clustering though, could be used to account for slight changes in fish sounds due to environmental changes such as temperature increase, by manually adjusting the labeling of the clusters. A review of the most recent fish sounds detection and classification methods is provided in Barroso et al. (2023). The main challenge remain the identification of the sound sources, which can be achieved through supervised or unsupervised clustering methods (Huang et al., 2023; Mahale et al., 2023). Then a conventional machine learning algorithm based on feature extraction (Mcloughlin et al., 2019) or a DL (Mishachandar and Vairamuthu, 2021) is applied to the classes of the clustering analysis.

One could assume that the datasets used in this study, hence the performance of FADAR is biased toward the seasonal timing of the FSAs. Therefore, the ambient noise and sounds sources other than the fish calls remain relatively similar across recording periods, although there is some environmental variability associated with the interannual variability. However, the soundscape can be significantly different between locations, hence masking of fish calls by ambient noise or other biophony. In their study of marine soundscapes in the Lesser Antilles, Heenehan et al. (2019) found that they were significantly different between the northern Antilles, the Windward and the Leeward Islands. The northern and Windward Islands soundscape was dominated by ship traffic and Humpback whale song that occurred on 49–93% of recording days. In the Leeward Islands, diurnal vessel patterns were observed with few to no vessels present during night time hours, possibly reflecting the activity of recreational craft and fishing vessels. Indeed the recordings from Puerto Rico and Virgin Islands contain a significant proportion of Humpback whale calls (Figure 14) that are in the same frequency band as the grouper calls (0–500Hz), which can affect the detection of the calls (Mooney et al., 2020). As shown in Wilson et al. (2020), Little Cayman is not affected by vessel noise or marine mammal sounds like its eastern counterparts. Thus, our training data through its diverse location with unique soundscapes, its diversity of recorders, gain settings and hydrophone sensitivities encompasses the diversity of the ambient conditions in the Caribbean region, hence of data quality, including low and high signal detectability. The results of the assessment of the model’s ability to generalize to unseen data across temporal and spatial scales within the region illustrate the reliability of the model output for other Caribbean regions and the resilience of FADAR DL architecture to noisy conditions as shown by Mooney et al. (2020).

Figure 14

Figure 14 Example spectrogram of overlapping humpback whale song (thick brush-like strokes) and red hind calls (at 0–3s, 8.5–10.5s, 14s, 16–19s) recorded on February 2015 near Abrir La Sierra off the west coast of Puerto Rico. The spectrogram was calculated with a FFT size of 4096 points.

Despite the great potential shown by FADAR on the curated and diverse dataset used in this study, there remain some challenges that FADAR is not designed to cope with. Recording errors in the recorders associated for example with an electric noise in the hydrophone can significantly modify the image of the sound in the spectrogram making it unrecognizable by the classifier. This type of error is not uncommon and was present in two of the data records from the Cayman Islands. Their impact on FADAR’s accuracy is discussed in van Horn et al. (2024). Therefore significant changes in the characteristics of the calls could significantly affect the performance of FADAR. Of the four grouper species classified by FADAR, red hind is the only known species to form choruses, which occur when multiple calls overlap to the point that they become indistinguishable in a spectrogram. FADAR failed to detect choruses of red hind at the peak of their spawning activity in the dataset used in this study (Appeldoorn-Sanders et al., 2023). To accurately assess spawning dynamics based on call types production and to better take advantage of FADAR it is necessary to also understand the phenomenology of the call types sound production, their relative evolution, and their role in the mating dynamics. The next improvement to FADAR would thus be the identification of fish choruses as done in Kim et al. (2023).

7 Conclusion

This conservation informatics approach combined with large datasets, will allow researchers and managers throughout the tropical western Atlantic to generate high-resolution time series of each species’ reproductive activity at multiple aggregation sites simultaneously, and thus produce metrics for species presence/absence and relative abundance in a fishery-independent monitoring context. Collectively, these data and machine learning performance metrics allow for studies of reproductive phenology throughout the region, and it is now possible to analyze recordings to characterize region-wide patterns in reproduction, as well as identify location-specific call patterns (i.e. dialects) in support of defining FSA specific protection measures that suite each site Sadovy de Mitcheson et al. (2020). Creating a region-wide professional network, coupled with our recent machine learning advances, will allow us to collaborate across borders and leverage the strength of a multitude of individual datasets. FADAR has been made available to fisheries and MPA managers as well as conservation practitioners throughout the region.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Author contributions

AI: Conceptualization, Data curation, Investigation, Methodology, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing. HZ: Conceptualization, Investigation, Methodology, Resources, Supervision, Visualization, Writing – review & editing. MS: Data curation, Formal analysis, Funding acquisition, Investigation, Resources, Writing – review & editing. CW: Data curation, Investigation, Validation, Writing – review & editing. NE: Writing – review & editing. LC: Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Resources, Supervision, Validation, Visualization, Writing – review & editing.

Funding

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This work was supported by the Harbor Branch Oceanographic Institute Foundation; the National Science Foundation’s Virgin Islands Established Program to Stimulate Competitive Research (VI-EPSCoR, #1355437). LC, AI, and MS were also supported in part by NOAA Saltonstall-Kennedy grant (NA15NMF4270329).

Acknowledgments

The authors are thankful to all the collaborators that provided data from the different Caribbean region, including R. Nemeth at University of the Virgin Islands, B. Semmens, S. Heppel and C. van Horn at University of California San Diego Scripps Institution of Oceanography, J. Keller and A. Acosta at Florida Fish and Wildlife Conservation Commission in the Florida Keys. Datasets from Puerto Rico and USVI were made available by the Southeast Area Monitoring and Assessment Program-Caribbean, Caribbean Fishery Management Council and the Caribbean Coral Reef Institute of the University of Puerto Rico under research permits provided by the Puerto Rico Department of Natural and Environmental Resources. Data collection in the Cayman island was in part supported by NOAA International Fisheries Science Research Program.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Allen A. N., Harvey M., Harrell L., Jansen A., Merkens K. P., Wall C. C., et al. (2021). A convolutional neural network for automated detection of humpback whale song in a diverse, long-term passive acoustic dataset. Front. Mar. Sci. 8, 607321. doi: 10.3389/fmars.2021.607321