Deep underwater image compression for enhanced machine vision applications

Zhang, Hanshu; Fan, Suzhen; Zou, Shuo; Yu, Zhibin; Zheng, Bing

doi:10.3389/fmars.2024.1411527

ORIGINAL RESEARCH article

Front. Mar. Sci., 15 July 2024

Sec. Ocean Observation

Volume 11 - 2024 | https://doi.org/10.3389/fmars.2024.1411527

This article is part of the Research TopicDeep Learning for Marine Science, volume IIView all 27 articles

Deep underwater image compression for enhanced machine vision applications

Hanshu Zhang¹

Suzhen Fan¹

Shuo Zou¹

Zhibin Yu^1,2*

Bing Zheng^1,2*

¹Sanya Oceanographic Institution, Ocean University of China, Sanya, China
²Faculty of Information Science and Engineering, Ocean University of China, Qingdao, China

Underwater image compression is fundamental in underwater visual applications. The storage resources of autonomous underwater vehicles (AUVs) and underwater cameras are limited. By employing effective image compression methods, it is possible to optimize the resource utilization of these devices, thereby extending the operational time underwater. Current image compression methods neglect the unique characteristics of the underwater environment, thus failing to support downstream underwater visual tasks efficiently. We propose a novel underwater image compression framework that integrates frequency priors and feature decomposition fusion in response to these challenges. Our framework incorporates a task-driven feature decomposition fusion module (FDFM). This module enables the network to understand and preserve machine-friendly information during the compression process, prioritizing task relevance over human visual perception. Additionally, we propose a frequency-guided underwater image correction module (UICM) to address noise issues and accurately identify redundant information, enhancing the overall compression process. Our framework effectively preserves machine-friendly features at a low bit rate. Extensive experiments across various downstream visual tasks, including object detection, semantic segmentation, and saliency detection, consistently demonstrated significant improvements achieved by our approach.

1 Introduction

The development of computer vision has greatly boost the advancement of underwater vision based marine research, including biological monitoring Gudimov (2020); Huo et al. (2021); Zhou et al. (2023), terrain mapping Rowley (2018); Nadai (2019); Jeyaraj et al. (2022), environmental surveillance Guo et al. (2020); Babić et al. (2023); Xue (2023), fisheries management Hsu et al. (2019); Madia et al. (2023); Wang et al. (2023a), etc. In these research domains, underwater imagery is pivotal in acquiring marine visual information. Since underwater photography and image acquisition usually rely on potable devices, underwater image compression is always required.

Learning-free techniques like JPEG Wallace (1991), JPEG2000 Rabbani and Joshi (2002), BPG Sullivan et al. (2012), and VVC Bross et al. (2021) reduce intra-frame information redundancy through encoding, quantization, and intra-frame prediction. Recent advancements in image compression methods based on deep learning networks have revealed their superior potential compared to conventional approaches Ballé et al. (2016, 2018); Sullivan et al. (2012); Minnen et al. (2018); He et al. (2021, 2022); Bross et al. (2021). These deep learning-based image compression methodologies leverage deep neural networks to acquire image data’s intrinsic features and compression strategies, aiming for higher compression rates and improved image quality. Unfortunately, current image compression methods are typically designed for terrestrial images. Applying these compression methods to underwater images makes it easy to trigger image information loss, which can be crucial to downstream visual tasks (e.g., image classification Deng et al. (2009); He et al. (2016); Sandler et al. (2018), object detection Redmon et al. (2016); He et al. (2017); Ren et al. (2017), and semantic segmentation models Long et al. (2015); Badrinarayanan et al. (2017); Chen et al. (2018a), as depicted in the Figure 1.

Figure 1

Figure 1 (A) source image, (B) the proposed method, (C) JPEG, and (D) BPG. We provide three groups of downstream visual tasks including object detection, semantic segmentation, and saliency detection. The initial subset of results pertains to the object detection task, where the first column exhibits the original image. Notably, our approach achieves superior accuracy and confidence in three tasks.

Due to the distinctive characteristics of the underwater environment, existing image compression methods suffer from two primary drawbacks during underwater image compression tasks. On the one hand, while these methods enhance the quality of reconstructed images to some extent, their primary focus is preserving pixel-level fidelity as perceived by the human visual system rather than facilitating feature recognition in machine visual applications Fang et al. (2023). Without considering the requirements of the underwater downstream visual tasks, the preserved information can be useless or even adverse to underwater downstream visual tasks.

On the other hand, current learning-based or learning-free compression methods are mainly designed to remove redundant information in terrestrial environments, in which typically exhibit uniform color distribution and high clarity Ancuti et al. (2012). However, underwater photos are highly susceptible to color bias, scattering, motion blur, and other distortions, which are quite different with the terrestrial environments Pei et al. (2018). The noise caused by the underwater environment can affect image compression and downstream visual tasks Jiang et al. (2020); Brummer and De Vleeschouwer (2023). Due to the enormous gap between the terrestrial and underwater domains, the experience for redundant information definition in terrestrial environments does not apply to underwater environments. In other words, the removed ‘redundancy’ information defined in these conventional compression methods may be useful in underwater downstream visual tasks.

Learning-based visual tasks are fundamental for underwater automation. For high-quality images, advanced visual tasks, such as image classification Deng et al. (2009); He et al. (2016); Sandler et al. (2018), object detection Redmon et al. (2016); He et al. (2017); Ren et al. (2017), and semantic segmentation models Long et al. (2015); Badrinarayanan et al. (2017); Chen et al. (2018a) can efficiently accomplish machine visual tasks by learning discriminative features. However, if we consider these tasks with underwater image compression, the accumulation loss of information due to underwater degradation and image compression can significantly impact the performance of reconstructed images in downstream machine visual tasks. Therefore, our primary concerns are effectively introducing underwater image transformation into the compression framework and obtain more machine-friendly feature representations. Following the learning-based compression framework, we introduce a task-driven feature decomposition fusion module (FDFM) to help the network understand and preserve machine-friendly information during the compression process. This allows the network to concentrate on information pertinent to the task, prioritizing task relevance over human visual perception. Furthermore, we propose a frequency-guided underwater image correction module (UICM) to reduce the impact of noise caused by the underwater environment and to accurately locate the redundant information that can be eliminated. To this end, we propose a novel underwater image compression framework that facilitates downstream visual tasks in underwater scenarios. The overall framework is illustrated in Figure 2. The primary contributions of this work are summarized as follows:

● We have proposed a novel machine-oriented underwater image compression framework, which has achieved high compression rates and ensured the performance of downstream underwater visual tasks. Extensive experiments on three different downstream visual tasks further demonstrate the consistent and significant improvements achieved by our method.

● To alleviate the impact of information loss caused by underwater degradation during the image compression process, we propose a frequency-guided underwater image correction module (UICM) that leverages frequency priors to remove the correct redundant information.

● We introduce a task-driven feature decomposition fusion module (FDFM). Under the guidance of downstream visual tasks, this module can effectively capture and keep machine-friendly information during the image compression process.

Figure 2

Figure 2 The details of the method we proposed. UICM, FDFM, AFAM respectively represent the feature recovery module, feature decomposition fusion module, and attention feature aggregation module, respectively. Q, AE and AD indicate quantization, arithmetic encoding and arithmetic decoding respectively. $I_{Y}$ and $I_{U V}$ represent the luminance and chrominance components of the image in the YCbCr color space.

2 Related works

2.1 Image compression

Image compression uses reversible function mapping and encoding techniques to represent the original image data losslessly or lossily using fewer bits.

2.1.1 Learning-free image compression

In early years, learning-free image compression algorithms, including JPEG Wallace (1991), JPEG2000 Rabbani and Joshi (2002), BPG Sullivan et al. (2012), and VVC Bross et al. (2021), have gained widespread practical adoption due to their extensive development. These algorithms employ lossy compression techniques, such as transform Khayam (2003); Al-Haj (2007), quantization, entropy coding Di et al. (2003); Sze and Budagavi (2012), intra-frame prediction Brand et al. (2019), and deep hierarchical structure Motl and Schulte (2015), to process images. However, the individual components of these standards are manually designed in advance, with rate-distortion optimization applied to determine pixel signal fidelity. The rigid, hand-crafted nature of traditional codecs limits their adaptability and efficiency in catering to diverse targets. Since they lack end-to-end optimization, they cannot dynamically adjust to image content characteristics. Consequently, compression requirements vary for different image types, scenarios and complexities, posing challenges to learning-free image compression methods.

2.1.2 Learning-based image compression

The rapid development of deep learning networks has significantly boost learning-based image compression methods. Notably, methods based on Variational Autoencoders (VAE) Ballé et al. (2016, 2018); Minnen et al. (2018); Cheng et al. (2020); Li et al. (2020); He et al. (2021; Chen et al. (2021); Zhu et al. (2022), 2022); Zou et al. (2022) employ encoders and decoders to compress images, focusing on compressing latent features. These approaches optimize the network in an end-to-end fashion, resulting in a high-performance compression framework. For instance, Ballé et al. (2016) introduced an image compression method incorporating a nonlinear analysis transform, a uniform quantizer, and a nonlinear synthesis transform. This method laid the foundation for the image compression model based on the VAE model. Similarly, Ballé et al. (2018) proposed an image compression model based on variational autoencoders, combining priors to capture spatial dependencies in latent representations and training the model in an end-to-end manner. When trained on appropriate losses, the model cannot fully achieve the performance of highly optimized traditional methods (such as BPG based on PSNR). This difference may indicate that the method has not yet reached the expressive power of traditional methods. In another approach, Minnen et al. (2018) enhanced an image compression method by refining the entropy model with an autoregressive model. The synergy between the autoregressive model and the prior model leads to improved image indicators, such as PSNR and MS-SSIM, outperforming the BPG Sullivan et al. (2012) method. However, the sequential computational approach of autoregressive models results in low operational efficiency. Moreover, Zhu et al. (2022) presented an image compression method using a multivariate Gaussian mixture, employing vector quantization to approximate the mean and solving it through cascaded estimation, avoiding the need for a context model and reducing complexity. However, this method is trained in an unsupervised manner, and the generated results may be biased. Li et al. (2020) introduced a content-weighted codec model, which generates an importance mask for local adaptive bit allocation through an importance mapping subnet, offering an alternative to entropy estimation. This method improves image compression efficiency while reducing the computational complexity of the context model. Chen et al. (2021) introduced an image compression method that combines non-local attention optimization with improved context modeling. This method utilizes local network operations as nonlinear transformations, estimating the corresponding latent features and priors by calculating local and global correlation information. This method leverages joint 3D convolution to enhance both the autoregressive model and the hyperprior model, improving the efficiency of the entropy model. Experimental results have demonstrated that this method outperforms JPEG, JPEG2000, and BPG in terms of image compression efficiency. Cheng et al. (2020) proposed an entropy model with enhanced flexibility in latent representation distribution estimation through a discretized Gaussian mixture model. Additionally, the performance was improved by incorporating an attention module to focus on complex regions. This method pays more attention to information-rich regions during the training process, thus improving the encoding performance. He et al. (2021, 2022) surpassed the compression efficiency of VVC by employing a checkerboard context model and unevenly grouped space channels. These two methods increase the decoding speed of the autoregressive model by more than 40 times, improving the parallelism and computational efficiency of the autoregressive model. Zou et al. (2022) presented a plug-and-play non-overlapping window local attention block, which calculates the attention map for each window using an embedded Gaussian function and normalization factors to focus on high-contrast regions. Tolstonogov and Shiryaev (2021) present an underwater image compression method based on camera frames, involving semantic segmentation, semantic shape simplification, and binary data compression. Compared to the JPEG algorithm, this method achieves a threefold increase in frame rate. Anjum et al. (2022) introduces a data-driven underwater image compression method for transmitting images through water. This method effectively utilizes limited bandwidth to transmit images and exhibits robustness against disturbances caused by channel transmission. Burguera and Bonin-Font (2022) proposes a progressive underwater image compression method that divides images into small blocks that can be transmitted separately. Experimental results have shown that this method performs well in low bandwidth or unreliable communication channel environments. Liu et al. (2023) introduces an autoencoder-based underwater image compression technique. This method enhances the reliability of encoding through a multi-step training strategy and multi-description encoding policy. Despite the remarkable performance of VAE-based methods, they are primarily designed to preserve pixel-wise signal fidelity rather than high-level semantic features, which are required in downstream visual tasks.

In parallel to the approaches above, certain studies Agustsson et al. (2019); Wu et al. (2020); Liu et al. (2021a) have explored generative adversarial networks (GANs) to generate visually pleasing textures at low bit rates. GAN-based image compression offers several notable advantages. Firstly, GANs can compress full-resolution images, showcasing the versatility of this approach. Secondly, GANs are capable of achieving extreme bit-rate image compression. However, it is essential to note that the generated images may exhibit significant differences from the original ones, resulting in a potentially deceptive perception of clarity and high resolution.

2.2 Underwater downstream visual tasks

2.2.1 Object detection

The authors of Ellen et al. (2023) utilized underwater drones with YOLOv5 to detect submerged objects, achieving considerable accuracy. In Zhang and Zhu (2023), the authors improved YOLOv5 by implementing coordinate attention mechanisms and bidirectional feature pyramids, resulting in enhanced precision in ship detection. The work in Ranolo et al. (2023) compared the detection results of seaweed using YOLOv5 and YOLOv3, with YOLOv3 exhibiting higher accuracy. The method proposed in Gao et al. (2023b) significantly increased the detection accuracy in sonar imagery by denoising sonar images and enhancing YOLOv5. The approach in Ercan et al. (2022) involves detecting targets in swimming pools through cloud-based computing. In Fu et al. (2022), the authors utilized K-means to recluster target anchor frames, improving YOLOv5’s accuracy in detecting small objects in side-scan sonar images. The authors of Hu and Xu (2022) reduced the backbone size of YOLOv5 and restructured the feature pyramid, introducing a novel method for underwater debris detection. The method presented in Xu and Matzner (2018) conducts a comparative analysis of fish detection across multiple datasets and suggests using different datasets during the detector training process. The sonar is an essential tool for the underwater image target detection. Zhang et al. (2024) developed a chirp scaling algorithm based on the reformulated Loffeld’s bistatic formula. Compared with the traditional method, the proposed method is much more efficient and can be directly applied to multichannel and tandem synthetic aperture radar. Yang (2024) proposes an imaging algorithm based on Loffeld’s bistatic formula for a multireceiver synthetic aperture sonar system. The presented method can produce high-resolution images.

2.2.2 Semantic segmentation

The authors of Nezla et al. (2021) used a deep convolutional encoder-decoder model based on the UNet architecture to segment the Fish4Knowledge image dataset, achieving commendable scores. Using a self-supervised approach, the method proposed in Singh et al. (2023) addresses the lack of large labeled datasets in underwater scenarios. This approach allows pretraining on extensive terrestrial datasets and fine-tuning on smaller underwater datasets. Kabir et al. (2023) introduced a novel underwater dataset centered around animals, with pixel-level annotations for various fine-grained animal categories. In Pergeorelis et al. (2022), the authors tackled the issue of class instance imbalance in underwater datasets by employing a scheme that involves cutting and pasting objects from one image to another. Chicchon et al. (2023) presented a combination loss function based on active contour theory and level-set methods to enhance underwater object segmentation accuracy. Wang et al. (2023b) employed a semi-supervised K-means clustering algorithm to train and validate objects like coral, sea urchins, starfish, and seagrass. Islam et al. (2020) proposed the first underwater semantic segmentation dataset, containing pixel annotations for eight object categories, and suggested that deep residual models can accurately segment underwater objects. Thampi et al. (2021) analyzed the impact of different thresholds on predicted masks for the underwater semantic segmentation of five different fish species in the Fish4Knowledge image dataset.

Despite the widespread application of advanced visual tasks in underwater environments, most require clear input images. Information loss caused by underwater degradation and image compression can affect the performance of these methods.

3 The proposed method

3.1 The overall architecture

The details of the proposed methodology are illustrated in Figure 2. To the impact of information loss caused by underwater degradation during the image compression process, we introduce the frequency-guided underwater image correction module (UICM). This module aims to reduce the impact of noise caused by the underwater environment and remove redundant information accurately. The subsequent advancement toward enhancing encoding efficiency at low bit rates involves the utilization of the task-driven feature decomposition fusion module (FDFM) for decomposing features according to their relevance to downstream underwater visual tasks. This procedure preserves machine-friendly data while eliminating redundancy, yielding a concise, machine-friendly feature representation and reduced bit rate while retaining key features. Finally, a machine-friendly image is reconstructed in the decoder stage to facilitate diverse downstream visual tasks.

3.2 Frequency-guided underwater image correction module

Due to the complexity of optical imaging in underwater environments compared to terrestrial environments, underwater images are often subject to noise interference. Since image noise is non-compressible and irrelevant to downstream visual tasks, the compressed image bit rate will be lower than the standard Brummer and De Vleeschouwer (2023). The work on Xu et al. (2020) suggests the varying significance of different frequency channels in visual tasks. We have designed a frequency-guided underwater image correction module (UICM) to address this issue to eliminate noise and pinpoint removable redundant information. The structure of UICM is illustrated in Figure 2.

Firstly, we revisit the operations and properties of the discrete cosine transform(DCT). DCT is an orthogonal transformation method. Compared with the fast Fourier transform (FFT) and the discrete wavelet transform (DWT), DCT can save computation and maintain good performance Wen et al. (2022). Given a single-channel image f of size N × N, the discrete cosine transform D transforms it into the discrete cosine space as X, which is expressed as Equation 1:

\begin{array}{l} {\begin{matrix} X (u, v) = D_{f (i, j)} = c (u) c (v) \sum_{i = 0}^{N - 1} \sum_{j = 0}^{N - 1} \\ f (i, j) cos [\frac{(i + 0.5) π}{N} u] cos [\frac{(j + 0.5) π}{N} v] \\ c (u) = {\begin{matrix} \sqrt{\frac{1}{N}}, u = 0 \\ \sqrt{\frac{2}{N}}, u \neq 0 \end{matrix} \\ c (v) = {\begin{matrix} \sqrt{\frac{1}{N}}, v = 0 \\ \sqrt{\frac{2}{N}}, v \neq 0 \end{matrix} \end{matrix} & (1) \end{array}

where i and j are the coordinate bases in the spatial space; u and v are the coordinate bases in the discrete cosine transform space and $D^{- 1}$ denotes the inverse discrete cosine transform.

The image features affected by the underwater environment $I_{d e g r a d a t i o n}$ can be expressed as:

\begin{array}{l} \begin{matrix} I_{d e g r a d a t i o n} = \sum_{i} I_{d e g r a d a t i o n}^{s_{i}} \end{matrix} & (2) \end{array}

where $I_{d e g r a d a t i o n}^{s_{i}}$ represents image features affected by the underwater environment at different scales and $s_{i}$ represents different scale ranges. DCT can effectively model noise signals and redundant signals. Let $s_{u v}^{s_{i}}$ represent component of $I_{d e g r a d a t i o n}^{s_{i}}$ in the DCT space. Equation 2 can be reexpressed as Equation 3:

\begin{array}{l} \begin{matrix} I_{d e g r a d a t i o n} = D^{- 1} (\sum_{s_{i}} \sum_{u} \sum_{v} s_{u v}^{s_{i}}) \end{matrix} & (3) \end{array}

where $s_{i}$ represents different scale ranges; u and v are the coordinate bases in the discrete cosine transform space.

Let $Q$ represent the expected image features with low noise and low redundancy, and its component in the DCT space is denoted as $q_{u v}^{s_{i}}$ . Q can be formulated as follows:

\begin{array}{l} \begin{matrix} Q = D^{- 1} \end{matrix} (\sum_{s_{i}} \sum_{u} \sum_{v} q_{u v}^{s_{i}}) & (4) \end{array}

where $s_{i}$ represents different scale ranges; u and v are the coordinate bases in the discrete cosine transform space.

The difference between $I_{d e g r a d a t i o n}$ and $Q$ in the DCT space, namely the spectral loss $z_{u v}^{s_{i}}$ , can be represented as Equation 5:

\begin{array}{l} \begin{matrix} z_{u v}^{s_{i}} = \sum_{s_{i}} \sum_{u} \sum_{u} (q_{u v}^{s_{i}} - s_{u v}^{s_{i}}) \end{matrix} & (5) \end{array}

where $s_{i}$ represents different scale ranges; u and v are the coordinate bases in the discrete cosine transform space.

Equation 4 can be reexpressed as Equation 6:

\begin{array}{l} Q = D^{- 1} (\sum_{s_{i}} \sum_{u} \sum_{u} (s_{u v}^{s_{i}} + z_{u v}^{s_{i}})) & (6) \end{array}

where $s_{i}$ represents different scale ranges; u and v are the coordinate bases in the discrete cosine transform space.

Conventional approaches reliant on DCT space aim to directly adjust DCT coefficients, posing significant challenges for practical implementation. Drawing inspiration from Zheng et al. (2019), we leverage a convolutional neural network (CNN) to estimate $z_{u v}^{s_{i}}$ . Acknowledging the influence of diverse-scale features and frequencies on images, our approach entails image adjustment across multiple scales.

UICM employs frequency-space interaction blocks (FSI) as illustrated in Figure 3 as fundamental units. The FSI block consists of a frequency branch and a spatial branch to learn global and local information, respectively. The frequency domain representation emphasizes global attributes, while the local attributes are learned in the spatial branch. These two branches interact to obtain complementary information. The frequency branch estimates the spectrum loss $z_{u v}^{s_{i}}$ in the DCT space via the CNN block and then converts it to the color space through block-IDCT. Block-IDCT uses a predefined convolutional layer with weights fixed as the D⁻¹ coefficient. The spatial branch processes information in the spatial domain through convolutional blocks. We then interweave features from the spatial and frequency branches, facilitating the acquisition of more information by different branches. The FSI will then repeat the same calculation once more. Finally, we merge the outputs of the two branches using 1×1 convolution to obtain the output of the FSI block.

Figure 3

Figure 3 The illustration of the amplitude format of the FSI block. The FSI block consists of frequency and spatial branches to learn global and local information. The frequency domain representation emphasizes global attributes, while the local attributes are learned in the spatial branch. These two branches interact to obtain complementary information.

3.3 Task-driven feature decomposition fusion module

To ensure the image compression network prioritizes machine-friendly features over preserving pixel-level fidelity as perceived by the human visual system, we employ the task-driven feature decomposition fusion module (FDFM). This module facilitates the preservation of machine-friendly information while eliminating redundancies. Guided by downstream visual tasks, FDFM extracts machine-friendly details from both the original image and the image processed by UICM, effectively removing redundant information. Additionally, attention mechanisms are applied to discern the significance of pixels at various spatial positions. In alignment with downstream underwater visual tasks, distinct weights are assigned to individual pixels to mitigate information redundancy.

The detailed workflow is illustrated in Figure 2. The FDFM comprises three essential modules: a shared encoder $Φ_{l}$ dedicated to extracting low-frequency features, a detailed encoder $Φ_{h}$ specialized in capturing high-frequency features, and a decoder Ψ employed for the reconstruction of features with enhanced semantic information.

In a detailed approach, the FDFM model initiates the process by utilizing the shared encoder $Φ_{l}$ and the detailed encoder $Φ_{h}$ to dissect the low-frequency and high-frequency components of both the original source image $I_{o r i g i n}$ and the image $I_{r e s t o r e d}$ processed through UICM. This results in the extraction of low-frequency information $F_{l o}$ $F_{l e}$ and high-frequency information $F_{h}$ , which is formulated as Equation 7. Drawing inspiration from recent advancements in backbone networks Ding et al. (2023, 2022, 2021); Liu et al. (2021b), we adopt the ConvNeXt Woo et al. (2023) structure for the detailed encoder.

\begin{array}{l} \begin{matrix} F_{l o} = Φ_{l} (I_{o r i g i n}), F_{l e} = Φ_{l} (I_{r e s t o r e d}), F_{h} = Φ_{h} (I_{o r i g i n}) \end{matrix} & (7) \end{array}

In the current context, we possess low-frequency information denoted as $F_{l o}$ and $F_{l e}$ extracted from both the source image $I_{o r i g i n}$ and the restored image $I_{r e s t o r e d}$ . The imperative task is to devise an efficient approach for integrating these information sets. Motivated by the positional attention mechanism discussed in Hou et al. (2021), which simultaneously empowers the neural network to assimilate information from diverse channels, we have formulated an attention feature aggregation module(AFAM). This module is specifically designed to handle features originating from various channels collaboratively. Moreover, this module analyzes pixel significance across various positions, utilizing coordinates to mitigate information redundancy. Initially, we engage in channel concatenation for the low-frequency information $F_{l o}$ and $F_{l e}$ . Subsequently, we conduct computations employing operation Coo, culminating in the utilization of a 1×1 convolution operation to produce the final output, which is formulated as Equation 8:

\begin{array}{l} \begin{matrix} F_{l c} = P w (C o o (C a t (F_{l o}, F_{l e}))) \end{matrix} & (8) \end{array}

where the $F_{l c}$ means the integrated low-frequency information; Pw is indicative of the 1×1 convolution operation; Coo represents positional attention, and Cat stands for channel concatenation.

Multiscale learning enables the network to autonomously acquire global and local information from features at higher and lower resolutions. Consequently, we conduct scale decomposition on the acquired $F_{l c}$ . We streamline Chen et al. (2022) and incorporate it as the feature extraction network. Subsequently, through AFAM fusion, we derive representations imbued with more profound semantic information. To encapsulate, the process above can be summarized as Equation 9:

\begin{array}{l} \begin{matrix} F_{l c}^{S i} = ϕ (F_{l c}) \\ F_{l c e} = Δ_{A F A M} (\sum_{i} (F_{l c}^{S i})) \end{matrix} & (9) \end{array}

where the ϕ signifies scale decomposition; $F_{l c}^{S i}$ denotes the features subsequent to scale decomposition; and $F_{l c e}$ represents the augmented representation with enriched information post scale fusion.

In conclusion, we integrate $F_{l c e}$ with $F_{h}$ . Drawing inspiration from He et al. (2016), we utilize skip connections to seamlessly amalgamate $F_{l c e}$ and $F_{h}$ . Subsequently, the acquired features undergo reconstruction into features $I_{r}$ endowed with more profound semantic information and less redundant information through the decoder Ψ. The process above can be summarized as Equation 10.

\begin{array}{l} \begin{matrix} I_{r} = Ψ (F_{l c e} + F_{h}) \end{matrix} & (10) \end{array}

3.4 Training

3.4.1 Loss function

In light of our approach to designing for downstream visual tasks, we employ four distinct loss functions to facilitate the training of our network.

$L_{m s e}$ is the reconstruction loss between the input image L and the reconstructed image L′, used to constrain the pixel-level fidelity of the reconstructed image L′ to the input image L, which is formulated as Equation 11:

\begin{array}{l} \begin{matrix} L_{m s e} = M S E (I, I^{'}) \end{matrix} & (11) \end{array}

Inspired by the work of Johnson et al. (2016), we integrate the perceptual loss, denoted as $L_{f e a}$ , to accentuate the perceptual quality of the reconstructed image. Employing the initial three layers of a pre-trained VGG-19 Simonyan and Zisserman (2014) as feature extractors, we input both the original images I and reconstructed images $I^{'}$ to derive the corresponding output features. The loss is formulated by leveraging these features, expressed mathematically as Equation 12:

\begin{array}{l} L_{f e a} = \sum_{i}^{N} (F_{i} (I) - F_{i} (I^{'})) & (12) \end{array}

where the $F_{i} (I)$ and $F_{i} (I^{'})$ denote the feature representations at the i layer within their pre-trained neural network; and the N represents the total number of layers.

In order to enhance the performance of the reconstructed image in sophisticated visual tasks, we incorporate diverse downstream task losses under the designation of the task loss $L_{t a s k}$ . The application of multiple loss constraints ensures that the reconstructed image aligns with the specific demands of a variety of downstream visual tasks. Throughout the training phase, the cumulative loss is denoted as Equation 13:

\begin{array}{l} \begin{matrix} L_{t o t a l} = λ_{1} L_{m s e} + λ_{2} L_{f e a} + λ_{3} L_{t a s k} + L_{b i t} \end{matrix} & (13) \end{array}

where $L_{b i t}$ represents the bit-rate of latent code; $λ_{1}$ , $λ_{2}$ and $λ_{3}$ are hyperparameters that mediate the compression ratio of the network. The hyperparameters $λ_{1}$ , $λ_{2}$ and $λ_{3}$ will all affect the results of the method. Typically, we set hyperparameters $λ_{1}$ , $λ_{2}$ and $λ_{3}$ based on experience. Please refer to section 4.1 for detail.

3.4.2 Adaptive training strategy

The single-stage training strategy encounters challenges in achieving a harmonious equilibrium between low-level and high-level visual tasks. Current approaches for low-level visual tasks, propelled by their high-level counterparts, frequently employ pre-trained high-level visual models to direct the training of models dedicated to low-level visual tasks. Alternatively, some methodologies opt for concurrently training low-level and high-level visual tasks within a unified stage. Our strategy upholds the performance synergy between image fusion and semantic segmentation by subjecting the compression network and semantic segmentation network to alternating training. This method mitigates potential issues, such as mode collapse, commonly observed during Generative Adversarial Network (GAN) training Tang et al. (2022).

4 Experiments

4.1 Experimental setup

4.1.1 Datasets

SUIM Islam et al. (2020) is a dataset for semantic segmentation of underwater. It comprises over 1500 images, each pixel annotated for eight distinct object categories: vertebrate fish, invertebrate coral reefs, aquatic plants, sunken ships/ruins, human divers, robots, and the seabed. Following a predefined partitioning scheme, the dataset is divided into 1525 images for training and 110 for testing. The hyperparameters $λ_{1}$ , $λ_{2}$ and $λ_{3}$ will all affect the results of the method. Typically, we set hyperparameters lambda based on experience. $λ_{1} / λ_{2} / λ_{3}$ are empirically set to 0.051/0.15/1, 0.051/0.5/1 and 0.051/2/1 under 0.1, 0.3 and 0.5 bpp respectively.

URPC2018 is a dataset for object detection of underwater. It compasses four distinct categories: sea cucumber, sea urchin, starfish, and scallop, comprising 2901 training images and 800 testing images. Our approach adheres to a pre-established partitioning scheme. The hyperparameters $λ_{1}$ , $λ_{2}$ and $λ_{3}$ will all affect the results of the method. Typically, we set hyperparameters lambda based on experience. $λ_{1} / λ_{2} / λ_{3}$ are empirically set to 0.051/0.17/1, 0.051/0.5/1 and 0.051/2/1 under 0.028, 0.86 and 0.237 bpp respectively.

4.1.2 Compared methods

We assessed the efficacy of our proposed method through a comparative analysis with traditional and CNN-based compression methods. The entropy model is based on Zou et al. (2022). The traditional methods encompass JPEG Wallace (1991), JPEG2000 Rabbani and Joshi (2002), BPG (intra-frame, 4:4:4 chroma format) Sullivan et al. (2012), and VVC intra-frame (4:4:4 chroma format) Bross et al. (2021). Additionally, CNN-based methods such as Hyperprior (ICLR2018) Ballé et al. (2018), Devil (CVPR2022) Zou et al. (2022) and Gao Gao et al. (2023a) were included in the comparison.

We conducted an extensive series of experiments to assess the performance of the proposed underwater image compression model in downstream visual tasks downstream, encompassing object detection, semantic segmentation, and saliency detection.

4.2 Downstream visual tasks performance comparison

4.2.1 Object detection

We employed the Yolov8s framework for downstream object detection to present our findings. We fine-tune the detector using a pre-trained model on the COCO dataset Lin et al. (2014) for identifying targets such as humans, robots, invertebrates, vertebrates, and fish. The image dimensions were standardized to 640×640, and the detector underwent training using the Adam Kingma and Ba (2014) optimizer for 100 epochs, initialized with a learning rate of 0.00001. Notably, consistent settings were applied across various image compression methods. Evaluation of detection performance was based on the recall rate (RA) and the mean average precision (mAP₅₀).

Table 1 illustrates that our proposed method achieves high accuracy in target detection even under low bit rates. Specifically, under 0.1bpp, on SUIM dataset, our method surpasses JPEG, JPEG2000, BPG, and VVC in RA/mAP₅₀ by 0.237/0.03267 points, 0.162/0.105 points, 0.102/0.065 points, 0.226/0.165 points, respectively. In comparison to Hyperprior Ballé et al. (2018), devil Zou et al. (2022) and Gao Gao et al. (2023a) under 0.01bpp, our method demonstrates notable improvements of 0.091/0.059 points, 0.189/0.198 points and 0.028/0.053 points in RA/mAP₅₀. Noteworthy, under 0.3 and 0.5bpp, our method has a comfortable lead over the alternatives.

Table 1

Table 1 Comparison on object detection tasks on SUIM dataset.

As shown in Table 2, our method demonstrates outstanding performance in the URPC2018 dataset. Specifically, under 0.028 bpp, our approach yields RA/mAP₅₀ scores 0.066/0.09 points higher than those of VVC. In contrast, the reconstructed images produced by the JPEG2000 method suffer severe degradation, leading to a loss in analytical efficacy. Across different bitrates, our proposed method keeps ahead of various compression methods in underwater object detection tasks. Due to the influence of the underwater environment, the images in the URPC2018 dataset are blurry. Qualitative analysis is shown in the Figure 4. Experimental results demonstrate that our method performs well even on blurry images.

Table 2

Table 2 Comparison on object detection tasks on URPC2018 dataset.

Figure 4

Figure 4 Qualitative analysis of object detection conducted on the URPC2018 dataset. Where (A) are original images, and (B) are reconstructed images using the proposed method under 0.237 bpp.

4.2.2 Semantic segmentation

We employed DeepLabV3+ Chen et al. (2018b) as the semantic segmentation framework to present our findings. The segmentation framework underwent fine-tuning, utilizing a pre-trained model from Imagenet dataset Deng et al. (2009), for the precise segmentation of targets including vertebrate fish, invertebrate coral reefs, aquatic plants, sunken ships/ruins, human divers, robots, and the seabed. Standardizing the image size to 256×256, the segmentation framework underwent training with the Adam Kingma and Ba (2014) optimizer for 100 epochs, commencing with an initial learning rate of 0.0001. Notably, consistent settings were applied across diverse image compression methods. Evaluation of segmentation performance was conducted using mean Intersection over Union (mIOU), mean Pixel Accuracy (mPA), and Pixel Accuracy (PA).

Table 3 shows the comparisons among different methods in semantic segmentation tasks. It is evident that our approach outperforms the other methods. As we discussed in section 1, the high-level features in underwater images can be easily affected, posing challenges for downstream visual tasks. Unlike our methods, current CNN-based compression methods still focus on mitigating pixel distortion without considering the key features required by semantic segmentation and other downstream visual tasks. For example, under approximately 0.1bpp, our proposed method attains higher mIOU/mPA/PA scores than JPEG, JPEG2000, BPG, and VVC by 12.84/13.22/6.8 points, 5.09/3.98/1.78 points, 3.96/4.63/1.34 points, 0.44/1.21/1.08 points, respectively. In comparison to Hyperprior Ballé et al. (2018), devil Zou et al. (2022) and Gao Gao et al. (2023a) under 0.01bpp, our method remains ahead of 0.42/1.63/1.38 points, 10.56/10.36/6.3 points and 0.46/1.83/1.26 points in mIOU/mPA/PA.

Table 3

Table 3 Comparison on semantic segmentation tasks on SUIM dataset.

4.2.3 Saliency detection

We employed the U²net Qin et al. (2020) framework for underwater saliency detection with different compression frameworks. The saliency detection framework underwent fine-tuning utilizing a pre-trained model on the DUTS dataset Piao et al. (2020), specifically targeting human divers, robots, fish, and vertebrates. The dimensions of the images in the detection framework were standardized to 320×320, and the training process utilized the AdamW Loshchilov and Hutter (2017) optimizer for 360 epochs, initializing with a learning rate of 0.001. Consistency was maintained across various image compression methods as we adhered to the same settings. Our evaluation of the detection performance relies on mean absolute error (MAE) and maximal F-measure (maxF_β) Achanta† et al. (2009).

Table 4 reveals that our proposed method achieved the best performance in saliency detection even under low bit rates. Specifically, under 0.1bpp, our method’s MAE/maxF_β outperforms JPEG, JPEG2000, BPG, and VVC by 0.025/0.115 points, 0.012/0.058 points, 0.014/0.058 points, 0.004/0.023 points, respectively. In comparison to Hyperprior Ballé et al. (2018), devil Zou et al. (2022) and Gao Gao et al. (2023a) under 0.01bpp, our method showcases improvements of 0.002/0.031 points, 0.022/0.107 points and 0.002/0.018 points in MAE/maxF_β. Under 0.3 and 0.5bpp, our method consistently maintains superior performance. It is evident that our method can efficiently support underwater saliency detection tasks.

Table 4

Table 4 Comparison on saliency detection tasks on SUIM dataset.

Figure 5 illustrates a qualitative analysis of the outcomes obtained from various methods across three tasks: object detection, semantic segmentation, and saliency detection. In the initial row of each set, we display the bounding boxes and confidence levels associated with the object detection results, with the initial image serving as a representation of the original image. Evidently, underwater images compressed with our approach remain high detection accuracy and confidence scores. The effectiveness of the object detector can be easily constrained by the impact of underwater degradation with compression. Nevertheless, our UICM systematically removes noise and redundant information, resulting in specific detection outcomes surpassing those of the original images. In the subsequent row of each set, the initial image serves as the ground truth for the semantic segmentation task, with varied colors denoting distinct categories. For semantic segmentation task, our approach obtained segmentation accuracy compared to alternative methods, producing contours that align more closely with the ground truth. Additionally, our approach demonstrates comparable efficacy in salient object detection, as depicted in the third-row results, where the initial image serves as the ground truth.

Figure 5

Figure 5 Examples results of (A) original image, (B) our method, (C) JPEG, (D) JPEG2000, (E) BPG, (F) VVC, (G) Hyperprior, (H) Devil. For each group, the results of object detection, segmentation and saliency detection are shown respectively. Our method has better performance compared to other methods.

Through a comprehensive examination of both qualitative and quantitative outcomes in the three tasks above, our proposed method has exhibited superior performance on various downstream visual tasks.

4.3 Ablation study

We conducted some ablation experiments to validate the contributions of the proposed UICM and FDFM. We compared the results of object detection, semantic segmentation, and saliency detection for three network structures: (a) without UICM, (b) without FDFM, and (c) without both UICM and FDFM.

The ablation experiments’ outcomes for the target detection task are presented in Table 5. Among the experimental setups, (b) demonstrates superior performance in RA and mAP₅₀. In comparison to (c), (b) exhibits a notable enhancement of 0.081/0.037 points in both RA and mAP₅₀. Similarly, when contrasted with (a), (b) manifests an improvement of 0.009/0.005 points in RA and mAP₅₀. Furthermore, in contrast to (c), (a) displays an increase of 0.072/0.032 points RA and mAP₅₀.

Table 5

Table 5 Ablation study on object detection tasks on SUIM dataset under 0.3 bpp.

We obtained similar performance on semantic segmentation tasks, as depicted in Table 6. Compared to (c), (b) demonstrates a notable improvement of 0.93/0.63/1.28 points in mIOU, mPA, and PA, respectively. Compared with (a), (b) displays a modest enhancement of 0.06/0.03/0.13 points in mIOU, mPA, and PA. Similarly, compared to (c), (a) exhibits an increase of 0.87/0.6/1.15 points in mIOU, mPA, and PA.

Table 6

Table 6 Ablation study on semantic segmentation tasks on SUIM dataset under 0.3 bpp.

The outcomes of the ablation experiments conducted for the saliency detection task are presented in Table 7, unveiling consistent patterns in the results of the saliency detection task. In comparison to (c), (b) demonstrates a noteworthy improvement of 0.004/0.015 points in MAE/maxF_β. Similarly, contrasted with (a), (b) exhibits a modest enhancement of 0.001/0.012 points in MAE/maxF_β. Furthermore, when compared to (c), (a) shows an increase of 0.003/0.003 points in MAE/maxF_β.

Table 7

Table 7 Ablation study on saliency detection tasks on SUIM dataset under 0.3 bpp.

The aforementioned experimental results validate the effectiveness of UICM and FDFM. UICM incorporates underwater prior knowledge into the image compression framework by leveraging frequency information, which is beneficial for noise and redundant information removal. Meanwhile, FDFM employs a task-driven approach to decompose image features, effectively assisting the network in understanding and preserving machine-friendly information during the compression process.

4.4 Human perception performance

In order to evaluate the proposed methodology can also apply to the human visual system, we prepared comprehensive evaluations to measure human perceptual performance. To refine the evaluation of our proposed approach within the context of the human visual system, a deliberate shift from pixel fidelity was made. This involved the utilization of metrics such as PSNR and MS − SSIM for natural images, along with the UIQM Panetta et al. (2015) metric tailored for underwater imagery. The ensuing outcomes have been meticulously compiled and are delineated in Tables 8–10. In the PSNR metric, our proposed method demonstrates comparable performance to the JPEG2000 approach. Within the MS − SSIM metric, the effectiveness of our proposed method aligns with that of the BPG method. Moreover, in UIQM, our proposed method outperforms alternative approaches.

Table 8

Table 8 Comparison on human perception performance tasks on SUIM dataset in terms of PSNR metric.

Table 9

Table 9 Comparison on human perception performance tasks on SUIM dataset in terms of MS-SSIM metric.

Table 10

Table 10 Comparison human perception performance tasks on SUIM dataset in terms of UIQM metric.

From the qualitative analysis examples presented in Figure 6, it is evident that, at low bit rates, the reconstructed images generated by our method exhibit enhanced clarity in fulfilling the task objectives. Specifically, in the first row, the human targets in our approach are markedly more distinct than alternative methods. In the second row, the small fish in the lower left corner of the reconstructed images from other methodologies appear more indistinct, whereas, in our proposed method, the small fish in the corresponding position is relatively well-defined. Progressing to the third row, our proposed method’s reconstructed image of the sea urchin object displays more defined boundaries compared with alternative methods. In summary, despite its primary design for machine analysis tasks, our method preserves fundamental functionality for human recognition.

Figure 6

Figure 6 The visual comparison of (A) original image, (B) the proposed method, (C) BPG, (D) VVC, and (E) Hyperprior. (A) is designated as the original image, while the remaining columns depict reconstructed images using various methods at different bpp. The reconstructed images generated by the method proposed in this paper exhibit relatively high clarity.

5 Conclusion

This paper proposes a new machine-oriented underwater image compression framework, introducing a frequency-guided underwater image correction module (UICM) and a task-driven feature decomposition fusion module (FDFM). The UICM progressively removes noise and redundant information. A Frequency-Spatial Interaction block (FSI) is used to learn complementary global and local attributes in the frequency domain. Additionally, the FDFM can effectively locate and keep useful features for downstream visual tasks through task-driven decomposition of image features. Extensive experiments on downstream visual tasks demonstrate that the proposed framework can effectively reduce the performance loss of the downstream visual tasks caused by compression at low bit rates.

In our future endeavors, we are committed to advancing the study of image compression techniques within more visual tasks. Moreover, we aim to investigate strategies for harnessing the potential advantages derived from large-scale models.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material. Further inquiries can be directed to the corresponding authors.

Author contributions

HZ: Writing – original draft, Conceptualization, Methodology, Software. SF: Writing – original draft, Data curation, Validation. SZ: Writing – original draft, Data curation, Visualization. ZY: Writing – review & editing, Conceptualization, Funding acquisition, Resources, Supervision. BZ: Writing – review & editing, Funding acquisition, Resources, Supervision.

Funding

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This work was supported by Hainan Province Science and Technology Special Fund, China (ZDYF2022SHFZ318); the National Natural Science Foundation of China under grant number 62171419; and Natural Science Foundation of Shandong Province of China under grant number ZR2021LZH005.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Achanta R., Hemami S., Estrada F., Süsstrunk S. (2009). “Frequency-tuned salient region detection,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition. (Miami, FL, USA: IEEE). doi: 10.1109/CVPR.2009.5206596