Discriminative context-aware network for camouflaged object detection

Ike, Chidiebere Somadina; Muhammad, Nazeer; Bibi, Nargis; Alhazmi, Samah; Eoghan, Furey

doi:10.3389/frai.2024.1347898

ORIGINAL RESEARCH article

Front. Artif. Intell., 27 March 2024

Sec. Machine Learning and Artificial Intelligence

Volume 7 - 2024 | https://doi.org/10.3389/frai.2024.1347898

Discriminative context-aware network for camouflaged object detection

Chidiebere Somadina Ike¹

Nazeer Muhammad²

Nargis Bibi³

Samah Alhazmi⁴^*

Furey Eoghan¹^*

¹Department of Computing, Atlantic Technological University, Letterkenny, Ireland
²School of Computing, Pak-Austria Fachhochschule Institute of Applied Sciences and Technology, Haripur, Pakistan
³Department of Computer Science, Fatima Jinnah Women University, Rawalpindi, Pakistan
⁴Computer Science Department, College of Computing and Informatics, Saudi Electronic University, Riyadh, Saudi Arabia

Introduction: Animals use camouflage (background matching, disruptive coloration, etc.) for protection, confusing predators and making detection difficult. Camouflage Object Detection (COD) tackles this challenge by identifying objects seamlessly blended into their surroundings. Existing COD techniques struggle with hidden objects due to noisy inferences inherent in natural environments. To address this, we propose the Discriminative Context-aware Network (DiCANet) for improved COD performance.

Methods: DiCANet addresses camouflage challenges through a two-stage approach. First, an adaptive restoration block intelligently learns feature weights, prioritizing informative channels and pixels. This enhances convolutional neural networks’ ability to represent diverse data and handle complex camouflage. Second, a cascaded detection module with an enlarged receptive field refines the object prediction map, achieving clear boundaries without post-processing.

Results: Without post-processing, DiCANet achieves state-of-the-art performance on challenging COD datasets (CAMO, CHAMELEON, COD10K) by generating accurate saliency maps with rich contextual details and precise boundaries.

Discussion: DiCANet tackles the challenge of identifying camouflaged objects in noisy environments with its two-stage restoration and cascaded detection approach. This innovative architecture surpasses existing methods in COD tasks, as proven by benchmark dataset experiments.

1 Introduction

The idea behind Charles Darwin’s theory of evolution and natural selection is the evolution of prey camouflage patterns and the understanding of animal cognition in a more ecological context. The earliest research on camouflage dates to the last century (Cott, 1940). Research by Thayer (1918) and Cott (1940) comprehensively studied the phenomenon of camouflage. Camouflage is an evolutionary concealment technique to mask objects’ location, identity, and movement in their surrounding environment. For living organisms to adapt to their environment, they require the exhibition of adaptive traits or behavioral strategies better suited to the environment. The combination of these physiological characteristics, such as color, pattern, morphology, and behavior (Gleeson et al., 2018; Stevens and Ruxton, 2019), provides them with some survival advantages by disrupting the visual silhouette of animals or potential predators. Inspired by this important natural phenomenon, humans have made attempts to replicate these patterns in many fields.

As a multidisciplinary study of computer science and evolutionary biology, it has a wide range of applications in practical scenarios, including wildlife preservation and animal monitoring; arts (e.g., recreational art) (Chu et al., 2010; Ge et al., 2018); agriculture (e.g., locust detection to prevent invasion); computer vision and other vision-related areas (e.g., search-and-rescue missions in natural disasters; military target detection and surveillance systems; rare species discovery); medical image analysis [e.g., polyp segmentation (Fan et al., 2020b); lung infection segmentation (Fan et al., 2020c; Wu et al., 2021)], to mention a few.

There are two types of camouflaged objects: naturally camouflaged objects and artificially camouflaged objects (Stevens and Merilaita, 2009). Natural camouflage results from the coevolution of predators and prey. Figures 1A,B show disruptive coloration and background pattern matching in animals attempting to exploit predators’ visual processing and cognition. Other camouflage strategies include countershading, transparency, masquerade, distractive markings (Galloway et al., 1802), etc. Artificially camouflaged objects are predatory camouflage strategies often seen in humans, such as military troops, vehicles, weapons, and positions in war zones (Zheng et al., 2018). These objects first observe their environment and elegantly blend their texture patterns to create a familiar scene as the environment to deceive potential observers’ visual perception systems, as shown in Figure 1C.

Figure 1

Figure 1. Natural and artificial camouflaged objects. (A,B) show Natural camouflage and (C) shows Artificial camouflage.

COD has gained increased attention in the computer vision community but is not well-studied due to the insufficiency of large training datasets and a standard benchmark like Pascal-VOC (Everingham et al., 2015), ImageNet (Deng et al., 2009), MS-COCO (Lin T. Y. et al., 2014), etc.

The majority of computer vision literature is largely concerned with the detection/segmentation of non-camouflaged objects (Ren et al., 2017). Based on the detecting and segmenting viewpoint (Zhao Z. Q. et al., 2019), the objects can be divided into three categories: generic objects, salient objects, and camouflage objects. Generic object detection (GOD) is a popular direction in cognitive computer vision which aims to find common objects. They can either be salient or camouflaged. Salient object detection (SOD) aims to find attention-grabbing objects in an image, i.e., objects with pre-defined classes. There exists a vast amount of research works for both generic (Shotton et al., 2006; Liu et al., 2010; Girshick et al., 2014; Everingham et al., 2015; Girshick, 2015; Ren et al., 2015; Kirillov et al., 2019; Le et al., 2020), and salient object detection (Wang et al., 2017; Wu et al., 2019; Zhao J. X. et al., 2019; Zhao and Wu, 2019; Fan et al., 2020a; Qin et al., 2020; Waqas Zamir et al., 2021). COD aims to identify objects whose shape and outline are not easily recognizable in images, as shown in Figure 2. The high intrinsic similarities between the camouflaged objects and the background require a significant amount of visual perception knowledge, hence making COD far more challenging than the conventional salient object detection or generic object detection (Ge et al., 2018; Zhao Z. Q. et al., 2019; Zhao J. X. et al., 2019).

Figure 2

Figure 2. Object segmentation exemplars: (A) Given input image, (B) GOD, (C) SOD, (D) COD.

In this paper, we present a review of deep-learning object detection from a camou-flaged perspective. We proposed a discriminative context-aware network called “Di-CANet.” In consideration of the noisy interference in natural systems, the low-frequency distribution contains smooth data disordering while the high-frequency details get an unwanted approximation. These contain channel-wise and pixel-wise features unevenly distributed across the camouflaged image and should be differentiated using weighted information to get an appropriate representation of salient features of objects. Therefore, rather than directly assigning equal weights to the channel-wise and pixel-wise features (Woo et al., 2018), inspired by Qin et al. (2020), we introduced an adaptive restoration block (ARB). This is used to adaptively learn the weights of the image features and assign different weights to them. This not only contributes to the representative ability of convolutional neural networks (CNN) but also provides the required robustness against various types of information preservation. After processing the ARB, these features are complementary-aware according to the fusion pipeline to generate restored camouflage images. Next, a cascaded detection module (Fan et al., 2020a) fortified with a modified receptive field block (Liu and Huang, 2018) was adopted to segment ecological signals and drive the segmentation performance of the target objects during the detection stage. Furthermore, a more refined camouflaged object prediction map is attained with clear boundaries and the generation of an accurate saliency map in terms of contextual details.

With the above considerations, the proposed DiCANet is used to develop a good, camouflaged prediction map. Our contributions can be summarized as follows: (1) We proposed a discriminative context-aware network (“DiCANet”) for camouflage object segmentation; (2) We intelligently infused an adaptive restoration block into a bio-inspired cascaded detection block to effectively guide detection and segmentation performance. The ARB comprises three key components: (a) feature attention block (FAB), (b) Group architecture incorporation, and (c) Attention-based feature fusion network. Details of these components will be discussed in subsequent sections; (3) The proposed COD model boosted performance to a new state-of-the-art (SOTA). The experiments are verified for the effectiveness of our proposed method.

2 Related work

This section reviews related works in two folds: image restoration approaches and deep learning-based COD approaches.

2.1 Image restoration

Visual information present in the real world contains undesired image contents, and as a positionally sensitive problem, it requires pixel-to-pixel correspondence between the input and the output image. To recover image content from natural images, the traditional approach showed promising reconstruction performance but suffered from computational drawbacks (Ulyanov et al., 2018). Recently, a deep-learning-based restoration model has led to the breakthrough of the conventional approach and achieved state-of-the-art results (Waqas Zamir et al., 2021; Zamir et al., 2022). Designing algorithms robust enough to maintain a spatially precise, high-resolution representation with strong semantic information throughout the entire network has been a challenge. Research by Zamir et al. (2020) proposed a novel multi-scale residual block to effectively learn enriched features for effective real image restoration and enhancement. Despite recent major advancements, state-of-the-art methods suffer from high system complexity, making them computationally inefficient (Nah et al., 2017; Abdelhamed et al., 2018; Chu et al., 2021). To reduce the inter-block complexity of the other SOTA methods (Chen et al., 2022) adopted the stacked neural networks in UNet architecture with skip connections (Ronneberger et al., 2015), following (Wang et al., 2022; Zamir et al., 2022), etc., to design a nonlinear activation-free network framework that is based on CNN rather than a transformer-based network due to SOTA performance drawbacks as reported by Liu et al. (2022) and Han et al. (2021). Research by Qin et al. (2020) proposed a feature fusion attention network, that fuses the FAB with an attention-based multipath local residual structure to focus on learning weights of important spatial information to generate accurate results.

2.2 COD

Research into COD has rooted history in biology and arts (Thayer, 1918; Cott, 1940). The studies are still relevant in widening our knowledge of visual perception. The recognition of camouflaged objects has not been well explored in the literature. Early camouflage research focused on detecting the foreground region even when the foreground texture resembled that of the background (Galun et al., 2003; Song and Geng, 2010; Xue et al., 2016). Based on cues such as color, shape, intensity, edge, and orientation, these works distinguished the foreground and background. To address the issue of camouflage detection, a few techniques based on hand-crafted features such as texture (Sengottuvelan et al., 2008; Pan et al., 2011; Liu et al., 2012) and motion (Hou, 2011; Le et al., 2019) are put forth. However, due to the high similarity between the foreground and background, none of these approaches performs well in real application scenarios for segmenting camouflaged objects but is only effective in the case of a simple and non-uniform background. Despite the numerous CNN-based object detection models available, unique designs are required to build models for COD. In contrast to pixel-level segmentation, GOD detects objects with bounding boxes. Furthermore, the segmentation in COD is based on saliency from a human perspective, not semantics, which is not modeled in GOD models. On the other hand, models that are designed for SOD are unable to effectively detect concealed objects. SOD models do non-semantic segmentation and model saliency; nevertheless, they do not specialize in finding indefinite boundaries of objects, as salient objects tend to be of potential human interest. Researchers have proposed several feasible methods for COD.

Recently, (Le et al., 2019) proposed an end-to-end network for segmenting camouflaged objects by integrating classification into the segmentation framework. Research by Lamdouar et al. (2020) and Zhu et al. (2021) has proposed novel approaches based on the assumption that camouflaged objects exist in an image, which is not always practical in the real world. To simulate the real world, (Le et al., 2021) proposed camouflaged instance segmentation without any assumption that camouflaged objects exist in an image. Following the same motivation, (Fan et al., 2020a) proposed a Search Identification Network (SINet) comprising two modules, namely a search module and an identification module, where the former searches whether a potential prey exists while the latter identifies the target animal. The SINet framework leverages a modified Receptive Field Block (Liu and Huang, 2018) to search for camouflaged object regions. Furthermore, aside from their COD model, (Fan et al., 2020a) presented a large COD dataset, called COD10K, which progressed COD research to a new level in the field of computer vision. Similarly, (Dong et al., 2021) proposed an MCIF-Net framework that integrates a large receptive field and an effective feature aggregation strategy into a unified framework to extra rich context features for accurate COD. In addition to existing literature, recent advancements, and relevant studies, such as the notable works of (Hussain et al., 2021; Qadeer et al., 2022; Naqvi et al., 2023), contribute to the understanding of object detection, tracking, and recognition in various contexts, enhancing the breadth and depth of the related literature. Despite research devoted to the challenges in the field of COD to achieve out-standing performance in terms of accuracy, existing deep learning-based COD methods suffer major limitations such as weak boundaries (i.e., edges), low boundary contrast, variations in object appearances, such as object size and shape, leading to unsatisfactory segmentation performance (Fan et al., 2020a; Mei et al., 2021; Ji et al., 2022), and raises the demands of more advanced feature fusion strategies.

Biological studies (Stevens and Merilaita, 2009; Merilaita et al., 2017; Rida et al., 2020) have shown that targets that are deliberately hidden cause more noisy inferences in the visual perception system, which contributes to object concealment. In nature, this is a common phenomenon. Finding ecologically relevant signals hidden in extreme situations becomes a challenge. More so, without precise control of the feature fusion process, detectors are vulnerable to significant attacks from low-frequency details, which cause vague object boundaries and misjudgment in extreme situations. Inspired by this real-world phenomenon, this paper aims to design a novel baseline model to balance the accuracy and efficiency of COD by adaptively exploiting the semantic and spatial information to obtain plausible final context-aware camouflage prediction maps with refined edge boundaries.

3 Materials and methods

3.1 Motivation and proposed framework

The term “survival of the fittest” was conceptualized by Charles Darwin’s theory of evolution (Flannelly, 2017). The survival of numerous species in the wild depends on cultural adaptation; thus, hunting in a wide variety of ecosystems of living things is essential to help organisms thrive in their environment. Motivated by the first two stages of predation, i.e., search (a sensory mechanism) and identification in nature, the DiCANet framework is proposed. The simplified version of the proposed framework is shown in Figure 3. Details of each component are discussed in subsequent sections.

Figure 3

Figure 3. Proposed DiCANet architecture.

3.2 Camouflaged image

The art of camouflage hinges on manipulating an object’s visual appearance to blend into its surroundings. At the heart of this strategy is the concept of pixel similarity. Digital images including those used in camouflage analysis are represented by pixels —tiny blocks of varying features that collectively form the image. In the context of input camouflaged images, the concept of pixel similarity measures how closely the pixels of objects in the camouflaged image match with the surroundings in terms of color, visual patterns, surface variations, and intensity (Talas et al., 2017). The more similar the pixels of the camouflaged object are to those of its intended background (Figure 3), the more effective the camouflage and the harder for observers to spot detectable features of the concealed object. Furthermore, any detectable discrepancies in pixel similarity will reveal the presence of the hidden object, undermining the effectiveness of the camouflage. By analyzing these features and strategically manipulating the pixel attributes of a camouflaged object, we proposed an effective Context-aware Network for Camouflaged Object Detection.

3.3 Adaptive restoration block (ARB)

To restore concealed images, redundant information unevenly distributed across a real-world image should be adaptively bypassed while robustly allowing the network architecture to focus on more effective information. The ARB framework’s internal block contains several key elements, including (a) the feature attention block (FAB), (b) the attention-based basic block structure, and (c) the feature fusion framework. A detailed framework is shown in Figure 4.

Figure 4

Figure 4. Adaptive restoration block architecture.

Given a $3 - D$ real-world camouflage input image $I_{c} \in R^{H x W x C_{in}}, where H, W$ and $C_{in}$ are the shape of the image (i.e., dimensions and input channel number) respectively. To map the input camouflaged image space into a higher dimensional feature space, a $3 * 3$ convolution $H_{S F} (\cdot)$ was applied to extract shallow features with edge information $F_{s f} \in R^{H x W x C}$ formulated as:

\begin{array}{l} F_{s f} = H_{S F} (I_{c}) & (1) \end{array}

Deep features $F_{d f} \in R^{H X W X C}$ are then extracted from $F_{s f}$ as:

\begin{array}{l} F_{d f} = H_{D F} (F_{s f}) & (2) \end{array}

Where $H_{D F} (\cdot)$ is the deep features extraction module and it contains $K$ residual Group Architectures block (RGAB) with multiple skip connections. More specifically, intermediate features $F_{1}, F_{2}, \dots \dots .. F_{K}$ and output deep features $F_{D F}$ are extracted block by block as:

\begin{array}{l} F_{i} = H_{R G A B_{i}} (F_{i - 1}), i = 1, 2, \dots \dots \dots K, F_{D F} = H_{CONV} (F_{K}), & (3) \end{array}

Where $H_{R G A B_{i}} (\cdot)$ represents the $i - t h$ RGAB and $H_{CONV}$ is the last convolutional layer, which introduces the convolution operation’s inductive bias into the network and sets the stage for shallow and deep feature aggregation.

3.4 Feature attention block (FAB)

To improve model representation, an attention mechanism has been introduced inside a CNN (Zhang et al., 2018; Dai et al., 2019; Niu et al., 2020). Many image restoration networks treat channel-and pixel-level features equally, making them incapable of efficiently handling images with uneven low-and high-frequency distributions. Realistically, redundant information is unevenly distributed across images, and the weight of the unwanted pixels should be significantly different for each channel-and pixel-wise feature. In the attention block, features are learned via a dynamic mechanism that enables the model to concentrate on diverse segments of the input data, highlighting pertinent features and attenuating or suppressing irrelevant ones. This process is typically realized through computing attention weights, which signify the significance or relevance of various input features. This adaptive learning approach provides additional flexibility for the network hierarchy in dealing with different types of information. Feature Attention blocks consist of a residual block with channel attention (RB-CA) and residual attention with pixel attention (RB-PA) as shown in Figure 5. The former ensures that different channel features have different weighted information (He et al., 2010) while the latter attentively focuses on informative features in the high-frequency pixel regions.

Figure 5

Figure 5. Feature attention block. (A) Channel attention (CA). (B) Pixel attention (PA).

3.4.1 Channel attention (CA)

To achieve channel-wise weighting for each channel in feature maps, global average pooling (GAP) was employed before feeding the data into fully connected layers for classification tasks. The concept of GAP in CNNs focuses on each feature map (channel) and aggregates information across the entire spatial extent of the feature maps, resulting in a single value per channel (Lin M. et al., 2014; Forrest, 2016; Hu et al., 2018; Machine Learning Mastery, 2019). The 1D vector (channel descriptors) obtained from GAP can then be used in subsequent calculations to extract meaningful features from the image. The mathematical expression detailing how channel descriptors achieve weighted information is as follows:

\begin{array}{l} g_{c} = H_{p} (F_{c}) = \frac{1}{H x W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{c} (i, j) & (4) \end{array}

Where $H_{p}$ represents the global pooling function, $F_{c}$ the input, and $X_{c} (i, j)$ denotes the value of $c - t h$ channel $X_{c}$ at spatial position $(i, j)$ . The shape of the feature map changes from $C x H x W$ to $C x 1 x 1$ i.e., collapsing $H x W$ . These feature maps are fed through two convolution layers and a computationally efficient sigmoid, followed by ReLu activation function (Figure 5A) to provide the weights of the different channels formulated as follows:

\begin{array}{l} C A_{c} = σ (Conv (δ (Conv (A_{c})))) & (5) \end{array}

Where $σ$ and $δ$ represent the sigmoid function and the ReLu activation function, respectively. By elementwise multiplication of the input $F_{c}$ and weights of the channels $C A_{c}$ , the output of the channel attention $F_{c}^{*}$ can be deduced as follows:

\begin{array}{l} F_{c}^{*} = C A_{c} \otimes F_{c} & (6) \end{array}

3.4.2 Pixel attention (PA)

To capture fine-grained details about spatial context, pixel attention (PA) mechanisms actively focus on specific pixels within the entire area (spatial extent) of the feature maps. The concept of attention mechanisms in CNNs, including those that focus on pixel-level details, has been explored in various research studies (e.g., Ismail Fawaz et al., 2019; Dosovitskiy et al., 2020). Inspired by CA (Hu et al., 2018) and spatial attention (SA) (Woo et al., 2018), PA is used to improve the feature representation capacity to obtain images with clear object boundaries. Comparable to CA as shown in Figure 5B, the input $F_{c}^{*}$ (i.e., the output of the channel attention block) is fed through two convolution layers with ReLu and sigmoid activation function (Figure 5B). The shape of the feature map changes from $C x H x W$ to 1 $x H x W$ .

\begin{array}{l} P A = σ (Conv (δ (Conv (F_{c}^{*})))) & (7) \end{array}

Recall that activation maps are often followed elementwise through an activation function such as ReLU. Therefore, by elementwise multiplication of $F_{c}^{*}$ and PA, Feature Attention Block (FAB) output $\tilde{F}$ is given by:

\begin{array}{l} \tilde{F} = F_{c}^{*} \otimes P A & (8) \end{array}

Integrating Channel Attention and Pixel Attention within CNNs empowers the network to learn both the overall image context and the finer details of specific regions simultaneously. This leads to stronger and more informative feature representations, improving the network’s ability to distinguish objects. Recent research (e.g., Hu et al., 2018; Ismail Fawaz et al., 2019; Dosovitskiy et al., 2020) has explored this combined approach to enhance CNN performance in various computer vision tasks like image classification, object detection, and semantic segmentation.

3.5 Block structure (BBS)

The performance of neural networks has been significantly impacted since attention mechanisms (Xu et al., 2015; Vaswani et al., 2017; Wang et al., 2018) and the emergence of residual connections (He et al., 2016) were introduced to train deep networks. The design of the BBS $(B_{i})$ is built on the combination of these concepts. As shown in Figure 6, BBS consist of a multiple local residual learning (LRL) skip connection block and a FAB. Local residual learning permits low-frequency details to be bypassed through multiple local residual learning, allowing the main network to learn discriminatively useful information. The combination of several basic block structures with skip connections increases the depth and capability of the ARB in overcoming training challenges.

Figure 6

Figure 6. Basic block structure.

By implementing a two-layer convolutional network at the end of the ARB network (as shown in Figure 4) and employing a long-skip connection global residual learning module as a recovery strategy to restore the input camouflage image.

3.6 Feature fusion attention strategy

Shallow feature information can often be difficult to retain as the network gets deeper. U-Net (Ronneberger et al., 2015) and other networks strive to fuse different level features of shallow and deep information. As depicted in Figure 4, feature maps produced by the 𝐺 group architecture in the channel direction are concatenated. Following the FAB weighting strategy, the retained low-level features with edge information in the shallow layer that preserve spatial details for establishing object boundaries are fed into deep layers, allowing the ARB network (ARB-Net) to focus more on semantic information like high-frequency textures for hidden objects scene visibility in real-world scenarios.

3.7 Loss function

According to Lim et al. (2017), training with L1 loss often outperformed training with L2 loss for image restoration tasks. Following the same strategy, we adopted L1 loss as our default loss function for training the ARB-Net. The total loss function L is:

\begin{array}{l} L (Θ) = \frac{1}{N} \sum_{i = 1}^{N} ‖I_{g t}^{i} - A R B (I_{c}^{i})‖ & (9) \end{array}

where $Θ$ represents the ARB-Net parameters, $I_{g t}^{i}$ stands for ground truth, and $I_{c}^{i}$ stands for the real-world camouflaged input image. The proposed ARB-Net extends the hyperparameters detailed in Qin et al. (2020), encompassing vital parameters like image size, learning rate, optimizer, batch size, and loss function. The selection process for the Adaptive Restoration Block (ARB) was meticulously executed through a systematic approach combining experimentation, domain knowledge, and optimization techniques. Leveraging our understanding of camouflage object detection and image restoration, we meticulously fine-tuned the hyperparameters to meet the unique demands of the task. Through iterative adjustments and rigorous validation of test data, we identified the most effective configuration for the ARB. This comprehensive approach ensures that the ARB-Net is finely tuned to excel in the intricate domain of camouflage object detection, enhancing its performance and applicability in real-world scenarios.

3.8 Cascaded detection block

3.8.1 Sensory module (SM)

According to a neuroscience study by Langley et al. (1996), when prey indiscriminately hides in the background, selective search attention (Riley and Roitblat, 2018) plays a significant role in the predatory sensory mechanism to reduce non-prey details, thus saving computational time. To take advantage of the sensory mechanism, search attention is used in the initial feature learning to select and aggregate semantic features from the restored camouflage image $I_{A R B}$ in the previous section.

Given an input image $I_{A R B} ϵ R^{W x H x 3}$ (the output of the ARB) a set of features $\{f_{k}, k ϵ \{1, 2, 3, 4, 5\}\}$ is extracted from the ResNet-50 (He et al., 2016) backbone architecture. The resolution of each feature $f_{k}$ is $\frac{H}{2^{k}} x$ $\frac{W}{2^{k}}$ , $k = \{4, 4, 8, 16, 32\} .$ Studies by Lin et al. (2017) demonstrated that high-level features in deep layers keep semantic information for finding objects, whereas low-level features in shallow layers preserve spatial details for establishing object boundaries. Based on the property of neural networks, extracted features are categorized as low-level $\{X_{0}, X_{1}\}$ , intermediate-level $\{X_{2}\}$ , and high-level features $\{X_{3}, X_{4}\},$ which are later fused through concatenation, up-sampling, and down-sampling operations; thereafter, by leveraging a dense convolutional network strategy of (Huang et al., 2017) to preserve more information from different layers and then use a modified receptive field (Liu and Huang, 2018) block to enlarge the receptive field and output a set of enhanced features.

3.8.2 Identification module (IM)

In the identification module, disguised objects need to be precisely identified using the output features obtained from the previous sensory module. Following the identification network of (Fan et al., 2020a), our final context-aware camouflaged object prediction maps with refined boundaries are generated.

4 Results

To demonstrate the generality of our newly proposed DiCANet COD model, the ARB-Net goes through a fine-tuning stage with different key network parameters and is trained on local image patches to perform restoration for more complex image background scenarios. For optimal results that preserve the camouflaged object’s latent spectral content and structural details, the Group Structure 𝐺 and each Basic Block Structure 𝐵 are set to 3 and 19 respectively, in the ARB. The filter size for all convolution layers is set to $3 * 3$ , except for the Channel Attention, whose kernel size is $1 * 1$ . Additionally, all feature maps maintain a fixed size except for the Channel Attention module. Each Group Structure outputs 64 filters.

5 Discussion

5.1 Experimental settings

5.1.1 Training/Testing details

ARB-Net builds on the same training settings of (Qin et al., 2020). Following the same hyperparameter configurations of (Fan et al., 2020a) for CDB. We evaluate the DiCANet models on the whole CHAMELEON (Skurowski et al., 2018) and test sets of CAMO (Le et al., 2019), and COD10K (Fan et al., 2020a). The entire experiment was executed on a 2.2 GHz dual-core Intel Core i7 CPU with 8 GB of RAM using Google COLAB as our working interface. Evaluation Metrics: We adopt four benchmark evaluation metrics to evaluate the performance of the DiCANet model including S-measure (Fan et al., 2017), mean E-measure (Fan et al., 2018), weighted F-measure (Margolin et al., 2014), and Mean Absolute Error.

5.2 Baseline models

To demonstrate the robustness of DiCANet, this research selected 13 strong baseline methods that adopted ResNet50 (He et al., 2016) as the backbone network for feature extraction and achieved SOTA performance in related fields, namely GOD and SOD: object detection FPN (Lin et al., 2017); semantic segmentation PSPNet (Zhao et al., 2017); instance segmentation Mask RCNN (He et al., 2017), HTC (Chen et al., 2019), and MSRCNN (Huang et al., 2019); medical image segmentation UNet++ (Zhou et al., 2018) and PraNet (Fan et al., 2020b); salient object detection PiCANet (Liu et al., 2018) BASNet (Qin et al., 2019), CPD (Wu et al., 2019), PFANet (Zhao and Wu, 2019), EGNet (Zhao J. X. et al., 2019), and camouflaged object segmentation SINet (Fan et al., 2020a).

5.3 Quantitative comparison

Table 1 summarizes the quantitative results of different baselines on three standard COD datasets. The proposed model achieved the highest values for the evaluation metrics, which indicates superior performance.

Table 1

Table 1. Quantitative comparison in terms of $S_{\propto}$ , $E_{ϕ}, F_{β}^{ω}, and M$ on three benchmark COD datasets (Fan et al., 2020a).

For the CAMO dataset, comparing DiCANet model with the top two performing baselines: PraNet and SINet, the proposed method improved by 0.003 and 0.009, respectively in terms of $M$ , and by 0.057 and 0.041, respectively, in terms of $E_{ϕ}$ and $F_{β}^{ω}$ . Although DiCANet achieved a low structural similarity score $S_{\propto}$ , accurate predictions with high integrity of preserved edge details and clear boundaries were still achieved. Similarly, when compared with the edge boundary models, e.g., EGNet and PFANet, our DiCANet improves $E_{ϕ}$ and $F_{β}^{ω}$ by (0.08 and 0.103) and (0.302 and 0.427), respectively, while drastically reducing MAE error by 0.016 and 0.110 for the CHAMELEON dataset. DiCANet achieved a significant improvement in $S_{\propto}$ of 0.011 compared with the best model PraNet. Interestingly, for the most challenging dataset, COD10K, DiCANet outperformed the competition in prediction accuracy for all metrics and boosted performance to a new SOTA.

5.4 Qualitative comparison

Figure 7 shows the qualitative comparison of the camouflaged prediction map of DiCANet against the top four cutting-edge models. Row 1 to row 2, (top to bottom) are examples from CHAMELEON datasets; row 3 are examples from CAMO datasets; row 4 is an example from COD10K’s super-class: amphibious. It is evident that DiCANet outperforms all competing models and provides the best prediction that is the closest to ground truth (best viewed when zoomed).

Figure 7

Figure 7. Camouflaged objects segmentation results. (A) Image, (B) GT, (C) DiCANet, (D) SINet, (E) PraNet, (F) EGNet, (G) CPD.

Noncamouflaged regions are consistently included in the results of the compared methods, while some details of camouflaged objects are neglected. In contrast, the competing models inaccurately detect disguised objects and provide unreliable visual results. The proposed model demonstrated excellent performance in locating concealed objects accurately, with rich, fine details in predictions and clear boundaries. Additionally, our method captures the object boundaries quite well due to the power of ARB’s adaptive weighing mechanism and feature fusion strategy.

5.4.1 Failure case

Despite achieving satisfactory quantitative performance and setting a record in the COD task, the proposed DiCANet framework exhibits limitations in specific scenarios as shown in Figure 8. When dealing with multiple camouflaged objects grouped closely together (row 1), DiCANet might struggle to accurately predict the number of objects. This limitation can be attributed to the network’s limited prior knowledge in handling scenes with a specific number of objects. The complicated topological structures (row 2) with dense details can also pose challenges for DiCANet due to background complexity distraction. This complexity overwhelms the attention mechanisms, diverting focus from the camouflaged objects. Additionally, the intricate details in the background could share similar features with the camouflage patterns, making it difficult to distinguish the camouflaged object from its surroundings. These limitations provide valuable insights and potential areas for future investigation. By tackling these challenges and exploring novel approaches, researchers can create more resilient COD systems capable of managing even the most intricate and challenging scenarios.

Figure 8

Figure 8. Failure cases of our DiCANet. (A) Images, (B) GT, (C) Ours, (D) SINet.

5.4.2 Ablation study

To further demonstrate the superiority of DiCANet architecture with previous state-of-the-art methods, we conducted an ablation study by considering challenging camouflage scenarios (Figure 9). The study observes that DiCNet consistently shows distinctive detection and segmentation of concealed objects in challenging natural scenarios, such as partial occlusion (1st row), weak object/background contrast (2nd row), and strong background descriptor (3rd row). Meanwhile, the structural similarity 𝑺∝ scores (in red) of DiCANet are much higher and with a minimal error (in red) compared to the competitors, which further demonstrates the superiority of our method. We can also clearly see that the combination of the proposed adaptive ARB-Net and Feature Fusion Attention Strategy has significantly elevated our results to an exceptional level.

Figure 9

Figure 9. Visual comparison with top three baselines on COD10K $(S_{\propto} / M)$ .

6 Conclusion

This paper presents Discriminative Context-Aware Network (DiCANet), a novel joint learning framework for detecting concealed objects with refined edges. The proposed model leverages two key components: the ARB-Net and the CDB. To improve the camouflage scene visibility, we employed ARB-Net to adaptively generate different attention weights for each channel-and pixel-wise feature and strategically fuse the feature maps to expand the discriminative power and representative ability of the convolution networks. To drive camouflage object localization and segmentation performance, we employed the CDB module. Based on the ARB and CDB modules, a context-aware network that effectively aims to pay more attention to local contextual information to evaluate the objectivity of the camouflage prediction map was proposed. Extensive experiments show that mining distinctive information can overcome the difficulties of both SOD and COD tasks with superior performance; thus, DiCANet outperforms SOTA methods under the commonly used evaluation metrics and deserves further exploration in other related computer vision tasks.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding authors.

Author contributions

CI: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing. NM: Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing. NB: Conceptualization, Data curation, Formal analysis, Investigation, Project administration, Resources, Validation, Visualization, Writing – original draft, Writing – review & editing. SA: Data curation, Formal analysis, Funding acquisition, Investigation, Project administration, Resources, Validation, Visualization, Writing – original draft, Writing – review & editing. FE: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing.

Funding

The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.

Acknowledgments

The authors would like to thank Atlantic Technological University, Pak-Austria Fachhochschule Institute of Applied Sciences and Technology, Fatima Jinnah Women University, and Saudi Electronic University.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Abdelhamed, A., Lin, S., and Brown, M. S. (2018). A high-quality denoising dataset for smartphone cameras, in “Proceedings of the IEEE conference on computer vision and pattern recognition”, pp. 1692–1700.