Skip to main content

ORIGINAL RESEARCH article

Front. Artif. Intell., 27 March 2024
Sec. Machine Learning and Artificial Intelligence

Discriminative context-aware network for camouflaged object detection

  • 1Department of Computing, Atlantic Technological University, Letterkenny, Ireland
  • 2School of Computing, Pak-Austria Fachhochschule Institute of Applied Sciences and Technology, Haripur, Pakistan
  • 3Department of Computer Science, Fatima Jinnah Women University, Rawalpindi, Pakistan
  • 4Computer Science Department, College of Computing and Informatics, Saudi Electronic University, Riyadh, Saudi Arabia

Introduction: Animals use camouflage (background matching, disruptive coloration, etc.) for protection, confusing predators and making detection difficult. Camouflage Object Detection (COD) tackles this challenge by identifying objects seamlessly blended into their surroundings. Existing COD techniques struggle with hidden objects due to noisy inferences inherent in natural environments. To address this, we propose the Discriminative Context-aware Network (DiCANet) for improved COD performance.

Methods: DiCANet addresses camouflage challenges through a two-stage approach. First, an adaptive restoration block intelligently learns feature weights, prioritizing informative channels and pixels. This enhances convolutional neural networks’ ability to represent diverse data and handle complex camouflage. Second, a cascaded detection module with an enlarged receptive field refines the object prediction map, achieving clear boundaries without post-processing.

Results: Without post-processing, DiCANet achieves state-of-the-art performance on challenging COD datasets (CAMO, CHAMELEON, COD10K) by generating accurate saliency maps with rich contextual details and precise boundaries.

Discussion: DiCANet tackles the challenge of identifying camouflaged objects in noisy environments with its two-stage restoration and cascaded detection approach. This innovative architecture surpasses existing methods in COD tasks, as proven by benchmark dataset experiments.

1 Introduction

The idea behind Charles Darwin’s theory of evolution and natural selection is the evolution of prey camouflage patterns and the understanding of animal cognition in a more ecological context. The earliest research on camouflage dates to the last century (Cott, 1940). Research by Thayer (1918) and Cott (1940) comprehensively studied the phenomenon of camouflage. Camouflage is an evolutionary concealment technique to mask objects’ location, identity, and movement in their surrounding environment. For living organisms to adapt to their environment, they require the exhibition of adaptive traits or behavioral strategies better suited to the environment. The combination of these physiological characteristics, such as color, pattern, morphology, and behavior (Gleeson et al., 2018; Stevens and Ruxton, 2019), provides them with some survival advantages by disrupting the visual silhouette of animals or potential predators. Inspired by this important natural phenomenon, humans have made attempts to replicate these patterns in many fields.

As a multidisciplinary study of computer science and evolutionary biology, it has a wide range of applications in practical scenarios, including wildlife preservation and animal monitoring; arts (e.g., recreational art) (Chu et al., 2010; Ge et al., 2018); agriculture (e.g., locust detection to prevent invasion); computer vision and other vision-related areas (e.g., search-and-rescue missions in natural disasters; military target detection and surveillance systems; rare species discovery); medical image analysis [e.g., polyp segmentation (Fan et al., 2020b); lung infection segmentation (Fan et al., 2020c; Wu et al., 2021)], to mention a few.

There are two types of camouflaged objects: naturally camouflaged objects and artificially camouflaged objects (Stevens and Merilaita, 2009). Natural camouflage results from the coevolution of predators and prey. Figures 1A,B show disruptive coloration and background pattern matching in animals attempting to exploit predators’ visual processing and cognition. Other camouflage strategies include countershading, transparency, masquerade, distractive markings (Galloway et al., 1802), etc. Artificially camouflaged objects are predatory camouflage strategies often seen in humans, such as military troops, vehicles, weapons, and positions in war zones (Zheng et al., 2018). These objects first observe their environment and elegantly blend their texture patterns to create a familiar scene as the environment to deceive potential observers’ visual perception systems, as shown in Figure 1C.

Figure 1
www.frontiersin.org

Figure 1. Natural and artificial camouflaged objects. (A,B) show Natural camouflage and (C) shows Artificial camouflage.

COD has gained increased attention in the computer vision community but is not well-studied due to the insufficiency of large training datasets and a standard benchmark like Pascal-VOC (Everingham et al., 2015), ImageNet (Deng et al., 2009), MS-COCO (Lin T. Y. et al., 2014), etc.

The majority of computer vision literature is largely concerned with the detection/segmentation of non-camouflaged objects (Ren et al., 2017). Based on the detecting and segmenting viewpoint (Zhao Z. Q. et al., 2019), the objects can be divided into three categories: generic objects, salient objects, and camouflage objects. Generic object detection (GOD) is a popular direction in cognitive computer vision which aims to find common objects. They can either be salient or camouflaged. Salient object detection (SOD) aims to find attention-grabbing objects in an image, i.e., objects with pre-defined classes. There exists a vast amount of research works for both generic (Shotton et al., 2006; Liu et al., 2010; Girshick et al., 2014; Everingham et al., 2015; Girshick, 2015; Ren et al., 2015; Kirillov et al., 2019; Le et al., 2020), and salient object detection (Wang et al., 2017; Wu et al., 2019; Zhao J. X. et al., 2019; Zhao and Wu, 2019; Fan et al., 2020a; Qin et al., 2020; Waqas Zamir et al., 2021). COD aims to identify objects whose shape and outline are not easily recognizable in images, as shown in Figure 2. The high intrinsic similarities between the camouflaged objects and the background require a significant amount of visual perception knowledge, hence making COD far more challenging than the conventional salient object detection or generic object detection (Ge et al., 2018; Zhao Z. Q. et al., 2019; Zhao J. X. et al., 2019).

Figure 2
www.frontiersin.org

Figure 2. Object segmentation exemplars: (A) Given input image, (B) GOD, (C) SOD, (D) COD.

In this paper, we present a review of deep-learning object detection from a camou-flaged perspective. We proposed a discriminative context-aware network called “Di-CANet.” In consideration of the noisy interference in natural systems, the low-frequency distribution contains smooth data disordering while the high-frequency details get an unwanted approximation. These contain channel-wise and pixel-wise features unevenly distributed across the camouflaged image and should be differentiated using weighted information to get an appropriate representation of salient features of objects. Therefore, rather than directly assigning equal weights to the channel-wise and pixel-wise features (Woo et al., 2018), inspired by Qin et al. (2020), we introduced an adaptive restoration block (ARB). This is used to adaptively learn the weights of the image features and assign different weights to them. This not only contributes to the representative ability of convolutional neural networks (CNN) but also provides the required robustness against various types of information preservation. After processing the ARB, these features are complementary-aware according to the fusion pipeline to generate restored camouflage images. Next, a cascaded detection module (Fan et al., 2020a) fortified with a modified receptive field block (Liu and Huang, 2018) was adopted to segment ecological signals and drive the segmentation performance of the target objects during the detection stage. Furthermore, a more refined camouflaged object prediction map is attained with clear boundaries and the generation of an accurate saliency map in terms of contextual details.

With the above considerations, the proposed DiCANet is used to develop a good, camouflaged prediction map. Our contributions can be summarized as follows: (1) We proposed a discriminative context-aware network (“DiCANet”) for camouflage object segmentation; (2) We intelligently infused an adaptive restoration block into a bio-inspired cascaded detection block to effectively guide detection and segmentation performance. The ARB comprises three key components: (a) feature attention block (FAB), (b) Group architecture incorporation, and (c) Attention-based feature fusion network. Details of these components will be discussed in subsequent sections; (3) The proposed COD model boosted performance to a new state-of-the-art (SOTA). The experiments are verified for the effectiveness of our proposed method.

2 Related work

This section reviews related works in two folds: image restoration approaches and deep learning-based COD approaches.

2.1 Image restoration

Visual information present in the real world contains undesired image contents, and as a positionally sensitive problem, it requires pixel-to-pixel correspondence between the input and the output image. To recover image content from natural images, the traditional approach showed promising reconstruction performance but suffered from computational drawbacks (Ulyanov et al., 2018). Recently, a deep-learning-based restoration model has led to the breakthrough of the conventional approach and achieved state-of-the-art results (Waqas Zamir et al., 2021; Zamir et al., 2022). Designing algorithms robust enough to maintain a spatially precise, high-resolution representation with strong semantic information throughout the entire network has been a challenge. Research by Zamir et al. (2020) proposed a novel multi-scale residual block to effectively learn enriched features for effective real image restoration and enhancement. Despite recent major advancements, state-of-the-art methods suffer from high system complexity, making them computationally inefficient (Nah et al., 2017; Abdelhamed et al., 2018; Chu et al., 2021). To reduce the inter-block complexity of the other SOTA methods (Chen et al., 2022) adopted the stacked neural networks in UNet architecture with skip connections (Ronneberger et al., 2015), following (Wang et al., 2022; Zamir et al., 2022), etc., to design a nonlinear activation-free network framework that is based on CNN rather than a transformer-based network due to SOTA performance drawbacks as reported by Liu et al. (2022) and Han et al. (2021). Research by Qin et al. (2020) proposed a feature fusion attention network, that fuses the FAB with an attention-based multipath local residual structure to focus on learning weights of important spatial information to generate accurate results.

2.2 COD

Research into COD has rooted history in biology and arts (Thayer, 1918; Cott, 1940). The studies are still relevant in widening our knowledge of visual perception. The recognition of camouflaged objects has not been well explored in the literature. Early camouflage research focused on detecting the foreground region even when the foreground texture resembled that of the background (Galun et al., 2003; Song and Geng, 2010; Xue et al., 2016). Based on cues such as color, shape, intensity, edge, and orientation, these works distinguished the foreground and background. To address the issue of camouflage detection, a few techniques based on hand-crafted features such as texture (Sengottuvelan et al., 2008; Pan et al., 2011; Liu et al., 2012) and motion (Hou, 2011; Le et al., 2019) are put forth. However, due to the high similarity between the foreground and background, none of these approaches performs well in real application scenarios for segmenting camouflaged objects but is only effective in the case of a simple and non-uniform background. Despite the numerous CNN-based object detection models available, unique designs are required to build models for COD. In contrast to pixel-level segmentation, GOD detects objects with bounding boxes. Furthermore, the segmentation in COD is based on saliency from a human perspective, not semantics, which is not modeled in GOD models. On the other hand, models that are designed for SOD are unable to effectively detect concealed objects. SOD models do non-semantic segmentation and model saliency; nevertheless, they do not specialize in finding indefinite boundaries of objects, as salient objects tend to be of potential human interest. Researchers have proposed several feasible methods for COD.

Recently, (Le et al., 2019) proposed an end-to-end network for segmenting camouflaged objects by integrating classification into the segmentation framework. Research by Lamdouar et al. (2020) and Zhu et al. (2021) has proposed novel approaches based on the assumption that camouflaged objects exist in an image, which is not always practical in the real world. To simulate the real world, (Le et al., 2021) proposed camouflaged instance segmentation without any assumption that camouflaged objects exist in an image. Following the same motivation, (Fan et al., 2020a) proposed a Search Identification Network (SINet) comprising two modules, namely a search module and an identification module, where the former searches whether a potential prey exists while the latter identifies the target animal. The SINet framework leverages a modified Receptive Field Block (Liu and Huang, 2018) to search for camouflaged object regions. Furthermore, aside from their COD model, (Fan et al., 2020a) presented a large COD dataset, called COD10K, which progressed COD research to a new level in the field of computer vision. Similarly, (Dong et al., 2021) proposed an MCIF-Net framework that integrates a large receptive field and an effective feature aggregation strategy into a unified framework to extra rich context features for accurate COD. In addition to existing literature, recent advancements, and relevant studies, such as the notable works of (Hussain et al., 2021; Qadeer et al., 2022; Naqvi et al., 2023), contribute to the understanding of object detection, tracking, and recognition in various contexts, enhancing the breadth and depth of the related literature. Despite research devoted to the challenges in the field of COD to achieve out-standing performance in terms of accuracy, existing deep learning-based COD methods suffer major limitations such as weak boundaries (i.e., edges), low boundary contrast, variations in object appearances, such as object size and shape, leading to unsatisfactory segmentation performance (Fan et al., 2020a; Mei et al., 2021; Ji et al., 2022), and raises the demands of more advanced feature fusion strategies.

Biological studies (Stevens and Merilaita, 2009; Merilaita et al., 2017; Rida et al., 2020) have shown that targets that are deliberately hidden cause more noisy inferences in the visual perception system, which contributes to object concealment. In nature, this is a common phenomenon. Finding ecologically relevant signals hidden in extreme situations becomes a challenge. More so, without precise control of the feature fusion process, detectors are vulnerable to significant attacks from low-frequency details, which cause vague object boundaries and misjudgment in extreme situations. Inspired by this real-world phenomenon, this paper aims to design a novel baseline model to balance the accuracy and efficiency of COD by adaptively exploiting the semantic and spatial information to obtain plausible final context-aware camouflage prediction maps with refined edge boundaries.

3 Materials and methods

3.1 Motivation and proposed framework

The term “survival of the fittest” was conceptualized by Charles Darwin’s theory of evolution (Flannelly, 2017). The survival of numerous species in the wild depends on cultural adaptation; thus, hunting in a wide variety of ecosystems of living things is essential to help organisms thrive in their environment. Motivated by the first two stages of predation, i.e., search (a sensory mechanism) and identification in nature, the DiCANet framework is proposed. The simplified version of the proposed framework is shown in Figure 3. Details of each component are discussed in subsequent sections.

Figure 3
www.frontiersin.org

Figure 3. Proposed DiCANet architecture.

3.2 Camouflaged image

The art of camouflage hinges on manipulating an object’s visual appearance to blend into its surroundings. At the heart of this strategy is the concept of pixel similarity. Digital images including those used in camouflage analysis are represented by pixels —tiny blocks of varying features that collectively form the image. In the context of input camouflaged images, the concept of pixel similarity measures how closely the pixels of objects in the camouflaged image match with the surroundings in terms of color, visual patterns, surface variations, and intensity (Talas et al., 2017). The more similar the pixels of the camouflaged object are to those of its intended background (Figure 3), the more effective the camouflage and the harder for observers to spot detectable features of the concealed object. Furthermore, any detectable discrepancies in pixel similarity will reveal the presence of the hidden object, undermining the effectiveness of the camouflage. By analyzing these features and strategically manipulating the pixel attributes of a camouflaged object, we proposed an effective Context-aware Network for Camouflaged Object Detection.

3.3 Adaptive restoration block (ARB)

To restore concealed images, redundant information unevenly distributed across a real-world image should be adaptively bypassed while robustly allowing the network architecture to focus on more effective information. The ARB framework’s internal block contains several key elements, including (a) the feature attention block (FAB), (b) the attention-based basic block structure, and (c) the feature fusion framework. A detailed framework is shown in Figure 4.

Figure 4
www.frontiersin.org

Figure 4. Adaptive restoration block architecture.

Given a 3D real-world camouflage input image IcRHxWxCin,whereH,W and Cin are the shape of the image (i.e., dimensions and input channel number) respectively. To map the input camouflaged image space into a higher dimensional feature space, a 33 convolution HSF was applied to extract shallow features with edge information FsfRHxWxC formulated as:

Fsf=HSFIc    (1)

Deep features FdfRHXWXC are then extracted from Fsf as:

Fdf=HDFFsf    (2)

Where HDF is the deep features extraction module and it contains K residual Group Architectures block (RGAB) with multiple skip connections. More specifically, intermediate features F1,F2,..FK and output deep features FDF are extracted block by block as:

Fi=HRGABiFi1,i=1,2,K,FDF=HCONVFK,    (3)

Where HRGABi represents the ith RGAB and HCONV is the last convolutional layer, which introduces the convolution operation’s inductive bias into the network and sets the stage for shallow and deep feature aggregation.

3.4 Feature attention block (FAB)

To improve model representation, an attention mechanism has been introduced inside a CNN (Zhang et al., 2018; Dai et al., 2019; Niu et al., 2020). Many image restoration networks treat channel-and pixel-level features equally, making them incapable of efficiently handling images with uneven low-and high-frequency distributions. Realistically, redundant information is unevenly distributed across images, and the weight of the unwanted pixels should be significantly different for each channel-and pixel-wise feature. In the attention block, features are learned via a dynamic mechanism that enables the model to concentrate on diverse segments of the input data, highlighting pertinent features and attenuating or suppressing irrelevant ones. This process is typically realized through computing attention weights, which signify the significance or relevance of various input features. This adaptive learning approach provides additional flexibility for the network hierarchy in dealing with different types of information. Feature Attention blocks consist of a residual block with channel attention (RB-CA) and residual attention with pixel attention (RB-PA) as shown in Figure 5. The former ensures that different channel features have different weighted information (He et al., 2010) while the latter attentively focuses on informative features in the high-frequency pixel regions.

Figure 5
www.frontiersin.org

Figure 5. Feature attention block. (A) Channel attention (CA). (B) Pixel attention (PA).

3.4.1 Channel attention (CA)

To achieve channel-wise weighting for each channel in feature maps, global average pooling (GAP) was employed before feeding the data into fully connected layers for classification tasks. The concept of GAP in CNNs focuses on each feature map (channel) and aggregates information across the entire spatial extent of the feature maps, resulting in a single value per channel (Lin M. et al., 2014; Forrest, 2016; Hu et al., 2018; Machine Learning Mastery, 2019). The 1D vector (channel descriptors) obtained from GAP can then be used in subsequent calculations to extract meaningful features from the image. The mathematical expression detailing how channel descriptors achieve weighted information is as follows:

gc=HpFc=1HxWi=1Hj=1WXcij    (4)

Where Hp represents the global pooling function, Fc the input, and Xcij denotes the value of cthchannel Xc at spatial position ij. The shape of the feature map changes from CxHxW to Cx1x1 i.e., collapsing HxW. These feature maps are fed through two convolution layers and a computationally efficient sigmoid, followed by ReLu activation function (Figure 5A) to provide the weights of the different channels formulated as follows:

CAc=σConvδConvAc    (5)

Where σ and δ represent the sigmoid function and the ReLu activation function, respectively. By elementwise multiplication of the input Fc and weights of the channels CAc, the output of the channel attention Fc can be deduced as follows:

Fc=CAcFc    (6)

3.4.2 Pixel attention (PA)

To capture fine-grained details about spatial context, pixel attention (PA) mechanisms actively focus on specific pixels within the entire area (spatial extent) of the feature maps. The concept of attention mechanisms in CNNs, including those that focus on pixel-level details, has been explored in various research studies (e.g., Ismail Fawaz et al., 2019; Dosovitskiy et al., 2020). Inspired by CA (Hu et al., 2018) and spatial attention (SA) (Woo et al., 2018), PA is used to improve the feature representation capacity to obtain images with clear object boundaries. Comparable to CA as shown in Figure 5B, the input Fc (i.e., the output of the channel attention block) is fed through two convolution layers with ReLu and sigmoid activation function (Figure 5B). The shape of the feature map changes from CxHxW to 1xHxW.

PA=σConvδConvFc    (7)

Recall that activation maps are often followed elementwise through an activation function such as ReLU. Therefore, by elementwise multiplication of Fc and PA, Feature Attention Block (FAB) output F˜is given by:

F˜=FcPA    (8)

Integrating Channel Attention and Pixel Attention within CNNs empowers the network to learn both the overall image context and the finer details of specific regions simultaneously. This leads to stronger and more informative feature representations, improving the network’s ability to distinguish objects. Recent research (e.g., Hu et al., 2018; Ismail Fawaz et al., 2019; Dosovitskiy et al., 2020) has explored this combined approach to enhance CNN performance in various computer vision tasks like image classification, object detection, and semantic segmentation.

3.5 Block structure (BBS)

The performance of neural networks has been significantly impacted since attention mechanisms (Xu et al., 2015; Vaswani et al., 2017; Wang et al., 2018) and the emergence of residual connections (He et al., 2016) were introduced to train deep networks. The design of the BBS Biis built on the combination of these concepts. As shown in Figure 6, BBS consist of a multiple local residual learning (LRL) skip connection block and a FAB. Local residual learning permits low-frequency details to be bypassed through multiple local residual learning, allowing the main network to learn discriminatively useful information. The combination of several basic block structures with skip connections increases the depth and capability of the ARB in overcoming training challenges.

Figure 6
www.frontiersin.org

Figure 6. Basic block structure.

By implementing a two-layer convolutional network at the end of the ARB network (as shown in Figure 4) and employing a long-skip connection global residual learning module as a recovery strategy to restore the input camouflage image.

3.6 Feature fusion attention strategy

Shallow feature information can often be difficult to retain as the network gets deeper. U-Net (Ronneberger et al., 2015) and other networks strive to fuse different level features of shallow and deep information. As depicted in Figure 4, feature maps produced by the 𝐺 group architecture in the channel direction are concatenated. Following the FAB weighting strategy, the retained low-level features with edge information in the shallow layer that preserve spatial details for establishing object boundaries are fed into deep layers, allowing the ARB network (ARB-Net) to focus more on semantic information like high-frequency textures for hidden objects scene visibility in real-world scenarios.

3.7 Loss function

According to Lim et al. (2017), training with L1 loss often outperformed training with L2 loss for image restoration tasks. Following the same strategy, we adopted L1 loss as our default loss function for training the ARB-Net. The total loss function L is:

LΘ=1Ni=1NIgtiARBIci    (9)

where Θ represents the ARB-Net parameters, Igti stands for ground truth, and Ici stands for the real-world camouflaged input image. The proposed ARB-Net extends the hyperparameters detailed in Qin et al. (2020), encompassing vital parameters like image size, learning rate, optimizer, batch size, and loss function. The selection process for the Adaptive Restoration Block (ARB) was meticulously executed through a systematic approach combining experimentation, domain knowledge, and optimization techniques. Leveraging our understanding of camouflage object detection and image restoration, we meticulously fine-tuned the hyperparameters to meet the unique demands of the task. Through iterative adjustments and rigorous validation of test data, we identified the most effective configuration for the ARB. This comprehensive approach ensures that the ARB-Net is finely tuned to excel in the intricate domain of camouflage object detection, enhancing its performance and applicability in real-world scenarios.

3.8 Cascaded detection block

3.8.1 Sensory module (SM)

According to a neuroscience study by Langley et al. (1996), when prey indiscriminately hides in the background, selective search attention (Riley and Roitblat, 2018) plays a significant role in the predatory sensory mechanism to reduce non-prey details, thus saving computational time. To take advantage of the sensory mechanism, search attention is used in the initial feature learning to select and aggregate semantic features from the restored camouflage image IARB in the previous section.

Given an input image IARBϵRWxHx3 (the output of the ARB) a set of features fk,kϵ12345is extracted from the ResNet-50 (He et al., 2016) backbone architecture. The resolution of each feature fkis H2kx W2k, k=4481632.Studies by Lin et al. (2017) demonstrated that high-level features in deep layers keep semantic information for finding objects, whereas low-level features in shallow layers preserve spatial details for establishing object boundaries. Based on the property of neural networks, extracted features are categorized as low-level X0X1, intermediate-level X2, and high-level features X3X4,which are later fused through concatenation, up-sampling, and down-sampling operations; thereafter, by leveraging a dense convolutional network strategy of (Huang et al., 2017) to preserve more information from different layers and then use a modified receptive field (Liu and Huang, 2018) block to enlarge the receptive field and output a set of enhanced features.

3.8.2 Identification module (IM)

In the identification module, disguised objects need to be precisely identified using the output features obtained from the previous sensory module. Following the identification network of (Fan et al., 2020a), our final context-aware camouflaged object prediction maps with refined boundaries are generated.

4 Results

To demonstrate the generality of our newly proposed DiCANet COD model, the ARB-Net goes through a fine-tuning stage with different key network parameters and is trained on local image patches to perform restoration for more complex image background scenarios. For optimal results that preserve the camouflaged object’s latent spectral content and structural details, the Group Structure 𝐺 and each Basic Block Structure 𝐵 are set to 3 and 19 respectively, in the ARB. The filter size for all convolution layers is set to 33, except for the Channel Attention, whose kernel size is 11. Additionally, all feature maps maintain a fixed size except for the Channel Attention module. Each Group Structure outputs 64 filters.

5 Discussion

5.1 Experimental settings

5.1.1 Training/Testing details

ARB-Net builds on the same training settings of (Qin et al., 2020). Following the same hyperparameter configurations of (Fan et al., 2020a) for CDB. We evaluate the DiCANet models on the whole CHAMELEON (Skurowski et al., 2018) and test sets of CAMO (Le et al., 2019), and COD10K (Fan et al., 2020a). The entire experiment was executed on a 2.2 GHz dual-core Intel Core i7 CPU with 8 GB of RAM using Google COLAB as our working interface. Evaluation Metrics: We adopt four benchmark evaluation metrics to evaluate the performance of the DiCANet model including S-measure (Fan et al., 2017), mean E-measure (Fan et al., 2018), weighted F-measure (Margolin et al., 2014), and Mean Absolute Error.

5.2 Baseline models

To demonstrate the robustness of DiCANet, this research selected 13 strong baseline methods that adopted ResNet50 (He et al., 2016) as the backbone network for feature extraction and achieved SOTA performance in related fields, namely GOD and SOD: object detection FPN (Lin et al., 2017); semantic segmentation PSPNet (Zhao et al., 2017); instance segmentation Mask RCNN (He et al., 2017), HTC (Chen et al., 2019), and MSRCNN (Huang et al., 2019); medical image segmentation UNet++ (Zhou et al., 2018) and PraNet (Fan et al., 2020b); salient object detection PiCANet (Liu et al., 2018) BASNet (Qin et al., 2019), CPD (Wu et al., 2019), PFANet (Zhao and Wu, 2019), EGNet (Zhao J. X. et al., 2019), and camouflaged object segmentation SINet (Fan et al., 2020a).

5.3 Quantitative comparison

Table 1 summarizes the quantitative results of different baselines on three standard COD datasets. The proposed model achieved the highest values for the evaluation metrics, which indicates superior performance.

Table 1
www.frontiersin.org

Table 1. Quantitative comparison in terms of S, Eϕ,Fβω,andM on three benchmark COD datasets (Fan et al., 2020a).

For the CAMO dataset, comparing DiCANet model with the top two performing baselines: PraNet and SINet, the proposed method improved by 0.003 and 0.009, respectively in terms of M, and by 0.057 and 0.041, respectively, in terms of Eϕand Fβω. Although DiCANet achieved a low structural similarity score S, accurate predictions with high integrity of preserved edge details and clear boundaries were still achieved. Similarly, when compared with the edge boundary models, e.g., EGNet and PFANet, our DiCANet improves Eϕ and Fβω by (0.08 and 0.103) and (0.302 and 0.427), respectively, while drastically reducing MAE error by 0.016 and 0.110 for the CHAMELEON dataset. DiCANet achieved a significant improvement in S of 0.011 compared with the best model PraNet. Interestingly, for the most challenging dataset, COD10K, DiCANet outperformed the competition in prediction accuracy for all metrics and boosted performance to a new SOTA.

5.4 Qualitative comparison

Figure 7 shows the qualitative comparison of the camouflaged prediction map of DiCANet against the top four cutting-edge models. Row 1 to row 2, (top to bottom) are examples from CHAMELEON datasets; row 3 are examples from CAMO datasets; row 4 is an example from COD10K’s super-class: amphibious. It is evident that DiCANet outperforms all competing models and provides the best prediction that is the closest to ground truth (best viewed when zoomed).

Figure 7
www.frontiersin.org

Figure 7. Camouflaged objects segmentation results. (A) Image, (B) GT, (C) DiCANet, (D) SINet, (E) PraNet, (F) EGNet, (G) CPD.

Noncamouflaged regions are consistently included in the results of the compared methods, while some details of camouflaged objects are neglected. In contrast, the competing models inaccurately detect disguised objects and provide unreliable visual results. The proposed model demonstrated excellent performance in locating concealed objects accurately, with rich, fine details in predictions and clear boundaries. Additionally, our method captures the object boundaries quite well due to the power of ARB’s adaptive weighing mechanism and feature fusion strategy.

5.4.1 Failure case

Despite achieving satisfactory quantitative performance and setting a record in the COD task, the proposed DiCANet framework exhibits limitations in specific scenarios as shown in Figure 8. When dealing with multiple camouflaged objects grouped closely together (row 1), DiCANet might struggle to accurately predict the number of objects. This limitation can be attributed to the network’s limited prior knowledge in handling scenes with a specific number of objects. The complicated topological structures (row 2) with dense details can also pose challenges for DiCANet due to background complexity distraction. This complexity overwhelms the attention mechanisms, diverting focus from the camouflaged objects. Additionally, the intricate details in the background could share similar features with the camouflage patterns, making it difficult to distinguish the camouflaged object from its surroundings. These limitations provide valuable insights and potential areas for future investigation. By tackling these challenges and exploring novel approaches, researchers can create more resilient COD systems capable of managing even the most intricate and challenging scenarios.

Figure 8
www.frontiersin.org

Figure 8. Failure cases of our DiCANet. (A) Images, (B) GT, (C) Ours, (D) SINet.

5.4.2 Ablation study

To further demonstrate the superiority of DiCANet architecture with previous state-of-the-art methods, we conducted an ablation study by considering challenging camouflage scenarios (Figure 9). The study observes that DiCNet consistently shows distinctive detection and segmentation of concealed objects in challenging natural scenarios, such as partial occlusion (1st row), weak object/background contrast (2nd row), and strong background descriptor (3rd row). Meanwhile, the structural similarity 𝑺∝ scores (in red) of DiCANet are much higher and with a minimal error (in red) compared to the competitors, which further demonstrates the superiority of our method. We can also clearly see that the combination of the proposed adaptive ARB-Net and Feature Fusion Attention Strategy has significantly elevated our results to an exceptional level.

Figure 9
www.frontiersin.org

Figure 9. Visual comparison with top three baselines on COD10K S/M.

6 Conclusion

This paper presents Discriminative Context-Aware Network (DiCANet), a novel joint learning framework for detecting concealed objects with refined edges. The proposed model leverages two key components: the ARB-Net and the CDB. To improve the camouflage scene visibility, we employed ARB-Net to adaptively generate different attention weights for each channel-and pixel-wise feature and strategically fuse the feature maps to expand the discriminative power and representative ability of the convolution networks. To drive camouflage object localization and segmentation performance, we employed the CDB module. Based on the ARB and CDB modules, a context-aware network that effectively aims to pay more attention to local contextual information to evaluate the objectivity of the camouflage prediction map was proposed. Extensive experiments show that mining distinctive information can overcome the difficulties of both SOD and COD tasks with superior performance; thus, DiCANet outperforms SOTA methods under the commonly used evaluation metrics and deserves further exploration in other related computer vision tasks.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding authors.

Author contributions

CI: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing. NM: Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing. NB: Conceptualization, Data curation, Formal analysis, Investigation, Project administration, Resources, Validation, Visualization, Writing – original draft, Writing – review & editing. SA: Data curation, Formal analysis, Funding acquisition, Investigation, Project administration, Resources, Validation, Visualization, Writing – original draft, Writing – review & editing. FE: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing.

Funding

The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.

Acknowledgments

The authors would like to thank Atlantic Technological University, Pak-Austria Fachhochschule Institute of Applied Sciences and Technology, Fatima Jinnah Women University, and Saudi Electronic University.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Abdelhamed, A., Lin, S., and Brown, M. S. (2018). A high-quality denoising dataset for smartphone cameras, in “Proceedings of the IEEE conference on computer vision and pattern recognition”, pp. 1692–1700.

Google Scholar

Chen, L., Chu, X., Zhang, X., and Sun, J. (2022). Simple baselines for image restoration. arXiv. doi: 10.48550/arXiv.2204.04676

Crossref Full Text | Google Scholar

Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., et al. (2019). Hybrid task cascade for instance segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4974–4983.

Google Scholar

Chu, X., Chen, L., Chen, C., and Lu, X. (2021). Revisiting global statistics aggregation for improving image restoration. arXiv

Google Scholar

Chu, H. K., Hsu, W. H., Mitra, N. J., Cohen-Or, D., Wong, T. T., and Lee, T. Y. (2010). Camouflage images. ACM Trans. Graph. 29:1. doi: 10.1145/1833351.1778788

Crossref Full Text | Google Scholar

Cott, H. B. (1940). Adaptive coloration in animals. Methuen, London.

Google Scholar

Dai, T., Cai, J., Zhang, Y., Xia, S. T., and Zhang, L. (2019). Second-order attention network for single image super-resolution. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11065–11074.

Google Scholar

Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., and Fei-Fei, L. (2009). Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, IEEE, pp. 248–255.

Google Scholar

Dong, B., Zhuge, M., Wang, Y., Bi, H., and Chen, G. (2021). Accurate camouflaged object detection via mixture convolution and interactive fusion. arXiv. doi: 10.48550/arXiv.2101.05687

Crossref Full Text | Google Scholar

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al. (2020). An image is worth 16x16 words:Transformers for image recognition at scale. arXiv. doi: 10.48550/arXiv.2010.11929

Crossref Full Text | Google Scholar

Everingham, M., Eslami, S. M., Van Gool, L., Williams, C. K., Winn, J., and Zisserman, A. (2015). The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. 111, 98–136. doi: 10.1007/s11263-014-0733-5

Crossref Full Text | Google Scholar

Fan, D. P., Cheng, M. M., Liu, Y., Li, T., and Borji, A. Structure-measure: a new way to evaluate foreground maps. Proceedings of the IEEE international conference on computer vision. (2017), pp. 4548–4557.

Google Scholar

Fan, D. P., Gong, C., Cao, Y., Ren, B., Cheng, M. M., and Borji, A. (2018). Enhanced-alignment measure for binary foreground map evaluation. arXiv. doi: 10.48550/arXiv.1805.10421

Crossref Full Text | Google Scholar

Fan, D. P., Ji, G. P., Sun, G., Cheng, M. M., Shen, J., and Shao, L. (2020a). Camouflaged object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2777–2787.

Google Scholar

Fan, D. P., Ji, G. P., Zhou, T., Chen, G., Fu, H., Shen, J., et al. (2020b). “Pranet: Parallel reverse attention network for polyp segmentation” in International conference on medical image computing and computer-assisted intervention (Cham: Springer), 263–273.

Google Scholar

Fan, D. P., Zhou, T., Ji, G. P., Zhou, Y., Chen, G., Fu, H., et al. (2020c). INF-net: automatic COVID-19 lung infection segmentation from ct images. IEEE Trans. Med. Imaging 39, 2626–2637. doi: 10.1109/TMI.2020.2996645

PubMed Abstract | Crossref Full Text | Google Scholar

Flannelly, K. J. (2017). Religious beliefs, evolutionary psychiatry, and mental health in America. New York, NY: Springer.

Google Scholar

Forrest, N. (2016). SqueezeNet: AlexNet-level accuracy with 50x fewer parameters. arXiv. doi: 10.48550/arXiv.1602.07360

Crossref Full Text | Google Scholar

Galloway, J. A., Green, S. D., Stevens, M., and Kelley, L. A. (1802). Finding a signal hidden among noise: how can predators overcome camouflage strategies? Philos. Trans. R. Soc. B 2020:20190478.

Google Scholar

Galun, M., Sharon, E., Basri, R., and Brandt, A. (2003). Texture segmentation by multiscale aggregation of filter responses and shape elements. ICCV 3:716. doi: 10.1109/ICCV.2003.1238418

Crossref Full Text | Google Scholar

Ge, S., Jin, X., Ye, Q., Luo, Z., and Li, Q. (2018). Image editing by object-aware optimal boundary searching and mixed-domain composition. Comput. Vis. Media 4, 71–82. doi: 10.1007/s41095-017-0102-8

Crossref Full Text | Google Scholar

Girshick, R. (2015). “Fast r-cnn” in Proceedings of the IEEE international conference on computer vision, 1440–1448.

Google Scholar

Girshick, R., Donahue, J., Darrell, T., and Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. (2014), pp. 580–587.

Google Scholar

Gleeson, P., Lung, D., Grosu, R., Hasani, R., and Larson, S. D. (2018). c302: a multiscale framework for modelling the nervous system of Caenorhabditis elegans. Philos. Trans. Royal Soc. B 373:20170379. doi: 10.1098/rstb.2017.0379

PubMed Abstract | Crossref Full Text | Google Scholar

Han, Q., Fan, Z., Dai, Q., Sun, L., Cheng, M. M., Liu, J., et al. (2021). Demystifying local vision transformer: Sparse connectivity, weight sharing, and dynamic weight. arXiv 2.

Google Scholar

He, K., Gkioxari, G., Dollár, P., and Girshick, R. (2017). Mask R-CNN. Proceedings of the IEEE international conference on computer vision, pp. 2961–2969.

Google Scholar

He, K., Sun, J., and Tang, X. (2010). Single image haze removal using dark channel prior. IEEE Trans. Pattern Anal. Mach. Intell. 33, 2341–2353. doi: 10.1109/TPAMI.2010.168

PubMed Abstract | Crossref Full Text | Google Scholar

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.

Google Scholar

Hou, J. Y. Y. H. W. (2011). Detection of the mobile object with camouflage color under dynamic background based on optical flow. Procedia Eng. 15, 2201–2205. doi: 10.1016/j.proeng.2011.08.412

Crossref Full Text | Google Scholar

Hu, J., Shen, L., and Sun, G. (2018). Squeeze-and-excitation networks. Proceedings of the IEEE conference on computer vision and pattern recognition, 7132–7141.

Google Scholar

Huang, Z., Huang, C., and Wang, X. (2019). Mask scoring R-CNN. CVPR, 6409–6418. doi: 10.48550/arXiv.1903.00241

Crossref Full Text | Google Scholar

Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. (2017). Densely connected convolutional networks. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708.

Google Scholar

Hussain, N., Khan, M. A., Kadry, S., Tariq, U., Mostafa, R. R., Choi, J. I., et al. (2021). Intelligent deep learning and improved whale optimization algorithm based framework for object recognition. Hum. Cent. Comput. Inf. Sci 11:2021. doi: 10.22967/HCIS.2021.11.034

Crossref Full Text | Google Scholar

Ismail Fawaz, H., Forestier, G., Weber, J., Idoumghar, L., and Muller, P. A. (2019). Deep learning for time series classification: a review. Data Min. Knowl. Disc. 33, 917–963. doi: 10.1007/s10618-019-00619-1

Crossref Full Text | Google Scholar

Ji, G. P., Zhu, L., Zhuge, M., and Fu, K. (2022). Fast camouflaged object detection via edge-based reversible re-calibration network. Pattern Recogn. 123:108414. doi: 10.1016/j.patcog.2021.108414

Crossref Full Text | Google Scholar

Kirillov, A., He, K., Girshick, R., Rother, C., and Dollár, P. (2019). Panoptic segmentation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9404–9413.

Google Scholar

Lamdouar, H., Yang, C., Xie, W., and Zisserman, A. (2020). Betrayed by motion: Camouflaged object discovery via motion segmentation. Proceedings of the Asian Conference on Computer Vision.

Google Scholar

Langley, C. M., Riley, D. A., Bond, A. B., and Goel, N. (1996). Visual search for natural grains in pigeons (Columba livia): search images and selective attention. J. Exp. Psychol. Anim. Behav. Process. 22, 139–151.

PubMed Abstract | Google Scholar

Le, T. N., Cao, Y., Nguyen, T. C., Le, M. Q., Nguyen, K. D., Do, T. T., et al. (2021). Camouflaged instance segmentation in-the-wild: dataset, method, and benchmark suite. IEEE Trans. Image Process. 31, 287–300. doi: 10.1109/TIP.2021.3130490

Crossref Full Text | Google Scholar

Le, T. N., Nguyen, T. V., Nie, Z., Tran, M. T., and Sugimoto, A. (2019). Anabranch network for camouflaged object segmentation. Comput. Vis. Image Underst. 184, 45–56. doi: 10.1016/j.cviu.2019.04.006

Crossref Full Text | Google Scholar

Le, T. N., Ono, S., Sugimoto, A., and Kawasaki, H. (2020). Attention R-CNN for accident detection. In 2020 IEEE intelligent vehicles symposium (IV), pp. 313–320.

Google Scholar

Lim, B., Son, S., Kim, H., Nah, S., and Lee, M., (2017). Enhanced deep residual networks for single image super-resolution. Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 136–144.

Google Scholar

Lin, M., Chen, Q., and Yan, S. (2014). Network in Network. arXiv. doi: 10.48550/arXiv.1312.4400

Crossref Full Text | Google Scholar

Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. (2017). Feature pyramid networks for object detection. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125.

Google Scholar

Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). “Microsoft coco: Common objects in context” in European conference on computer vision (Cham: Springer), 740–755.

Google Scholar

Liu, N., Han, J., and Yang, M. H. (2018). Picanet: learning pixel-wise contextual attention for saliency detection. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3089–3098.

Google Scholar

Liu, S., and Huang, D. (2018). Receptive field block net for accurate and fast object detection. In Proceedings of the European conference on computer vision, pp. 385–400.

Google Scholar

Liu, Z., Huang, K., and Tan, T. (2012). Foreground object detection using top-down information based on EM framework. IEEE Trans. Image Process. 21, 4204–4217. doi: 10.1109/TIP.2012.2200492

PubMed Abstract | Crossref Full Text | Google Scholar

Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., and Xie, S. (2022). A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11976–11986.

Google Scholar

Liu, C., Yuen, J., and Torralba, A. (2010). Sift flow: dense correspondence across scenes and its applications. IEEE Trans. Pattern Anal. Mach. Intell. 33, 978–994. doi: 10.1109/TPAMI.2010.147

Crossref Full Text | Google Scholar

Machine Learning Mastery. (2019) A Gentle Introduction to Pooling Layers for Convolutional Neural Networks. Available at: https://machinelearningmastery.com/crash-course-convolutional-neural-networks/

Google Scholar

Margolin, R., Zelnik-Manor, L., and Tal, A. (2014). How to evaluate foreground maps?. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 248–255.

Google Scholar

Mei, H., Ji, G. P., Wei, Z., Yang, X., Wei, X., and Fan, D. P. (2021). Camouflaged object segmentation with distraction mining. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8772–8781.

Google Scholar

Merilaita, S., Scott-Samuel, N. E., and Cuthill, I. C. (2017). How camouflage works. Philos. Trans. Royal Soc. B 372:20160341. doi: 10.1098/rstb.2016.0341

PubMed Abstract | Crossref Full Text | Google Scholar

Nah, S., Hyun Kim, T., and Lee, M.. (2017). Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3883–3891.

Google Scholar

Naqvi, S. M. A., Shabaz, M., Khan, M. A., and Hassan, S. I. (2023). Adversarial attacks on visual objects using the fast gradient sign method. J Grid Comput 21:52. doi: 10.1007/s10723-023-09684-9

Crossref Full Text | Google Scholar

Niu, B., Wen, W., Ren, W., Zhang, X., Yang, L., Wang, S., et al. (2020). “Single image super-resolution via a holistic attention network” in European conference on computer vision (Cham: Springer), 191–207.

Google Scholar

Pan, Y., Chen, Y., Fu, Q., Zhang, P., and Xu, X. (2011). Study on the camouflaged target detection method based on 3D convexity. Mod. Appl. Sci. 5:152. doi: 10.5539/mas.v5n4p152

Crossref Full Text | Google Scholar

Qadeer, N., Shah, J. H., Sharif, M., Khan, M. A., Muhammad, G., and Zhang, Y. D. (2022). Intelligent tracking of mechanically thrown objects by industrial catching robot for automated in-plant logistics 4.0. Sensors 22:2113. doi: 10.3390/s22062113

PubMed Abstract | Crossref Full Text | Google Scholar

Qin, X., Wang, Z., Bai, Y., Xie, X., and Jia, H. (2020). FFA-net: feature fusion attention network for single image dehazing. Proc. AAAI Conf. Artif. Intel. 34, 11908–11915. doi: 10.1609/aaai.v34i07.6865

Crossref Full Text | Google Scholar

Qin, X., Zhang, Z., Huang, C., Gao, C., Dehghan, M., and Jagersand, M. (2019). Basnet: boundary-aware salient object detection. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7479–7489.

Google Scholar

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural Inf. Proces. Syst. 28, 1137–1149. doi: 10.48550/arXiv.1506.01497

Crossref Full Text | Google Scholar

Ren, S., He, K., Girshick, R., and Sun, J. (2017). Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137–1149. doi: 10.1109/TPAMI.2016.2577031

PubMed Abstract | Crossref Full Text | Google Scholar

Rida, I., Al-Maadeed, N., Al-Maadeed, S., and Bakshi, S. (2020). A comprehensive overview of feature representation for biometric recognition. Multimed. Tools Appl. 79, 4867–4890. doi: 10.1007/s11042-018-6808-5

Crossref Full Text | Google Scholar

Riley, D. A., and Roitblat, H. L. (2018). Selective attention and related cognitive processes in pigeons. Cogn. Proces. Anim. Behav., 249–276. doi: 10.4324/9780203710029-9

Crossref Full Text | Google Scholar

Ronneberger, O., Fischer, P., and Brox, T. (2015). “U-net: convolutional networks for biomedical image segmentation” in International conference on medical image computing and computer-assisted intervention (Cham: Springer), 234–241.

Google Scholar

Sengottuvelan, P., Wahi, A., and Shanmugam, A. (2008). Performance of decamouflaging through exploratory image analysis. 2008 first international conference on emerging trends in engineering and technology, pp. 6–10.

Google Scholar

Shotton, J., Winn, J., Rother, C., and Criminisi, A. (2006). “Textonboost: joint appearance, shape and context modeling for multi-class object recognition and segmentation” in European conference on computer vision (Berlin: Springer), 1–15.

Google Scholar

Skurowski, P., Abdulameer, H., Błaszczyk, J., Depta, T., Kornacki, A., and Kozieł, P. (2018), Animal camouflage analysis: CHAMELEON database. Unpublished manuscript, 2, p. 7.

Google Scholar

Song, L., and Geng, W. (2010). A new camouflage texture evaluation method based on WSSIM and nature image features. 2010 international conference on multimedia technology, pp. 1–4.

Google Scholar

Stevens, M., and Merilaita, S. (2009). Animal camouflage: current issues and new perspectives. Philos. Trans. Royal Soc. B 364, 423–427. doi: 10.1098/rstb.2008.0217

PubMed Abstract | Crossref Full Text | Google Scholar

Stevens, M., and Ruxton, G. D. (2019). The key role of behaviour in animal camouflage. Biol. Rev. 94, 116–134. doi: 10.1111/brv.12438

PubMed Abstract | Crossref Full Text | Google Scholar

Talas, L., Baddeley, R. J., and Cuthill, I. C. (2017). Cultural evolution of military camouflage. Philos. Trans. Royal Soc. B 372:20160351. doi: 10.1177/10482911211032971

PubMed Abstract | Crossref Full Text | Google Scholar

Thayer, G. H. (1918). Concealing-coloration in the animal kingdom: An exposition of the laws of disguise through color and pattern. New York: Macmillan Company.

Google Scholar

Ulyanov, D., Vedaldi, A., and Lempitsky, V. (2018). Deep image prior. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 9446–9454).

Google Scholar

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). Attention is all you need. Adv. Neural Inf. Proces. Syst. 30, 5998–6008. doi: 10.48550/arXiv.1706.03762

Crossref Full Text | Google Scholar

Wang, T., Borji, A., Zhang, L., Zhang, P., and Lu, H. (2017). A stagewise refinement model for detecting salient objects in images. In Proceedings of the IEEE international conference on computer vision, pp. 4019–4028.

Google Scholar

Wang, Z., Cun, X., Bao, J., Zhou, W., Liu, J., and Li, H. (2022). Uformer: a general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 17683–17693.

Google Scholar

Wang, X., Girshick, R., Gupta, A., and He, K. (2018). Non-local neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7794–7803.

Google Scholar

Waqas Zamir, S., Arora, A., Khan, S., Hayat, M., Shahbaz Khan, F., and Yang, M. H. (2021). Restormer: efficient transformer for high-resolution image restoration. arXiv :2111. doi: 10.48550/arXiv.2111.09881

Crossref Full Text | Google Scholar

Woo, S., Park, J., Lee, J. Y., and Kweon, I. S. (2018). CBAM: convolutional block attention module. Proceedings of the European conference on computer vision (ECCV), pp. 3–19.

Google Scholar

Wu, Y. H., Gao, S. H., Mei, J., Xu, J., Fan, D. P., Zhang, R. G., et al. (2021). Jcs: an explainable COVID-19 diagnosis system by joint classification and segmentation. IEEE Trans. Image Process. 30, 3113–3126. doi: 10.1109/TIP.2021.3058783

PubMed Abstract | Crossref Full Text | Google Scholar

Wu, Z., Su, L., and Huang, Q. (2019). Cascaded partial decoder for fast and accurate salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3907–3916.

Google Scholar

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., et al. (2015). Show, attend and tell: Neural image caption generation with visual attention. International conference on machine learning, pp. 2048–2057

Google Scholar

Xue, F., Yong, C., Xu, S., Dong, H., Luo, Y., and Jia, W. (2016). Camouflage performance analysis and evaluation framework based on features fusion. Multimed. Tools Appl. 75, 4065–4082. doi: 10.1007/s11042-015-2946-1

Crossref Full Text | Google Scholar

Zamir, S. W., Arora, A., Khan, S., Hayat, M., Khan, F. S., and Yang, M. H. (2022). Restormer: efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5728–5739.

Google Scholar

Zamir, S. W., Arora, A., Khan, S., Hayat, M., Khan, F. S., Yang, M. H., et al. (2020). “Learning enriched features for real image restoration and enhancement” in European conference on computer vision (Cham: Springer), 492–511.

Google Scholar

Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., and Fu, Y. (2018). Image super-resolution using very deep residual channel attention networks. In Proceedings of the European conference on computer vision (ECCV), pp. 286–301.

Google Scholar

Zhao, J. X., Liu, J. J., Fan, D. P., Cao, Y., Yang, J., and Cheng, M. M. (2019). EGNet: edge guidance network for salient object detection. Proceedings of the IEEE/CVF international conference on computer vision, pp. 8779–8788.

Google Scholar

Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. (2017). Pyramid scene parsing network. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890.

Google Scholar

Zhao, T., and Wu, X. (2019). Pyramid feature attention network for saliency detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3085–3094.

Google Scholar

Zhao, Z. Q., Zheng, P., Xu, S. T., and Wu, X. (2019). Object detection with deep learning: a review. IEEE Trans. Neural Netw. Learn. Syst. 30, 3212–3232. doi: 10.1109/TNNLS.2018.2876865

Crossref Full Text | Google Scholar

Zheng, Y., Zhang, X., Wang, F., Cao, T., Sun, M., and Wang, X. (2018). Detection of people with camouflage pattern via dense deconvolution network. IEEE Signal Proces. Lett. 26, 29–33. doi: 10.1109/LSP.2018.2825959

Crossref Full Text | Google Scholar

Zhou, Z., Rahman Siddiquee, M. M., Tajbakhsh, N., and Liang, J. (2018). “UNET++: A nested U-Net architecture for medical image segmentation” in Deep learning in medical image analysis and multimodal learning for clinical decision support (Cham: Springer), 3–11.

Google Scholar

Zhu, J., Zhang, X., Zhang, S., and Liu, J. (2021). Inferring camouflaged objects by texture-aware interactive guidance network. Proceedings of the AAAI Conference on Artificial Intelligence, 35, pp. 3599–3607.

Google Scholar

Keywords: camouflage object detection, COD, dataset, feature extraction, benchmark, deep learning, convolutional neural network, artificial intelligence

Citation: Ike CS, Muhammad N, Bibi N, Alhazmi S and Eoghan F (2024) Discriminative context-aware network for camouflaged object detection. Front. Artif. Intell. 7:1347898. doi: 10.3389/frai.2024.1347898

Received: 01 December 2023; Accepted: 12 March 2024;
Published: 27 March 2024.

Edited by:

Hanqi Zhuang, Florida Atlantic University, United States

Reviewed by:

Khalil Khan, Nazarbayev University, Kazakhstan
Anum Masood, NTNU, Norway

Copyright © 2024 Ike, Muhammad, Bibi, Alhazmi and Eoghan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Samah Alhazmi, cy5hbGhhem1pQHNldS5lZHUuc2E=; Furey Eoghan, ZW9naGFuLmZ1cmV5QGF0dS5pZQ==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.