Cascaded information enhancement and cross-modal attention feature fusion for multispectral pedestrian detection

Yang, Yang; Xu, Kaixiong; Wang, Kaizheng

doi:10.3389/fphy.2023.1121311

ORIGINAL RESEARCH article

Front. Phys., 18 January 2023

Sec. Radiation Detectors and Imaging

Volume 11 - 2023 | https://doi.org/10.3389/fphy.2023.1121311

This article is part of the Research TopicMulti-Sensor Imaging and Fusion: Methods, Evaluations, and ApplicationsView all 18 articles

Cascaded information enhancement and cross-modal attention feature fusion for multispectral pedestrian detection

Yang Yang¹

Kaixiong Xu¹

Kaizheng Wang²*

¹Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China
²Faculty of Electrical Engineering, Kunming University of Science and Technology, Kunming, China

Multispectral pedestrian detection is a technology designed to detect and locate pedestrians in Color and Thermal images, which has been widely used in automatic driving, video surveillance, etc. So far most available multispectral pedestrian detection algorithms only achieved limited success in pedestrian detection because of the lacking take into account the confusion of pedestrian information and background noise in Color and Thermal images. Here we propose a multispectral pedestrian detection algorithm, which mainly consists of a cascaded information enhancement module and a cross-modal attention feature fusion module. On the one hand, the cascaded information enhancement module adopts the channel and spatial attention mechanism to perform attention weighting on the features fused by the cascaded feature fusion block. Moreover, it multiplies the single-modal features with the attention weight element by element to enhance the pedestrian features in the single-modal and thus suppress the interference from the background. On the other hand, the cross-modal attention feature fusion module mines the features of both Color and Thermal modalities to complement each other, then the global features are constructed by adding the cross-modal complemented features element by element, which are attentionally weighted to achieve the effective fusion of the two modal features. Finally, the fused features are input into the detection head to detect and locate pedestrians. Extensive experiments have been performed on two improved versions of annotations (sanitized annotations and paired annotations) of the public dataset KAIST. The experimental results show that our method demonstrates a lower pedestrian miss rate and more accurate pedestrian detection boxes compared to the comparison method. Additionally, the ablation experiment also proved the effectiveness of each module designed in this paper.

1 Introduction

Pedestrian detection, parsing visual content to identify and locate pedestrians on an image/video, has been viewed as an essential and central task within the computer vision field and widely employed in various applications, e.g. autonomous driving, video surveillance and person re-identification [1–7]. The performance of such technology has greatly advanced through the facilitation of convolutional neural networks (CNN). Typically, pedestrian detectors take Color images as input and try to retrieve the pedestrian information from them. However, the quality of Color images highly depends on the light condition. Missing recognition of pedestrians occurs frequently when pedestrian detectors process Color images with poor resolution and contrast caused by unfavorable lighting. Consequently, the use of such models has been limited for the application of all-weather devices.

Thermal imaging is related to the infrared radiation of pedestrians, barely affected by changes in ambient light. The technique of combining Color and Thermal images has been explored in recent years [8–16]. These methods has been shown to exhibit positive effects on pedestrian detection performance in complex environments as it could retrieve more pedestrian information. However, despite important initial success, there remain two major challenges. First, as shown in Figure 1, the image of pedestrians tends to blend with the background for nighttime Color images resulting from insufficient light [17], and for daytime Thermal images as well due to similar temperatures between the human body and the ambient environment [18]. Second, there is an essential difference between Color images and Thermal images the former displays the color and texture detail information of pedestrians while the latter shows the temperature information. Therefore, solutions needed to be taken to augment the pedestrian features in Color and Thermal modalities in order to suppress background interference, and enable better integration and understanding of both Color and Thermal images to improve the accuracy of pedestrian detection in complex environments.

FIGURE 1

FIGURE 1. Example of color and thermal images of pedestrians in daytime and nighttime scenes.

To address the challenges above, the researches [19,20] designed illumination-aware networks to obtain illumination-measured parameters of Color and Thermal images respectively, which were used as fusion weights for Color and Thermal features in order to realize a self-adaptively fuse of two modal features. However, the acquisition of illumination-measured parameters relied heavily on the classification scores, the accuracy of which was limited by the performance of the classifier. [21] reported confidence-aware networks to predict the confidence of detection boxes for each modal, and then Dempster-Sheffer theory combination rules were employed to fuse the results of different branches based on uncertainty. Nevertheless, the accuracy of predicting the detection boxes’ confidence is also affected by the performance of the confidence-aware network. A cyclic fusion and refinement scheme was introduced by [22] for the sake of gradually improving the quality of Color and Thermal features and automatically adjusting the complementary and consistent information balance of the two modalities to effectively utilize the information of both modalities. However, this method only used a simple feature cascade operation to fuse Color and Thermal features and failed to fully exploit the complementary features of these two modalities.

To tackle the problems aforementioned, we propose a multispectral pedestrian detection algorithm with cascaded information enhancement and cross-modal attention feature fusion. The cascaded information enhancement module (CIEM) is designed to enhance the pedestrian information suppressed by the background in the Color and Thermal images. CIEM uses a cascaded feature fusion block to fuse Color and Thermal features to obtain fused features of both modalities. Since the fused features contain the consistency and complementary information of Color and Thermal modalities, the fused features can be used to enhance Color and Thermal features respectively to reduce the interference of background on pedestrian information. Inspired by the attention mechanism, the attention weights of the fused features are sequentially obtained by channel and spatial attention learning, and the Color and Thermal features are multiplied with the attention weights element by element, respectively. In this way, the single-modal features have the combined information of the two modalities, and the single-modal information is enhanced from the perspective of the fused features. Although CIEM enriches single-modal pedestrian features, simple feature fusion of the enhanced single-modal features is still insufficient for robust multispectral pedestrian detection. Thus, we design the cross-modal attention feature fusion module (CAFFM) to efficiently fuse Color and Thermal features. Cross-modal attention is used in this module to implement the differentiation of pedestrian features between different modalities. In order to supplement the pedestrian information of the other modality to the local modality, the attention of the other modality is adopted to augment the pedestrian features of the local modality. A global feature is constructed by adding the Color and Thermal features after performing cross-modal feature enhancement, and the global feature is used to guide the fusion of the Color and Thermal features. Overall, the method presented in this paper enables more comprehensive pedestrian features acquisition through cascaded information enhancement and cross-modal attention feature fusion, which effectively enhances the accuracy of multispectral image pedestrian detection. The main contributions of this paper are summarized as follows.

(1) A cascaded information enhancement module is proposed. From the perspective of fused features, it reduces the interference from the background of Color and Thermal modalities on pedestrian detection and augments the pedestrian features of Color and Thermal modalities separately through an attention mechanism.

(2) The designed cross-modal attention feature fusion module first mines the features of both Color and Thermal modalities separately through a cross-modal attention network and adds them to the other modality for cross-modal feature enhancement. Meanwhile, the cross-modal enhanced Color and Thermal features are used to construct global features to guide the feature fusion of the two modalities.

(3) Numerous experiments are conducted on the public dataset KAIST to demonstrate the effectiveness and superiority of the proposed method. In addition, the ablation experiments also demonstrate the effectiveness of the proposed modules.

2 Related works

2.1 Multispectral pedestrian detection

Multispectral sensors can obtain paired Color-Thermal images to provide complementary information about pedestrian targets. A large multispectral pedestrian detection (KAIST) dataset was constructed by [8]. Meanwhile, by combining the traditional aggregated channel feature (ACF) pedestrian detector [23] with the HOG algorithm [24], an extended ACF (ACF + T + THOG) method was proposed to fuse Color and Thermal features. In 2016, [9] proposed four fusion modalities of low-layer feature, middle-layer feature, high-layer feature, and confidence fraction fusion with VGG16 as the backbone network, and the middle-layer feature fusion was proved to offer the maximum integration capability of Color and Thermal features. Inspired by this, [25] developed a multispectral region candidate network with Faster RCNN (Region with CNN features, RCNN) [26] as the architecture and replaced the original classifier in Faster RCNN with an enhanced decision tree classifier to reduce the missed and false detection of pedestrians. Recently,[27] deployed the EfficientDet as the backbone network and proposed an EfficientDet-based fusion framework for multispectral pedestrian detection to improve the detection accuracy of pedestrians in Color and Thermal images by adding and cascading the Color and Thermal features. Although the studies [8,9,25,27] fused Color and Thermal features for pedestrian detection, they mainly focused on exploring the impact of different stages of fusion on pedestrian detection, and only adopted simple feature fusion and not focusing on the case of pedestrian and background confusion.

In 2019, [28] observed a weak alignment problem of pedestrian position between Color and Thermal images, for which the KAIST dataset was re-annotated and Aligned Region CNN (AR-CNN) was proposed to handle weakly aligned multispectral pedestrian detection data in an end-to-end manner. But the deployment of this algorithm requires pairs of annotations, and the annotation of the dataset is a time-consuming and labor-intensive task, which makes the algorithm difficult to be applied in realistic scenes. [29] proposed a new single-stage multispectral pedestrian detection framework. This framework used multi-label learning to learn input state-aware features based on the state of the input image pair by assigning an individual label (if the pedestrian is visible in only one image of the image pair, the label vector is assigned as y₁ ∈ [0, 1] or y₂ ∈ [1, 0]; if the pedestrian is visible in both images of the image pair, the label vector is assigned as y₃ ∈ [1, 1]) to solve the problem of weak alignment of pedestrian locations between Color and Thermal images, but the model still requires pairs of annotations during training. [19] designed illumination-aware networks to obtain illumination-measured parameters for Color and Thermal images separately and used them as the fusion weights for Color and Thermal features. [20] designed a differential modality perception fusion module to guide the features of the two modalities to become similar, and then used the illumination perception network to assign fusion weights to the Color and Thermal features. [30] reported an uncertainty-aware cross-modal guidance (UCG) module to guide the distribution of modal features with high prediction uncertainty to align with the distribution of modal features with low prediction uncertainty. The researches [19,20] noticed that the pedestrians in Color and Thermal images are easily confused with the background and used illumination-aware networks to assign fusion weights to Color and Thermal features. However, the acquisition of illumination-measured parameters relied heavily on the classification scores, whose accuracy was limited by the performance of the classifier. In contrast, the method proposed in this paper not only considers the confusion of pedestrians and background in Color and Thermal images but also effectively fuses the two modal features.

2.2 Attention mechanisms

Attention mechanisms [31] utilized in computer vision are aimed to perform the processing of visual information. Currently, attention mechanisms have been widely used in semantic segmentation [32], image captioning [33], image fusion [34,35], image dehazing [36], saliency target detection [37], person re-identification [38–40], etc. [41] introduced the idea of a squeeze and excitation network (SENet) to simulate the interdependence between feature channels in order to generate channel attention to recalibrate the feature mapping of channel directions. [42] employed the use of a selective kernel unit (SKNet) to adaptively fuse branches with different kernel sizes based on input information. A work inspired by this was from [43]. They designed a multi-scale channel attention feature fusion network that used channel attention mechanisms to replace simple fusion operations such as feature cascades or summations in feature fusion to produce richer feature representations. However, this recent progress in multispectral pedestrian detection has also been limited to two main challenges the interference caused by background and the difference of fundamental characteristics in Color and Thermal images. Therefore, we propose a multispectral pedestrian detection algorithm with cascaded information enhancement and cross-modal attention feature fusion based on the attention mechanism.

3 Methods

The overall network framework of the proposed algorithm is shown in Figure 2. The network consists of an encoder, a cascaded information enhancement module (CIEM), a cross-modal attentional feature fusion module (CAFFM) and a detection head. Specifically, ResNet-101 [44] is used as the backbone network of the encoder to encode the features of the input Color images X_c and Thermal images X_t to obtain the corresponding feature maps F_c ∈ R^W×H×C and F_t ∈ R^W×H×C (W, H, C represent the width, height and the number of channels of the feature maps, respectively). CIEM enhances single-modal information from the perspective of fused features by cascading feature fusion blocks to fuse F_c and F_t, and attention weighting the fused features to enrich pedestrian features. CAFFM complements the features of different modalities by mining the complementary features between the two modalities and constructs global features to guide the effective fusion of the two modal features. The detection head is employed for pedestrian recognition and localization of the final fused features.

FIGURE 2

FIGURE 2. Overall framework of the proposed algorithm.

3.1 Cascaded information enhancement module

Considering the confusion of pedestrians with the backgrounds in Color and Thermal images, we design a cascaded information enhancement module (CIEM) to augment the pedestrian features of both modalities to mitigate the effect of background interference on pedestrian detection. Specifically, a cascaded feature fusion block is used to fuse the Color features F_c and Thermal features F_t. The cascaded feature fusion block consists of feature cascade, 1 × 1 convolution, 3 × 3 convolution, BN layer, and ReLu activation function. The feature cascade operation splice F_c and F_t along the direction of channels. 1 × 1 convolution is conducive to cross-channel feature interaction in the channel dimension and reducing the number of channels in the splice feature map, while 3 × 3 convolution expands the field of perception and makes a more comprehensive fusion of features for generating fusion features F_ct:

F_{c t} = R e L u (B N ({C o n v}_{3} ({C o n v}_{1} [F_{c}, F_{t}]))) (1)

where BN denotes batch normalization, $C o n v_{n} (\cdot)$ denotes a convolution kernel with kernel size n × n, [⋅, ⋅] denotes the cascade of features along the channel direction, ReLu(⋅) represents ReLu activation function. Fusion feature F_ct is used to enhance the single-modal information because F_ct combines the consistency and complementarity of the Color features F_c and Thermal features F_t. The use of F_ct for enhancing the single-modal feature can reduce the interference of the noise in the single-modal features (for example, it is difficult to distinguish between the pedestrian information and the background noise).

In order to effectively enhance pedestrian features, the fusion feature F_ct is sent into the channel attention module (CAM) and spatial attention module (PAM) [45] to make the network pay attention to pedestrian features. The network structure of CAM and PAM is shown in Figure 3. F_ct first learns the channel attention weight w_ca ∈ R^1×1×C by CAM, then uses w_ca to weight F_ct, and the spatial attention weight w_pa ∈ R^W×H×1 is obtained from the weighted features by PAM.

FIGURE 3

FIGURE 3. Network structure of channel attention and spatial attention.

The single-modal Color features F_c and Thermal features F_t are multiplied element by element with the attention weights w_ca and w_pa to enhance the single-modal features from the perspective of fused features. The whole process can be described as follows:

F_{t}^{'} = (F_{t} \otimes w_{c a}) \otimes w_{p a} (2)

F_{c}^{'} = (F_{c} \otimes w_{c a}) \otimes w_{p a} (3)

where $F_{t}^{'}$ and $F_{c}^{'}$ denote the Color features and Thermal features obtained by the cascaded information enhancement module, respectively. ⊗ represents the element by element multiplication.

3.2 Cross-modal attention feature fusion module

There is an essential difference between Color and Thermal images, Color images reflect the color and texture detail information of pedestrians while Thermal images contain the temperature information of pedestrians, however, they also have some complementary information. In order to explore the complementary features of different image modalities and fuse them effectively, we design a cross-modal attention feature fusion module.

Specifically, the Color features $F_{c}^{'}$ and Thermal features $F_{t}^{'}$ enhanced by CIEM are first mapped into feature vectors v_c ∈ R^1×1×C and v_t ∈ R^1×1×C, respectively, by using global average pooling operation. The cross-modal attention network consists of a set of symmetric 1 × 1 convolutions, ReLu activation functions, and Sigmoid activation functions. In order to obtain the complementary features of the two modalities, more pedestrian features need to be mined from the single-modal. The feature vectors v_t and v_c are learned to the respective modal attention weights w_t ∈ R^1×1×C and w_c ∈ R^1×1×C by a cross-modal attention network, and then the Color features $F_{c}^{'}$ are multiplied element by element with the attention weights w_t of the Thermal modality, and the Thermal features $F_{t}^{'}$ are multiplied element by element with the attention weights w_c of the Color modality to complement the features of the other modality into the present modality. The specific process can be expressed as follows.

w_{t} = S i g m o i d (R e L u ({C o n v}_{1} (G A P (F_{t}^{'})))) (4)

F_{c t}^{'} = w_{t} \otimes G A P (F_{c}^{'}) (5)

w_{c} = S i g m o i d (R e L u ({C o n v}_{1} (G A P (F_{c}^{'})))) (6)

F_{t c}^{'} = w_{c} \otimes G A P (F_{t}^{'}) (7)

where $F_{c t}^{'}$ denotes Color features after supplementation with Thermal features, $F_{t c}^{'}$ denotes Thermal features after supplementation with Color features, GAP(⋅) denotes global average pooling operation, Conv₁(⋅) denotes convolution with convolution kernel size 1 × 1, ReLu(⋅) denotes ReLu activation operation, and Sigmoid (⋅) denotes Sigmoid activation operation.

In order to efficiently fuse the two modal features, the features $F_{c t}^{'}$ and $F_{t c}^{'}$ are subjected to an element by element addition operation to obtain a global feature vector containing Color and Thermal features. Then, the features $F_{t}^{'}$ and $F_{c}^{'}$ are added element by element and multiplied with the attention weight w_ct of the global feature vector element by element to guide the fusion of Color and Thermal features from the perspective of global features to obtain the final fused feature F. The fused feature F is input to the detection head to obtain the pedestrian detection results. The feature fusion process can be expressed as follows:

w_{c t} = S i g m o i d (R e L u ({C o n v}_{1} (F_{c t}^{'} \oplus F_{t c}^{'}))) (8)

F = w_{c t} \otimes (F_{t}^{'} \oplus F_{c}^{'}) (9)

where ⊕ denotes element by element addition.

3.3 Loss function

The loss function in this paper is consistent with the literature [26] and uses the Region Proposal Network (RPN) loss function L_RPN and Fast RCNN [46] loss function L_FR to jointly optimize the network:

L = L_{R P N} + L_{F R} (10)

Both L_RPN and L_FR consist of classification loss L_cls and bounding box regression loss L_reg:

L (\{p_{i}\}, \{t_{i}\}) = \frac{1}{N_{c l s}} \sum_{i} L_{c l s} (p_{i}, p_{i}^{*}) + λ \frac{1}{N_{reg}} \sum_{i} p_{i}^{*} L_{r e g} (t_{i}, t_{i}^{*}) (11)

Where, N_cls is the number of anchors, N_reg is the sum of positive and negative sample number, p_i is the probability that the i-th anchor is predicted to be the target, $p_{i}^{*}$ is 1 when the anchor is a positive sample, otherwise it is 0, t_i denotes the bounding box regression parameter predicting the i-th anchor, and $t_{i}^{*}$ denotes the GT bounding box parameter of the i-th anchor, λ = 1.

The difference between the classification loss of RPN network and Fast RCNN network is that the RPN network focuses only on the foreground and background when classifying, so its loss is a binary cross-entropy loss, while the Fast RCNN classification is focused to the target category and is a multi-category cross-entropy loss:

L_{c l s} (p_{i}, p_{i}^{*}) = - \log [p_{i}^{*} p_{i} + (1 - p_{i}^{*}) (1 - p_{i})] (12)

The bounding box regression loss of RPN network and Fast RCNN network uses ${Smooth}_{L_{1}}$ loss:

L_{reg} (t_{i}, t_{i}^{*}) = R (t_{i} - t_{i}^{*}) (13)

Where, R denotes ${Smooth}_{L_{1}}$ function,

{Smooth}_{L_{1}} (x) = \{\begin{matrix} \frac{x^{2}}{2 σ^{2}} & if | x | < \frac{1}{σ^{2}} \\ | x | - 0.5 & otherwise \end{matrix} (14)

The difference between the bounding box regression loss of RPN loss and the regression loss of Fast RCNN loss is that the RPN network is trained when σ = 3 and the Fast RCNN network is trained when σ = 1.

4 Experimental results and analysis

4.1 Datasets

This paper evaluates the algorithm performance on the KAIST pedestrian dataset [8], which is composed of 95,328 pairs of Color and Thermal images captured during daytime and nighttime. It is the most widely used multispectral pedestrian detection dataset at present. The dataset is labeled with four categories including person, people, person?, and cyclist. Considering the application areas of multispectral pedestrian detection (e.g., automatic driving), all four categories are treated as positive examples for detection in this paper. To address the problem of the annotation errors and missing annotations in the original annotation of the KAIST dataset, studies [9,28,47] performed data cleaning and re-annotation of the original data. Given that the annotations used in various studies are not consistent, we use 7601 pairs of Color and Thermal images from synthetic annotation (SA) [47] and 8892 pairs of Color and Thermal images from paired annotation (PA) [28] for model training. The test set consists of 2252 pairs of Color and Thermal images, of which 1455 pairs are from the daytime and 797 pairs are from the nighttime. For a fair comparison with other methods, the test experiments were performed according to the reasonable settings proposed in the literature [8].

4.2 Evaluation indexes

In this paper, Log-average Miss Rate (MR) proposed by [48] is employed as an evaluation index and combined with the plotting of the Miss Rate-FPPI curve to assess the effectiveness of the algorithm. The horizontal coordinate of the Miss Rate-FPPI curve indicates the average number of False Positives Per Image (FPPI), and the vertical coordinate represents the Miss Rate (MR), which is expressed as:

MissRate = \frac{F N}{T P + F N} (15)

F P P I = \frac{F P}{Total (images)} (16)

where FN denotes False Negative, TP denotes True Positive, FP denotes False Positive, the sum of TP and FN is the number of all positive samples, and Total (images) denotes the total number of predicted images. It is worth noting that the lower the Miss Rate-FPPI curve trend, the better the detection performance; the smaller the MR value, the better the detection performance. In order to calculate MR, in logarithmic space, nine points are taken from the horizontal coordinate (limited value range is $[1 0^{- 2}, 1 0^{0}]$ ) of Miss Rate-FPPI curve, and then there are nine corresponding vertical coordinates m₁, m₂,…m₉. By averaging these values, MR can be obtained as follows:

M R = \exp [\frac{1}{n} \sum_{i = 1}^{n} \ln (m_{i})] (17)

where n is 9.

4.3 Implementation details

In this paper, the deep learning framework pytorch1.7 is adopted. The experimental platform is the ubuntu18.04 operating system and a single NVIDIA GeForce RTX 2080Ti GPU. Stochastic Gradient Descent (SGD) algorithm is used to optimize the network during model training, with momentum value of 0.9, weight attenuation value 5 × 10^–4, and initial learning rate is 1 × 10^–3. The model is iterated for five epochs with the batch size of 4, and the learning rate decay to 1 × 10^–4 after the 3rd epoch.

4.4 Experimental results and analysis

4.4.1 Construction of the baseline

This work constructs a baseline algorithm architecture based on ResNet-101 backbone network and Faster RCNN detection head. Simple characteristic fusion (feature cascade, element by element addition and element by element multiplication) of the Color and Thermal features output by the backbone network is carried out in three sets of experiments. The fused feature is used as the input of the detection head. In order to ensure the high efficiency of the build baseline algorithm, synthesis annotation is employed to train and test the baseline. The test results are shown in Table 1. The MR values using feature cascade, element by element addition and element by element multiplication in the all-weather scene are 14.62%, 13.84% and 14.26%, respectively. By comparing these three results, it can be seen that the feature element by element addition demonstrates the best performance. Therefore, we adopt the method of adding features element by element as the baseline integration method.

TABLE 1

TABLE 1. Experimental results of baseline under different fusion modes.

4.4.2 Performance comparison of different methods

The performance of this method is compared with several other state-of-the-art methods. The compared methods include hand-represented methods, e.g., ACT + T + THOG [8] and deep learning-based methods, e.g., Halfway Fusion [9], CMT_CNN[49], CIAN[50], IAF R-CNN[51], IATDNN + IAMSS[19], CS-RCNN [52], IT-MN [53], and DCRD [54]. Here, the model is trained using 7601 pairs of Color and Thermal images from SA and 8892 pairs of Color and Thermal images from PA, respectively. Besides, 2252 pairs of Color and Thermal images from the test set are used for model testing. Table 2 lists the experimental results.

TABLE 2

TABLE 2. MRs of different methods on KAIST datasets.

Table 2 shows that when the model is trained with SA, the MRs of the method proposed in this paper are 10.71%, 13.09% and 8.45% for all-weather, daytime and nighttime scenes, respectively, which are 0.72%, −1.23% and 0.37% lower than the compared method CS-RCNN with the best performance, respectively. The PA (Color) and PA (Thermal) in Table 2 represent the Color annotation and Thermal annotation in the pairwise annotation PA, respectively, for the purpose of training the model. It can be seen from two that the MRs of the method in this paper are 11.11% and 10.98% when using Color annotation and Thermal annotation in the all-weather scene, which are 2.53% and 3.70%, respectively, lower than those of compared method with the best performance. In addition, by analyzing the experimental results of two improved versions of annotations, it can be found that pedestrian detection results are different when using different annotations, indicating the importance of annotations.

4.4.3 Analysis of ablation experiments

4.4.3.1 Complementarity and importance of color and thermal features

This section compares the effect of different input sources on pedestrian detection performance. In order to eliminate the impact of the proposed module on detection performance, three sets of experiments are conducted on baseline: 1) the combination of Color and Thermal images as the input source (the input of the two branches of the backbone network are respectively Color and Thermal images); 2) dual-stream Color image as the input source (use Color images to replace Thermal images, that is, the backbone network input source is Color images); 3) dual-stream Thermal images as the input source (use Thermal images to replace Color images, that is, the backbone network input source is Thermal images).The training set of the model here is 7061 pairs of images of SA, and the test set is 2252 pairs of Color and Thermal images. Table 3 shows the MRs of these three input sources for the all-weather, daytime, and nighttime scenes. It can be seen from Table 3 that the MRs obtained using Color and Thermal images as input to the network are 13.84%, 15.35% and 12.48% for the all-weather, daytime and nighttime scenes, respectively, which are 11.53%, 3.96%, 18.70% and 3.71%, 7.46%, 0.13% lower than using Color images and Thermal images as input alone. The experimental results prove that the detection network combining Color and Thermal features delivers better performance, indicating that Color and Thermal features are important for pedestrian detection.

TABLE 3

TABLE 3. MRs of different modal inputs.

Figure 4 shows the Miss Rate-FPPI curves of the detection results for these three input sources in the all-weather, daytime, and nighttime scenes (blue, red and green curves indicate dual-stream Thermal images, dual-stream Color images, and Color and Thermal images, respectively). By analyzing the Miss Rate-FPPI curve trend and combining with the experimental data in Table 3, it can be seen that the detection effect of Color images as the input source is better than that of Thermal images in the daytime scene while the result is the opposite for the night scene, and the detection effect of Color and Thermal images combined as the input source is better than that of single-modal input in both daytime and nighttime. It shows that there are complementary features between Color and Thermal modalities, and the fusion of the two modal features can improve the pedestrian detection performance.

FIGURE 4

FIGURE 4. The Miss Rate-FPPI curves of the detection results of the three groups of input sources in the All-weather, Daytime and Nighttime scenes (From left to right, All-weather, Daytime and Nighttime Miss Rate-FPPI curves are shown in the figure).

4.4.3.2 Ablation experiments

In this section, ablation experiments are conducted to demonstrate the effectiveness of the proposed cascaded information enhancement module (CIEM) and cross-modal attentional feature fusion module (CAFFM). Here, 7061 pairs of SA images are used to train the model, and 2252 pairs of Color and Thermal images in the test set are used to test the model.

Effectiveness of CIEM: CIEM is used to enhance the pedestrian features in Color and Thermal images to reduce the interference from the background. The experimental results are shown in Table 4. The MRs of baseline on SA are 13.84%, 15.35% and 12.48% for all-weather, daytime and nighttime scenes, respectively. When CIEM is additionally employed, the MRs are 11.21%, 13.15% and 9.07% for all-weather, daytime and nighttime scenes, respectively, which are reduced by 2.63%, 2.20% and 3.41% compared to the baseline, respectively. It is shown that the proposed CIEM effectively enhances the pedestrian features in both modalities, reduces the interference of background, and improves the pedestrian detection performance.

TABLE 4

TABLE 4. MRs for ablation studies of the proposed method on SA.

Validity of CAFFM: CAFFM is used to effectively fuse Color and Thermal features. The experimental results are shown in Table 4. On the SA, when the baseline is used with CAFFM, the MRs are 11.68%, 13.81% and 9.50% in all-weather, daytime and nighttime scenes, respectively, which are reduced by 2.16%, 1.54% and 2.98% compared baseline, respectively. It shows that the proposed CAFFM effectively fuses the two modal features to achieve robust multispectral pedestrian detection.

Overall effectiveness: The proposed CIEM and CAFFM are additionally used on the basis of baseline. Experimental results show a reduction of 3.13%, 2.26% and 4.03% in MRs for all-weather, daytime and nighttime scenes, respectively, compared to the baseline, indicating the overall effectiveness of the proposed method. A closer look reveals that with additional employment of CIEM and CAFFM alone, MRs are decreased by 2.63% and 2.16%, respectively, in the all-weather scene, but the MR of the overall model is reduced by 3.13%. It demonstrates that there is some orthogonal complementarity in the role of the proposed two modules.

Figure 5 shows the Miss Rate-FPPI curves for CIEM and CAFFM ablation studies in all-weather, daytime and nighttime scenes (blue, red, orange and green curves represent baseline, baseline + CIEM, baseline + CAFFM and overall model, respectively). It is clear that the curve trends of each module and the overall model are both lower than that of the baseline, which further proves the effectiveness of the method presented in this work.

FIGURE 5

FIGURE 5. The Miss Rate-FPPI curves of CIEM and CAFFM ablation studies in All-weather, Daytime and Nighttime scenes (From left to right, All-weather, Daytime and Nighttime Miss Rate-FPPI curves are shown in the figure).

Furthermore, in order to qualitatively analyze the effectiveness of the proposed CIEM and CAFFM, four pairs of Color and Thermal images (two pairs of images are taken from daytime and two pairs of images are taken from nighttime) are selected from the test set for testing. The pedestrian detection results of the baseline and each proposed module are shown in Figure 6. The first row is the visualization results of labeled boxes for Color and Thermal images, and the second to the fifth rows are the visualization results of the labeled and prediction boxes for baseline, baseline + CIEM, baseline + CAFFM, and the overall model pedestrian detection with the green and red boxes representing the labeled and prediction boxes, respectively. It can be seen that the proposed method successfully addresses the problem of pedestrian missing detection in complex environments and achieves more accurate detection boxes. For example, the second row, pedestrian detection missing happens in the first, third, and fourth pairs of images in the baseline detection result, however, the pedestrian miss detection problem is properly solved with CIEM and CAFFM added to the baseline and the overall model produces more accurate pedestrian detection boxes.

FIGURE 6

FIGURE 6. In this paper, each module and baseline pedestrian detection results (The first row is the visualization results of labeled boxes for Color and Thermal images, and the second to the fifth rows are the visualization results of the labeled and prediction boxes for baseline, baseline + CIEM, baseline + CAFFM and the overall model pedestrian detection with the green and red boxes representing the labeled and prediction boxes, respectively).

5 Conclusion

In this paper, we propose a multispectral pedestrian detection algorithm including cascaded information enhancement module and cross-modal attention feature fusion module. The proposed method improves the accuracy of pedestrian detection in multispectral images (Color and Thermal images) by effectively fusing the features from the two modules and augmenting the pedestrian features. Specifically, on the one hand, a cascaded information enhancement module (CIEM) is designed to enhance single-modal features to enrich the pedestrian features and suppress interference from the background noise. On the other hand, unlike previous methods that simply splice Color and Thermal features directly, a cross-modal attention feature fusion module (CAFFM) is introduced to mine the features of both Color and Thermal modalities and to complement each other, then complementary enhanced modal features are used to construct global features. Extensive experiments have been conducted on two improved annotations of the public dataset KAIST. The experimental results show that the proposed method is conducive to obtain more comprehensive pedestrian features and improve the accuracy of multispectral image pedestrian detection.

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found here: https://gitcode.net/mirrors/soonminhwang/rgbt-ped-detection?utm_source=csdn_github_accelerator.

Author contributions

YY responsible for scheme design, experiment and writing of the paper. KX guide the scheme design and experiment of the paper. KW guide experimental data analysis, paper writing and modification.

Funding

This work was supported by the National Natural Science Foundation of China (No. 52107017) and Fundamental Research Fund of Science and Technology Department of Yunnan Province(No.202201AU070172).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1. Jeong M, Ko BC, Nam J-Y. Early detection of sudden pedestrian crossing for safe driving during summer nights. IEEE Trans Circuits Syst Video Technol (2017) 27:1368–80. doi:10.1109/TCSVT.2016.2539684

Cascaded information enhancement and cross-modal attention feature fusion for multispectral pedestrian detection

1 Introduction

2 Related works

2.1 Multispectral pedestrian detection

2.2 Attention mechanisms

3 Methods

3.1 Cascaded information enhancement module

3.2 Cross-modal attention feature fusion module

3.3 Loss function

4 Experimental results and analysis

4.1 Datasets

4.2 Evaluation indexes

4.3 Implementation details

4.4 Experimental results and analysis

4.4.1 Construction of the baseline

4.4.2 Performance comparison of different methods

4.4.3 Analysis of ablation experiments

4.4.3.1 Complementarity and importance of color and thermal features

4.4.3.2 Ablation experiments

5 Conclusion

Data availability statement

Author contributions

Funding

Conflict of interest

Publisher’s note

References

94% of researchers rate our articles as excellent or good