A novel Dynahead-Yolo neural network for the detection of landslides with variable proportions using remote sensing images

Han, Zheng; Fang, Zhenxiong; Li, Yange; Fu, Bangjie

doi:10.3389/feart.2022.1077153

ORIGINAL RESEARCH article

Front. Earth Sci., 04 January 2023

Sec. Geohazards and Georisks

Volume 10 - 2022 | https://doi.org/10.3389/feart.2022.1077153

This article is part of the Research TopicArtificial intelligence in Remote Sensing for Landslide MappingView all 4 articles

A novel Dynahead-Yolo neural network for the detection of landslides with variable proportions using remote sensing images

Zheng Han^1,2

Zhenxiong Fang¹

Yange Li¹*

Bangjie Fu¹

¹School of Civil Engineering, Central South University, Changsha, China
²Hunan Provincial Key Laboratory for Disaster Prevention and Mitigation of Rail Transit Engineering Structures, Changsha, China

Efficient and automatic landslide detection solutions are beneficial for regional hazard mitigation. At present, scholars have carried out landslide detection based on deep learning. However, continuous improvement regarding the accuracy of landslide detection with better feature extraction of landslides remain an essential issue, especially small-proportion landslides in the remote sensing images are difficult to identify up to date. To address this issue, we propose a detection model, the so-called Dynahead-Yolo which is designed by combining unifying scale-aware, space-aware, and task-aware attention mechanisms into the YOLOv3 framework. The proposed method focuses on the detailed features of landslide images with variable proportions, improving the ability to decode landslides in complex background environments. We determine the most efficient cascade order of the three modules and compare previous detection networks based on randomly generated prediction sets from the three study areas. Compared with the traditional YOLOv3, the detection rate of Dynahead-Yolo in small-proportion landslides and complex background landslides is increased by 13.67% and 14.12%, respectively.

1 Introduction

Landslide is a common and extremely hazardous natural phenomenon, that poses serious threats to human lives and cause huge economic losses (Clague et al., 2012; Alam, 2020; Valdés Carrera et al., 2021). After a landslide, it is essential to determine quickly and accurately the magnitude and distribution area of the landslides for subsequent rescue. It is also helpful to update the existing landslide database and provide data support for landslide research (Hungr et al., 2014; Ghorbanzadeh et al., 2019; Guzzetti et al., 2012).

Traditional in-site investigation is a common method of landslide investigation, requiring surveyors spend plenty of time and efforts during in-site surveys. With the booming and recent development of remote sensing technology, the resolution of remote sensing images has significantly enhanced, which provides potential to obtain large-proportion feature information and has been widely used in geological hazard interpretation (Meena et al., 2022; Liu and Wu, 2016). Currently, there are four major types of methods for landslide identification using remote sensing images, i.e., visual interpretation, pixel-based, object-oriented, and artificial intelligence methods (Ju et al., 2020; Ju et al., 2022). Visual interpretation is a conventional method of using remote sensing images to obtain landslide information (Xu et al., 2014; Petschko et al., 2016; Fiorucci et al., 2019). It requires experts to comprehensively utilize image features such as object shape, texture, and spectrum, and then combine some non-remote sensing data for analysis and reasoning. This method consumes a lot of time and energy, and has limitations such as large errors and low efficiency (Hölbling et al., 2014; Moosavi et al., 2014; Wang et al., 2017; Zhao et al., 2017; Singh et al., 2021). Pixel-based methods usually uses binarization algorithm to determine whether a pixel of the image belong to the landslide (Li et al., 2014; Han et al., 2019). Recently development of this kind of methods refers to Han et al. (2022), in which we proposed a pixel-based landslide interpretation method and designed a multi-strategy feature fusion strategy, which combined the terrain slope of the detection object, the main axis features and the Normalized Difference Vegetation Index (NDVI) for screening, to reduce the false detection rate of landslides. However, it is difficult to distinguish them correctly when there are objects in the image with spectral characteristics similar to those of landslides. The object-oriented recognition method segments remote sensing images by setting certain thresholds based on spectrum, shape, and texture information (Sandric et al., 2010; Eeckhaut et al., 2011; Lu et al., 2011; Stumpf and Kerle. 2011). However, the method is less applicable since the pre-defined feature thresholds are varying case by case, and therefore need empirical adjusting.

The above remarkable methods attempt detect landslides in remote sensing images. These methods have promoted the progress of landslide detection research to varying degrees. However, inherent limitation should be noticed, such as a long detection time and low accuracy in the case of the complex background in remote sensing images. With the continuous development of artificial intelligence, big data and other technologies, recent attempts are trying to apply artificial intelligence (AI)-based methods for landslide detection. AI-based methods can be generally divided into machine learning and deep learning methods. Many studies have been performed on the development of machine learning landslide detection algorithms such as primary logistic regression, support vector machines, Bayesian methods, and decision trees (Parker, 2013; Korup and Stolle, 2014; Hu et al., 2019; Piralilou et al., 2019). This type of methods generally requires manual construction and selection of features, followed by classification with a classifier, which complicates the algorithm and limits real-time applications. The main algorithms of deep learning currently include Convolutional Neural Networks (CNNs) (Chumerin, 2017), Recurrent Neural Networks (RNNs) (Zaremba et al., 2014), and Generative Adversarial Networks (GANs) (Goodfellow et al., 2014). CNN-based methods have excellent non-linear mapping capabilities and can automatically learn features of landslide data (Hao et al., 2016; Vargas et al., 2017), which can quickly and accurately identify landslides. Ghorbanzadeh et al. (2022a) established a benchmark dataset by manually annotating landslide images and evaluated the performance of 11 deep learning models for landslide boundary detection. Furthermore, they applied the U-Net and ResU-Net models to landslide detection from free satellite data for the first time. They train the model using three case study regions and evaluate the transferability of the model through different training-test scenarios (Ghorbanzadeh et al., 2021). A recent development of the U-Net model for landslide detection could be referred to Fu et al. (2022). Cai et al. (2021) introduced dense convolutional networks into landslide detection, and significantly improved the detection ability of the model through measures such as feature reuse and feature enhancement. It is challenging to detect landslides in complex backgrounds based on CNN, but there are some excellent studies that have achieved such goals. Ju et al. (2020) combined deep learning and Google Earth data to detect historical landslides in typical loess regions in China. They established a database of historical loess landslides and used mask region⁃based convolutional networks to automatically identify loess landslides. Yu et al. (2022) realized the detection of loess landslides in complex backgrounds by improving the You Only Look Once X (YOLO X) model. Ghorbanzadeh et al. (2020) proposed the Dempster–Shafer model based on convolutional neural networks and combined it with the analysis of terrain factors to reduce the false detection rate of landslides with complex backgrounds. According to the different forms of landslide detection results, these CNN-based detection models are mainly divided into two categories. The first type is the segmentation model represented by U-Net and ResNet. These kinds of models classify the foreground and background based on pixels and predicts the boundaries of landslides (Jiang et al., 2021; Liu et al., 2022). The second type is the bounding box detection model represented by YOLO and Faster R-CNN. The method first divides the input image into patches of different sizes, and then classifies each sub-image to distinguish whether the image patch is a landslide (Hou et al., 2022; Liu et al., 2022). The final predicted results are bounding boxes of landslides.

Attention mechanism originate from the study of the human visual system, which can automatically locate useful information and suppress useless information (Mnih et al., 2014). Dai et al. (2021) compared the effects of introducing multiple attention channels, a single attention channel or no attention channel on the model detection ability through experiments. The results show that models that introduce multiple attention channels perform best for detection. To improve the model detection ability of complex background landslides, some studies have introduced an attention mechanism into the model. Ji et al. (2020) designed a novel 3D attention module to emphasize the unique features of landslide images. Cheng et al. (2021) introduced an attention module designed based on the visual system and incorporated it into the yolov4 model for training, which improved the attention to landslide features and reduced background noise. Amankwah et al. (2022) introduced an attention module into the deep network structure to improve the ability to suppress background noise. The research results show that the attention module can significantly improve the landslide detection performance. The above research shows that introducing attention mechanism into the model is an effective way to improve the detection ability.

Although the above studies have employed deep learning algorithms for landslide detection and achieved satisfactory results, there are still some limitations. First, the majority of these models are focusing on the impact of a single attention mechanism on the detection performance, ignoring the composite effect of multiple attention mechanisms. Dai et al. (2021) compared the detection effect of single attention mechanism and compound attention mechanism, and the results proved that the compound attention mechanism achieves better detection performance. Besides, some landslide detection models mainly focus on large-proportion landslide images. For landslide images with complex backgrounds and different proportions in their size, especially small-proportion landslide images, the resulting false detection rate is high. The problem with respect to the low detection rate of small-proportion landslide images can be explained by two main reasons as below. First, small-proportion landslides occupy fewer pixels in the image, and the features of landslides are more likely to be lost during the encoding process, especially after the pooling process (Wang et al., 2018; Ghorbanzadeh et al., 2022b). For example, an image with a size of $256 \times 256$ is down-sampled to $128 \times 128$ after a pooling layer, and some pixel information will be lost. Second, with the deepening of network layers and increasing receptive field, the image features of small-proportion landslides are more difficult to retain than large-proportion images (Luo et al., 2016; Krishna and Jawahar, 2018; Ajaz et al., 2022). This means that the model is more sensitive to large-proportion landslides, and it is possible to miss small-proportion landslides. For landslide detection with complex backgrounds, false detection often occurs in regions with spectral characteristics similar to those of landslide regions (Han et al., 2019; Han et al., 2022).

To address these problems, we propose the Dynahead-Yolo object detection model based on the attention mechanism. We choose YOLOv3 as the basic detection framework, which deals with the landslide detection as a mathematically regression problem and directly predicts the bounding box coordinates of the landslide area (Ju et al., 2022; Pang et al., 2022). The model has three detection branches with different scales, which is more effective for small proportion landslide detection. Based on the YOLOv3 detection framework, we redesign the detection head module. We employ scale-aware, space-aware and task-aware attention modules in the detection head, so that it can learn rich detailed features and achieve high-precision detection of landslides with variable proportions and complex background landslides. We conduct multiple comparison experiments based on the dataset consisting of three study areas. The experimental results demonstrate the feasibility and effectiveness of Dynahead-Yolo in landslide detection.

2 Materials and methods

2.1 Study areas

We select three study areas in this paper, including Ludian County in Yunnan Province, Bijie City in Guizhou Province and Beichuan County in Sichuan Province, where earthquakes and co-seismic landslides are often reported (Chang and Zhang. 2017; Ji et al., 2020). The three chosen areas have common characteristics. They are all located in SouthWestern China, with a large mountainous terrain, high mountains and sharped valleys, and staggered rivers and ditches. Landslides occur basically every year in these areas, causing serious damage to facilities such as human settlements, tunnels, roads, bridges, farmland and reservoirs, and greatly disrupting human life (Chen et al., 2016). The occurrence of landslides is also exacerbated by human production activities, such as logging, mining and agricultural production. At present, there are two main methods to obtain the location and boundaries of landslides in the study area. The first method is that experts interpret the landslide based on the images obtained by the drone, and then conduct on-site surveys to determine the landslide area and boundary. The second method is that local residents report to the government, and then the government assigns personnel to conduct landslide surveys.

2.2 Landslide datasets

In this study, we obtain remote sensing images of landslides from 91 wemap software and Bijie open-source database (Ji et al., 2020). These images are combined and pre-processed to generate the landslide datasets. We manually select a total of 950 valid and clear landslide images from the three study areas, and the selected images are additionally confirmed by experts. Part of the data consists of the Bijie open-source dataset (Ji et al., 2020), which was created for landslide segmentation. Due to the low resolution of images, we select approximately 200 images with a higher resolution. The rest consists of the Ludian and Beichuan areas obtained by 91 wemap software. All images are three-channel (RGB) data. We randomly select 200 images as the test set. Then, we perform data augmentation to improve the robustness of the model. We randomly select 250 images of landslides from the remaining images, and perform data augmentation by geometric transformation and noise processing. The geometric transformation mainly includes operations such as rotation, cropping, vertical flipping and horizontal flipping of landslide pictures. The noise processing includes adding salt and pepper noise, Gaussian noise and random noise to pictures. Therefore, the dataset includes a total of 1,200 images, of which 900 are used for training, 100 for validation, and 200 for testing. The proposed landslide detection model is a supervised model, so we annotate the dataset based on the Labelimg platform with PASCAL VOC2007 format. The label files mainly record the category of the object and the coordinates of the upper left and lower right corners of the label frame. Figure 1 shows a partially labelled image.

FIGURE 1

FIGURE 1. Examples of labelled images.

2.3 Model architecture

The schematic architecture of Dynahead-Yolo is shown in Figure 2. It consists of three parts: a backbone for extracting features, a neck for feature fusion, and a detection head for object classification and localization. The three critical components are described in detail below.

FIGURE 2

FIGURE 2. Schematic architecture of Dynahead-Yolo. (A) The details of backbone for extracting features. (B) The structure of neck for feature fusion. (C) Overview of detection head for object classification and localization..

2.3.1 Backbone

Our proposed model is an improved landslide detection network based on YOLOv3. Its backbone is Darknet-53 which is similar to that of YOLOv3, and the difference is mainly on the detection head of the model. The structure of Darknet-53 is pictured in Figure 2A. It consists of one convolutional layer and five residual structures, where each residual structure contains a different number of basic residual blocks. The first and second residual structures include 1 and 2 basic residual blocks, respectively. Both the third and fourth residual structures contain 8 basic residual blocks, and the last residual structure contains 4 basic residual blocks. When a feature map tensor $f$ is sent into the basic residual block, $1 \times 1$ and $3 \times 3$ convolutional layers are utilized for feature extraction to obtain a feature $f^{'}$ . Then f and $f^{'}$ are summed by a shortcut to acquire the output of the residual block. In addition, each residual structure also includes a $3 \times 3$ convolutional layer to compress the height and width of the input features. The convolutional layers mentioned above are followed by a batch normalization (BN) layer and a leaky rectified linear unit (Leaky ReLU). In the model structure diagram, we use a CBL module to represent the combination of these three layers, as shown in the CBL block of the legend in Figure 2.

Given an image with height H and width W, through the backbone network, we can acquire three different sizes of feature maps at the last three residual structures, denoted as $f_{1} \in R^{C_{1} \times \frac{H}{8} \times \frac{W}{8}} (C_{1} = 256)$ , $f_{2} \in R^{C_{2} \times \frac{H}{16} \times \frac{W}{16}} (C_{2} = 512)$ , and $f_{3} \in R^{C_{3} \times \frac{H}{32} \times \frac{W}{32}} (C_{3} = 1024)$ .

2.3.2 Neck

The main function of the neck network is to fuse the feature maps obtained from the backbone for feature enhancement. The neck network of Dynahead-Yolo includes the three branches depicted in Figure 2B. The first branch is to acquire a fusion feature map $h_{3}$ , which is obtained from $f_{3}$ through five convolutional layers. The second branch merges feature maps of different sizes. The $h_{3}$ obtained from the first branch is first processed by a convolution and an up-sampling layers, and then concatenate with $f_{2}$ to obtain $h_{2}$ through five convolutional layers. The processing process of the third branch is similar to that of the second branch. The fusion feature map $h_{2}$ is processed by a convolution and an up-sampling layers, and then concatenate with $f_{1}$ and through five convolutional layers to obtain $h_{1}$ . The fusion feature maps obtained by the three branches are sent to the head module for object classification and localization. By fusing features of different scales, the neck module can make full use of the extracted information and improve the performance of the detection network.

2.3.3 Detection head

The detection head module is employed to predict the class and location of objects. In this research, we utilize the dynamic head module to improve the original head of YOLOv3, as shown in Figure 2C. The detection head module consists of three detection branches, each of which includes a convolutional layer and a dynamic head module. The dynamic head module (Dai et al., 2021) combines three attention mechanisms: spatial-aware, scale-aware and task-aware. In our Dynahead-Yolo model, we explore the effect of the connection order of three perception modules on the model performance and design the cascade order that is most useful for landslide image detection, as depicted in Figure 3. Details of the specific cascade sequence are provided in the discussion section.

FIGURE 3

FIGURE 3. The overall structure of Dynamic Head.

The task-aware block can adapt to detection tasks by activating the channels of feature maps and improve the detection performance. The specific process is given an input feature map x, and it is first passed through an average pooling layer to reduce the feature dimension. Then, two fully connected layers and a normalization layer are employed to map the feature to the range of −1 to 1. The normalization layer is obtained by scaling and shifting the sigmoid function and the normalized result $x_{n}$ is sent into a hyperfunction $θ (x_{n})$ to generate four learnable parameters $α^{1}$ , $β^{1}$ , $α^{2}$ and $β^{2}$ for subsequent computations. Finally, an activation function $f_{θ} (x)$ is used to dynamically activate different channels of the input feature x to obtain the final output of the task perception block. A more detailed introduction to the hyperfunction $θ (x_{n})$ the and activation function $f_{θ} (x)$ can be found in the literature (Chen et al., 2020). The spatial-aware attention block is mainly deployed in the space dimension using deformable convolution to adapt to the shape and scale variations of objects. A convolutional layer is first utilized to obtain the offset and mask, which are to be applied for each position in the convolution kernel. Then, the offset, mask and input features are sent into the deformable convolution (Dai et al., 2017) layer to obtain the geometric transformation of the landslide. The scale-aware bock fuses features of different scales based on the weights of the input features. The input feature first goes through an average pooling layer to remove redundant information and reduce the number of parameters. Then it is sent into a convolutional layer with a kernel size of 1 and a ReLU activation layer. The last hard sigmoid activation function layer is utilized to speed up training. The hard ac activation function can be regarded as a classifier that approximates the sigmoid with a linear piecewise function, which is calculated by:

σ (z) = \max (0, \min (1, \frac{z + 1}{2})) (1)

The output of the activation function is multiplied by the corresponding elements of the input map to obtain the final result of the scale-aware block.

When the fusion features of three sizes obtained by the neck module are sent to the detection head module, they will go through two convolution layers and the dynamic head module for classification and positioning. The first convolutional layer is followed by an BN layer and a Leaky ReLU layer. Three detection result maps of different sizes are generated in the detection head branches, with sizes 52, 26 and 13 respectively. The number of channels for the three result plots is 18. Each grid region of three different result maps predicts 3 bounding boxes to generate a total of $(52 \times 52 + 26 \times 26 + 13 \times 13) \times 3 = 10 647$ bounding boxes, thus the predictions for every six channels make up a vector $P$ of each predicted bounding box, and the composition of the vector $P$ is as follows:

P = (t_{x} + t_{y} + t_{w} + t_{h}) + P_{0} + P_{1} (2)

The first to fourth elements represents the coordinate information of the prediction box. $P_{0}$ means the confidence that there is an object in the predicted box, and $P_{1}$ represents the probability that the object belongs to landslides. Finally, non-maximum suppression is performed on the generated prediction frame to obtain the final prediction result.

2.4 Model evaluation metrics

To evaluate the performance of the model in terms of the detection accuracy, the precision, recall, F1 score and average precision (AP) evaluation metrics are employed in our experiments. The precision represents the size of correct predictions in the samples predicted to be landslides. The recall stands for the size of all samples that can be predicted to be landslides. The precision and recall are calculated as follows:

P r e c i s i o n = \frac{T P}{T P + F P} (3)

R e c a l l = \frac{T P}{T P + F N} (4)

where TP is a true positive, FP is a false positive, and FN is a false positive. The F1 score weights precision and recall, and is a common target detection index. The F1 score is defined as follows:

F_{1} = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l} (5)

In addition, we also utilize the AP to evaluate the performance of model. First, we need to sort the detection results in descending order according to the confidence of the prediction, and then calculate the precision and recall of the cumulative for each sample, and finally draw the precision-recall curve. The AP value is calculated through the area under the curve. The calculation formula is as follows:

A P = \sum_{i = 1}^{n - 1} (r_{i + 1} - r_{i}) {\times P}_{i n t e r} (6)

$r_{1}, r_{2}, \dots, r_{i + 1}$ are the values of recall in ascending order, $r_{i + 1} - r_{i}$ means the deviation between adjacent recall values; $P_{i n t e r}$ is the maximum value of the corresponding precision when the recall rates are $r_{i + 1}$ and $r_{i}$ . In addition, we adopt intersection over union (IoU) to measure the overlapping area between the detection and label boxes. The larger the area is, the more credible the detection box. IoU is defined as follows:

I o U = \frac{P \cap G}{P \cup G} (7)

where p and G represent the prediction box and label box, respectively. $\cap$ and $\cup$ indicate the intersection and union between two bounding boxes, respectively.

3 Model training and results

3.1 Model training

To evaluate the effectiveness of the proposed model, we conduct a series of comparative experiments and ablation experiments. To ensure the same hardware conditions for each experiment, all object detection networks are trained based on the same dataset, and all experiments are performed on a server with an NVIDIA RTX2080Ti, 12 GB of GPU memory, and a 6× Xeon E5-2678 v3 CPU. The weights of the model are initialized according to the pretrained model on the VOC2007 dataset. During training, the max epoch is 200 epochs for each experiment, including 100 freeze training epochs and 100 non-freeze training epochs. Freeze training means that the parameters of the backbone feature extraction network will not be updated during the training process. Non-freezing training will update all parameters of the model during training. This training method can improve the training speed and make the model converge quickly. The Adam optimizer with an initial learning rate of $3 \times 10^{- 4}$ and decayed weights of $5 \times 10^{- 4}$ is utilized for model training. The learning rate decays by a fixed step size of 1, and the decay coefficient is .94.

We plot the loss variation curves for the training set and validation set, as shown in Figure 4A. The model loss gradually decreases during the training process, and finally oscillates and stabilizes, indicating that the model has been fully trained. In addition, we calculate the mean average Precision (mAP) of the training weights on the training and validation datasets and plot the curves, as shown in Figure 4B The mAP of the training and validation datasets are stable at approximately 97.8% and 83.2%, respectively, suggesting that the proposed Dynahead-Yolo has an excellent effect on landslide detection. We select the model weight of the 185th epoch with the smallest validation loss for testing, and compare it with other models to verify the feasibility of the study.

FIGURE 4

FIGURE 4. (A) Loss curve during training (B) Model mAP curve.

3.2 Results

To illustrate the recognition ability of Dynahead-Yolo for landslides in complex backgrounds and the detection ability with variable proportions landslide images, especially for small-proportion landslides. We randomly generated 200 test images and count the number of different proportion landslides. The object detection model extracts features from pixels in the local area through the sliding window. We could classify landslides into large, medium and small proportion landslides according to the proportion of pixels in the image. The proportion of pixels also represents the percentage of the landslide area to the image area, when the landslide area accounts for more than 50% of the total image area, it can be considered as a large-proportion landslide. If the area ratio is between 10% and 50%, it can be regarded as a medium-proportion landslide, and when the area ratio is less than 10%, it can be treated as a small-proportion landslide. As follows:

α = \frac{S_{L}}{S_{I}} (8)

\{\begin{array}{c} α \leq 10 % s m a l l - p r o p o r t i o n l a n d s l i d e s \\ 10 % < α \leq 50 % m e d i u m - p r o p o r t i o n l a n d s l i d e s \\ α > 50 % l a r g e - p r o p o r t i o n l a n d s l i d e s \end{array} (9)

where $S_{L}$ is the area of the landslides; $S_{I}$ represents the area of the entire image; and $α$ represents the area ratio, which is the proportion of landslides in the image.

The complexity of the background mainly includes four aspects: 1) interference from clouds and fog when imaging, 2) interference from houses near the landslide, 3) interference from bare sand with similar characteristics to the landslide image, and 4) interference from terraces in mountainous areas. Based on the above criteria, we counted the number of landslide images with complex backgrounds in the test set, for a total of 80 images. We employed the developed Dynahead-Yolo to detect complex background landslides images, and compared them with the prediction results of YOLOv3. Figure 5 shows the ground truth, detected results of Dynahead-Yolo and YOLOv3. The blue bounding box is manually annotated label. The green bounding box is the correctly predicted detection box, which should ideally overlap with the label, and the red box is the result of the wrong prediction. YOLOv3 mistakenly detected landslide adjacent areas, terraced fields, and houses as landslides, as listed in Figures 5A′–C′. Besides, in Figures 5E′, YOLOv3 cannot detect landslides when the spectral radiance of the target region is similar to that of the background region. Overall, the proposed model can well identify landslide images in such complex backgrounds and achieve higher detection accuracy.

FIGURE 5

FIGURE 5. Complex background landslide detection. A–E show the landslide detection results of Dynahead-Yolo with different complex backgrounds. A′–E′ represent the detection results of YOLO v3 in the corresponding complex background.

The detection results of landslides with variable proportions are shown in Figure 6. We compared the results of two different models in small, medium, and large proportion landslides. The number above the image indicates the percentage of landslides in the image, which is the proportion of the area of a detected landslide compared to the image proportion. In Figures 6A′, J′, L′, we found that no matter whether the proportion of landslides is too small or too large, the Dynahead-Yolo can locate the landslide and achieve accurate bounding box prediction. As shown in Figures 6C′, D′, the YOLOv3 model may mistake exposed soil for landslide. In addition, it is obvious that the Dynahead-Yolo can achieve higher IoU between the label and detected box in Figures 6I, I′. A more detailed comparison of the distribution of the IoU is provided in the Discussion section.

FIGURE 6

FIGURE 6. Landslides detection with variable sizes. A–D and A′–D′ show the detection results of Dynahead-Yolo and YOLO v3 for small-proportion landslides, respectively. E–H and E′–H′ compare the effect of Dynahead-Yolo and YOLO v3 on the medium-proportion landslides. I–L and I′–L′ represent the detection results of Dynahead-Yolo and YOLO v3 for large-proportion landslides, separately.

To further verify the ability to detect landslide images with complex environments and variable proportions, we counted the percentage of the correct prediction results of two different models in different scenarios, as summarized in Table 1. According to the comparison results, the correct detection rate of our method for complex background landslides reaches 82.3%, which is 14.1% higher than that of YOLOv3. Moreover, our method also improved the detection rate of small-proportion landslides and large-proportion landslides by 13.6% and 7.8%, respectively. The results show the advantages of the Dynahead-Yolo model for landslides with complex backgrounds and variable proportions of landslide detection.

TABLE 1

TABLE 1. Percentage of correct landslides detection with variable proportions and complex backgrounds.

4 Discussion

4.1 Model results for different concatenation sequences

We first investigate the effect of the cascade order for three attention modules on the model performance. As shown in Table 2, we evaluate the performance of six concatenation sequences for landslide detection from fort aspects: precision, recall, F1 value, and AP value. According to the comparison results, we find that different concatenation sequences have a significant impact on the performance of the model. The detection performance in the order of task-aware, space-aware and scale-aware is the best, reaching an AP value of 85.53%. The cascade sequence first activates different channels according to the detection task, then enhances the spatial location features of foreground objects through a spatial-aware attention module, and finally improves the detection ability of landslide areas with different proportions through a scale-aware attention module.

TABLE 2

TABLE 2. Performance of the model with different concatenation sequences.

4.2 Comparison of different detection models

To further demonstrate the performance of Dynahead-Yolo, we compare the detection results of previous object models such as Faster R-CNN, Faster R-CNN, YOLOv3, SSD, and Centernet. All models were trained based on the RTX2080TI GPU with 200 epochs each and tested with the same dataset. Table 3 summarizes the comparison results for the four evaluation indexes of different models. The faster R-CNN with resnet50 backbone network has the highest recall value, but its precision is only 45.56%, which shows that the model has high false positives in landslide detection. Faster R-CNN is a two-stage detection model. The first stage is to generate many proposals, and the second stage adjusts the coordinates of proposals. In the first step, all areas suspected of landslides will be detected as proposals, which leads to a high false detection rate. Dynahead-Yolo directly extracts features from the network and predicts the location of landslides without generating proposals. The addition of the compound attention module enhances the ability of the model to acquire features of different scales and spatial features, enabling to obtain better detection results. The precision and recall can evaluate the performance from different perspectives. We usually employ the F1 score and mAP combining the two indexes to comprehensively evaluate the effect of the model. The F1 score and mAP of Dynahead-Yolo are 0.87 and 85.53%, respectively, which are 0.02 and 6.95% higher than those of YOLOv3. The results show that the proposed method is more suitable for the automatic detection of landslide remote sensing images.

TABLE 3

TABLE 3. Models performance.

We calculate the IoUs of the detection results for each model and plotted a violin plot, where the horizontal coordinates are the individual model and the vertical coordinates are the IoU values, as pictured in Figure 7. The blue and red areas represent the number of true positive and false positive samples, respectively. From Figure 7, it is clear that the IoUs of the correct prediction results for Dynahead-Yolo are mainly concentrated in the range of 0.8–0.87, which is obviously higher than that of other models. The results indicate that the coordinates of detection and prediction bounding boxes are close to each other.

FIGURE 7

FIGURE 7. IoU distribution of different models.

To illustrate the performance of Dynahead-Yolo for landslide with variable proportions, we compare it with the results of other studies. We select three published papers (Ye et al., 2019; Ju et al., 2022; Li et al., 2022) based on deep learning methods for landslide detection, and calculate the percentage of landslides in the detected images shown in each paper. We draw a box-plot and represent the results of the comparison, as shown in Figure 8. It can be considered that the paper shows representative landslide images reflecting the detection effect of the model. Therefore, the drawn box-plot shows that the Dynahead-YOLO model is more suitable for detecting a certain proportion of landslides. According to Figure 8, it can be observed that the models (Ye et al., 2019; Ju et al., 2022) are more useful for medium-proportion and large-proportion landslides. The proportion of landslides in Paper (Li et al., 2022) is mainly distributed in areas greater than 50%, and the model of the article is more effective for large-proportion landslides. Dynahead-Yolo has good detection performance for different proportions of landslides.

FIGURE 8

FIGURE 8. Comparison of landslides detection results with variable sizes.

4.3 Limitations and future research

In this study, we propose a named Dynahead-Yolo object detection model for high accuracy detection in complex environments and landslides with variable proportions. We discuss the effects of various concatenation sequences for three attention modules on the detection capability of the model, and compare them with classical object detection networks. Nevertheless, there are some problems in this study due to the limitation of data resources and experimental equipment. First of all, there are very few public datasets related to landslides, and it is difficult to obtain effective and clear landslide data. Therefore, the study only utilizes 1,200 landslide images, and the amount of data is small. Besides, the labels for training are manually labeled. Although the labels have been confirmed and evaluated by experts, there will still be errors, which will affect the accuracy and precision of the model. In the future research, we will acquire more landslide data containing various types of remote sensing images to improve the generalization and detection capability of the model.

In subsequent studies, we believe that it remains an important means to improve the performance of the model by an attention mechanism. We will also try to combine the Dynamic head with other object detection models to achieve the purpose of improving detection capabilities.

5 Conclusion

In this paper, we proposed a novel Dynahead-Yolo neural network for the detection of landslides using remote sensing images. This neural network was specifically designed to address the poor detection of the small proportion landslides in the remote sensing images. The combination of the three attention modules including scale-aware, spatial-aware, and task-aware enhanced the feature extraction ability and improved the adaptability to landslide images with different proportions and complex backgrounds. Landslide images from Bijie city, Ludian County and Beichuan County were collected to generate datasets and randomly separated as prediction sets to verify the performance of the model. Results show that, compared with the conventional YOLOv3, the accuracy of the proposed Dynahead-Yolo for complex backgrounds and small-proportion landslides were improved by 14.19% and 13.67%, respectively, while the F1 score and AP of the model were 0.87 and 85.53%, respectively. The results indicate an outperforming ability of the proposed Dynahead-Yolo for detecting small proportion landslides comparing to the conventional YOLOv3. However, the performance of the model may be limited by the number of datasets and the reliability of the labels. Therefore, increasing the dataset or designing a suitable model to adapt to small sample landslide detection is the main work in the future.

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://github.com/Abbott-max/dataset; doi:10.4121/20401740.

Author contributions

ZH contributed to the conception of the study; ZF performed the experiment; BF contributed significantly to analysis and manuscript preparation; ZF performed the data analyses and wrote the manuscript; YL helped perform the analysis with constructive.

Funding

This study was financially supported by the National Key R&D Program of China (Grant No. 2018YFD1100401); the National Natural Science Foundation of China (Grant No. 52078493); the Natural Science Foundation for Excellent Young Scholars of Hunan (Grant No. 2021JJ20057); the Innovation Provincial Program of Hunan Province (Grant No. 2020RC3002); the Natural Science Foundation of Hunan Province (Grant No. 2022JJ30700); and the Science and Technology Plan Project of Changsha (No. kq2206014), the Fundamental Research Funds for the Central Universities of Central South University (Grant 2022ZZTS0669).

Acknowledgments

Financial supports are gratefully acknowledged.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Ajaz, A., Salar, A., Jamal, T., and Khan, A. U. (2022). Small object detection using deep learning[J]. arXiv e-prints. doi:10.48550/arXiv.2201.03243