A comparative study for landslide susceptibility assessment using machine learning algorithms based on grid unit and slope unit

Deng, Niandong; Li, Yuxin; Ma, Jianquan; Shahabi, Himan; Hashim, Mazlan; de Oliveira, Gabriel; Chaeikar, Saman Shojae

doi:10.3389/fenvs.2022.1009433

ORIGINAL RESEARCH article

Front. Environ. Sci. , 01 November 2022

Sec. Environmental Informatics and Remote Sensing

Volume 10 - 2022 | https://doi.org/10.3389/fenvs.2022.1009433

This article is part of the Research Topic Advanced Application of Deep Learning, Statistical Modelling, and Numerical Simulation on Geo-Environmental Hazards View all 59 articles

A comparative study for landslide susceptibility assessment using machine learning algorithms based on grid unit and slope unit

Niandong Deng^1,2

Yuxin Li³

Jianquan Ma^1,2

Himan Shahabi^4,5*

Mazlan Hashim⁵

Gabriel de Oliveira⁶

Saman Shojae Chaeikar⁷

¹College of Geology and Environment, Xi’an University of Science and Technology, Xi’an, China
²Key Laboratory of Geological Processes and Mineral Resources, Northern Qinghai-Tibet Plateau, Xining, China
³Xi’an Meihang Remote Sensing Information Co., Ltd., Xi’an, China
⁴Department of Geomorphology, Faculty of Natural Resources, University of Kurdistan, Sanandaj, Iran
⁵Geoscience and Digital Earth Centre (INSTeG), Research Institute for Sustainability and Environment (RISE), Universiti Teknologi Malaysia, Johor Bahru, Malaysia
⁶Department of Earth Sciences, University of South Alabama, Mobile, AL, United States
⁷Faculty of Information Technology, Monash University, Melbourne, VIC, Australia

Landslide susceptibility assessment is an important support for disaster identification and risk management. This study aims to analyze the application ability of machine learning hybrid models in different evaluation units. Three typical machine learning models, including random forest forest by penalizing attributes (FPA) and rotation forest were merged by random subspace algorithm. Twelve evaluation factors, including elevation, slope angle, slope aspect, roughness, rainfall, lithology, distance to rivers, distance to roads, normalized difference vegetation index, topographic wetness index, plan curvature, and profile curvature, were extracted from 155 landslides in Yaozhou District, Tongchuan City, China. Six landslide susceptibility maps were generated based on the slope units divided by curvature and 30 m resolution grid units. Multiple performance metrics showed that the RS-RF model based on slope units has excellent spatial prediction ability. At the same time, the method of slope unit division based on curvature is proved to be more suitable for the typical Loess tableland regions, which provides basis for the selection of evaluation units in landslide susceptibility assessment.

1 Introduction

A landslide is a phenomenon where the soil or rock mass on a slope moves down in the soft direction under the action of gravity (Varnes 1978). With extensive human activities and increasingly serious surface environmental problems, landslides have become one of the most severe geological disasters threatening human life and property (Kannan et al., 2015). On average, landslides cause more than 300 million dollars in economic losses each year, especially in mountainous regions of China (Wang Q. et al., 2016). According to China’s geological disaster bulletin, a total of 4,220 landslides occurred in China in 2019, accounting for 68.27% of all geological disasters nationally and causing more than 200 deaths. Policymakers have been concentrating efforts to formulatea series of measures intended to reduce the risk of harm caused by landslides in relation to people in recent years.

The landslide susceptibility map (LSM) has been considered an effective tool for landslide control and land use (Nicu and Asăndulesei 2018). In recent years, studied have been conducted on landslide susceptibility worldwide, and their results have provided an essential reference tool for local governments in disaster management and urbanplanning (Feizizadeh et al., 2014). The core goal of LSM is obtaining a high-precision susceptibility map. This work procedure has worked perfectly, and the most important step is to select an appropriate evaluation model. Previously, several methods and models were applied to landslide susceptibility maps, but there was no consistent standard for selecting models (Chen et al., 2017a). From previous studies, landslide susceptibility assessment models can be divided into qualitative, semi-quantitative, and quantitative methods (Lee et al., 2018). The qualitative method mainly relies on experts to score the topographical features and related parameters of a particular slope. The results depend on expert ability but can have high accuracy (Pham et al., 2020). However, this method, by nature, is highly subjective, and the results in applying this approach to a specific locale are not suitable for other locations. The semi-quantitative method combines the qualitative method with statistical analysis of relevant factors, and, accordingly, the results are also subjective and tend to be one-sided (Tien Bui et al., 2019b).

The quantitative method focuses on analyzing the relationship between influencing factors and landslides and quantitatively forecasting the possibility of landslides in a certain area (Reichenbach et al., 2018). It is divided into two parts, the first using a probability statistic method and the other a machine learning method. The probability statistic method can be divided into binary and multivariate statistical models (Pourghasemi et al., 2012b). For example, the classical binary statistical models have frequency ratio, the weight of evidence, index of entropy, information value, etc. (Che et al., 2012). These methods were simplistic in terms of model construction and had a low predictive ability for landslides in more complex areas (Abedini et al., 2019a). Multivariate statistical models include logistic regression (Shahabi et al., 2015; Sangchini et al., 2016) and linear discriminant analysis (Nicu and Asăndulesei 2018). These models are often superior to binary statistical models but still have the disadvantage of low accuracy when faced with complex nonlinear data (Akgun 2012). With the development of computers, the machine learning model has been widely used in landslide susceptibility charts and has achieved good prediction results (Tien Bui et al., 2019a; Fang et al., 2020). Numerous machine learning models have been developed, including the artificial neural network (Bragagnolo et al., 2020), decision tree (Wang L.-J. et al., 2016; Wu et al., 2020), support vector machine (Chen et al., 2016; Yu et al., 2016), naive Bayes (Tsangaratos and Ilia 2016) and have performed well overall. These models have better designed strategies for dealing with nonlinear problems. However, over fitting and parameter optimization are common problems in machine learning models (Peng and Bai 2019). It is noted that with increased promotion of the decision tree model, some improved models based on the underlying theory have been applied to landslide susceptibility, such as the J48 tree (Hong et al., 2018a), reduced error pruning tree (Pham et al., 2019), alternating decision tree (Shirzadi et al., 2018), logistic model tree (Chen et al., 2018a; Chen et al., 2018b;), naive Bayes tree (Chen et al., 2017c), random forest (RF) (Pourghasemi and Kerle 2016; Chen et al., 2019a; Hong et al., 2019), forest by penalizing attributes (FPA) (Hong et al., 2020), rotation forest (ROF) (Chen et al., 2017b; He et al., 2019), among others. The core of the hybrid model is to combine individual learners through a range of strategies in order to enhance the diversity of learners and achieve complementary effects. Hybrid models such as fuzzy weight of evidence integration (Hong et al., 2017a), bayesian logistic regression ensemble (Abedini et al., 2019b), data mining and multi-criteria decision-making methods (Rafiei Sardooi et al., 2021), among others, were found that the predictive capability is often higher than that of individual models (Umar et al., 2014; Pham et al., 2017a). In the present work, a variety of improved models were used to replace decision tree, the original base classifier of random subspace algorithm (RS), aiming to compare the generalization ability of different hybrid models.

The selection of appropriate mapping units is one of the prerequisites for generating high-precision landslide susceptibility maps (Reichenbach et al., 2018). Different types or sizes of units will present additional landslide attribute information, affecting the model’s effectiveness. In previous studies, grid and slope units were mainly selected to evaluate landslide susceptibility. The grid unit divides the region into regular squares of a certain size, which has obvious advantages and disadvantages. Due to the use of boundary rules, the calculation is convenient and efficient, conducive to attribute extraction, and for sample training of the machine learning model. It ignores, however, the unique factors of topography and geomorphology, and the size of individual grid units is questionable (Trigila et al., 2015). Results showed that the size of grid units in different resolutions affects evaluation accuracy, and evaluation results do not always increase with smaller resolutions (Chen et al., 2020b). Based on GIS software, a hydrological analysis model is commonly used in slope units, dividing a region by valley lines and ridgelines. It also considers the morphological elements of hills and mountainous areas. However, the division of slope units based on this method does not match the geomorphic background in a wide area, such as an intermountain basin (Reichenbach et al., 2018).

In this study, three models based on RS, including ROF, FPA and RF, were created respectively to compare the applicability of different hybrid models using slope unit divided by curvature and grid unit to obtain a more sensitive map of landslide susceptibility for the Yaozhou District, Tongchuan City, China.

2 Study area

Yaozhou District is located south of Tongchuan City, Shaanxi Province, China. The coordinates are 108°34′–109°06′ east longitude and 34°50′–35°20′ north latitude, with a total area of about 1622 km² (Figure 1). It belongs to the southern Loess Plateau; the terrain is high in the north and low in the south, with a relative elevation difference of 1156 m. The north and west are mainly medium and low elevation mountainous areas, the central part is a ruined highland gully area, and the south is primarily a plains and river valley area. The study area is a warm temperate continental monsoon semi-arid and semi-humid climate zone. Rainfall is concentrated primarily between July and September, accounting for more than half of the rainfall for the entire year. The spatial distribution of rainfall increases from southeast to northwest. According to data from the main regional observation station, the maximum rainfall is 830.5 mm (1983), and the minimum rainfall is 344.1 mm (1977). The annual average rainfall, therefore, is 616.3 mm. The highest temperature throughout the year is 39.7°C, and the lowest temperature is -16°C. The vegetation coverage rate of the whole area is 41%, of which the vegetation coverage rate of the northern mountainous area reaches 85%, while the southern plateau area is only 10%. The earthquake intensity belongs to the VII-degree zone, and there was no geological background of strong earthquakes.

FIGURE 1

FIGURE 1. Location of the study area and landslide inventory map.

3 Material and methods

3.1 Preparation of landslide inventory and datasets

The preparation of a landslide inventory map is the first step in landslide susceptibility assessments (Rosi et al., 2018). In this study, 115 landslides were delineated by referring to the detailed survey report of geological disasters. The basic data of landslides for this area was obtained by using field surveys, an interpretation of remote sensing satellite images, and a 1:50000 scale geological map. According to the geological map scale of 1:50000, a grid of 30 × 30 m was selected as the base evaluation unit (Petschko et al., 2014). The landslide inventory map of current study area was generated in ArcGIS software (Figure 1) (Tien Bui et al., 2016b). In order to construct positive and negative sample data, 115 non-landslide points with the same number were randomly selected (Pham et al., 2016). The elevation data image was obtained from the geospatial data cloud (http://www.gscloud.cn/). The sample data were randomly divided into a training set (170 locations) and a validation set (60 locations) according to scale of 7:3 (Dou et al., 2019).

3.2 Selection of landslide influencing factors

It is particularly significant to select relevant influencing factors for generating a high-precision landslide susceptibility map. In selecting these factors using predecessors, due to the difference in the geological environment conditions and landslide formation mechanism in this specific study area, the selection of factors was not clearly defined (Chen et al., 2018c; Abuzied and Alrefaee 2019). In this paper, referring to the influence factors of previous references and the geological environment conditions of the study area, a total of 12 influencing factors were selected for the model, including elevation, slope angle, slope aspect, roughness, rainfall, lithology, distance to rivers, distance to roads, normalized difference vegetation index (NDVI), topographic wetness index (TWI), plan curvature and profile curvature. This paper used ASTER GDEM 30 M resolution digital elevation data to generate elevation, slope angle, slope aspect, roughness, TWI, plan curvature, and profile curvature. NDVI, distance to roads, and distance to rivers were generated by land imaging from the Landsat eight OLI images. Rainfall observation data for multiple years were obtained from the Meteorological Bureau of Tongchuan City, Shaanxi Province, China (http://sn.cma.gov.cn/).

3.2.1 Elevation

As one of the most critical factors affecting landslides, elevation is widely used in landslide susceptibility modeling (Pradhan 2013). It mainly affects the stress distribution of the slope, which is crucial to the stability of a landslide. In this study, elevation was classified by the natural break model as 0–815°m, 815–991°m, 991–1170°m, 1170–1358°m, and 1358–1704°m, totaling five categories (Figure 2A).

FIGURE 2

FIGURE 2. Thematic maps of the study area: (A)elevation; (B)slope angle; (C)slope aspect; (D)roughness; (E)rainfall; (F)lithology; (G)distance to rivers; (H)distance to roads; (I)NDVI; (J)TWI; (K)plan curvature; (L)profile curvature.

3.2.2 Slope angle

The slope angle is also closely related to the stability of a landslide (Saha et al., 2005). When the slope is larger than the dip angle of the rock-soil structural surface, the overlying rock-soil mass will slide along the crack surface. The slope angle was divided into five classes: 0–8.44°, 8.44–15.83°, 15.83–24.30°, 24.30–36.19°, 36.19–72.56° by the natural break method (Figure 2B).

3.2.3 Slope aspect

The slope aspect affects the orientation of the slope, and different orientations are related to differences in sunlight, which lead to differences in vegetation growth and weathering (Galli et al., 2008; Trigila et al., 2015). Studies have shown that the soil moisture of a shady slope is 1.09–1.52°times that of a sunny slope and the vegetation coverage rate is 4–5°times that of a sunny slope (Tien Bui et al., 2016a). The temperature of a sunny slope is higher than that of a shady slope, which may cause weathering and fragmentation of carbonate rocks and induce landslide instability. This paper divided the slope aspect into nine categories: plan, north, northeast, east, southeast, south, southwest, west, and northwest, based on equal intervals (Figure 2C).

3.2.4 Roughness

Roughness refers to the ratio of the surface area of a particular area to its projected area, which is a dimensionless parameter. It was divided into 1–1.05, 1.05–1.14, 1.14–1.30, 1.30–1.59, and 1.59–3.34 through the natural break model (Figure 2D).

r = \frac{1}{\cos α} (1)

Where r is the surface roughness and $α$ is the slope angle.

3.2.5 Rainfall

Rainfall is one of the factors that cause landslide instability and sliding and is closely related to subsequent landslides (Bai et al., 2013). It causes the soil to soften with water and lose its original strength. The rainfall at equal pitches was divided into seven classes, as follows: <560°mm, 560–580°mm, 580–600°mm, 600–620°mm, 620–640°mm, 640–660°mm, and >660 mm (Figure 2E).

3.2.6 Lithology

The weathering resistance and cohesion of different strata are different, so lithology is one of the important factors affecting landslide stability (Chen et al., 2019b). This paper is divided into five categories according to the age and composition of the lithology: Group A: Quaternary Holocene (Q₄) silty sand, fine sand, medium sand, and gravel sand; Group B: Upper Triassic (T) sandstone, fine sandstone, mudstone, and sandy mudstone interbedded; Group C: Malan loess, fine sand, and silt from the Upper Pleistocene quaternary system ( $Q_{3}^{eol}$ ); Group D: Ordovician (O) limestone, dolomite, and dolomite limestone; And, Group E: Lower Cretaceous (K₁) sandstone, argillaceous sandstone and mudstone (Figure 2F).

3.2.7 Distance to rivers

The lateral erosion of a river will weaken the rock and soil stability on both banks’ slopes (Pourghasemi et al., 2012a). The distance to rivers was generated by Euclidean distance and divided into five categories according to the natural break method: 0–876.18m, 876.18–1833.93m, 1833.93–2920.50m, 2920.50–4274.03m, 4274.03–8034.23 m (Figure 2G).

3.2.8 Distance to roads

Road construction is usually accompanied by excavating the slope toe, which changes the slope’s stress distribution and significantly impacts landslides (Pham et al., 2017c). The distance to roads was divided into five categories by the natural break method: 0–819.39°m, 819.39–1914.60°m, 1914.60–3291.94°m, 3291.94–5210.08°m, 5210.08–9406.85 m (Figure 2H).

3.2.9 NDVI

NDVI is the difference between the near-infrared band’s reflection value and the red band’s reflection value divided by their sum, which is a dimensionless parameter (Ada and San 2018). It was divided into five categories by natural break method: 0.01–0.04.0.04–0.05, 0.05–0.07, 0.07–0.12, 0.12–0.31 (Figure 2I).

NDVI = \frac{N I R - R}{N I R + R} (2)

Where NIR is the reflection value of the near-infrared band and R is the reflection value of the red band.

3.2.10 TWI

The TWI reflects the dry and wet conditions of the soil under the ideal condition, which is a dimensionless parameter (Hong et al., 2018b). It was divided into 2.14–5.53.5.53–7.86.7.86–15.17, 15.17–22.79, and 22.79–29.25 by the natural break method (Figure 2J).

TWI = \ln (\frac{A s}{\tan β}) (3)

Among them, A_s stands for specific catchment area $β$ stands for slope angle.

3.2.11 Plan curvature

The plan curvature is the rate of change in the direction perpendicular to the maximum slope, which is a dimensionless parameter (Ohlmacher 2007). The natural break method divided into five categories: −8.01–−1.12, −1.12–−0.35, −0.35–0.06, 0.06–0.75, and 0.75–9.65 (Figure 2K).

3.2.12 Profile curvature

The profile curvature is the variability of the slope along the maximum slope, which is a dimensionless parameter. It was divided into five categories by the natural break method: −9.48–−1.10, −1.10–−0.34, −0.34–0.14, 0.14–0.89 and 0.89–7.97 (Figure 2L).

3.3 The division of evaluation units

The study area was divided into grid units and slope units. The grid unit mainly determines the resolution size, and an empirical formula was used to determine the resolution (Formula 4). The slope unit division method based on curvature was proposed by Yan Ge, which was applied to the evaluation of landslide susceptibility in other studies (Yan G 2016). Results showed that this method has a better evaluation result than the traditional hydrological analysis model. Combined with previous studies, 30 m resolution grid unit and slope unit based on curvature division were selected to evaluate the applicability of different units in this paper. Then, the study area was divided into 1,799,237 grid units and 42,981 slope units.

G_{s} = 7.5 + 0.0006 S - 2.01 \times 10^{9} S^{2} + 2.91 \times 10^{15} S^{3} (4)

Where $G_{s}$ is the grid size, and S is the denominator of the geological map scale.

3.4 Random subspace

The random subspace algorithm (RS) was first proposed in 1998 (Tin Kam Ho 1998). This method constructed a classifier based on a decision tree, which maintained the highest accuracy in training data, and the generalization accuracy also increased with the increase of complexity. The classifier comprises several trees constructed by a pseudo-random selection of feature vectors, and one classifier is trained by using each subspace (Kuncheva and Plumpton 2010). Overfitting can be avoided to a certain extent by remaining part of the training data. First, there are k attributes (1 < k < n) selected from the attribute set of the training set (a₁, a₂, ... a_n) randomly. Each sample of the initial training set was described to obtain the new training set. Then, it was continuously randomly sampled until the bootstrap sample sets were consistent with the number of the training set. Finally, the reduced-error pruning tree model was used as the base classifier. The classification results are combined with the simple majority voting rule to obtain the final classification results.

3.5 Random forest

The random forest (RF) was first proposed in 2001 (Breiman 2001). First, the bootstrap resampling method was used to extract multiple samples from the original sample, among which 2/3 samples were randomly put back. For each bootstrap sample, decision trees were established through feature sampling and optimal segmentation. The decision tree was generated using classification and regression tree algorithms, and some factors were randomly selected for internal node branching, unrestricted growth, and no pruning (Trigila et al., 2015). Then the out-of-bag error was calculated by 1/3 of the data. Finally, these decision trees were combined, and the final classification result was obtained by voting.

3.6 Forest by penalizing attributes

The forest by penalizing attributes (FPA) was developed in 2017 (Adnan and Islam 2017). It comes from the forest by continuously excluding the root node and has become a more balanced and accurate decision forest algorithm. It systematizes weights by penalizing attributes, effectively avoiding the attributes of low-level trees. It generates a bootstrap data set from training data samples and uses attribute weights to create a decision tree, similar to a classification and regression tree. Still, the difference is that the attribute is divided based on merit values rather than classification ability, and the attribute weight of the tree is constantly updated. It is worth mentioning that FPA avoids switching between similar trees by preserving the weight of the previous tree, so the test attributes of the previous tree will not be penalized. The weight of the attribute mainly considers the attribute level of the nearest tree (λ). It randomly generates weight range (WR) (Formula 5). $ρ$ is used to ensure the WR for different levels be non-overlapping. Finally, the weight of the applicable attribute is updated, and its increment value $σ_{i}$ is calculated as shown in Formula 6. This incremental parameter is used to avoid being unable to test in subsequent tree nodes due to the low weight when testing at the previous tree node.

W R^{λ} = {\begin{array}{c} [0.000, e^{- \frac{1}{λ}}], & if λ = 1 \\ [e^{- \frac{1}{λ - 1}} + ρ, e^{- \frac{1}{λ}}], & if λ > 1 \end{array} (5)

σ_{i} = \frac{1.0 - ω_{i}}{(η + 1) - λ} (6)

Where $ω_{i}$ is the weight of the attribute; $η$ it is equal to the height of the highest tree.

3.7 Rotation forest

The rotation forest model (ROF) has been widely used to evaluate landslide susceptibility and has achieved high prediction accuracy (Pham et al., 2018; Pham et al., 2020). It is assumed that N is a training sample data composed of an A×B matrix (A represents the training instances and B represents the landslide influencing factors), and the decision tree classifiers and feature set in the ensemble model are represented by D_i (i = 1.2, … ,L) and F. First, F is randomly divided into K subsets. K represents M features contained by B in each feature subset, where M = B/k. The ith decision tree classifier of the jth subset can be represented by F_ij. Then 75% of the training data is selected to generate the random non-empty subset, and the principal component analysis (PCA) is run through the subset of N and M features (J. J. Rodriguez et al., 2006). Finally, its coefficients are stored to obtain vectors, $a_{i, j}^{(1)}$ ,…, $a_{i, j}^{(M_{j})}$ , of the same size as the M × 1 matrix. Therefore, the “rotation” matrix R_i is constructed (Formula 7). Thus, the training set D_i ( $X R_{i}^{a}, B$ ) is also constructed, and x is assumed to come from $ω_{j}$ class. The confidence of each class $ω_{j}$ is calculated by the average grouping method (Formula 8):

R_{i} = [\begin{array}{c} a_{i, 1}^{(1)}, a_{i, 1}^{(2)}, \dots, a_{i, 1}^{(M_{1})}, & [0] & \dots & [0] \\ [0] & a_{i, 2}^{(1)}, a_{i, 2}^{(2)}, \dots, a_{i, 2}^{(M_{2})}, & \dots & [0] \\ ⋮ & ⋱ & ⋱ & ⋮ \\ [0] & [0] & \dots & a_{i, K}^{(1)}, a_{i, K}^{(2)}, \dots, a_{i, K}^{(M_{K})} \end{array}] (7)

μ_{j} (x) = \frac{1}{L} \sum_{i = 1}^{L} d_{i, j} (x R_{i}^{a}) (8)

3.8 Model performance evaluation

The evaluation of the generalization ability of different models requires effective experimental methods and effective experimental methods and evaluation criteria to measure the generalization ability of the models (Abedini et al., 2019b). When the evaluation of landslide susceptibility is taken as the task requirement, it is significant to verify the relative good or bad of different models to select evaluation models (Nguyen et al., 2019). In this study, the commonly used statistical test methods were selected, such assensitivity, specificity, accuracy (ACC), mean absolute error (MAE), precision, kappa statistics, F-score, and Matthews correlation coefficient (MCC).

S e n s i t i v i t y = \frac{T P}{T P + F N} (9)

S p e c i f i c i t y = \frac{T N}{T N + F P} (10)

A C C = \frac{T P + T N}{T P + T N + F P + F N} (11)

M A E = \frac{| p_{1} - a_{1} | + | p_{2} - a_{2} | + . . . + | p_{n} - a_{n} |}{n} (12)

P r e c i s i o n = \frac{T P}{T P + F P} (13)

k = \frac{k_{0} - k_{e}}{1 - k_{e}} (14)

F - s c o r e = \frac{2 \times T P}{2 \times T P + F P + F N} (15)

M C C = \frac{T P \times T N - F P \times F N}{\sqrt{(T P + F P) \times (T P + F N) \times (T N + F P) \times (T N + F N)}} (16)

True positive (TP) and true negative (TN) are correctly predicted and classified as landslides and non-landslides, respectively. False-positive (FP) and false-negative (FN) are misclassified as landslides and non-landslide, respectively. p_i is the predicted value of landslide sensitivity, while a_i is the actual value of landslide sensitivity (i = 1, 2, 3, … , n, n is the number of sample instances). k₀ represents the sum of the number of correctly classified samples in each category divided by the total number of samples, and k_e represents the product of the true and predicted samples in each category divided by the square of the total number of samples.

3.9 Data processing

The main steps of this study are shown in Figure 3. The first step is to determine the distribution of landslides in the study area by interpreting remote sensing images, field investigation, and verification. In the second step, 12 types of influencing factors such as elevation, slope angle, and slope aspect were selected to establish a database divided into grid units and slope units. Then the data were screened through VIF and CAE. The third step is to show the RS model through WEKA software, mix it with the ROF, FPA, and RF models, respectively, and continuously adjust the iterations to determine the optimal parameters. The fourth step is to get the map of landslide susceptibility, divided into five grades from very low to very high. The fifth step used the ROC curve, AUC, MAE, and SE to verify the model and obtains the most sensitive model through analysis.

FIGURE 3

FIGURE 3. Flow chart of the study.

4 Results and analysis

4.1 Landslide influencing factor analysis

In the landslide susceptibility assessment, the analysis of influencing factors is mainly divided into two aspects: one is the multicollinearity between factors, and the other is the importance of the influencing factors on landslides in the study area. Tolerance (TOL) and variance inflation factors (VIF) are measures of collinearity severity of multiple linear regression models. It is generally believed that when TOL is less than 0.1 or VIF is greater than 10, there is more severe collinearity among data (Toebe and Cargnelutti Filho 2013). The results of the collinearity analysis are shown in Table 1.

TABLE 1

TABLE 1. Multicollinearity analysis.

Correlation attribute evaluation (CAE) is used to evaluate the worth of an attribute by measuring the correlation (Pearson’s) between it and the class. In previous work, it was also used to select relevant factors for landslide susceptibility. CAE was used to calculate the importance of 12 types of factors in this paper, and the results are shown in Table 2. If average merit (AM), the weight of a factor, is greater than 0, it indicates that it is beneficial to evaluating landslide susceptibility. It can be clearly seen from the results that all the 12 types of factors selected in this paper are suitable for landslide susceptibility evaluation in the study area.

TABLE 2

TABLE 2. Importance of influencing factors based on correlation attribute evaluation (CAE).

4.2 Selection of model parameters

The parameter adjustment of the machine learning model is a complex task. In order to reduce the complexity of parameter optimization and obtain more reliable parameters, we obtained the test results of the training set and validation set by comparing the changes of AUC and MAE values of each model under the number of iterations ranging from 10 to 200 times (Figure 4). In order to select the number of iterations more in line with the overall sample, the difference between AUC and MAE was calculated, and the mean value was calculated to obtain average statistical index (ASI), which represented the average level of model fitting under the number of iterations (Figure 5). The results show that RS-ROF, RS-FPA, and RS-RF models have the best fitting degree under 190, 50, and 200 iterations, respectively, to adjust the model’s parameters.

FIGURE 4

FIGURE 4. AUC and MAE of training set and validation set under different iteration times; (A) AUC value of the training set; (B) AUC value of the validation set; (C) MAE value of the training set; (D) MAE value of the validation set.

FIGURE 5

FIGURE 5. ASI of each model under different iteration times.

4.3 Generation of landslide susceptibility maps

As mentioned previously, we used different datasets applied to RS model, and the ROF, FPA, and RF models were selected as base classifiers to construct RS-ROF, RS-FPA, and RS-RF models. Then, the success rate and prediction rate curves based on three models were obtained. After the training of models was completed, the data in the study area {A_i, B_i, … , L_i} (0 < i $\leq$ n, where n is the number of instances in the study area) were substituted, respectively. The probability of landslide occurrence for each unit, also known as the landslide susceptibility index (LSI), was calculated (Figure 6). The natural break method was used to classify LSI in this paper to eliminate the differences between the three models. There are five categories, very low, low, moderate, high, and very high. Therefore, six visual landslide susceptibility maps were obtained for three models of different units (Figure 7).

FIGURE 6

FIGURE 6. Generation of landslide susceptibility.

FIGURE 7

FIGURE 7. Landslide susceptibility maps using three models of different units: (A) LSI of RS-ROF model under grid units; (B) LSI of RS-ROF model under slope units; (C) LSI of RS-FPA model under grid units; (D) LSI of RS-FPA model under slope units; (E) LSI of RS-RF model under grid units; (F) LSI of RS-RF model under slope units.

4.4 Model performance and comparison

Model performance is a critical step in detecting the predictive ability of the model and the last stage of a landslide susceptibility evaluation (Ozdemir and Altural 2013; Hussin et al., 2016). Test indicators were calculated according to Formulas 9 to 15 (Table 3 and Table 4). Among them, the larger the values of other parameters except for the MAE value, the higher the model’s prediction ability is. In the case of the grid unit, the training set showed the maximum value of the specificity, ACC, precision, kappa, and MCC of the RS-ROF model, followed by the RS-FPA and RS-RF model. In the RS-RF model, the value of sensitivity and F-score were the largest, while its MAE was the smallest. The validation set showed that the comparison results differed from the training set. The RS-RF model obtained the highest value insensitivity, ACC, F-score, Kappa, and MCC, respectively, while the MAE value remained at the minimum, followed by the RS-FPA and RS-ROF model. Only specificity and specificity showed that the RS-ROF model was superior to the other two models. The difference of the results may be caused by the size of the samples and the different testing principles of each parameter.

TABLE 3

TABLE 3. The performance of the training set and the validation set of models using grid unit.

TABLE 4

TABLE 4. The performance of the training set and the validation set of models using slope unit.

When the slope unit was used, the training set showed that the RS-RF model’s sensitivity, ACC, F-score, Kappa, and MCC reached the maximum value, and MAE was the minimum, followed by the RS-ROF model and RS-FPA model. The RS-ROF model has the maximum value in specificity and specificity indicators. The validation set showed that the RS-FPA model achieved the maximum value insensitivity, ACC, precision, F-score, Kappa, and MCC, followed by the RS-RF and RS-ROF models. Each exhibited the same specificity, but the MAE of the RS-RF model remained at the minimum. It showed that the results obtained by different test indexes are not the same, and it is difficult to measure according to one standard.

The receiver operating characteristic curve (ROC) originated in the 1990s and has been widely used in data mining and machine learning classification model evaluation (Chen et al., 2017d). ROC curves were generated by counting the sensitivity (landslide samples predicted as landslides) and 1-specificity (non-landslide samples predicted as landslides) of each model (Figure 8). Using the area under the curve (AUC), which is the main statistical indicator of the ROC curve (Pham et al., 2021). In both the training and validation sets, the AUC value of the RS-RF model was the highest among the three models and its SE was also the smallest (Table 5 and Table 6). The AUC value of the training set of RS-FPA in the two types of units is higher than that of the RS-ROF model, while the AUC value of the RS-ROF model of the grid unit is higher than that of the RS-FPA model in the validation set and the comparison results of the slope unit were consistent with that of the training set. The RS-RF model shows the highest generalization ability among the three models, followed by the RS-FPA and the RS-ROF model. Compared with the AUC results of the grid and slope units, the three models all evidence that the prediction accuracy of slope units is higher than that of grid units, which highlights that it is more appropriate to divide slope units to evaluate landslide susceptibility in Yaozhou District.

FIGURE 8

FIGURE 8. ROC curves of the models using: (A) training set of grid units; (B) validation set of grid units; (C) training set of slope units; (D) validation set of slope units.

TABLE 5

TABLE 5. Parameters of ROC curves using the training dataset.

TABLE 6

TABLE 6. Parameters of ROC curves using the validation dataset.

The frequency ratio (FR) was used to measure the consistency of the predicted results with the actual results (Formula 17). A higher FR means a relatively higher number of landslides in a smaller range. Thus, FR should increase with the increase of susceptibility, and the model with a higher FR is closer to reality at a very high susceptibility. In Figure 9, the FR of the three models increased with increased sensitivity, and slope units’ FR was higher than grid units in very high sensitivity grade. Among them, the RS-RF model had the highest FR value, followed by the RS-FPA and the RS-ROF model.

F R = \frac{υ}{γ} (17)

Where $υ$ is the proportion of landslide number, and $γ$ is the proportion of the sensitive area.

FIGURE 9

FIGURE 9. Comparison of the frequency ratios of three models.

5 Discussion

The evaluation of landslide susceptibility involves sampling strategy, determination of evaluation unit, selection of influence index, evaluation model, and model verification. Due to the complex mechanism of landslide occurrence and the uncertainty of the evaluation process, the evaluation of landslide susceptibility has not reached a unified standard despite much research spanning decades (Nhu et al., 2020). In this study, slope units and grid units were selected for comparison. A curvature watersheds-based slope unit was used to evaluate the landslide susceptibility of Baxie River Basin, and the study showed that the slope division based on curvature was superior to the traditional hydrological analysis method (Chen et al., 2020a). The classification of hydrological analysis methods over wide areas has been deemed inappropriate and requires extensive manual correction, which subsequently increases objectivity and uncertainty (Yan G 2016). The 70 m resolution grid units have been proved to be the best size in Baxie River Basin, but this phenomenon is not fixed, causing problems in evaluating different areas and different types of landslides (Chen et al., 2020b). According to the empirical formula, 30 m resolution was selected as the grid unit compared to the slope unit in this study. The results showed that the model prediction results of the slope unit were better than that of the grid unit, highlighting the importance of the slope as the boundary for evaluating landslide susceptibility (Table 5 and Table 6) (Ba et al., 2018). The slope unit division method based on essential curvature in this study can effectively overcome the shortcomings of fuzzy boundaries and complicated operations (Chen et al., 2020a). The geomorphology of the study area is mainly Loess tableland. The results showed that the division of slope unit has a suitable application in relatively flat areas, while its adaptability in hilly and mountainous areas needs further analysis and verification.

There was no fixed standard for the selection range of factors in the previous literature, so there is still a certain subjectivity (Pham et al., 2017b). In this paper, 12 factors were evaluated based on detailed investigation reports of geological hazards in the study area, and their importance to the landslide in the study area was analyzed through the CAE method. The results showed (Table 2) that elevation is the factor most closely related to the occurrence of landslides in the area (AM = 0.506) among the 12 types of factors, which is consistent with the relevant research results (Tien Bui et al., 2016b; Chen et al., 2017d). The second is the lithology of the area (AM = 0.425), which determines the deformation and failure mode of the slope, as well as the location of a weak structural plane. It, therefore, has a noticeable control effect on the location of the sliding plane. Slope angle (AM = 0.275) and roughness (AM = 0.267) are also important influencing factors in this study, significantly changing the stress distribution state and making the slope more unstable, thereby inducing a landslide. NDVI (AM = 0.228) plays a role in slope protection and soil erosion prevention (Shou and Lin 2016). According to Figure 2I, it can be seen that vegetation is lush and landslides are relatively less developed in the northwest of the region. In contrast, vegetation coverage is more diminutive, and landslides are more densely developed in the southeast. The distance to roads (AM = 0.236) is also an important landslide-inducing factor related to the rapid development of highway construction in the study area resulting in many slopes excavated to form the blank surface (Xu et al., 2012). The remaining factors are relatively less important but cannot be ignored because AM values showed that they are still associated with landslides.

The random subspace algorithm was used to optimize the support vector machine model to evaluate the landslide susceptibility in the Wuning area, China (Hong et al., 2017b). Results showed that the hybrid model constructed by the random subspace algorithm improved the prediction ability of the basic classifier. The former study applied the ROF, FPA, and RF models to predict landslide susceptibility. In this study, they were compared with the hybrid model constructed by the random subspace to verify their applicability in predicting landslide susceptibility. A variety of performance measurement methods were adopted to evaluate the three models of two types of units. Due to the different interpretation emphases of other measurement methods, the evaluation results were also different. Precision is used to measure the proportion of the actual number of landslides in the predicted unit, for example, while sensitivity is used to measure the exact number of landslides in the predicted unit. As a result, two indexes often have opposite rules (Table 3 and Table 4). Although there are slight differences, the RS-RF model had the highest AUC value after the overall comparison by the ROC curve, followed by the RS-FPA and the RS-ROF model. It was proved that the RF model has a stronger generalization ability and prediction performance than other single models. In extremely vulnerable areas, the RS-RF model predicted the highest landslide frequency ratio under the slope unit, which was consistent with the AUC value and proved that the results are relatively reliable. It should be pointed out that in this study, the model training process only adopts the grid search method to determine the iterative parameters. However, there are many hyperparameters in machine learning hybrid model. We recommend, for future studies, the optimization methods of different parameters are worthy of further analysis and research to improve the stability and accuracy of the model. Besides that, in addition to RS algorithm, improved complete ensemble empirical mode decomposition with adaptive noise (ICEEMDAN) was used for the construction of machine learning hybrid model. It has been proved that the prediction ability of machine learning models has been improved in energy fields such as global solar radiation and wind direction, which is also inspiring for the construction of landslide susceptibility models (Li et al., 2021; Ghimire et al., 2022). The advantages of hybrid models can be further compared and explored.

6 Conclusion

Landslide susceptibility prediction is a necessary and often an uncertain work due to the harmfulness and complexity of landslide procesess. The main purpose of this study was to compare the applicability of different hybrid machine learning models, including RS-ROF, RS-FPA, and RS-RF, for landslide susceptibility assessment in the Yaozhou District, Tongchuan City, China. The results were compared based on the grid unit and slope unit, respectively. Combined with the geological environment characteristics of the study area, a total of 12 categories of influencing factors, including elevation, slope angle, slope aspect, roughness, rainfall, lithology, distance to rivers, distance to roads, NDVI, TWI, plan curvature, and profile curvature were used for evaluation and models were verified by ROC curve, Kappa coefficient, F-score, MCC, and other performance metrics. The results showed that the 12 factors selected were all suitable for this study, among which elevation, lithology, slope, roughness, and NDVI were the main inducing factors of landslide. The prediction results of the three models based on slope unit were all the better than those of the grid unit, which verified the efficiency and accuracy of the curvature-based slope unit division method. In typical Loess tableland regions, the method of slope unit division based on curvature is worthy of application. Curvature analysis improves the partition efficiency of slope unit, which is observed in the landslide susceptibility assessment of different machine learning hybrid models. Through the comprehensive comparison of different performance measures, excellent landslide susceptibility prediction ability was demonstrated by RS-RF model. At the same time, the hyperparameter optimization of the model can be further studied and explored. Landslide susceptibility is an important branch in the field of environmental geology. This model can be used to evaluate and apply a wide range of environmental problems. The research results based on machine learning hybrid model will play an important role and influence on the development and utilization of urban/agricultural areas aiming for an harmonious development of human and environment.

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: http://www.gscloud.cn/sources/index?pid = 1&rootid = 1.

Author contributions

ND collected field data and conducted the landslide mapping and analysis. YL and JM wrote the manuscript. HS, MH, GO, and SC provided critical comments in planning this paper and edited the manuscript. All the authors discussed the results and edited the manuscript.

Funding

This study was supported by the National Natural Science Youth Foundation of China (Grant Nos. 41602359 and Grant Nos. 41702377) and the Open Project of the Key Laboratory of Geological Process and Mineral Resources in the Northern Qinghai-Tibet Plateau of Qinghai Province (No. 2019-KZ-01).

Acknowledgments

The authors thank the Shaanxi Institute of Geological Survey of China for providing the geological survey data of the study area and Professor Chen Wei of Xi’an University of Science and Technology for his suggestions on the evaluation model and factor analysis in this paper. Furthermore, the authors thank the University of Kurdistan, Iran, and Universiti Teknologi Malaysia (UTM) for preparing this international collaboration for the scientific sharing experience.

Conflict of interest

Author YL was employed by the company Xi’an Meihang Remote Sensing Information Co., Ltd.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fenvs.2022.1009433/full#supplementary-material

References

Abedini, M., Ghasemian, B., Shirzadi, A., and Bui, D. T. (2019a). A comparative study of support vector machine and logistic model tree classifiers for shallow landslide susceptibility modeling. Environ. Earth Sci. 78, 560. doi:10.1007/s12665-019-8562-z

A comparative study for landslide susceptibility assessment using machine learning algorithms based on grid unit and slope unit

1 Introduction

2 Study area

3 Material and methods

3.1 Preparation of landslide inventory and datasets

3.2 Selection of landslide influencing factors

3.2.1 Elevation

3.2.2 Slope angle

3.2.3 Slope aspect

3.2.4 Roughness

3.2.5 Rainfall

3.2.6 Lithology

3.2.7 Distance to rivers

3.2.8 Distance to roads

3.2.9 NDVI

3.2.10 TWI

3.2.11 Plan curvature

3.2.12 Profile curvature

3.3 The division of evaluation units

3.4 Random subspace

3.5 Random forest

3.6 Forest by penalizing attributes

3.7 Rotation forest

3.8 Model performance evaluation

3.9 Data processing

4 Results and analysis

4.1 Landslide influencing factor analysis

4.2 Selection of model parameters

4.3 Generation of landslide susceptibility maps

4.4 Model performance and comparison

5 Discussion

6 Conclusion

Data availability statement

Author contributions

Funding

Acknowledgments

Conflict of interest

Publisher’s note

Supplementary material

References

95% of researchers rate our articles as excellent or good