Optimal parameters of random forest for land cover classification with suitable data type and dataset on Google Earth Engine

Sun, Jing; Ongsomwang, Suwit

doi:10.3389/feart.2023.1188093

ORIGINAL RESEARCH article

Front. Earth Sci., 23 October 2023

Sec. Environmental Informatics and Remote Sensing

Volume 11 - 2023 | https://doi.org/10.3389/feart.2023.1188093

This article is part of the Research TopicAdvances in Characterizing and Monitoring Land Cover/Use and Associated Ecosystem Changes Using Remote Sensing DataView all 13 articles

Optimal parameters of random forest for land cover classification with suitable data type and dataset on Google Earth Engine

Jing Sun¹

Suwit Ongsomwang²*

¹Department of Geographic Information Science, School of Architectural Engineering, Tongling University, Tongling, China
²School of Mathematics and Geoinformatics, Institute of Science, Suranaree University of Technology, Nakhon Ratchasima, Thailand

Exact land cover (LC) map is essential information for understanding the development of human societies and studying the impacts of climate and environmental change. To fulfill this requirement, an optimal parameter of Random Forest (RF) for LC classification with suitable data type and dataset on Google Earth Engine (GEE) was investigated. The research objectives were 1) to examine optimum parameters of RF for LC classification at local scale 2) to classify LC data and assess accuracy in model area (Hefei City), 3) to identify a suitable data type and dataset for LC classification and 4) to validate optimum parameters of RF for LC classification with a suitable data type and dataset in test area (Nanjing City). This study suggests that the suitable data types for LC classification were Sentinel-2 data with auxiliary data. Meanwhile, the suitable dataset for LC classification was monthly and seasonal medians of Sentinel-2, elevation, and nighttime light data. The appropriate values of the number of trees, the variable per split, and the bag fraction for RF were 800, 22, and 0.9, respectively. The overall accuracy (OA) and Kappa index of LC in model area (Hefei City) with suitable dataset was 93.17% and 0.9102. In the meantime, the OA and Kappa index of LC in test area (Nanjing City) was 92.38% and 0.8914. Thus, the developed research methodology can be applied to update LC map where LC changes quickly occur.

1 Introduction

Detailed, timely and accurate land cover (LC) can help people understand the relationship between the development of human societies and climate and environmental change (Turner et al., 2007; Song et al., 2018; Liu et al., 2020b). LC classification and mapping are considered to be the key technologies to obtain surface information at various scales (Feddema et al., 2005), which is critical for understanding the impact of LC changes on agricultural production, ecotourism, carbon sequestration, water quality, runoff, and species conservation. In recent years, more and more fields require LC maps with higher temporal and spatial resolution, and more and more government departments and institutions need LC maps as the basis for decision-making, planning, and budgeting. However, due to surface heterogeneity and spectral confusion, accurate LC classification and mapping still face many challenges, especially in utilizing time-series satellite data.

The Landsat images have been shared data source for LC mapping and monitoring in the past 50 years (Hansen and Loveland, 2012; Gómez et al., 2016). The Sentinel-2 (S2) satellite provides higher spatial and spectral resolution data than Landsat and opens up new opportunities for LC classification (Pirotti et al., 2016; Forkuor et al., 2018). Recent studies have demonstrated that S2 data can classify LC from a single date image (Clark, 2017; Mongus and Žalik, 2018). However, a single scene of such data cannot effectively monitor dynamic changes to distinguish spectrally similar LC classes. Time-series images often yield better performance than single-temporal images do (Sothe et al., 2017). S2 is superior to Landsat in terms of spatial resolution, time resolution and spectral bands, and it can provide rich phenological information, spatial information and spectral information, making it an increasingly important data source for LC classification.

Nevertheless, the availability of optical data becomes limited and difficult with the occurrence of frequent cloud cover when more than one cloud-free observation per month in time-series images is required for LC classification (Ju and Roy, 2008). The synthetic aperture radar (SAR) data have been increasingly utilized recently as they do not rely on sunlight and are not influenced by cloud and fog (DeFries, 2008; Freitas et al., 2008; Zeng et al., 2020). By exploiting different physical principles, the optical and radar data deliver complementary information, providing higher accuracy for LC classification than a single data source do (Erasmi and Twele, 2009; Stefanski et al., 2014; Joshi et al., 2016). With the free open-access policy of the ESA Sentinel satellite constellation, both multi-sensor and multi-temporal LC mapping have become even more attractive (Drusch et al., 2012; Torres et al., 2012). Some studies have used the multi-sensor method of Sentinel-1 (S1) and S2 data for LC classification and mapping (Pesaresi et al., 2016; Ban et al., 2017; Chatziantoniou et al., 2017; Clerici et al., 2017).

LC is usually affected by natural conditions and human activities. Elevation data can intuitively reflect the natural conditions of a region, while nighttime light images provide unique footprints of human activities and settlements. More and more scholars have begun to add night light data and elevation data to the classification of LC, and discuss their importance in the classification of LC. For example, Tang et al. (2020) use JL1-3B high-resolution nighttime light imagery and Sentinel-2 time series imagery fusion for impervious surface area mapping. Goldblatt et al. (2018) classify the urban LC with fusion approach utilizing nighttime light data and Landsat imagery. Liu et al. (2020a) classified the Landsat data for the Gannan Prefecture and the results showed that the topographic features contributed the most, followed by the spectral indices and bands. Phan et al. (2020) used Landsat 8 data to classify the LC in Mongolia, and the results showed that elevation was the most important feature.

In recent years, deep learning (DL) and machine learning (ML) have gained increasing attention, and their uses are constantly increasing in LC classification. DL is often confusing with ML, but it should be noted that DL is a subset of ML, and both belong to the category of artificial intelligence (AI) (Chen et al., 2019; Klaiber, 2021). The commonly used ML algorithms include linear regression, logistic regression, naïve bayes (NB), support vector machines (SVM), decision tree, Bayesian learning, K nearest neighbor (KNN), neural networks (NN) and random forest (RF) (Ray, 2019). DL algorithms are the upgraded version of artificial neural networks, and commonly used DL algorithms include deep Boltzmann machine (DBM), deep belief network (DBN), convolutional neural network (CNN), graph convolutional network (GCN), recurrent neural networks (RNN), and recursive neural networks (RvNN) (Shrestha and Mahmood, 2019; Hong et al., 2021a; Hong et al., 2021b; Wu et al., 2022). In the field of LC classification, these state-of-the-art (SOTA) classification methods mentioned above and their improved algorithms have been widely adopted in various research topics. Lin et al. (2018) extract the LC types of Weihai from 1985–2015 using SVM classification method with Landsat MSS/TM/OLI images. Thanh Noi and Kappas (2018) compare RF, KNN, and SVM for LC classification using Sentinel-2 in Red River Delta. Sun et al. (2019) build a long short-term memory (LSTM) RNN model for LC classification in North Dakota with time series Landsat. Hong et al. (2021a) develop a new minibatch GCN for hyperspectral image classification, which allows to train large-scale GCNs in a minibatch fashion.

The RF (Breiman, 2001), which is one of ML classifiers and developed based on decision tree, became one of the favorite and most promising LC classifiers due to its relatively stable and robust classification accuracy and effectiveness in handling large and high-dimensional datasets, and it has been widely used in multi-temporal and multi-sensor images classification (Gislason et al., 2006; Rodriguez-Galiano et al., 2012; Pelletier et al., 2016; Ghorbanian et al., 2020; Phan et al., 2020; Ghorbanian et al., 2021). Bourgoin et al. (Bourgoin et al., 2020) used the RF algorithm for LC classification based on Landsat and S2 data and reported overall accuracy (OA) and Kappa index of 0.81 and 0.87, respectively.

Recently, the explosive growth in data volume, including multi-temporal and multi-sensor datasets that are effective in LC information extraction, has led to problems of time consumption with low efficient processing using personal or workstation computers. The Google Earth Engine (GEE) can easily get access to multi-sensor and multi-temporal images and it provides high-performance operation without downloading these data to a local machine (Gorelick et al., 2017; Kumar and Mutanga, 2018; Mutanga and Kumar, 2019; Amani et al., 2020; Tamiminia et al., 2020). Furthermore, the temporal aggregation method, such as the median, in the GEE significantly reduces cloud interference, resolves the problems of unavailable satellite data for specific periods. In addition, the availability of powerful classification models, such as RF (Shelestov et al., 2017), makes the GEE a widely used remote-sensing tool for LC research. Some studies have used the RF algorithm to classify LC on the GEE platform (Azzari and Lobell, 2017; Ghorbanian et al., 2020; Phan et al., 2020; Zhang et al., 2020; Yang et al., 2021).

Meanwhile, LC datasets at regional and global scales, including FROM-GLC30 (Gong et al., 2013), FROM-GLC10 (Gong et al., 2019), GlobeLand30 (Chen et al., 2015), GLC_FCS30 (Zhang et al., 2021), ESA WorldCover (Zanaga et al., 2021), and ESRI Land Cover (Krishna et al., 2021) are available for public uses. In this study, the LC dataset in 2020 was used as reference data to extract training and validating areas and examine optimal parameters of RF for LC classification at a local scale under the GEE platform with S1 and S2 satellites in the model area. Then, the derived optimal parameters of RF were applied to classify LC type in the test area with the suitable dataset for validation.

Our specific research objectives were 1) to examine optimum parameters of RF for LC classification at local scale 2) to classify LC data and assess accuracy in model area (Hefei City), 3) to identify a suitable data type and dataset for LC classification and 4) to validate optimum parameters of RF for LC classification with a suitable data type and dataset in test area (Nanjing City). The proposed method can reduce the cost for updating LC map at local scale.

2 Materials and methods

2.1 Study area

Hefei City and Nanjing City were selected as the model and test areas respectively. Hefei City, with a population of more than 8 million, is the capital and the largest city in Anhui Province, China. It was chosen to examine an optimal parameter of RF and to identify a suitable data type and dataset for LC classification (Figure 1A). At the same time, Nanjing City, with a population of more than 8.5 million, is a sub-provincial city and the capital city in Jiangsu Province, China. It was selected to validate the optimal parameters of RF for LC classification under GEE with suitable datatype and dataset (Figure 1B).

FIGURE 1

FIGURE 1. Location of the study area: (A) model area (Hefei City), and (B) test area (Nanjing City).

2.2 Data

The data used in this study were categorized into three groups: 1) Sentinel satellite data, 2) auxiliary data, and 3) LC data. Brief information on each data group is summarized below.

2.2.1 Sentinel satellite data

Both S1 and S2 satellites were launched by European Space Agency (ESA).

(1) S1 data (ESA, 2023b). The S1 ground range detected (GRD) products were delivered with polarizations of the vertical transmit vertical receive (VV) and the vertical transmit horizontal receive (VH). The S1 can provide continuous all-weather, high spatial (10 m), and improved temporal resolution images at C-band unaffected by clouds, to support land monitoring (Cremer et al., 2020).

(2) S2 data (ESA, 2023c). The S2 multi-spectral instrument (MSI) sensor provides high spatial (10 m) and multi-spectral images over the global surface, with unprecedented potential in LC monitoring and mapping (Drusch et al., 2012; Spoto et al., 2012; Zheng et al., 2017).

In this study, S1 GRD and S2 MSI products acquired in 2020 were the primary data input for LC classification.

2.2.2 Auxiliary data

(1) Elevation (NASA, 2023). The 30 m spatial resolution elevations over the study area were extracted from the Shuttle Radar Topography Mission (SRTM) data (Su et al., 2021). It is helpful to distinguish various LC types if we adopt the combination of satellite images and physiography variables (Phan et al., 2020).

(2) Nighttime light products (EOG, 2023). Obtained from the NPP-VIIRS day/night band (DNB), the 464 m spatial resolution nighttime light products were used to distinguish between artificial surfaces and bare ground (Miller et al., 2013).

2.2.3 LC data

(1) ESA WorldCover (ESA, 2023a). It is a global LC product with 10 m spatial resolution published by the ESA WorldCover team based on S1 and S2 data, which consists of 11 LC classes (Zanaga et al., 2021).

(2) ESRI Land Cover (ESRI, 2023). It is also a global LC product with 10 m spatial resolution produced by ESRI Impact Observatory through S2 image, which consists of 10 LC classes (Krishna et al., 2021).

The common areas of LC type of ESA WorldCover and ESRI Land Cover products are used to select training and validation samples in this study.

2.3 Methods

The research methodology consisted of 7 steps: 1) preprocessing of Sentinel satellite data, 2) feature extraction and dataset preparation, 3) selection of training and validation samples, 4) optimal parameters identification of RF, 5) LC classification and accuracy assessment in model area, 6) suitable datatype and dataset identification for LC classification, and 7) LC classification and accuracy assessment in test area. The workflow is displayed in Figure 2. Details are described in the following sections.

FIGURE 2

FIGURE 2. Workflow of research methodology.

2.3.1 Preprocessing of sentinel satellite data

There are two steps for preprocessing of Sentinel satellite data.

(1) Data selection, cloud pixels masking, and topographic correction for S2

Unlike S1 data, which is not affected by the cloud, scenes covered by cloud are typical in S2 data, so S2 data must be pre-processed to minimize the impact of cloud coverage. Only S2 scenes with a cloud coverage percentage of less than 85% were selected and used in subsequent steps according to the mask information of QA60 in the S2 image collection, while the S2 scenes with a cloud coverage percentage of more than 85% were removed. Then, the cloud coverage pixels in the selected S2 scene were identified and masked based on the QA60 band information, and these cloud coverage pixels did not participate in subsequent processing. Moreover, topographic correction was performed (Soenen et al., 2005) to compensate for the solar irradiance, thereby minimizing terrain-induced reflectance changes. This work was implemented by executing open-source code on GEE (https://mygeoblog.com/2018/07/27/sentinel-2-terrain-correction/). Finally, all data used in this study will be clipped to the scope of the study area to improve computational efficiency.

(2) Monthly/Seasonal median calculation for S1 and S2

To minimize the influence of holes in the image caused by the masked cloud cover pixels in the S2 scene in the previous step, and to reduce the amount of data to improve the speed of classification, the median calculation was performed. The monthly median for 12 months and seasonal median for four seasons were calculated by time aggregation for both S1 and S2 data (Luo et al., 2021; Shetty et al., 2021; Masroor et al., 2022).

2.3.2 Feature extraction and dataset preparation

For S1 data, each monthly/seasonal median included VV and VH polarizations, and the ratio between VH and VV were extracted.

For S2 data, spectral bands (Blue, Green, Red, Red Edge 1, Red Edge 2, Red Edge 3, NIR, Red Edge 4, SWIR 1, and SWIR 2) were extracted. Additionally, three significant indices for representative vegetation, urban and built-up and wetness features: normalized difference vegetation index (NDVI) (Tucker, 1979), normalized difference built index (NDBI) (Zha et al., 2003), and normalized difference water index (NDWI) (Gao, 1996) were calculated using the following equations.

N D V I = \frac{ρ_{N I R} - ρ_{R E D}}{{ρ_{N I R} + ρ}_{R E D}} (1)

N D B I = \frac{ρ_{S W I R 2} - ρ_{N I R}}{{ρ_{S W I R 2} + ρ}_{N I R}} (2)

N D W I = \frac{ρ_{N I R} - ρ_{S W I R 1}}{{ρ_{N I R} + ρ}_{S W I R 1}} (3)

Where $ρ_{R E D}$ , $ρ_{N I R}$ , $ρ_{S W I R 2}$ , and $ρ_{S W I R 1}$ are Band 4, Band 8, Band 12, and Band 11 of S2 satellite.

For auxiliary data, nighttime light data from the NPP-VIIRS DNB and elevation data from the SRTM, were added to the S1 and S2 data.

Finally, eighteen datasets were designed to examine an optimal parameter of RF and suitable data type and dataset for LC classification (Table 1). According to the data type, the datasets were categorized into six groups: S1 data (D1–D3), S1 and auxiliary data (D4–D6), S2 data (D7–D9), S2 and auxiliary data (D10–D12), S1 and S2 data (D13–D15), and S1, S2, and auxiliary data (D16–D18).

TABLE 1

TABLE 1. Combination of features in 18 datasets for classifying LC data.

2.3.3 Selection of training and validation samples

In this study, samples were selected from the area with consistent LC attributes of ESA WorldCover and ESRI Land Cover data in 2020. In practice, the LC map from ESA WorldCover and the LC map from ESRI Land Cover were compared to find pixels with the same LC attributes, which are potential samples. Then, a stratified proportional random sampling method was used to reselect the samples from the potential samples, which could avoid the final samples from being too concentrated in a certain area or a certain LC type. As a result, a total of 5,000 samples were obtained in the model area (Hefei City), and 70% of the training samples and 30% of the validation samples were divided by using the “randomColumn” function on the GEE platform.

2.3.4 Optimal parameter of RF identification for LC classification

The operation of the RF algorithm on the GEE platform requires providing six parameters. In this study, the number of trees (NT) from 100 to 1,000 at 100-tree intervals, the variables per split (VPS) from 1 to 30 at 1-variable intervals, and the bag fraction (BF) from 0.1 to 1 at 0.1 fraction intervals were examined to identify optimal parameters of RF for LC classification. As a result, a total of 3,000 combinations (NT = 10, VPS = 30, and BF = 10) were generated and evaluated. To find suitable parameter values, the RF classifier was executed 3,000 times for each dataset (D1 to D18). Then, optimal values of three primary parameters: NT, VPS, and BF, were selected based on the small out-of-bag (OOB) error. Meanwhile, the other three parameters, maximum nodes, minimum leaf population, and seed were set up using the default values with values of null, 1, and 0, respectively.

2.3.5 LC classification and accuracy assessment in model area

This study applied the RF algorithm with optimum parameters on the GEE platform to classify six LC types: urban and built-up land, cropland, forest land, grassland, water bodies, and bare land in model area. Urban and built-up land comprises rural and urban areas, commercial and industrial areas, transportation, utilities, and infrastructures. Cropland consists of crops, orchards, tea gardens, and vegetable fields. Forest land includes natural and man-made forests. Grassland encompasses natural grass and artificial grassland. Water bodies consists of rivers, streams, ponds, and lakes. Bare land includes abandoned fields, exposed rock or soil, and open land. The expected LC types in this study were considered on the basis of the regional characteristics of Hefei City and the existing LC products. The LC classification was performed by “ee.smileRandomForest” function in the GEE.

In practices, the OA, Kappa index, producer’s accuracy (PA), and user’s accuracy (UA) calculated using “confusionMatrix.accuracy,” “ConfusionMatrix.kappa,” “confusionMatrix.producersAccuracy,” and “confusionMatrix.consumersAccuracy” functions through GEE platform, respectively, were used for the accuracy assessment (Huang et al., 2017; Lyons et al., 2018). Then, 18 groups of the OA, Kappa index, PA, and UA corresponding to 18 datasets were obtained after executing RF classifier.

2.3.6 Suitable datatype and dataset identification for LC classification

In this step, the appropriate datasets for LC classification were selected by comparing the OA and Kappa index values. Average values of OA and Kappa index by datatype, as categorized into six groups in Table 1, were applied to identify an appropriate datatype for LC classification using RF under GEE platform. Meanwhile, Kappa index values with a thresholding value of 0.8 were compared among 18 datasets to identify the suitable dataset for LC classification using RF under GEE platform.

Moreover, pairwise Z-test among top three datasets providing the highest Kappa values was applied to identify significant difference value as suggested by (Congalton and Green, 2009) as below.

Z = \frac{|\hat{K_{1}} - \hat{K_{2}}|}{\sqrt{\hat{v a r} \hat{(K_{1})} + \hat{v a r} \hat{(K_{2})}}} (4)

Where Z is standard normal distribution, $\hat{K_{1}}$ , and $\hat{K_{2}}$ are Kappa indices of the first and the second dataset, and $\hat{v a r} \hat{(K_{1})}$ and $\hat{v a r} \hat{(K_{2})}$ are variances of Kappa indices of the first and the second dataset. The variance of Kappa index was calculated through Eq. 5.

\hat{v a r} \hat{(K)} = \frac{1}{n} \{\frac{θ_{1} (1 - θ_{1})}{{(1 - θ_{2})}^{2}} + \frac{2 (1 - θ_{1}) (2 θ_{1} θ_{2} - θ_{3})}{{(1 - θ_{2})}^{3}} + \frac{{(1 - θ_{1})}^{2} (θ_{4} - 4 {θ_{2}}^{2})}{{(1 - θ_{2})}^{4}}\} (5)

where $θ_{1} = \frac{1}{n} \sum_{i = 1}^{k} n_{i i}$ ,

θ_{2} = \frac{1}{n^{2}} \sum_{i = 1}^{k} n_{i +} n_{+ i},

θ_{3} = \frac{1}{n^{2}} \sum_{i = 1}^{k} n_{i i} (n_{i +} + n_{+ i}), and

θ_{4} = \frac{1}{n^{3}} \sum_{i = 1}^{k} \sum_{j = 1}^{k} n_{i j} {({n_{j +} + n}_{+ i})}^{2}

In principle, given the null hypothesis and the alternative, null hypothesis is rejected if Z ≥ Zα/2 (Congalton and Green, 2009).

2.3.7 LC classification and accuracy assessment in test area

A suitable datatype for LC classification in the model area (Hefei City) was firstly prepared for test area (Nanjing City). Then, the prepared suitable datatype was used to classify LC type with an optimal parameter of RF as identified in the Step 2.3.4. The classified LC map was assessed its thematic accuracy. The thematic accuracy information of Hefei City and Nanjing City was compared to validate LC classification using RF under GEE platform.

3 Results

3.1 Optimal parameters of RF for LC classification

For this study, the three critical parameters, NT, VPS, and BF, of the RF classifier on the GEE platform were analyzed, and the OOB error was used to select the optimal one for 18 datasets. Figure 3 shows the average OOB error values of all 18 datasets at different NT. The OOB error tends to decrease as the NT increases, no matter what kind of datasets are used. When the NT is higher than 600, the OOB error value changes less; however, the execution time of the algorithm becomes longer when the NT increases. To balance the efficiency and accuracy of the RF algorithm, the optimal NT for all 18 datasets was set to 800 because the OOB errors of all datasets did not change much after increasing the NT to 900 and 1,000.

FIGURE 3

FIGURE 3. The average OOB error values for the number of trees (NT) of 18 datasets: (A) D1–D6; (B) D7–D12; and (C) D13–D18.

Figure 4 shows the average OOB error values of all 18 datasets at different BF when NT is 800. Regardless of the datasets used, OOB errors first decrease and then increase as the BF increases. For D1, D2, D3, and D5, the value of OOB reaches the minimum when the BF is around 0.6, while for other datasets, the value of OOB is the smallest when the BF is 0.9.

FIGURE 4

FIGURE 4. The average OOB error values for the bag fraction (BF) of 18 datasets: (A) D1–D6; (B) D7–D12; and (C) D13–D18.

Figure 5 shows the average OOB error values for all 18 datasets at different VPS when NT is 800. For D2, D5, D8, D11, D14, and D17, which only use season median or season median + auxiliary data, the OOB error showed a trend of decreasing at first and then increasing. For other data sets, OOB error shows a trend of decreasing and then tending to be stable with the increase of VPS value. The VPS value corresponding to the red solid circle in Figure 5 is the suggested value of GEE (square root of the number of variables). In Figure 5A, we found that when only the S1 data or S1+auxiliary data are used, the VPS value suggested by GEE is appropriate, and the OOB value obtained at this time is smaller. In Figures 5B, C, it is beneficial to use a VPS value larger than the GEE recommendation to reduce the value of OOB, especially for the four data sets D10, D12, D16 and D18. So, without loss of generality, we recommend using 1.5 times the square root of the number of variables as the value of VPS, as marked by the blue solid circle in the figure. At this time, the OBB errors of the 12 data sets D7-D18 are all small.

FIGURE 5

FIGURE 5. The average OOB error values for the variables per split (VPS) of 18 datasets: (A) D1–D6; (B) D7–D12; and (C) D13–D18.

Table 2 summarizes the optimal combination values for the parameter pair of VPS and BF with the smallest OOB error value when the NT value is set to 800. It was found that the optimal BF value was 0.9 for D7 to D18, see Table 2, which differs from the default value of BF (0.5) suggested by the GEE.

TABLE 2

TABLE 2. Optimal combination values of VPS and BF.

3.2 LC classification in model area

The results of a LC classification in model area (Hefei City) in 2020 of 18 datasets (D1-D18) using the identified optimum parameters of the RF on the GEE are presented in Figure 6 and Table 3.

FIGURE 6

FIGURE 6. LC maps of Hefei City in 2020 of 18 datasets: D1–D18.

TABLE 3

TABLE 3. Area of each LC type of 18 datasets: D1–D18.

As a result, in Figure 7, the patterns of LC distribution from 18 datasets are slightly different according to the number of features in the datasets. Likewise, the area of each LC type of 18 datasets in Table 3 is changed according to the number of features in the datasets. These phenomena can be clearly observed in each LC type change from 18 datasets (see Figure 7). Urban and built-up land areas are increased after adding auxiliary data to Sentinel-1 data (D1–D3), and they are relatively stable. Similarly, areas of forest land are stable using D7–D18 datasets. On the contrary, areas of cropland, grassland, water bodies, and bare land fluctuate for all 18 datasets.

FIGURE 7

FIGURE 7. Change of each LC area from 18 datasets: (A) urban and built-up land, (B) cropland, (C) forest land, (D) grassland, (E) water bodies, and (F) bare land.

3.3 Thematic accuracy assessment of LC map in model area

The results of the OA and Kappa index for the thematic accuracy assessment of LC maps of 18 datasets (D1–D18) in model area (Hefei City) are presented in Figure 8. As a result, the OA values vary from 71.37% for D2 to 93.62% for D18, and the Kappa index values vary from 0.6017 for D2 to 0.9154 for D18.

FIGURE 8

FIGURE 8. OA and Kappa index results for the 18 datasets: (A) OA and (B) Kappa index.

Meanwhile, the PA and UA values of each LC type are calculated on GEE. The PA values of urban and built-up land vary from about 51% for D2 to 95% for D18, the PA values of bare land vary from about 26% for D1 to 80% for D10, the PA values of cropland vary from about 89% for D2 to 98% for D10, the PA values of forest land vary from about 58% for D2 to 95% for D16, the PA values of grassland vary from about 12% for D1 to 82% for D12, and the PA values of water bodies vary from about 95% for D5 to 99% for D9. The UA values of urban and built-up land vary from about 67% for D3 to 93% for D12, the UA values of bare land vary from about 46% for D2 to 91% for D10, the UA values for cropland vary from about 69% for D2 to 93% for D18, the UA values of forest land vary from about 64% for D2 to 98% for D16, the UA values of grassland vary from about 45% for D2 to 95% for D18, and the UA values of water bodies vary from about 97% for D3 to 100% for D14 and D16.

In addition, the average PA and UA of three main data types of 18 datasets, S1 (D1–D6), S2 (D7–D12), and S1 and S2 (D13–D18), are summarized in Table 4.

TABLE 4

TABLE 4. Average PA and UA of three primary data types of 18 datasets (D1–D18).

For average PA in Table 4, the six datasets of S1 data (without and with auxiliary data), D1-D6, delivered average PA values from 32.20% for bare land to 96.84% for water bodies. On the contrary, the six datasets of S2 data (without and with auxiliary data), D7–D12, delivered average PA values from 73.00% for grassland to 98.16% for water bodies. Meanwhile, the six datasets of S1 and S2 data (without and with auxiliary data), D13–D18, delivered average PA values from 72.69% for bare land to 97.55% for water bodies.

Like PA, for average UA in Table 4, the six datasets of S1 data (without and with auxiliary data), D1-D6, delivered average UA values from 54.75% for grassland to 97.88% for water bodies. On the contrary, the six datasets of S2 data (without and with auxiliary data), D7-D12, delivered average UA values from 84.19% for grassland to 98.16% for water bodies. Meanwhile, the six datasets of S1 and S2 data (without and with auxiliary data), D13-D18, delivered average UA values from 84.55% for bare land to 99.47% for water bodies.

Furthermore, all six datasets of S1 data (without and with auxiliary data), (D1–D6), delivered a lower average PA and UA than the other two main data types.

3.4 Suitable data type for LC classification in model area

As summarized in Table 5, the OA, Kappa index, and the average value by data type for S2 data without auxiliary data (D7–D9) and with auxiliary data (D10–D12) were higher than that of S1 data without auxiliary data (D1–D3) and with auxiliary data (D4–D6). The average values of OA for the datasets of S2 data without and with auxiliary data were 90.94% and 92.64%, respectively, but the average values of OA for the datasets of S1 data without and with auxiliary data were only 72.61% and 82.66%, respectively. Likewise, the average values of the Kappa index for datasets of S2 data without and with auxiliary data were 0.8796 and 0.9004, respectively, but the average values of the Kappa index for datasets of S1 data without and with auxiliary data were only 0.6216 and 0.7667, respectively. Therefore, S2 data with and without auxiliary data were suitable for LC classification in terms of data type.

TABLE 5

TABLE 5. OA, Kappa index, and average value by data type.

3.5 Suitable dataset for LC classification in model area

To identify a suitable dataset for LC classification, Kappa index values were compared among 18 datasets (see Table 5); the top three datasets, D18, D16, and D12, provided the highest Kappa index, with values of 0.9154, 0.9133, and 0.9102, respectively. On the contrary, three datasets, D3, D1, and D2, displayed the lowest Kappa index, with values of 0.6434, 0.6198, and 0.6017, respectively.

The z-statistic value for D12 and D16, D12 and D18, and D16 and D18 are 0.130737, 0.341274, and 0.208664, respectively, which are less than 1.28 (80% confidential level), 1.65 (90% confidential level), and 2.58 (100% confidential level). It means that when we consider the change of Kappa index value of top three datasets (D12, D16, and D18), the increasing of Kappa index of D16 and D18 is insignificant. So, D12 was selected as the suitable dataset for LC classification since we can reduce the time for preparing S1 data.

3.6 LC classification in test area

The results of LC classification in test area (Nanjing City) in 2020 from suitable data type (D7–D12) using RF on the GEE platform are presented in Figure 9 and Table 6.

FIGURE 9

FIGURE 9. LC maps of Nanjing City in 2020 from datasets: D7–D12.

TABLE 6

TABLE 6. Area of each LC type and OA and Kappa index results in Nanjing City in 2020.

As a result, in Figure 9, the patterns of LC distribution from 6 datasets are different according to the number of features in the datasets. Meanwhile, using the D7-D12 datasets, the areas of forest land are stable, while the areas of urban and built-up land, cropland, grassland, and water bodies fluctuate slightly (see Table 6).

Furthermore, thematic accuracy assessment of LC maps of 6 datasets (D7-D12) are presented in Table 6. The OA values vary from 88.10% for D8 to 92.38% for D12, and the Kappa index values vary from 0.8361 for D8 to 0.8914 for D12. As a result, the classified LC map in test area (Nanjing City) using the identified optimal parameters of RF from model area (Hefei city) with the suitable dataset can be accepted.

4 Discussion

4.1 Optimal parameters for RFs

Based on OOB error measurement, an optimal number of trees (NT) was 800 for balancing the efficiency and accuracy of the RF (see Figure 3). Meanwhile, the optimal combination values of VPS and BF of the RF algorithm for each dataset (D1–D18) were identified by trial and error with the smaller OOB error value, summarized in Table 2. The selection of a suitable NT and appropriate VPS and BF values can increase the classification accuracy of the RF classifier. This finding was consistent with the Svoboda’s (Svoboda et al., 2022) study that used the appropriate values of NT, VPS and BF of the RF for land use change and forestry in the Czech Republic.

4.2 Suitable data type and dataset for LC classification in model area

According to OA, the Kappa index, and the average value by data type in Table 5, S2 data without and with auxiliary data, D7–D9 and D10–D12, respectively, were suitable data types compared with S1 data without and with auxiliary data, D1–D3 and D4–D6, respectively. In this study, S2 data could distinguish between forest land and grassland better than S1 data since broad vegetation classes are discriminated more easily by their physiology using the optical sensor than their physical structure from the radar sensor. This finding is consistent with LC studies combining optical and radar data in other geographic regions (Vaglio Laurin et al., 2013; Stefanski et al., 2014).

For suitable dataset identification, when comparing the OA and Kappa index of 18 datasets, the top three datasets, D18, D16, and D12, provided the highest OA and Kappa index, as reported in Table 5. However, the derived Kappa index of three datasets were not significantly different according to pairwise Z test. This study selected D12 (S2 Monthly median + S2 Season median + DNB + Elevation), as the suitable dataset for LC classification. The combining S2 and auxiliary data could provide an OA and a Kappa index of 93.17% and 0.9102, respectively. This dataset was sufficient and acceptable to classify LC with high accuracy, as suggested by Anderson et al. (1976). and Rosenfield and Fitzpatrick-Lins, (1986).

4.3 LC classification and its accuracy in model area

As a result of LC classification in model area (Hefei City) shown in Figure 6 and Table 3, patterns of LC distribution and the area of each LC type were dependent on the selected features of each dataset (D1–D18). Meanwhile, the OA values of the 18 datasets varied from 71.37% to about 94%, and their Kappa index values varied from 0.6017 (or about 60%) to 0.9154 (or about 92%) (see Figure 8). The OA and Kappa index of LC maps from the D7–D18 dataset was higher than 85% and 0.8, respectively, indicating that the classification results are acceptable (Anderson et al., 1976; Rosenfield and Fitzpatrick-Lins, 1986). The results suggested that a dataset of S2 data without and with auxiliary data, D7–D12, could classify LC better than a dataset of S1 data without and with auxiliary data, D1–D6. The results are consistent with other studies (Gislason et al., 2006; Rodriguez-Galiano et al., 2012; Pelletier et al., 2016; Ghorbanian et al., 2020; Phan et al., 2020; Ghorbanian et al., 2021), that is, when classifying LC, the classification accuracy of S2 data is better than that of S1 data.

According to the results of PA and UA of each LC type (Supplementary Appendix S1, S2) and the average PA and UA of three primary data types of 18 datasets (Table 4), the derived values of PA and UA of each LC depend on the data type and its feature for LC classification using RF. In this study, water bodies can be classified with a highly accurate value of PA and UA in 18 datasets. Similarly, cropland can be classified with a highly accurate PA value in 18 datasets, while cropland can be classified with high accurate UA value in 15 datasets, but not in D1–D3. On the contrary, urban and built-up land, bare land, forest land, and grassland can be classified with highly accurate PA and UA values in 12 datasets, but not in D1–D6.

4.4 LC classification and its accuracy in test area

For the result of LC classification in the model area (Nanjing City) displayed in Figure 9 and Table 6, patterns of LC distribution and the area of each LC type also rely on the selected features of each dataset (D7-D12). In the meantime, the overall accuracy values of the six datasets fluctuated between 88.10% and 92.38%, and their Kappa index values varied from 0.8361 (or about 84%) to 0.8914 (or about 89%) (see Table 6). The OA and Kappa indexes of each LC maps from D7–D12 datasets were higher than 85% and 80% respectively, indicating that they can provide acceptable results (Anderson et al., 1976).

As a result, whether it is OA or Kappa index, the dataset of S2 data with auxiliary data (D10–D12) is higher than the dataset of S2 data without auxiliary data (D7–D9). In addition, datasets that adopt a combination of monthly median and season median (D9 and D12) have the highest OA and Kappa index, followed by the datasets using the monthly median (D7 and D10). Although datasets utilizing season median (D8 and D11) yield the lowest OA and Kappa, their values are still higher than 85% and 0.8, respectively, which are still acceptable results (Anderson et al., 1976).

4.5 Validation of optimal parameter of RF for LC classification

As far as the NT is concerned, there is a decreasing relationship between the value of OOB and the NT, which means that the more the NT, the more conducive to improving the classification accuracy, but the larger the value, the more memory and calculation time required, which affects the classification efficiency. In addition, the classification benefit brought by the increase in the NT also decreases as the NT increases. Therefore, it is vital to choose an appropriate NT. In this study, we choose 800 as a suitable value of NT, which is similar to the previous studies that set 300 or 500 (Nguyen et al., 2020; Fekri et al., 2021; Piao et al., 2021; Xiao et al., 2021; Yang et al., 2021).

As far as BF is concerned, the value of OOB and the value of BF show a relationship of decreasing at first and then increasing. For the D1, D2, D3, and D5 datasets, the most suitable BF value is about 0.6, while for other datasets (D6, D6, and D7–D18), the most suitable BF value is about 0.9, both of which are higher than the value suggested by GEE (0.5). This finding is consistent with Patrick’s (Kacic et al., 2021) work concluding that the classification accuracy is higher when the BF is equal to 0.9. But the finding is more different from the research (Svoboda et al., 2022), who set the value of bag fraction to 0.1 when using S2 for land use change and forestry.

For the value of VPS, the datasets using season median data or using season median data with auxiliary data (D2, D5, D8, D11, D14, and D17), there is a relationship of decrease at beginning and then increase between the value of VPS and the value of OOB. And the other 12 datasets show a decreasing relationship between the value of VPS and the value of OOB. In addition, the suitable values of VPS for these 18 different datasets are related to the number of features in the dataset. In this study, we choose 1.5 times square root of the number of features as the appropriate value, which is higher than GEE suggests 1 times square root of the number of features. This choice is consistent with the result of Patrick’s (Kacic et al., 2021) research proposing that the number of VPS holds positive correlations with classification accuracy. But the finding is slightly different from the research (Ghorbanian et al., 2021; Venter and Sydenham, 2021) using the value of VPS suggested by GEE.

As shown in Figure 8, the OA and Kappa index of the LC map in Hefei City using D12 as the dataset through RF under GEE are 93.17% and 0.9102, respectively. Meanwhile, in Table 6, the OA and Kappa index of the LC map in Nanjing City of D12 are 92.83% and 0.8914, respectively. The OA and Kappa index of the LC of the two cities are higher than 90% and 0.8, respectively, indicating that the appropriate data types, datasets and RF parameters obtained in this study have certain generalizability and can be applied to other cities.

4.6 Impact of Sentinel-2 data missing on LC classification

Since S2 data is susceptible to atmospheric influences, it is likely that some regions or some years do not have S2 median data during the rainy season. To study the effect of S2 median data missing on LC classification, taking the D7 dataset (OA = 90.51%) as a reference, the S2 median data of a certain month were artificially removed. We found that when the median data of S2 in any month are missing, the OA does not change much, and the change of OA is within ±1%. OA decreased the most (−0.83%) when the median data for June were missing, and OA improved the most (+0.94%) when the median data for July were missing. The possible reason is that there were more cloudy and rainy days in July, and some cloud coverage pixels were not identified by the QA band, which cause the data in July to play a negative role in the LU classification. Taking the D8 dataset (OA = 90.44%) as a reference, the S2 median data of a certain season were artificially removed. We found that when the median data of Spring, Summer, Autumn, and Winter were missing, the OA changed to 90.62%, 88.71%, 89.98%, and 90.22%. It shows that the median data of Spring have a slight negative impact on the classification of LC. The median data of Summer are more important than other seasons, and when the data are missing, the classification accuracy of LC is reduced by about 1.7%. This shows that median data of Summer are more important for LC classification, but at the same time, it should be noted that the quality of data in some months of Summer may be low. So, accurate identification of cloud contamination pixels is conducive to improving the OA of LC classification.

4.7 Strength and limitation of google earth engine

GEE’s data archive contains more than 40 years of historical datasets that are updated and expanded daily, including the Landsat, Sentinel, MODIS, land cover data, and so on. GEE also provides a variety of ML classification algorithms, such as CART, RF, SVM, naive Bayes, and decision tree. Since it is cloud-based, there is no need to download a large number of image files, and when the relevant parameters are set, the classification results of city-scale LC can be obtained within a few minutes. When running analytics on platforms like ERDAS and ENVI, it can take hours or even days to download the data and process the analytics. Therefore, the research method developed based on GEE can be applied to the update of LC maps where LC changes quickly occur. However, GEE is also limited in some cases, such as memory overflow, and other cloud platforms such as Amazon Web Services and Pixel Information Expert (PIE) Engine can be tried in the future.

5 Conclusion

In recent years, the advantages of GEE’s abundance of available data and fast processing of remote sensing data have facilitated the remarkable development of RF algorithm for LC classification. To this end, it is crucial to investigate suitable data types, datasets, and input parameters of RF for LC classification.

This study classifies the LC of the model area (Hefei City) and evaluates the accuracy to determine the suitable data types, datasets and RF input parameters. The results show that the OA, Kappa index, PA and UA of all six datasets of S2 data (D7–D12) are higher than that of S1 data (D1–D6).

Meanwhile, the most suitable dataset for LC classification is D12, which combines S2 and auxiliary data. The OA and Kappa index of Hefei City reach 93.17% and 0.9102, respectively, when the values of the three primary parameters NT, VPS and BF of RF are 800, 22 and 0.9, respectively. Then, the suitable dataset and parameters of the RF obtained in the model area (Hefei City) were verified in a test area (Nanjing City). The results show that the OA and Kappa index of Nanjing City are 92.38% and 0.8914 respectively. The OA and Kappa index of LC in Nanjing City and Hefei City are higher than 90% and 0.85 respectively, and the OAs are also higher than the accuracies reported by the LC products data providers themselves: ESA WorldCover reported 74%; ESRI Land Cover reported 85% (Venter et al., 2022). It turns out that based on the suitable data type obtained from the model area (Hefei City), the dataset and the input parameters of the RF can be generalized to test area (Nanjing City). In conclusion, the proposed method and the appropriate data types, datasets and RF parameters obtained in this study have certain universality and reference, and can be used to update the local LC information in other cities at low cost and high speed in the future. However, LC classification usually depends heavily on samples, data and algorithms. In future work, we will study sample generation strategies based on data distribution characteristics, open source databases, and existing land cover products; data integration techniques on the basis of multi-scale, multi-platform, and multi-modal; and algorithm integration of various ML and DL algorithms.

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found here: https://developers.google.com/earth-engine/datasets/catalog/NOAA_VIIRS_DNB_MONTHLY_V1_VCMCFG?hl=en, https://developers.google.com/earth-engine/datasets/catalog/ESA_WorldCover_v100, https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S1_GRD, https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S2_SR, https://livingatlas.arcgis.com/landcover/, https://developers.google.com/earth-engine/datasets/catalog/USGS_SRTMGL1_003?hl=en.

Author contributions

JS: Conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, original draft preparation, visualization, project administration, funding acquisition. SO: Conceptualization, methodology, validation, review and editing, supervision. All authors contributed to the article and approved the submitted version.

Funding

This research was funded by the Natural Science Research Project of the Anhui Education Department, grant number KJ2019A0707.

Acknowledgments

The facility support from Tongling University is gratefully acknowledged by the authors. Special thanks from the authors go to the reviewers for their valuable comments and suggestions that improved our manuscript from various perspectives.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/feart.2023.1188093/full#supplementary-material

References

Amani, M., Ghorbanian, A., Ahmadi, S. A., Kakooei, M., Moghimi, A., Mirmazloumi, S. M., et al. (2020). Google Earth engine cloud computing platform for remote sensing big data applications: a comprehensive review. IEEE J. Sel. Top. Appl. Earth Observations Remote Sens. 13, 5326–5350. doi:10.1109/jstars.2020.3021052

Optimal parameters of random forest for land cover classification with suitable data type and dataset on Google Earth Engine

1 Introduction

2 Materials and methods

2.1 Study area

2.2 Data

2.2.1 Sentinel satellite data

2.2.2 Auxiliary data

2.2.3 LC data

2.3 Methods

2.3.1 Preprocessing of sentinel satellite data

2.3.2 Feature extraction and dataset preparation

2.3.3 Selection of training and validation samples

2.3.4 Optimal parameter of RF identification for LC classification

2.3.5 LC classification and accuracy assessment in model area

2.3.6 Suitable datatype and dataset identification for LC classification

2.3.7 LC classification and accuracy assessment in test area

3 Results

3.1 Optimal parameters of RF for LC classification

3.2 LC classification in model area

3.3 Thematic accuracy assessment of LC map in model area

3.4 Suitable data type for LC classification in model area

3.5 Suitable dataset for LC classification in model area

3.6 LC classification in test area

4 Discussion

4.1 Optimal parameters for RFs

4.2 Suitable data type and dataset for LC classification in model area

4.3 LC classification and its accuracy in model area

4.4 LC classification and its accuracy in test area

4.5 Validation of optimal parameter of RF for LC classification

4.6 Impact of Sentinel-2 data missing on LC classification

4.7 Strength and limitation of google earth engine

5 Conclusion

Data availability statement

Author contributions

Funding

Acknowledgments

Conflict of interest

Publisher’s note

Supplementary material

References

94% of researchers rate our articles as excellent or good