Remote sensing and integration of machine learning algorithms for above-ground biomass estimation in Larix principis-rupprechtii Mayr plantations: a case study using Sentinel-2 and Landsat-9 data in northern China

Ali, Jamshid; Haoran, Wang; Mehmood, Kaleem; Hussain, Wakeel; Iftikhar, Farhan; Shahzad, Fahad; Hussain, Khadim; Qun, Yin; Zhongkui, Jia

doi:10.3389/fenvs.2025.1577298

ORIGINAL RESEARCH article

Front. Environ. Sci. , 02 April 2025

Sec. Environmental Informatics and Remote Sensing

Volume 13 - 2025 | https://doi.org/10.3389/fenvs.2025.1577298

Remote sensing and integration of machine learning algorithms for above-ground biomass estimation in Larix principis-rupprechtii Mayr plantations: a case study using Sentinel-2 and Landsat-9 data in northern China

Wang Haoran¹

Farhan Iftikhar⁴

Yin Qun¹*

¹State Key Laboratory of Efficient Production of Forest Resources, Engineering Technology Research Center of Pinus tabuliformis of National Forestry and Grassland Administration, Beijing Forestry University, Beijing, China
²Key Laboratory for Silviculture and Conservation of Ministry of Education, Beijing Forestry University, Beijing, China
³School of Geophysics and Geomatics, China University of Geosciences, Wuhan, China
⁴School of Soil and Water Conservation, Beijing Forestry University, Beijing, China
⁵Mapping and 3S Technology Center, Beijing Forestry University, Beijing, China
⁶State Forestry and Grassland Administration Key Laboratory of Forest Resources and Environmental Management, Beijing Forestry University, Beijing, China

Estimating above-ground biomass (AGB) is important for ecological assessment, carbon stock evaluation, and forest management. This research assesses the performance of the machine learning algorithms XGBoost, SVM, and RF using data from the Sentinel-2 and Landsat-9 satellites. The study assesses the influence of the significant spectral bands and vegetation indices on the accuracy of the AGB estimate. The results presented in the paper indicate that Sentinel-2 data were more effective than Landsat-9 data. This is mainly because it had higher spatial and spectral resolution, which enabled the model vegetation gradients and structural attributes more accurately. The XGBoost model performed the best with an R² of 0.82 and RMSE of 0.73 Mg/ha with Sentinel-2 and R² of 0.80 and RMSE of 0.71 Mg/ha with Landsat-9. In the current study, SVM also showed a substantial accuracy with an R² of 0.79 and RMSE of 0.73 Mg/ha for Sentinel-2 and R² of 0.76 and RMSE of 0.80 Mg/ha for Landsat-9. For Sentinel-2, the random forest achieved an R² of 0.74 and an RMSE of 0.93 Mg/ha, and Landsat 9 yielded an R² of 0.72 and an RMSE of 0.88 Mg/ha. Thus, using variable importance analysis, the results showed that vegetation indices and spectral bands have higher importance in predicting AGB. As expected from their application in biomass research, these predictors consistently emerged as highly significant across models and datasets. This study demonstrates the potential of integrating machine learning with remote sensing data to achieve accurate and efficient biomass assessment.

1 Introduction

Forests are crucial for sustaining biodiversity, as they offer crucial habitats that sustain diverse biodiversity. They are essential for sustaining ecological balance and enhancing biodiversity (Ali et al., 2023; Stephenson and Damerell, 2022). Therefore, an estimate of the forestry biomass determines a given ecosystem’s ability to capture carbon and maintain a stable carbon stock (Jafri et al., 2022; Wang et al., 2022). Accurately assessing forest biomass is critical for analyzing the global carbon cycle and addressing numerous concerns, including climate change, Forest strength, and service regulation (Hu et al., 2022; Titus et al., 2021).

Conventional field measurements or remote sensing techniques often evaluate AGB in forests (Santoro et al., 2021). Satellite imagery is better than traditional forest inventories and surveys using LiDAR technology because it can cover larger areas at a lower cost and in less time (López Serrano et al., 2022). Integrating reference values within satellite data is a significant process in estimating aboveground biomass or forest inventories from airborne LiDAR more precisely (Campbell et al., 2021; Labrière et al., 2023).

The next step involves employing spatial prediction algorithms to generate precise geographic proportions of AGB (Das et al., 2024; Sun et al., 2023). Researchers have made significant progress in mapping forest AGB by combining modeling methods with better predictor applications from satellite data (Hojo et al., 2023). Past research has shown that a combination of diverse remote sensing techniques has successfully quantified and monitored biomass from forests at a regional level (Coops et al., 2023; Zhang and Shao, 2021). Consequently, current research indicates that diverse remote sensing methods, encompassing both passive and active sensors, can estimate AGB in a designated area (Ma et al., 2024). Researchers frequently utilize optical remote sensing imagery, distinguished by spatial, spectral, and temporal resolutions, to assess AGB at diverse scales (Sedano et al., 2021). Researchers primarily employ moderate and coarse-resolution data from the Moderate Resolution Imaging Spectroradiometer (MODIS) (Shahzad et al., 2025; Wongsai et al., 2020).

When looking at AGB on a small scale, medium-resolution data from the Sentinel-2 and Landsat satellites is needed, when using High-quality commercial satellite data from IKONOS, QuickBird, and WorldView-2, it is possible to get a pretty good idea of AGB at the forest stand-level (Fu et al., 2022; Lin et al., 2022). To estimate AGB at the regional level with average spatial resolution, microwave radar remote sensing data is required. These include synthetic aperture radar (SAR), interferometric SAR (InSAR), and polarimetric interferometric SAR (PolInSAR) data (Godinho Cassol et al., 2021; Ramachandran et al., 2023). The first and essential step in constructing accurate models for predicting AGB is indicating the correct algorithm (Araza et al., 2022; Li et al., 2021).

Previous research has frequently used the traditional statistical regression method for aboveground biomass (AGB) estimation despite its simplicity and ease of computation (Luo et al., 2024). This method employs a regression model that combines test data and remote-sensing features (Han et al., 2019; Hussain et al., 2024). It does not capture the correlation between forest AGB and RS data (Zhang et al., 2022). Standard methodologies for predicting and mapping AGB include interpolation techniques, non-parametric models, and kriging. Researchers have utilized geo statistics for AGB data to examine variations and develop sample designs for satellite and field-based forest monitoring (Li et al., 2020b; Su et al., 2020). It is challenging to map the continuous forest characteristics in large, steep areas. Important site factors like soil type, texture, nutrient status, solar flux density, moisture regime, and water holding capacity affect the key tree attributes within different stand types in terms of Diameter at breast height, height, and volume. When establishing the inventory, measurements of forest trees and AGB showed spatial dependency within small areas of stand types (Carmenta et al., 2020; Octavia et al., 2022).

However, this spatial autocorrelation varies depending on the community’s topographical conditions, residential zones, and the locations of commercial logging activities (Gibson, 2018; Shahzad et al., 2024). For AGB, it is clear that several studies have pointed out the integration of remote sensing technology with geostatistical and machine learning methods (Musekiwa et al., 2022; Prăvălie et al., 2023). This combination is especially advantageous for forecasting extensive regions characterized by diverse bioclimatic conditions and irregular terrain (Masereti Makori et al., 2024).

Remote sensing-based AGB estimation uses machine learning techniques, including decision trees, random forests, and support vector regression. These strategies improve the model’s ability to provide accurate biomass predictions, mainly where nonlinearity is a key reason. The literature published in the last decade reveals that decision tree-based algorithms like Random Forest (RF) and Gradient Boosting (GB) yield high accuracy in biomass estimation modeling (Cameron et al., 2022; Ghasemloo et al., 2022). Moreover, machine learning techniques encompass numerous adjustable hyperparameters significantly influencing the models. The adjustment of these settings has, at times, been overlooked. Previous studies have shown that the tuning procedure significantly influences the model performance, with the sensitivity of parameters varying between stochastic gradient boosting and random forests (Freeman et al., 2015; Li et al., 2020a; Prakash et al., 2022).

Research gaps persist in integrating RS data with machine learning models (XGBoost, SVM, and Random Forest) for biomass prediction despite applying RS data, particularly Sentinel-2 and Landsat-9 data, in biomass estimation. While earlier literature has established the superiority of high spatial resolution imagery such as Sentinel-2 over low spatial resolution imagery such as Landsat-9, there is still a lack of comparative research on how these datasets perform under different environmental conditions, especially in forests with structural homogeneity and a 50-year-old Larix principis-ruprechtii in Northern China. Some studies have established the significance of vegetation indices like NDVI, TNDVI, and NDI45 as biomass predictors. However, they have not comprehensively studied their contribution to machine learning algorithms such as XGBoost, SVM, and Random Forest. However, there is still a lack of comprehensive investigation into applying indices like GNDVI and NDI45 for biomass modeling, particularly in temperate forests.

The present study is expected to enhance the precision and generality of the AGB estimation in L. principis-rupprechtii Mayr plantations at the Saihanba Mechanical Forest Farm in Hebei Province, northern China. This is accomplished by integrating machine learning techniques with remote sensing data. This research aims to evaluate the performance of three popular machine learning algorithms, namely XGBoost, Support Vector Machine (SVM), and Random Forest (RF), to estimate AGB using Sentinel-2 and Landsat-9 data.

2 Materials and methods

2.1 Location and description of the study area

The study area included the Saihanba Forest Farm, which is located in Hebei Province, northern China, and ranges (41°22′– 42°58′N, 116°53′– 118°3′W). The research site is in the warm temperate continental monsoon climate zone. The altitude of the area is (1,010 ∼ 1,940 m). The mean annual temperature is (−1.2°C), and the average annual temperature range is from (−43.3°C–33.4°C). The annual rainfall is (452.2 mm), and the annual evaporation is 1,388 mm.

The typical soil types in the region include aeolian sandy soil, meadow soil, brown soil, and grey forest soil. The total operating area is 94,000 ha, of which the forest area is 73,333 ha, planted forest 57,333 ha, and natural forest 16,000 ha; the forest coverage rate is 80%, total forest volume is 5.025 million m³.

The most important vegetation zones include grassland, meadow, conifer and broad-leaved mixed forest, broad-leaved forest, and shrub forest; the forest density is 75.5%. The main trees are L. principis-rupprechtii, Picea asperata Mast., and Betula platyphylla Suk., and the main shrubs are Rhododendron micranthum Turcz., Syringa oblata Lindl. var. alba Rehder., and Sambucus williamsii Hance. The main herbaceous plants are Galium verum L. and Menyanthes trifoliata L. (Tao et al., 2023; Xu et al., 2022) (Figure 1).

Figure 1

Figure 1. Location of the study sites in China, Chengde City, Hebei Province, Saihanba Mechanical Forest Farm.

2.2 Forest inventory and biomass estimation

We conducted a study for the forest inventory in August 2023. We meticulously chose the sampling spots, excluding non-forest regions. The study set up 45 sampling plots in total for the 50-year-old L. principis-rupprechtii Mayr plantation. We recorded the coordinates of each tree and plot, using Real-Time Kinematic (RTK). For analysis, we recorded each plot’s elevation, aspect, slope, height (in meters), stem density (trees per hectare), and DBH, measured 1.3 m above the ground using a calibrated diameter tape and caliper. Stem density was determined by counting the number of trees within each plot. Tree heights were measured using a Relascope (Almeida et al., 2021). Soil samples were obtained from the upper 20 cm layer using a soil auger to determine soil organic carbon (SOC) (Liu N. et al., 2021). These samples were put in plastic bags, allowed to air dry, and then taken to a laboratory for further tests.

To measure total biomass distribution, we use allometric equations of all tree elements, such as stem, branches, leaves, and roots. It allows for precise calculations of the distribution of above and belowground biomass (Zhao et al., 2016). To estimate the AGC and BGC amount, the obtained AGB and BGB values were multiplied by 0.5, assuming that the total amount of aboveground and belowground biomass had a 50% carbon content (Aye et al., 2022; Eshetu and Hailu, 2020), (Figure 2), (Supplementary Figure S3).

Figure 2

Figure 2. Observed Above-Ground Biomass (AGB) (Mg/ha) by plot, Durning Forest Inventory.

2.3 Pre-processing of Sentinel 2 and land set 9 satellite data and derivation of variables

The European Space Agency’s Sentinel-2 provides medium-resolution multispectral imagery for Earth observation. Using the Google Earth Engine (GEE) platform, acquired and pre-processed Sentinel-2A and Landsat 9 images for the study area (https://earthengine.google.com/). The Sentinel-2A data were ortho-corrected bottom-atmospheric reflectance images, with Bands 2, 3, 4, and 8 selected for analysis, while Bands 1, 5, 6, and 9 were excluded due to their relevance to atmospheric correction and hydrological applications. Preprocessing included filtering out images with cloud cover exceeding 5%, performed using the Sen2Cor processor for Sentinel-2 Level-1C products. Cloud-covered pixels were identified, masked, and corrected accordingly. For Landsat 9, to do a complete analysis, carefully pick and extract (4) bands (Band 2, Band 3, Band 4, Band 5) that are thought to be important for lowering errors in estimating forest AGB and making comparisons useful. The images with less than 5% cloud cover were retained. We applied the CFMask algorithm, integrated into the Landsat Surface Reflectance product, to mask cloud pixels, replacing them with maximum value composites for data consistency. Both datasets were resampled to a common spatial resolution (10 m for Sentinel-2 and 30 m for Landsat 9) using bilinear interpolation and aligned to the WGS84 coordinate system. The final preprocessed data were split into training (70%) and validation (30%) samples. Comprehensive processing of Landsat 9 included orthorectification, georectification, and registration, ensuring high-quality data. (Li et al., 2024), (Table 1), (Figure 3).

Table 1

Table 1. Spectral band characteristics of Sentinel-2A and landsat 9 sensors.

Figure 3

Figure 3. A flowchart for evaluating Vegetation Indices and modeling algorithms for mapping forest Above Ground Biomass using Sentinel 2 and Landsat 9 data.

2.4 The extraction of remote sensor parameters from field plots

Diverse methodological strategies are utilized to obtain field remote-sensing data (Aslam et al., 2024). In this study, the coordinates of the southwestern corner of each field plot were used to define the center point, serving as the geographic anchor for plot-level remote sensing data extraction. Satellite imagery from Sentinel-2 and Landsat 9 was resampled to match the spatial extent of the field plots as closely as possible. Despite these efforts, minor spatial mismatches remained between the pixel grid and plot boundaries due to geometric distortion, sensor resolution, and terrain variability.

To reduce the effect of such spatial discrepancies, circular buffer zones were applied: a 10-m radius for Sentinel-2 (10 m resolution) and a 30-m radius for Landsat 9 (30 m resolution) around each plot center. These buffer radii were selected to balance the minimization of geolocation error with the need to avoid spectral contamination from adjacent land covers with contrasting canopy structures. The average pixel value within each buffer was extracted to represent the spectral signal associated with each plot’s biophysical parameters (Turner et al., 2015). While the buffer-based averaging approach cannot eliminate all residual spatial discrepancies, it substantially reduces the influence of pixel-level misalignment by integrating spectral information across a spatial area representative of the plot. This methodological choice reflects a practical balance between spatial accuracy and ecological representativeness, commonly adopted in similar fields–remote sensing integration studies. We acknowledge that some degree of residual uncertainty may persist, but this was considered during model interpretation and is not expected to significantly bias the results at the scale of analysis.

2.5 Techniques of modeling and evaluation

The machine learning techniques were chosen because they can handle the complicated problems of Forest biomass estimation, where variables do not constantly interact in a straight line, there are many predictors, and there are many drivers. This makes their use, normalization, and insensitivity to outliers highly suitable. In the case of evaluating multisensory indices against field-measured AGB, we used Pearson’s product-moment correlation to perform a paired analysis (Chen et al., 2018).

We checked and analyzed the provided dataset to overcome the issue of multicollinearity. This approach included using the variance inflation factor (VIF) to determine whether any variables were redundant and, if so, to remove them (Mehmood et al., 2024a; Thompson et al., 2017). We systematically removed predictor variables with a coefficient magnitude of 0.8 or higher and high VIF values of 10 or more from the set in regression analysis (Kristensen et al., 2015; Pérez-Girón et al., 2020). The R Statistical Computation program performed the analytical operations (Table 2).

Table 2

Table 2. Overview of machine learning models: Key features, parameters, and references.

2.6 Enumeration of the tested algorithms

XGBoost, SVR, and RF should be used because these algorithms efficiently disentangle intricate and non-linear connections in natural systems such as the forest ecosystem (Zennaro et al., 2021). These algorithms can select several essential predictor factors independently and are well-equipped to handle datasets of high dimensionality. They are robust, can perform ensemble analysis, and employ state-of-the-art methods, aligning with our accurate AGB estimation objective. Detailed evaluation of several algorithms enhances the scientific credibility of the results and ensures the selection of the most appropriate approach for estimating AGB in temperate forests (Oehmcke et al., 2024; Pham et al., 2023).

2.7 Machine learning methods

XGBoost is the collective model for learning that includes gradient boosting and complicated regularization techniques to improve its predictiveness. It consistently generates new models to rectify errors in previous models while reducing a loss function given by the user through the second-order Taylor expansion. Researchers have found XGBoost highly effective for complex and multiple-variable predictive modeling due to its lack of missing values and overfitting problems (Mehmood et al., 2024b; Zhang and Jánošík, 2024).

The SVR regression method transforms the input data into the higher-dimensional space by the kernel function and attempts to minimize the prediction errors by shifting a hyperplane. This approach is beneficial in identifying non-linear relationships that might exist in the data. Different regression problems can employ SVR because it has low computational complexity and reasonable empirical risk (Hussain et al., 2025; Lee et al., 2020). Random Forest is a learning technique that integrates multiple decision trees through a process known as bagging. This technique grows each tree based on a bootstrap sample and employs the remaining data to compute the out-of-bag (OOB) error. To find the best partition, every node randomly chooses some explanatory variables. Environmental modeling and habitat suitability evaluations widely apply to random forests (RF) due to their applicability in classification and regression problems (Anees et al., 2024; Teng et al., 2023).

2.8 Optimizing model parameters

Some of the key parameters for hyperparameter tuning of the XGBoost model include features such as “nrounds”/“boosting iteration,” “max depth,” “min child weight,” “gamma,” and “subsample.” We used a grid search approach to enhance the model’s performance and identify the optimal parameters. The variable importance measures for XGBoost are “Gain” and “Frequency” (Asselman et al., 2023; Mehmood et al., 2025).

SVR tuning is done by choosing the kernel function and the “C” parameter that defines the width of the margin and permits misclassification. Thus, the authors have established that SVR’s flexibility allows for effectively handling complex decision limits (Mahmood et al., 2023; Shams et al., 2024). RF parameters include “ntree,” which measures the number of trees, and “mtry,” which indicates the number of features used randomly for splitting. Grid search optimized the values of these parameters. We measured the variable importance using the percent IncMSE and IncNodePurity (Bouslihim et al., 2024; Li Yudong et al., 2020) (Table 3).

Table 3

Table 3. Optimizing algorithm hyperparameters for peak performance.

2.9 The performance of the models

The performance of the XGBoost, SVR, and RF algorithms for each variable group was then compared. The evaluations used the correlation coefficient r, RMAE, RRMSE, and mean error (MAE) to enable comparison based on the following Equations 1–5 (Bouslihim et al., 2024; Liu H. et al., 2021). The approach with the highest accuracy was assigned the predictive mapping of the AGB distribution for each variable group.

In Equations 1–5, $y_{i}$ represents the observed Aboveground Biomass (AGB) values, with n = 45. $\hat{y} ⅈ$ represents the estimated AGB values derived from each model, and $\bar{y}$ denotes the mean of the observed AGB values. The objective is to minimize the Root Mean Squared Error (RMSE), Relative Mean Absolute Error (RMAE), and Mean Error (ME), while maximizing the correlation coefficient r to achieve more accurate predictions.

R M S E = \sqrt{\sum_{\dot{i} = 1}^{n} \frac{{(y_{i} - {\hat{y}}_{i})}^{2}}{n}} (1)

R R M S E = (\frac{R M S E}{\bar{y}}) \times 100 (2)

M A E = \sum_{1}^{n} \frac{|y_{i} - {\hat{y}}_{i}|}{n} (3)

M A E = (\frac{M A E}{\bar{y}}) \times 100 (4)

M E = \sum_{i}^{n} \frac{(y_{i} - {\hat{y}}_{i})}{n} (5)

3 Results

3.1 Field observations and descriptive statistics

Table 4 presents a comprehensive statistical overview of a 50-year-old Larix principis-rupprechtii forest stand, providing insights into its structural attributes and biomass dynamics. Despite the homogeneity in species composition and age, the forest exhibits notable variability in several key parameters, reflecting the inherent complexity of natural systems. The mean DBH of 25.43 cm, accompanied by a standard deviation of 1.31 cm, suggests a moderate variation in tree size within the stand, typically ranging from 22.61 cm to 29.05 cm. Similarly, tree height, with a mean of 19.42 m and a relatively low standard deviation of 0.89 m, points to a uniform vertical structure, with tree heights distributed between 17.1 m and 21.64 m.

Table 4

Table 4. Forest stand descriptive statistics.

The stand density, averaging 677 trees per hectare, with an extensive standard deviation of 146 trees per hectare, highlights the variability in tree spacing and distribution, ranging from 450 to 1000 trees per hectare. The stand’s average AGB is 7.57 Mg/ha, with a standard deviation of 1.44 Mg/ha and values ranging from 5.07 to 10.29 Mg/ha. This variation in AGB reflects individual trees differing growth potential and carbon sequestration capacity within the stand (Table 4).

3.2 Correlation analysis of Sentinel-2 and Landsat-9 data

The correlation study of Sentinel-2 and Landsat-9 data provided key findings regarding the remote sensing predictors and AGB. For Sentinel-2, figure (A) with correlation coefficients varying from (0.32–0.58). Among the variables, vegetation indices emerged as the most significant predictors, particularly TNDVI and NDI45, showing correlations of (0.58). They emerged as the strongest predictors of AGB.

Other indices, such as NDVI (0.56) and GNDVI (0.52), also demonstrated strong correlations, reflecting their effectiveness in integrating vegetation density and photosynthetic activity—key factors influencing biomass accumulation. The NIR band (0.54) showed a strong correlation among individual spectral bands. Visible bands, such as Red (0.44) and Green (0.37), exhibited moderate correlations, with indices like SAVI (0.39) and MSAVI2 (0.42) showing secondary relevance. The WDVI (0.32) exhibited the weakest correlation.

In the case of Landsat-9 figure (B), the correlation coefficients varied between (0.30–0.56), with TNDVI (0.56), NDVI (0.53), and NDI45 (0.55) emerging as the top predictors of AGB. The band B5, (0.50) and band B4, (0.42) showed moderate correlations. However, soil background and atmospheric conditions often influence visible bands like B2 and B3, exhibiting weaker correlations (0.30) and (0.35). Some vegetation indices, such as PSSRa (0.43) and MSAVI2 (0.38), also showed moderate correlations, suggesting their complementary role in enhancing model performance.

The WDVI (0.30) displayed the weakest correlation, highlighting its limited predictive power for AGB in the context of Landsat-9 data. When comparing the two datasets, Sentinel-2 consistently outperformed Landsat-9 regarding correlation strength across all variables. Sentinel-2’s superior spatial and spectral resolution allows for precisely capturing vegetation characteristics like canopy structure, leaf area index, and photosynthetic activity. Stronger connections are seen between Sentinel-2-derived indices like TNDVI and NDVI, showing that it can accurately model AGB (Figure 4).

Figure 4

Figure 4. Demonstrates how different predictors relate to the field-measured biomass, Sentinel 2 (A) and Landsat 9 (B).

3.3 Variable importance analysis for AGB estimation

The variable importance analysis of predictors from Sentinel-2 and Landsat 9 images shows essential details about how they improve the performance of machine learning models such as XGBoost, SVM, and Random Forest (RF). The present work focuses on the effects of the spectral bands and vegetation indices on the model’s biomass and carbon stock assessment performance. Sentinel-2 shows that GNDVI and GEMI are the most influential predictors across all models, particularly within the XGBoost. GNDVI stands out due to its high sensitivity to plant properties, capturing subtle changes in vegetation vigor. It is a key variable for modeling above-ground biomass (AGB) and carbon stock because it can distinguish between important biophysical features like chlorophyll content and leaf area. Similarly, Landsat 9 highlights the ND145 index as the most important predictor, excelling particularly in XGBoost. ND145 can tell a lot about the health of plants; it can find changes in leaf area and chlorophyll content, which are important biophysical features for figuring out biomass and carbon stock. Both Sentinel-2 and Landsat 9 datasets reveal that indices such as SAVI, TNDVI, WDVI, PSSRa, and IPVI contribute moderately to the model’s performance. These indices provide valuable Supplementary Information on vegetation structure, density, and canopy properties, further refining biomass estimates.

In particular, WDVI and PSSRa in the Sentinel-2 and Landsat 9 datasets make notable contributions by capturing information related to vegetation moisture content and plant stress, which are important for biomass modeling under varying environmental conditions. To further clarify their relative importance, we quantified and compared the normalized variable importance scores of vegetation indices across the XGBoost, SVM, and RF models. This comparison revealed that although GNDVI consistently ranked highest in both datasets (e.g., 0.183 in Sentinel-2 and 0.064 in Landsat-9 using XGBoost), indices such as WDVI (Sentinel-2: 0.026 in XGBoost; Landsat-9: 0.045) and PSSRa (Sentinel-2: 0.030; Landsat-9: 0.045) demonstrated moderate yet model-consistent importance across all approaches. The systematic comparison also showed that WDVI and PSSRa ranked higher in RF and SVM models relative to XGBoost, indicating that their influence varies by algorithm but remains non-negligible. These findings are summarized in Supplementary Tables S1, S2, where the importance values of each VI across models are presented. This numerical evidence strengthens our interpretation of the ecological relevance of these indices in AGB estimation (Figure 5).

Figure 5

Figure 5. Shows the variable’s importance derived from XGboost, SVM, and RF from Sentinel 2 (A), Landsat 9 (B).

3.4 Performance evaluation using Sentinel-2 and Landsat-9 data

For this study, we used data from Sentinel-2 and Landsat-9 to compare how well three machine learning models—XGBoost, Support Vector Machine (SVM), and Random Forest (RF)—estimated AGB. Even though Landsat-9 has lower spatial and spectral resolution than Sentinel-2, both datasets helped estimate biomass, but Sentinel-2 consistently did better than Landsat-9.

XGBoost consistently delivered the best performance across both datasets. With Sentinel-2, it achieved a coefficient of determination R² of 0.82, an RMSE of 0.73 Mg/ha, and an MAE of 0.60 Mg/ha, while for Landsat-9, it achieved an R² of 0.80, an RMSE of 0.71 Mg/ha, and an MAE of 0.58 Mg/ha. Sentinel-2’s high spatial and spectral resolution made it easier for the model to pick up on small changes in canopy reflectance, vegetation structure, and biomass-related parameters, which led to lower error metrics. However, XGBoost maintained robust performance with Landsat-9, demonstrating adaptability across datasets with varying resolutions. The SVM model also exhibited strong performance, with Sentinel-2 achieving an R² of 0.79, RMSE of 0.73 Mg/ha, and MAE of 0.63 Mg/ha, while Landsat-9 produced an R² of 0.76, RMSE of 0.80 Mg/ha, and MAE of 0.66 Mg/ha.

SVM’s capacity to model non-linear relationships was evident, especially with appropriate kernel selection, making it a viable alternative for biomass estimation. In contrast, Random Forest (RF) showed the weakest performance, with Sentinel-2 yielding an R² of 0.74, RMSE of 0.93 Mg/ha, and MAE of 0.76 Mg/ha, and Landsat-9 producing an R² of 0.72, RMSE of 0.88 Mg/ha, and MAE of 0.74 Mg/ha. RF’s performance lagged behind the other models, particularly with Landsat-9, where its reduced ability to capture fine-scale variations in vegetation structure likely contributed to the lower accuracy. The coarser resolution of Landsat-9 likely hindered RF’s capacity to capture the variability needed for precise biomass estimation effectively. Overall, the results highlight the critical influence of satellite data resolution on model performance, with Sentinel-2 providing superior results due to its higher resolution. However, Landsat-9, despite its limitations, remains a valuable tool for global biomass estimation, particularly when paired with effective machine-learning models like XGBoost and SVM (Figure 6).

Figure 6

Figure 6. Predicted above-ground biomass (AGB) using Sentinel-2 (A–C) and Landsat-9 (D–F) data.

3.5 Comparative analysis of Sentinel-2 and Landsat-9 for AGB mapping using machine learning models

Comparing Sentinel-2 and Landsat-9-based AGB predictions made with XGBoost, SVM, and Random Forest models shows how spatial and spectral resolution affects the accuracy of biomass mapping. The Sentinel-2-based maps in Figure 7 (S1, S2, S3) had better spatial resolution. The AGB values ranged from (5.39–9.15 Mg/ha), showing apparent differences in biomass across the study area. The XGBoost model, in particular, excelled in delineating high and low-biomass zones, reflecting its ability to model complex spatial patterns with high precision.

Figure 7

Figure 7. The study developed maps of AGB using the (XGBoost, SVM, RF) models, and data from the Sentinel 2 (S1, S2, S3) Landsat 9 (L1, L2, L3).

The SVM model had similar biomass ranges, but the changes between biomass classes were smoother because it uses a kernel-based approach that values global trends over local variability. The Random Forest model, in contrast, displayed more localized spatial noise, with a slightly wider range of AGB predictions from (4.71–10.15 Mg/ha).

When the same models were used on Landsat-9 data Figure 7 (L1, L2, L3), the lower spatial resolution, but the overall trends in biomass distribution were still well captured. XGBoost’s predictions for Landsat-9 resembled those of Sentinel-2, highlighting its robustness in handling data with lower resolution. The SVM model again showed smoother transitions, while the Random Forest model introduced more variability and noise, particularly with the Landsat-9 dataset. A comparison of the two data sets showed that Sentinel-2, which had a higher resolution, consistently made more accurate and detailed biomass maps. On the other hand, Landsat-9, which had a lower resolution, could still make accurate biomass predictions for larger-scale uses. Among the machine learning models, XGBoost consistently outperformed the others regarding spatial accuracy, its ability to capture non-linear relationships, and model complex interactions between input features. This study underscores the importance of selecting the appropriate remote sensing data and machine learning model for biomass estimation, with Sentinel-2 offering clear advantages for studies requiring fine-scale detail and XGBoost emerging as the most effective model for both datasets. The findings have significant implications for ecological monitoring, carbon accounting, and sustainable land management, highlighting the potential for combining high-resolution satellite data with advanced machine-learning techniques to improve AGB mapping (Figure 7).

4 Discussion

This research assesses the capability of Sentinel-2 and Landsat-9 satellite data integrated with machine learning models for predicting AGB in a 50-year-old Larix principis-rupprechtii forest stand. It uses sound statistical and machine-learning techniques to demonstrate the usefulness of both Sentinel-2 and Landsat-9 satellite data in estimating forest biomass. The results are helpful for carbon stock assessment, forest evaluation, and land-use activities.

4.1 Comparisons of correlation analysis

The results consistently demonstrate that Sentinel-2 outperforms Landsat-9 in AGB estimation, attributable to Sentinel-2’s higher spatial and spectral resolution. Indices from Sentinel-2, like TNDVI and ND145 (correlation = 0.58) and NDVI (correlation = 0.56), had stronger links with AGB. Landsat-9 counterparts (TNDVI = 0.56), (ND145 = 0.55), and NDVI = 0.53) did. This finding aligns with Castillo et al. (2017), who highlighted the advantages of higher-resolution imagery in capturing fine-scale vegetation gradients and structural attributes critical for biomass estimation.

Fassnacht et al. (2021) found through a correlation analysis between remote-sensing variables and field-measured AGB. These indices rely on the spectrum’s NIR and red edge regions, making them sensitive to canopy architecture, chlorophyll content, and vegetation vigor.

4.2 The variable importance analysis

A study of the variable importance of Sentinel-2 and Landsat 9 images shows how important spectral bands and vegetation indices are in machine-learning models for estimating biomass and carbon stock. This study found that GNDVI was the best predictor for Sentinel-2 data in all models, especially in XGBoost. This is similar to Morales-Gallegos et al. (2023), Its sensitivity to subtle variations in vegetation properties, such as chlorophyll content and leaf area, positions it as an essential feature for biomass modeling. Similarly, NDI45 in Landsat 9 was the best predictor, especially in XGBoost. This aligns with recent research highlighting how indices like NDI45 and modified versions can capture important vegetation dynamics for biomass estimation (Pham et al., 2020). The study also revealed that other indices, such as WDVI and SAVI, had a relatively low correlation with biomass. Despite their application in remote sensing for vegetation and biomass estimation. The results are similar to those of Moghimi et al. (2024), who reported that these indices do not significantly contribute to biomass estimation.

Our findings demonstrate that WDVI and PSSRa in Landsat 9 play a crucial role. The results were consistent with the findings of Vidican et al. (2023), WDVI and PSSRa in Landsat 9 contributed to capturing moisture stress and vegetation health, which is crucial for biomass modeling, this study also found that spectral bands like Red and Blue in Sentinel-2 and Band 3 in Landsat 9 significantly affected biomass estimation.

These findings resonate with the work of Dong et al. (2020), who highlighted the importance of these bands for accurate biomass and carbon stock assessment. Overall, Sentinel-2 and Landsat 9 data complement each other well. Using them together improves model performance, making tracking vegetation and classifying land cover easier.

4.3 Comparison of model performance

The results reveal that XGBoost consistently outperformed SVM and Random Forest across the Sentinel-2 and Landsat-9 datasets, achieving higher R² values and lower error metrics. For Sentinel-2 data, XGBoost attained an (R² = 0.82), SVM (R² = 0.79), and Random Forest (R² = 0.74). Similarly, with Landsat-9, XGBoost achieved an R² of 0.80, outperforming SVM (R² = 0.76) and Random Forest (R² = 0.72). These results are similar to those of Liu H. et al. (2021), who reported that XGBoost’s gradient-boosting method is excellent at dealing with complicated, non-linear interactions in environmental datasets, especially when estimating biomass.

Similarly, Li et al. (2022) found that XGBoost outperformed traditional tree-based methods for forest biomass modeling, particularly when integrating multiple vegetation indices. New research from Miao et al. (2022) supports XGBoost’s ability to provide high accuracy, especially when combining data from different sources like Sentinel-2 and Landsat-9, which allows a more complex understanding of how plants change over time. SVM exhibited strong predictive capabilities, particularly in capturing non-linear relationships. However, its performance was slightly lower than XGBoost across both datasets, with Sentinel-2 results showing an RMSE of (0.73 Mg/ha) compared to XGBoost’s (0.69 Mg/ha). This finding is similar to Mehmood et al. (2012), but its kernel-based approach can sometimes make differences too smooth. also, this tendency highlights the need for careful kernel selection to ensure robust spatial transitions. Random Forest produced comparatively lower accuracy, particularly with Landsat-9 data, which recorded an RMSE of 0.88 Mg/ha. Thanh Noi and Kappas. (2017) reported that RF has been popular among ecological modeling techniques because of its stability and applicability for high-dimensional data. Yin et al. (2021), reported that its sensitivity to noise and potential overfitting in complex landscapes has been underscored in studies.

4.4 Recommendations

Future research should further integrate complementary datasets, such as LiDAR and hyperspectral imagery, to improve biomass prediction accuracy. Multitemporal analyses that account for seasonal and phenological variations could offer a more dynamic understanding of biomass changes. Also, creating ensemble methods that use the best parts of several machine learning models could avoid the problems that come with single algorithms and make predictions more accurate in complex and varied environments.

5 Conclusion

This study showed that remote sensing data from both Sentinel-2 and Landsat-9 can estimate the AGB of a L. principis-rupprechtii forest stand. It focused on the performance evaluation of the developed machine learning models, including XGBoost, SVM, and RF. The paper concludes that Sentinel-2, with higher spatial and spectral resolution, performs better in estimating biomass than Landsat-9, resulting in higher accuracy and detailed AGB prediction. It was discovered that to get a good picture of plants’ canopy structure and biomass’s most important chlorophyll content, use vegetation indices like TNDVI and NDI45 along with spectral bands like NIR.

The correlation analysis indicated that indices from Sentinel-2 had excellent correlations with AGB compared to the indices from Landsat-9. This demonstrates the significance of spatial and spectral resolution in remote sensing applications. Nevertheless, due to the coarser spatial resolution of Landsat-9, the results were helpful in larger-scale biomass mapping, especially when integrated with machine learning algorithms such as XGBoost and SVM, which showed good flexibility in handling data of different spatial resolutions. XGBoost was the best model in this study, with the highest accuracy in biomass predictions, followed by SVM, which was also very good at capturing non-linear patterns. The study that compared Sentinel-2 and Landsat-9 for AGB mapping demonstrates how valuable it could be to use remote sensing data in combination with modern machine learning techniques to obtain more accurate and less time-consuming biomass estimates. This approach has much potential for ecological assessment, carbon stock estimation, and sustainable forest management, particularly in areas that need accurate biomass information to address climate change and conserve biological diversity.

While Sentinel-2 provides superior accuracy for high-resolution biomass mapping, Landsat-9 remains a valuable tool for large-scale applications, especially when paired with effective machine-learning models. The findings from this study highlight the importance of selecting the appropriate remote sensing platform and machine learning technique to optimize biomass estimation, thereby contributing to the broader field of remote sensing-based environmental monitoring. Future research should focus on integrating multi-temporal satellite data and exploring more advanced machine learning algorithms to further enhance the accuracy and applicability of biomass mapping for global carbon accounting and sustainable land management initiatives.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding authors.

Author contributions

JA: Conceptualization, Data curation, Formal Analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review and editing. WHa: Resources, Writing – original draft, Writing – review and editing. KM: Software, Writing – original draft, Writing – review and editing. WHu: Formal Analysis, Software, Writing – original draft, Writing – review and editing. FI: Resources, Writing – original draft, Writing – review and editing. FS: Writing – original draft, Writing – review and editing. KH: Writing – original draft, Writing – review and editing. YQ: Funding acquisition, Project administration, Supervision, Writing – original draft, Writing – review and editing. JZ: Funding acquisition, Project administration, Supervision, Writing – original draft, Writing – review and editing.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. Effects of spatial variability and biological factors on trunk respiration of L. principis-rupprechtii and its internal mechanism: Grant Number: 31870387.

Acknowledgments

This work was made possible by the support, cooperation, and collaboration of the Saihanba Mechanical Forest Farm staff provided invaluable assistance and expertise throughout the project. The Hebei Forest Department members also played a crucial role in supporting our efforts. The (State Key Laboratory of Efficient Production of Forest Resources) contributed significantly with their advanced research and resources. Furthermore, the (Engineering Technology Research Center of Pinus tabuliformis of National Forestry and Grassland Administration) offered essential technological insights and guidance. Their collective efforts were instrumental in the successful completion of this work.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fenvs.2025.1577298/full#supplementary-material

References

Ali, J., Malik, S. U., Ashraf, M. I., Zhongkui, J., Husnain, Z., and Gulzar, S. (2023). Exploring the potential of carbon sequestration in sub-tropical pine forest ecosystem: a case study in district kurram, Pakistan. Sarhad J. Agric. 39. doi:10.17582/JOURNAL.SJA/2023/39.3.647.654

Remote sensing and integration of machine learning algorithms for above-ground biomass estimation in Larix principis-rupprechtii Mayr plantations: a case study using Sentinel-2 and Landsat-9 data in northern China

1 Introduction

2 Materials and methods

2.1 Location and description of the study area

2.2 Forest inventory and biomass estimation

2.3 Pre-processing of Sentinel 2 and land set 9 satellite data and derivation of variables

2.4 The extraction of remote sensor parameters from field plots

2.5 Techniques of modeling and evaluation

2.6 Enumeration of the tested algorithms

2.7 Machine learning methods

2.8 Optimizing model parameters

2.9 The performance of the models

3 Results

3.1 Field observations and descriptive statistics

3.2 Correlation analysis of Sentinel-2 and Landsat-9 data

3.3 Variable importance analysis for AGB estimation

3.4 Performance evaluation using Sentinel-2 and Landsat-9 data

3.5 Comparative analysis of Sentinel-2 and Landsat-9 for AGB mapping using machine learning models

4 Discussion

4.1 Comparisons of correlation analysis

4.2 The variable importance analysis

4.3 Comparison of model performance

4.4 Recommendations

5 Conclusion

Data availability statement

Author contributions

Funding

Acknowledgments

Conflict of interest

Generative AI statement

Publisher’s note

Supplementary material

References

95% of researchers rate our articles as excellent or good