A hybrid XGBoost-SMOTE model for optimization of operational air quality numerical model forecasts

Ke, Huabing; Gong, Sunling; He, Jianjun; Zhang, Lei; Mo, Jingyue

doi:10.3389/fenvs.2022.1007530

ORIGINAL RESEARCH article

Front. Environ. Sci. , 23 September 2022

Sec. Atmosphere and Climate

Volume 10 - 2022 | https://doi.org/10.3389/fenvs.2022.1007530

This article is part of the Research Topic Machine-Learning in Studies of Atmospheric Environment and Climate Change View all 6 articles

A hybrid XGBoost-SMOTE model for optimization of operational air quality numerical model forecasts

Huabing Ke^1,2

Sunling Gong²*

Jianjun He²

Lei Zhang²

Jingyue Mo^1,2

¹Climate and Weather Disasters Collaborative Innovation Center, Nanjing University of Information Science and Technology, Nanjing, China
²State Key Laboratory of Severe Weather and Key Laboratory of Atmospheric Chemistry of CMA, Chinese Academy of Meteorological Sciences, Beijing, China

As a main technical tool, the air quality numerical model is widely used in the forecasts of atmospheric pollutants, and its development is of great significance to the atmospheric environment and human health. In this study, a hybrid XGBoost-SMOTE model has been developed and applied for the optimization of forecasted PM_2.5 and O₃ concentrations from the Chinese operational air quality forecasting model - CMA Unified Atmospheric Chemistry Environment model (CUACE), which automatically finds the optimal hyperparameters and features without human intervention. Supported by a knowledge base including the ground-observed, CUACE-forecasted pollutants and meteorological data as well as some auxiliary variables, and based on the evaluation analysis of 46 selected key national cities, it was found that the XGBoost-SMOTE model can achieve satisfactory optimization effects for the operational model, especially the significant improvement of the pollutant extreme values on high-pollution days. The results show that after optimization, the 5-day average correlation coefficient (R), mean error (ME) and root mean square error (RMSE) values can reach 0.87, 10.34 µg/m³ and 16.53 µg/m³ for PM₂₅, and 0.89, 14.53 µg/m³ and 18.83 µg/m³ for O₃, far better than those from original CUACE model and XGBoost model. Furthermore, the optimization of the spatial distribution of pollutants from the CUACE model and the impact analysis of the input features by the SHAP method were also explored. The developed hybrid model unveils a good application prospect in the field of environmental meteorology forecasts.

1 Introduction

With the continuous development of China’s economy and the acceleration of industrialization and urbanization in the past 40 years, a large number of anthropogenic gases and aerosol particles have been mitted, making the problem of air pollution increasingly serious. Especially in recent years, the frequent occurrence of haze and photochemical pollution incidents in various regions of China has made air quality a major national strategic issue and received extensive attention from the government and the public (He et al., 2017; Zhong et al., 2019). At present, among the six common air pollutants, PM_2.5 and O₃ are the most concerned, which have important impacts not only on the atmospheric environment but also on human health, the earth’s radiation balance and ecosystems (Forouzanfar et al., 2016). Studies have shown that chronic exposure to high levels of PM_2.5 and O₃ in humans can lead to respiratory diseases, diabetes, asthma, myocardial infarction, and cardiovascular disease (Goldberg et al., 2013; Cohen et al., 2017; Requia et al., 2017). In addition, the global climate can also be affected by PM_2.5 and O₃ by altering the Earth’s radiative energy balance (Fu et al., 2019). Moreover, high concentrations of O₃ can inhibit vegetation growth and reduce crop yields (Sitch et al., 2007).

At present, through government action plans such as the Air Pollution Prevention and Control Action Plan and the Three-Year Plan on Defending the Blue Sky, China’s air pollution control has achieved remarkable results. Data from the Ministry of Ecology and Environment (MEEC) (available at http://www.mep.gov.cn) shows that the PM_2.5 concentration and the number of heavy pollution days have been significantly reduced, and ambient air quality has been significantly improved (Zheng et al., 2018; Fu et al., 2019; Zhai et al., 2019). However, the concentration of PM_2.5 has not yet reached the transition target-I standard of 35 µg/m³ proposed by the World Health Organization (WHO). At the same time, the concentration of O₃ showed a trend of rapid increase and spread (Zheng et al., 2018; Gong et al., 2022). The combined air pollution represented by high concentrations of PM_2.5 and O₃ is becoming one of the bottlenecks restricting the continuous improvement of the atmospheric environment, which has attracted widespread attention from scientists. Therefore, accurate forecasting of atmospheric pollutant concentrations is of great significance for atmospheric environment monitoring and human health protection.

As a mainstream tool to study air quality, numerical models use meteorological principles and mathematical methods to simulate various physical and chemical processes such as emission, diffusion, transport, transformation, and deposition in the actual atmosphere to analyze the temporal and spatial evolution of air pollution and internal mechanism (Gong et al., 2003; Grell et al., 2005; Werner et al., 2016). However, the current numerical models also have some limitations. For example, uncertain factors such as initial meteorological fields, boundary conditions, emission inventories, and physical and chemical parameterization schemes will affect the simulation results of the model, resulting in certain errors (Ritter et al., 2012; Li et al., 2013; Gavidia-Calderón et al., 2018; Bao et al., 2019; Peng et al., 2021).

In order to overcome the above limitations and uncertainties of numerical models, and with the development of information science and artificial intelligence technology, a large number of studies have begun to utilize machine learning or deep learning methods to optimize model output products. Currently, many surveys employ multi-source data fusion techniques, based on air quality numerical models and more datasets, such as satellite aerosol optical depth (AOD) data and other auxiliary data with long-term records and high resolution, to improve pollutant concentrations or spatiotemporal distributions by using machine learning models of random forests, neural networks or extreme gradient boosting model (Fang et al., 2016; Ma et al., 2016; Lin et al., 2018; Xiao et al., 2018; Xue et al., 2019; Geng et al., 2021; Huang et al., 2021; Wei et al., 2021; Zhong et al., 2021). However, due to the timeliness of datasets, these studies can only address the optimization of historical or near real-time model simulations, making them infeasible in operational air quality numerical models. In addition, previous studies usually have significant underestimation of extreme PM_2.5 concentrations on high-pollution days due to the small number of high-pollution samples, leading to their associations being masked by normal samples (Xue et al., 2019; Wei et al., 2020).

In this study, a hybrid XGBoost-SMOTE model was developed, which can achieve the goal of automatically selecting the optimal hyperparameter and features without human intervention, combined with ground-observed pollutant data, CUACE-forecasted meteorological data, CUACE-forecasted pollutant data and some auxiliary variables to optimize PM_2.5 and O₃ concentrations of the Chinese operational CUACE model forecasts. Additionally, the XGBoost-SMOTE model can balance the uneven proportion of high-pollution and normal samples, significantly improving the optimization performance of the numerical model at high concentration levels.

The structure of the paper is organized as follows: Section 2 shows the framework of the proposed XGBoost-SMOTE model and briefly introduces the knowledge base, CUACE model, XGBoost algorithm, and SMOTE algorithm. The results and discussion, which include performance evaluation of XGBoost-SMOTE model optimization, spatial distribution optimization of CUACE model, and feature analysis are illustrated in Section 3. Finally, the conclusion are presented in Section 5.

2 Data and methodology

This study is based on the developed hybrid XGBoost-SMOTE model to optimize PM_2.5 and O₃ concentrations from the operational CUACE model forecasts. The framework of the XGBoost-SMOTE model is shown in Figure 1 and consists of three main components.

(1) The knowledge base: The core basis of the machine learning system, in this study ground-observed pollutant data, CUACE-forecasted meteorological data, CUACE-forecasted pollutant data, and some auxiliary variables are included.

(2) The data preprocessing: The data from the knowledge base is combined by date, and then goes through data cleaning, normalization, and data division of training and test sets. In addition, the SMOTE technique is used here to reconstruct samples based on a high-pollution indicator to improve the forecasting performance of pollutants concentration on high-pollution days.

(3) The XGBoost algorithm: The processed sample data will enter this component, which includes the modules of feature selection, hyperparameter selection, performance evaluation, feature importance analysis and system memory, achieving the function of automatically finding the best “hyperparameter + features” without manual intervention for different pollutants and cities.

FIGURE 1

FIGURE 1. The framework of the hybrid XGBoost-SMOTE model.

2.1 CMA unified atmospheric chemistry environment model description

CMA Unified Atmospheric Chemistry Environment model (CUACE), a national haze numerical forecast operation system, which couples chemical weather models online based on Mesoscale Model version 5 (MM5), and has been applied to the national environmental meteorology operation (Gong and Zhang, 2008; Zhou et al., 2012). CUACE contains four main functional subsystems, namely emissions, gas phase chemistry, aerosol microphysics and data assimilation (Niu et al., 2008). In the aerosol module, seven aerosol components, i.e., sulfate, nitrate, ammonium salts, sea salt, black carbon, organic carbon, and sand/dust, are divided into 12 size bins, where an internal mixture is assumed for all aerosol components and the external mix is adopted between different bins for the particle number density. The major aerosol processes in the atmosphere of production, transport, moisture absorption growth, collision, nucleation, condensation, dry and wet deposition, removal in and under clouds, and aerosol-cloud interaction are involved in the aerosol module (Gong et al., 2003; Wang et al., 2010).

The gas chemistry module is based on the Regional Acid Deposition Model (RADM II) mechanism with 63 gaseous species through 121 photochemical reactions and 121 gas-phase reactions applicable under a wide variety of environmental conditions, especially for smog (Stockwell et al., 1990). In addition, CUACE adopts the thermodynamic equilibrium ISSOROPIA (Nenes et al., 1999; Yu et al., 2005) to calculate nitrate and ammonium aerosols.

The simulation domain uses the Lambert projection to cover the entire territory of China, with a horizontal resolution of 15 km, and the number of east-west and north-south grids are 360 and 320, respectively. The vertical direction is divided into 23 layers from the ground to the height of 100 hpa at unequal intervals, of which approximately eight layers are within the boundary layer. The operational Global Forecast System (GFS) data from the National Centers for Environmental Prediction (NCEP) is utilized for providing the initial and lateral meteorological conditions to CUACE, with temporal and spatial resolutions of 3 h and 0.5° × 0.5°. Anthropogenic emissions are used from the Multi-resolution Emission Inventory for China version 1.3 (MEIC v1.3, http://www.meicmodel.org/, last access: 31 June 2022) with the base year of 2017, developed by Tsinghua University. The operational CUACE model starts running at 08:00 and 20:00 BJT every day, with a forecast time of 5 days. In this study, only the model forecast outputs from 08:00 BJT were used for optimization experiments with a time interval of 3 h. Since each model running produces output of 5 days, the entire dataset sample is divided into five subset samples corresponding to the first–fifth forecast day. For each forecast day, the independent XGBoost-SMOTE model is built, and then the best “hyperparameters + features” are automatically found through training without manual intervention, to obtain the best optimization results.

2.2 Knowledge base

Table 1 summarizes all the input data in the knowledge base, consisting of ground-observed pollutant data, CUACE-forecasted meteorological data, CUACE-forecasted pollutant data, and some auxiliary variables from 1 January 2021 to 31 December 2021. The hourly ground-observed pollutant concentration data used in this study were obtained from the China National Environmental Monitoring Center (CNEMC, http://www.cnemc.cn). Regarding the National Air Quality Forecast Information Release System (https://air.cnemc.cn:18014), 46 key national cities were selected in this study, namely Baoding, Beijing, Cangzhou, Changchun, Chengde, Chengdu, Chongqing, Dalian, Fuzhou, Guangzhou, Guiyang, Haikou, Handan, Hangzhou, Harbin, Hefei, Hengshui, Hohhot, Jinan, Kunming, Langfang, Lanzhou, Lhasa, Nanchang, Nanjing, Nanning, Ningbo, Qinhuangdao, Qingdao, Shanghai, Shenyang, Shenzhen, Shijiazhuang, Taiyuan, Tangshan, Tianjin, Urumqi, Wuhan, Xiamen, Xi’an, Xingtai, Xining, Yinchuan, Zhangjiakou and Zhengzhou (Figure 2).

TABLE 1

TABLE 1. Summary of the knowledge base.

FIGURE 2

FIGURE 2. The locations of 46 selected key national cities in China.

The CUACE-forecasted meteorological data includes sea level pressure (SLP), 2-m temperature (T), 2-m mixing ratio (Q), 10-m u-component wind (U), 10-m v-component wind (V), precipitable water (PWAT), 2-m humidity (RH) and visibility (VIS), and the model-forecasted pollutant data includes PM_2.5, PM₁₀, SO₂, CO, NO₂, and O₃, taking advantage of the data available in model forecasting results. Moreover, a high-pollution indicator (HPI) is defined to improve the pollutants forecasting performance on highly polluted days, when PM_2.5 or O₃ are usually underestimated in statistical and machine learning models (Huang et al., 2021; Wei et al., 2021). This high-pollution indicator is calculated based on CUACE-forecasted pollutant data and describes whether the pollutant concentration at each city exceeds the yearly mean by two standard deviations:

{H P I}_{i} = {\begin{array}{c} 1, x_{i} \geq \bar{x} + 2 σ \\ 0, x_{i} < \bar{x} + 2 σ \end{array}

where $x_{i}$ is the forecasted pollutant concentration of the $i t h$ sample, $\bar{x}$ and $σ$ refer to the yearly mean value and the standard deviation of pollutant concentration data. In addition, other auxiliary variables such as latitude (LAT) and longitude (LON), altitude (ALTI), and hour indicator (HOUR, from 1 to 24) are added to provide geographic and periodic information that may affect regional air quality forecasts in the operational model.

In summary, nineteen input features are used in the study, including the CUACE-forecasted meteorological variables of SLP, T, Q, U, V, PWAT, RH, and VIS, the CUACE-forecasted pollutants of PM_2.5, PM₁₀, SO₂, CO, NO₂, and O₃, and the auxiliary variables of HPI, LAT, LON, ALTI, HOUR. The output is the optimized pollutant concentration of the operational CUACE model. All the input data are combined through the link of date and the sample with missing values will be removed. In the data preprocessing, the extreme values are not deleted but remain for more realistic optimized performance based on the hybrid SMOTE-XGBoost model. The processed data are then normalized and scaled to (−1, 1) to improve the accuracy of the algorithm and speed up the algorithm’s convergence. The whole year of data (2021) is randomly assigned as training and test sets with a ratio of 3:1, where 5-fold cross-validation (CV) is used during the training process. The assigned test set is finally used to evaluate the performance of the well-trained model.

2.3 XGBoost model description

XGBoost (Extreme Gradient Boosted Decision Tree) is an algorithm or engineering implementation based on Gradient Boosted Decision Tree (GBDT). The basic idea of XGBoost is the same as GBDT, but some optimizations have been made, such as the second derivative to make the loss function more accurate, the regular term to avoid tree overfitting, and the block storage can be calculated in parallel (Chen and Guestrin, 2016). XGBoost is efficient, flexible and lightweight, and has been widely used in data mining, recommendation systems and other fields. The principle of the algorithm is as follows:

Assuming that $K$ decision trees have been trained, the final predicted value for the $i t h$ sample is:

{\hat{y}}_{i} = \sum_{k = 1}^{K} f_{k} (x_{i}), f_{k} \in F

where $x_{i}$ is the features of the sample, and $f_{k} (x_{i})$ uses the $k t h$ tree to predict the $i t h$ sample. Adding the results together gives the final predicted value ${\hat{y}}_{i}$ , and the true label of the sample is $y_{i}$ . So the objective function is constructed as follows:

O b j = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}) + \sum_{k = 1}^{K} Ω (f_{k})

where the first term $\sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i})$ is the loss function, which calculates the loss of the model predicted value and the true value. The second term $\sum_{k = 1}^{K} Ω (f_{k})$ is the regular item to control the complexity of the model and prevent overfitting.

2.4 Synthetic minority oversampling technique algorithm description

SMOTE (Synthetic Minority Oversampling Technique) is an improved algorithm for the random oversampling method. The random oversampling method directly re-uses the minority class, which will cause many duplicate samples in the training set, and may easily lead to the problem of model overfitting. While the basic idea of the SMOTE algorithm is to randomly select a sample ${\hat{x}}_{i}$ from its nearest neighbors for each minority class sample $x_{i}$ ( ${\hat{x}}_{i}$ is a sample in the minority class, denoted by one by default), and then a point is randomly selected as a newly synthesized minority class sample on the connection line between $x_{i}$ and ${\hat{x}}_{i}$ . In this study, the minority class refers to the high-pollution samples and the majority class means the normal samples. The detailed algorithm flow is described as three steps:

1) For each sample $x_{i}$ in the minority class, the Euclidean distance is used as the standard to calculate the distance from it to all samples in the minority class sample set $S_{\min}$ , and get the $k$ -nearest neighbors. In the actual algorithm execution process, the determination of the $k$ value needs to be set by the user in advance, and it is unknown to select the most suitable $k$ value to make the algorithm optimal, which is a defect of the SMOTE algorithm. This study uses the imblearn library provided by python to implement the SMOTE algorithm, and the k value is set to five by default.

2) A sampling rate N, which depends on the number of majority and minority class samples, is used to resample the dataset to equalize the number of samples in different classes. For each minority class sample $x_{i}$ , several samples are randomly selected from its k-nearest neighbors, denoted as ${\hat{x}}_{i}$ .

3) For each randomly selected nearest neighbor ${\hat{x}}_{i}$ , construct a new sample with $x_{i}$ according to the following formula:

x_{n e w} = x_{i} + r a n d (0,1) \times ({\hat{x}}_{i} - x_{i})

In the study, as high-pollution days only accounted for about 4.0% of our training data set, which hinders the model’s ability to characterize the relationship between high-pollution events and selected input features, the SMOTE technique is adopted to oversample our data set and strike a balance between high-pollution and normal samples.

3 Results and discussion

3.1 Performance evaluation of XGBoost-SMOTE optimization

The assigned test set of 46 key national cities from the knowledge base is used to study the effectiveness of the proposed hybrid XGBoost-SMOTE model in optimizing the forecasting results from the operational CUECE model. To reflect the advantages of the XGBoost-SMOTE model, especially on high-pollution days, the results of the separate XGBoost model are also compared.

Figures 3, 4 show the performance comparison of the first-day PM_2.5 forecasting results for the original CUACE, XGBoost, and XGBoost-SMOTE model. It can be seen that after the optimization of the XGBoost or XGBoost-SMOTE model, the PM_2.5 forecasting performance of the CUACE model has been greatly improved, with the average R values increasing from 0.51 to 0.75 and 0.88, the average ME values decreasing from 20.71 µg/m³ to 11.69 µg/m³ and 9.85 µg/m³, and the RMSE values dropping from 35.80 µg/m³ to 22.54 µg/m³ and 16.21 µg/m³, respectively for XGBoost and XGBoost-SMOTE model. In the original CUACE model, the forecasting performance of North China and Central China is generally better, with the R values basically above 0.6, followed by the southeast coast and Northeast China, with the R values of 0.4–0.5, and Northwest China and Qinghai-Tibet Plateau are the worst, with R values of many cities even lower than 0.3. From the perspective of ME and RMSE values, Northwest China and some cities of Central China performed poorly, possibly due to the influence of northwest dust weather and regional pollutant transport. Overall, the values in southern China are significantly lower than those in northern China, which is also in line with the current pollution situation. After the optimization of the XGBoost model, the forecasting performance of the operational CUACE model for various cities has significantly improved, with the R values increasing by more than 0.2, the ME values decreasing by about 9.0 µg/m³, and the RMSE values decreasing by about 12.0 µg/m³. However, in Northwest China and Southwest China, the forecasting effect is still somewhat unsatisfactory, and a considerable gap is remained compared with eastern China. The XGBoost-SMOTE model adopts the high-pollution indicator and SMOTE technology, so it can better capture the relationship between the pollutant concentration and the input features, whether in normal events or high-pollution events, thus further improving the forecasting ability of the CUACE model and achieving excellent forecasting results, and the details are shown in Figure 5. Test samples from four central cities (Beijing, Xi’an, Guangzhou, and Harbin) in different regions were selected to compare the performance of XGBoost and XGBoost-SMOTE models in optimizing PM_2.5 concentration from the operational CUACE model, especially in high-pollution samples. It can be seen from the figure that for different cities, compared with the XGBoost model, the forecasting improvement of the XGBoost-SMOTE model for high-pollution samples is extremely significant, with the PM_2.5 concentration bias being reduced by tens to hundreds of µg/m³.

FIGURE 3

FIGURE 3. Scatter plots of performance comparison on the 1st-day PM_2.5 and O₃ forecasting results for the original CUACE, XGBoost, and XGBoost-SMOTE model.

FIGURE 4

FIGURE 4. Performance comparison of the 1st-day PM_2.5 forecasting results for the original CUACE, XGBoost, and XGBoost-SMOTE model.

FIGURE 5

FIGURE 5. Results of PM_2.5 concentration for OBS (observation), XGBoost, and XGBoost-SMOTE model.

Figures 3, 6 show the performance comparison of the first-day O₃ forecasting results for the original CUACE, XGBoost and XGBoost-SMOTE model. It can be also seen that after the optimization of the XGBoost or XGBoost-SMOTE model, the O₃ forecasting performance of the CUACE model has also been greatly improved, with the average R values increasing from 0.55 to 0.90 and 0.91, the average ME values decreasing from 36.03 µg/m³ to 13.83 µg/m³ and 13.27 µg/m³, and the RMSE values dropping from 47.17 µg/m³ to 18.39 µg/m³ and 17.31 µg/m³, respectively for XGBoost and XGBoost-SMOTE model. In the original CUACE model, the forecasting performance of North China is generally better, with the R values basically above 0.7, followed by Northwest China, Central China, and East China, with the R values of 0.4–0.6, and Northwest China and Southwest China are the worst, with R values of many cities even lower than 0.3. From the perspective of ME and RMSE values, North China and Northeast China performed the best, while Central China and Southwest China performed poorly. After the optimization of the XGBoost model or XGBoost-SMOTE model, O₃ forecasting results are all satisfactory, where XGBoost-SMOTE shows a little advantage over XGBoost, which is obviously different from the optimization of PM_2.5. This may be due to the better periodicity of O₃ concentrations, leading to the fact that the single XGBoost model combined with the input features selected in this study can already fit their associations well, also including high-pollution events, which can be verified from Figure 7.

FIGURE 6

FIGURE 6. Comparison of the 1st-day O₃ forecasting results for the original CUACE, XGBoost, and XGBoost-SMOTE model.

FIGURE 7

FIGURE 7. Results of O₃ concentration for OBS (observation), XGBoost, and XGBoost-SMOTE model.

Statistics of three evaluation criteria (R, ME, and RMSE) for different forecast days are depicted in Table 2. It is easy to find that as the increase of forecast days (from first day to fifth day), the PM_2.5 and O₃ forecasting performance of CUACE gradually deteriorates, the same as that for the XGBoost or XGBoost-SMOTE model, which is in line with our cognition due to the inevitably increasing error of the input meteorological field. For different forecast days, the optimization performance of XGBoost-SMOTE for CUACE forecast is always better than that of XGBoost, especially for PM_2.5. For example, the 5-day average R, ME, RMSE values for XGBoost-SMOTE are 0.87, 10.34 µg/m³ and 16.53 µg/m³, respectively, better than those of 0.73, 12.34 µg/m³and 22.79 µg/m³ for XGBoost. In addition, to further demonstrate the advantages of the hybrid XGBoost-SMOTE model, the evaluation criteria of the XGBoost-SMOTE model and other machine learning models are shown in Table 3. The results show that the hybrid XGBoost-SMOTE model consistently outperforms the Multiple Linear Regression (MLR), Multilayer Perceptron (MLP), Random Forest (RF), and Gradient Boosted Decision Tree (GBDT) for different pollutants and forecast days. In general, combined with the evaluation criteria, it can be found that the built hybrid XGBoost-SMOTE model can achieve satisfactory forecasting performance in optimizing the forecasting results of multiple pollutants and cities with different climate and pollution characteristics from the operational CUACE model.

TABLE 2

TABLE 2. Statistics of evaluation criteria for different pollutants and forecast days.

TABLE 3

TABLE 3. Statistics of evaluation criteria between XGBoost-SMOTE model and other machine learning models.

3.2 Optimization of spatial distribution of CUACE model

In order to obtain the grid-optimized pollutant concentration distributions in other regions of the CUACE model, this study adopts the well-trained XGBoost-SMOTE hybrid algorithm based on the selected 46 cities and applies it to the optimization of model grid points. To increase the optimization efficiency, cyclic optimization is carried out with every four grid points as a group, and the optimized data set corresponding to the grid point data of the CUACE model in the simulation area is obtained.

Figure 8 show the spatial distribution of PM_2.5 and O₃ concentrations at the high pollution event on CUACE forecast and XGBoost-SMOTE algorithm optimization, respectively. Observation values for all cities in the model domain from the China National Environmental Monitoring Center are represented by colored dots. To make it easier to compare observations with CUACE forecast and XGBoost-SMOTE optimization, the same color bar is shared for all data. It can be seen from the figure that the spatial distribution of pollutant concentrations optimized by the XGBoost-SMOTE algorithm is closer to the observed results. The operational CUACE model has a poor forecasting effect on pollutant concentrations during the study period, with significantly overestimated PM_2.5 concentrations in central and southern Hebei Province, southern Henan Province, central Shaanxi Province, and southern Hubei Province, and general overestimated O₃ concentrations in northern China. The spatial distribution of pollutants shows that the built XGBoost-SMOTE hybrid algorithm can well calibrate the simulation results of the CUACE model, making the pollutant concentrations closer to the actual observed values. Compared with the tens of thousands of model grid points, the number of 46 cities selected in this study is small, which leads to certain errors in the pollutant concentrations optimized by the XGBoost-SMOTE algorithm. For example, in some cities in central Hebei Province and Shaanxi Province, the optimization effect of PM_2.5 concentration is poor, showing a relatively underestimation.

FIGURE 8

FIGURE 8. PM_2.5 and O₃ observed (spot), CAUCE forecasted and XGBoost-SMOTE optimized concentration spatial distribution map.

3.3 Feature analysis of the hybrid XGBoost-SMOTE model

To explore the effect of input features on the optimization of pollutant concentrations, a unified framework called Shapley Additive Interpretation (SHAP), a post-interpretation model with three desirable properties of local accuracy, consistency, and absence, was used to explain some sophisticated machine learning models (Roth, 1988; Lundberg and Lee, 2017). SHAP simulates feature contribution to individual predictions by calculating the Shapley value for each feature per sample (Lundberg et al., 2020).

Figure 9A shows the SHAP summary plot for the optimization of the first-day PM_2.5 forecasting, which is used to demonstrate feature importance and feature impact. Each point on the summary plot is the Shapley value of each feature per sample, and the larger the Shapley value is, the more influence of feature has on the model output. The y-axis represents the input features, sorted by feature importance from top to bottom, and the x-axis represents the Shapley value of the corresponding point. The color (from red to blue) represents the values of the input features (from high to low), and the overlapping points are jittered in the y-axis direction, showing the distribution of Shapley values for each feature. It can be seen from the feature importance ranking in the figure that the feature of the high-pollution indicator ranks first, which means the greatest impact on the model optimization of PM_2.5 concentration, with the positive and negative contribution of Shapely value reaching tens to hundreds of micrograms per cubic meter. When the sample is a high-pollution sample with the feature value of INDICATOR marked as 1, the SHAP value is higher, indicating that the feature has a positive contribution to the optimization of PM_2.5. When the sample is a normal sample with the feature value of INDICATOR marked as 0, the feature has a negative contribution to the optimization of PM_2.5. This also explains why high-pollution indicators combined with SMOTE technology can significantly improve the advantages of the XGBoost model in high-pollution events. In addition, auxiliary geographic features such as LAT and ALTI significantly affect the optimized PM_2.5 concentration with positive correlation, i.e., higher values of LAT and ALTI tend to increase the optimized value (Figures 9B,C). In fact, LAT and ALTI represent not only the longitude and altitude in the geographical sense but also the local topographic conditions and industrial structure data behind them, which are temporarily not included in the knowledge base in this study. The higher LAT values tend to increase the optimized values, which is the rule learned by the XGBoost-SMOTE model from 46 central cities, i.e., the concentration of pollutants in the relatively northern region is higher, basically conforming to the current pollution situation in China. Higher ALTI values are associated with higher Shapley values, mainly due to the generally higher PM_2.5 concentrations in central and western cities with higher terrain, which are caused by sparse vegetation coverage and strong secondary industries. Furthermore, the features of PM_2.5, PM₁₀, RH, and HOUR also affect the optimized PM_2.5 concentration basely with positive correlation, while physical features of Q, PWAT, T, SLP, U, and V are negatively correlated with the optimized PM_2.5 concentration.

FIGURE 9

FIGURE 9. SHAP summary plot (A), partial dependence plot of LAT (B) and partial dependence plot of ALTI (C) for the optimization of the 1st-day PM_2.5 forecasting.

Additionally, Figure 10A shows the SHAP summary plot for the optimization of the first-day O₃ forecasting. Based on the feature importance ranking, it is observed that the feature of the high-pollution indicator also ranks first, which means the greatest impact on the model optimization of O₃ concentration, with the positive and negative contribution of Shapely value reaching 20 µg/m³–50 µg/m³. Compared with the optimization process of PM₂₅, the contribution of high-pollution indicators to the O₃ concentration is greatly reduced, which also indirectly verifies that the improvement of O₃ optimization by XGBoost-SMOTE is far less than that of PM₂₅ optimization. In addition, the features of temperature T significantly affect the optimized O₃ concentration with a positive correlation, i.e., the optimized value increases with the high values of T. The main reason for this phenomenon is that solar radiation is strongly correlated with surface temperature, and solar radiation can directly affect the production of O₃, resulting in T can significantly affecting the surface O₃ concentration, which is consistent with the conclusion of other studies (Chen et al., 2020; Jodzis and Baran, 2022). Figure 10B shows more details about the effect of T on the SHAP values. When T is low, the SHAP value is negative with the extreme T value around 270K, reducing the optimized O₃ concentration. As T exceeds around 290 K, the temperature starts to make a positive contribution to SHAP values and gradually increases as T increases. Figure 10C shows details about the effect of the feature HOUR on the SHAP values. It can be seen that HOUR has a significant periodic contribution to the generation of O₃ with the largest positive contribution to the SHAP values at about 14:00–17:00 and the largest negative contribution to the SHAP values at about 7:00–8:00 each day. Furthermore, the features of O₃, LAT, V, and VIS also affect the optimized O₃ concentration basely with a positive correlation, while the features of RH, SLP, PWAT, NO₂ and U are negatively correlated with the optimized O₃ concentration, which may be mainly related to photochemical reaction and ozone production (He et al., 2017; Chen et al., 2019; Chen et al., 2020; Ma et al., 2021; Mousavinezhad et al., 2021).

FIGURE 10

FIGURE 10. SHAP summary plot (A), partial dependence plot of T (B) and partial dependence plot of HOUR (C) for the optimization of the 1st-day O₃ forecasting.

4 Discussion

It is worth noting that since only 46 cities are used in this study to train the XGBoost-SMOTE model, leading its performance in optimizing the spatial distribution of pollutants from the CUACE model somewhat unsatisfactory, which is understandable based on the fact that the pollution mechanism learned from the 46 cities is hard to apply to the whole country, especially the vast border areas with no or only individual training cities. In our subsequent work, more cities will add to better capture the pollution mechanisms of various regions across China with different climatic and topographical characteristics. In addition, the knowledge base needs to be supplemented with more auxiliary data, such as emission source inventory, vegetation cover data, etc., so that the XGBoost-SMOTE model can better analyze the dominant physicochemical processes or dominant factors of the pollutant generation, transport, and dissipation in different regions.

5 Conclusion

A hybrid XGBoost-SMOTE model was established in this study and applied for the optimization of PM_2.5 and O₃ concentrations from the operational CUACE model. Ground-observed pollutant data, CUACE-forecasted meteorological data, CUACE-forecasted pollutant data and some auxiliary variables form the basis of the model application. The XGBoost-SMOTE model can achieve the goal of automatically selecting the optimal hyperparameter and features without human intervention, and significantly improving the pollutant forecasting performance from the numerical model on high polluted days combined with a self-defined high-pollution indicator.

Results showed that after the optimization of the XGBoost-SMOTE model, the PM_2.5 forecasting performance of the CUACE model has been greatly improved, with the 5-day average R, ME, RMSE values changing from 0.49 to 0.87, 22.35 µg/m³ to 10.34 µg/m³, and 38.52 µg/m³ to 16.53 µg/m³, respectively, which are far better than those of 0.73, 12.34 µg/m³, and 22.79 µg/m³ by using XGBoost model. For the optimization of O₃ forecasting performance, the 5-day average R, ME, RMSE values have changed from 0.47 to 0.89, 37.74 µg/m³ to 14.53 µg/m³, and 49.52 µg/m³ to 18.83 µg/m³, respectively, which are also better than those of 0.87, 15.33 µg/m³, and 20.35 µg/m³ by using XGBoost model. And through the comparison of the time series diagrams of the four central cities (Beijing, Xi’an, Guangzhou, and Harbin), it can be verified that the XGBoost-SMOTE model has a great improvement in the extreme values of forecasted pollutant concentrations from the CUACE model in high-pollution days, thereby achieving excellent forecasting results, especially for PM_2.5. In addition, the application of the optimization for other model grid points showed that the spatial distribution of pollutant concentrations optimized by the XGBoost-SMOTE algorithm was closer to the observed results. Furthermore, through the impact analysis of the input features by SHAP, it was found that the high-pollution indicator ranked first in feature importance. For PM_2.5 optimization, the features of LAT, ALTI, PM_2.5, PM₁₀, RH, and HOUR display positive correlation, while physical features of Q, PWAT, T, SLP, U, and V display negative correlation. For O₃ optimization, the features of T, HOUR, O₃, LAT, V, and VIS display positive correlation, while the features of RH, SLP, PWAT, NO₂ and U display negative correlation. To summarize, the built hybrid XGBoost-SMOTE model can achieve reliable optimization results of the operational CUACE model and implies a good application prospect in the field of atmospheric environmental forecasting.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

SG put forward the ideas and formulated overarching research goals. HK carried them out and wrote the manuscript with suggestions from all authors. JH and LZ participated in the scientific interpretation and discussion. JM assisted with data acquisition and processing. All authors contributed to the discussion and improvement of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Nos 42090030, 91744209), the National Key Foundation Study Developing Programs (Nos 2019YFC0214601 and 2019YFC0214801), and CAMS Basic Research Project (Grant No. 2019Z009 and No. 2020Y001).

Acknowledgments

The authors would like to thank the National Meteorological Centre for providing historical operational CUACE data.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Bao, Y., Zhu, L., Guan, Q., Guan, Y., Lu, Q., Petropoulos, G. P., et al. (2019). Assessing the impact of Chinese FY-3/MERSI AOD data assimilation on air quality forecasts: Sand dust events in northeast China. Atmos. Environ. 205, 78–89. doi:10.1016/j.atmosenv.2019.02.026

A hybrid XGBoost-SMOTE model for optimization of operational air quality numerical model forecasts

1 Introduction

2 Data and methodology

2.1 CMA unified atmospheric chemistry environment model description

2.2 Knowledge base

2.3 XGBoost model description

2.4 Synthetic minority oversampling technique algorithm description

3 Results and discussion

3.1 Performance evaluation of XGBoost-SMOTE optimization

3.2 Optimization of spatial distribution of CUACE model

3.3 Feature analysis of the hybrid XGBoost-SMOTE model

4 Discussion

5 Conclusion

Data availability statement

Author contributions

Funding

Acknowledgments

Conflict of interest

Publisher’s note

References

95% of researchers rate our articles as excellent or good