Climate-informed monthly runoff prediction model using machine learning and feature importance analysis

Yan, Lei; Lei, Qingwen; Jiang, Cong; Yan, Pengtao; Ren, Zheng; Liu, Bin; Liu, Zhangjun

doi:10.3389/fenvs.2022.1049840

ORIGINAL RESEARCH article

Front. Environ. Sci., 17 November 2022

Sec. Atmosphere and Climate

Volume 10 - 2022 | https://doi.org/10.3389/fenvs.2022.1049840

This article is part of the Research TopicHydro-Climate Extremes and Natural Disasters During Global Warming: Observation, Projection, and MitigationView all 11 articles

Climate-informed monthly runoff prediction model using machine learning and feature importance analysis

Lei Yan^1,2*

Qingwen Lei^1,2

Cong Jiang³

Pengtao Yan⁴

Zheng Ren^1,2

Bin Liu^1,2

Zhangjun Liu⁵

¹College of Water Conservancy and Hydropower, Hebei University of Engineering, Handan, China
²Hebei Key Laboratory of Intelligent Water Conservancy, Hebei University of Engineering, Handan, China
³School of Environmental Studies, China University of Geosciences, Wuhan, China
⁴School of Physics and Electronic Engineering, Xingtai University, Xingtai, China
⁵Jiangxi Academy of Water Science and Engineering, Nanchang, China

Accurate runoff prediction can provide a reliable decision-making basis for flood and drought disaster prevention and scientific allocation of water resources. Selecting appropriate predictors is an effective way to improve the accuracy of runoff prediction. However, the runoff process is influenced by numerous local and global hydrometeorological factors, and there is still no universal approach about the selection of suitable predictors from these factors. To address this problem, we proposed a runoff prediction model by combining machine learning (ML) and feature importance analysis (FIA-ML). Specifically, take the monthly runoff prediction of Yingluoxia, China as an example, the FIA-ML model uses mutual information (MI) and feature importance ranking method based on random forest (RF) to screen suitable predictors, from 130 global climate factors and several local hydrometeorological information, as the input of ML models, namely the hybrid kernel support vector machine (HKSVM), extreme learning machine (ELM), generalized regression neural network (GRNN), and multiple linear regression (MLR). An improved particle swarm optimization (IPSO) is used to estimate model parameters of ML. The results indicated that the performance of the FIA-ML is better than widely-used long short-term memory neural network (LSTM) and seasonal autoregressive integrated moving average (SARIMA). Particularly, the Nash-Sutcliffe Efficiency coefficients of the FIA-ML models with HKSVM and ELM were both greater than 0.9. More importantly, the FIA-ML models can explicitly explain which physical factors have significant impacts on runoff, thus strengthening the physical meaning of the runoff prediction model.

1 Introduction

Water resources are important for social and economic development and the ecological environment, and accurate runoff forecasting can provide a reasonable decision-making basis for the optimal allocation and utilization of water resources (Huang et al., 2014; Xiong et al., 2019; Feng et al., 2020a; Yan et al., 2021a; Jian et al., 2022). However, under changing environments, the runoff process and associated hydrological system have been altered by human activities and climate change (Song et al., 2018; Sun et al., 2018, 2022; Jiang et al., 2019; Lu et al., 2020; Yan et al., 2020; Hu et al., 2022), and the runoff series becomes nonlinear and nonstationary, which makes it challenging to capture the variation characteristics of runoff (Sun et al., 2014; Lin et al., 2020; Yan et al., 2021b; Samantaray et al., 2022a; Samantaray et al., 2022b; Samantaray et al., 2022c; Zhou et al., 2022). Therefore, there is an urgent need to develop a runoff prediction model with robustness and high forecasting accuracy under a changing environment (Sit et al., 2020; Niu et al., 2021; Zhao et al., 2021). In recent years, there have been many studies trying to transform the complex runoff series into stationary sub-sequences using wavelets or mode decomposition methods, and then predict the sub-sequences to improve the accuracy of prediction (Meng et al., 2019; Feng et al., 2020b; Niu et al., 2020). However, these studies often ignore the relationship between hydrometeorological factors and the variation characteristics of runoff.

As revealed by recent studies, the changing characteristics of runoff process are controlled by numerous factors, such as astronomy, meteorology, ocean, and underlying surface conditions (Tang et al., 2018; Shi et al., 2021; Bian et al., 2022; Ma et al., 2022). When there are no significant changes for the underlying surface conditions, the runoff process mainly depends on precipitation and evaporation, which are influenced by atmospheric circulations (Talaee et al., 2014; Huang et al., 2017; Luo et al., 2017). On the other hand, the interaction between the ocean and the atmosphere drives the exchange of matter and energy between regions of the Earth, which can affect regional weather and climate change (Nugent and Matthews, 2012; Singh and Roxy, 2022). Therefore, the sea temperature indices, such as Pacific Decadal Oscillation Index (PDO) and El Niño–Southern Oscillation index (ENSO), are also important affecting factors of runoff (Yang et al., 2021; You et al., 2021). From the analysis of the physical mechanism of runoff, atmospheric circulation, sea temperature index, precipitation, and evaporation are all important influencing factors in the runoff prediction.

Variations of runoff are closely related to large-scale climate factors in hydrometeorological teleconnection analysis (Lima and Lall, 2010; Peters et al., 2013; Steinschneider and Lall, 2016; Wang et al., 2022). Therefore, many studies combining teleconnection analysis have been proposed to strengthen the physical meaning of runoff prediction. Wang et al. (2020) established a runoff prediction model by combining teleconnection analysis and ensemble empirical mode decomposition (EEMD) and achieved good application results, but they mentioned the cross-correlation method used in their paper is not suitable for analyzing nonlinearity and non-stationary time series. Maity and Kashid (2011) combined genetic programming (GP) with importance analysis, and used global climate factors and local meteorological variables as predictors to carry out the weekly-scale runoff forecasting for the Mahanadi River in India. It has been proved that the model derived by GP for a complex runoff system can effectively improve the accuracy of weekly runoff prediction. Yang et al. (2017) selected 17 known climate phenomenon indices, such as PDO and ENSO, and used several machine learning (ML) regression models to predict and simulate the runoff 1-month in advance of headwater reservoirs in USA and China. The results indicated that the climate phenomenon indices are beneficial for improving the accuracy of monthly or seasonal reservoir runoff prediction.

Based on the previous studies, we can find that enriching the input features of regression models is an effective method to improve the accuracy of runoff prediction. However, since numerous physical factors are expected to jointly affect the variation of runoff, there is expected to exist a mutual correlation among these factors. Thus, finding a suitable set of predictors from the high-dimensional data, consist of physical factors, as the input of the runoff prediction model is still a challenge for medium and long-term runoff prediction. In information theory, mutual information (MI) method is a powerful tool for analyzing linear and nonlinear relationships between variables (Elkiran et al., 2021), but simply selecting predictors through correlation analysis provided by MI will introduce many redundant features (Tiwari and Chaturvedi 2022). To solve this problem, in traditional high-dimensional data preprocessing, principal component analysis (PCA) is typically used to reduce dimensionality and it can effectively reduce redundant variables and feature dimensions (Ouyang et al., 2022). But PCA cannot accurately measure the importance of each feature on runoff prediction results. Thus, how to select the input predictors of regression model needs further investigation. In this study, we attempted to develop a runoff prediction model by combining ML with importance analysis for each feature (FIA-ML) to improve the accuracy of runoff prediction and explain the variation law of runoff from the view of physical causes.

In a summary, the aims of this study are: 1) to select appropriate predictors according to the importance measures computed by random forest (RF) for each feature; and 2) to construct the runoff prediction models by fitting the correlation between predictors and monthly runoff based on the ML models, whose hyper-parameters are optimized by an improved particle swarm optimization (IPSO). To fulfill these aims, the monthly runoff data collected from the Yingluoxia station in China was selected for illustration purposes, and the performance of the developed FIA-ML models was compared with two widely-used runoff time series analysis models, i.e., long short-term memory neural network (LSTM) and seasonal autoregressive integrated moving average (SARIMA).

2 Study area and data

2.1 Hydrological characteristics of the study area

The Heihe River Basin (HRB) is the second-largest inland watershed in Northwest China, whose geographic coordinates are roughly between $98^{\circ} - 101^{\circ} E$ and $38^{\circ} - 42^{\circ} N$ , with a drainage area of about $142,900 {km}^{2}$ . From the upper to the lower reaches of HRB, the annual average precipitation is: 284, 112, and 41 mm, respectively, and the annual average water surface evaporation is: 891, 1221, and 1,372–2081 mm, respectively. The climate of HRB is mainly influenced by the westerly circulation in the middle and high latitudes and the polar cold air mass, and the precipitation is sparse and concentrated.

The HRB originates from the Qilian Mountains in the south, with a total length of 821 km. It flows through Qinghai, Gansu, and Inner Mongolia provinces and finally into Juyanhai. In this study, the Yingluoxia basin (YLX) is selected as the study area, which is in the upper reaches of the HRB, with a controlled catchment area of about ${10,018 km}^{2}$ (Figure 1). The main sources of surface water supply in YLX are the melting of snow and ice in the mountains and atmospheric precipitation, and the base flow is mainly supplied by groundwater, and most of the water resources of YLX concentrate in spring and summer. We collected 684 months of runoff data at YLX from 1960 to 2016 for the further analysis and research. The average annual runoff of YLX is $16.41 \times 10^{8} m^{3}$ . Meanwhile, the Mann-Kendall method (Mann, 1945; Kendall, 1975) was used to analyze the trend of the annual runoff series. The results indicated that the annual runoff series showed an upward trend after 1980, and a significant upward trend was detected after 1988. The annual runoff rising rate is $0.1 \times 10^{8} m^{3} / yr$ from 1960 to 2016. The increasing trend of the annual runoff of YLX is most likely due to global warming, since YLX is in the upper reaches of the HRB and is less affected by human activities (Wen et al., 2019; Zou et al., 2022).

FIGURE 1

FIGURE 1. The location of the study area and meteorological observation stations. YNG denotes Yeniugou (Meteorological station 1), ZY denotes Zhangye (Meteorological station 2), QL denotes Qilian (Meteorological station 3) and YLX denotes Yingluoxia (Hydrological station).

2.2 Meteorological and large-scale climate factors

The 130 large-scale climate factors used in this study are provided by the National Climate Center of China Meteorological Administration (NCC-CMA) (https://cmdp.ncc-cma.net/cn/download.htm), including 88 atmospheric circulation indices, 26 sea temperature indices, and 16 other indices. In this study, these large-scale climate factors are sequentially numbered for 1–130, which is consistent with the serial number of the datasets provided by the NCC-CMA. As for the local meteorological information, namely the evaporation (factor numbers: 131–133) and precipitation data (factor numbers: 134–136) of the Yeniugou (YNG), Zhangye (ZY), and Qilian (QL) meteorological stations, are collected from the National Tibetan Plateau/Third Pole Environment Data Center (https://data.tpdc.ac.cn/zh-hans/). In addition, we also considered the previous runoff of YLX (factor number: 137) as a factor. Therefore, a total of 137 physical factors were considered for further building the monthly runoff prediction model in this study.

3 Methodologies

3.1 Mutual information based on k-nearest neighbors

Mutual information (MI) is a nonparametric statistical method used to measure the degree of correlation between variables. MI does not have any special requirements for the distribution type of variables, and it can describe both linear and nonlinear correlation. Therefore, it has been widely used in variable selection in hydrology, meteorology, and other fields (He et al., 2015; Fang et al., 2018). Given variables $X$ and $Y$ , the MI between them is defined as follows:

I (X; Y) = \iint μ (x, y) \log \frac{μ (x, y)}{μ_{X} (x) μ_{Y} (y)} d x d y (1)

where $μ (x, y)$ represents the joint probability density of variables $X$ and $Y$ . $μ_{X} (x)$ and $μ_{Y} (y)$ are the marginal probability densities of $X$ and $Y$ , respectively. The greater the MI, the more information the variable $X$ contains about $Y$ , in other words, it would demonstrate a stronger dependence between these two variables.

MI has difficulties in estimating probability density. Thus, Kraskov et al. (2004) proposed a method based on k-nearest neighbors to avoid directly estimating the probability density of the variables. As for the significance test of MI, the method proposed by Sharma (2000) is applied, and the significance level is set to be 0.01 in this study.

3.2 Feature importance analysis based on random forest

Random forest (RF) is an extended variant of the bagging parallel ensemble learning method (Breiman, 2001). During the training period, it uses bootstrap sampling to generate a subset of training samples. Therefore, each base learner only uses about 63.2% of the samples in the initial training set, and the remaining 36.8% of the samples can be used as the validation set to evaluate the generalizability of RF. This way of assessing the generalization of the model is called out-of-bag (OOB) estimation.

As shown in many studies, reasonable predictors can significantly improve the accuracy and robustness of the regression model in runoff prediction (Taormina and Chau, 2015; Sharifi et al., 2017). RF can efficiently deal with the multivariate samples and measure the importance of each variable (Genuer et al., 2010; Hapfelmeier et al., 2014). The importance measure of each feature in runoff forecasting is computed as follows: 1) to obtain the mean square error vector ${M S E_{t} | t = 1, 2, \dots, T}$ for the $T$ base decision trees in OOB estimation; 2) shuffling each feature of the OOB samples in turn, and the variation of $M S E_{t}$ for each feature perturbation is represented by the following matrix.

[\begin{array}{c} M S E_{11} & M S E_{12} & \dots & M S E_{1 T} \\ M S E_{21} & M S E_{22} & \dots & M S E_{2 T} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ M S E_{m 1} & M S E_{m 2} & \dots & M S E_{m T} \end{array}] (2)

where $m$ and $T$ are the number of features and the number of decision trees, respectively. $M S E_{m t}$ is the OOB mean square error of the $t$ base decision tree after shuffling the $m - th$ feature. Therefore, Eq. 3 can be used to measure the perturbation degree of the $m - th$ feature to the model.

H_{m}^{o o b} = \frac{1}{T} \sum_{t = 1}^{T} (M S E_{m t} - M S E_{t}) (3)

If the feature is shuffled, the greater the accuracy of OOB estimation decreases, the greater the degree of disturbance to the model by the feature. By normalizing $H_{m}^{o o b}$ to the range of [0,1], the importance measure of each feature (FIM) can be obtained. In previous studies, the impurity weighted increments of the leaf nodes of the tree model were mostly used to measure the importance of features (Zuo et al., 2020). However, the impurity-based method usually causes the feature importance to drop rapidly, and the identification of feature importance is relatively insensitive. Therefore, we used the decreases of accuracy of OOB estimation to measure the influence of each feature on the runoff prediction result in this study.

3.3 Regression model

3.3.1 Hybrid kernel support vector machine

Support vector machines (SVM) can solve nonlinear problems, mainly using the kernel function to map the input factors to the high-dimensional space and then performing linear regression. Different kernel functions will significantly affect the fitting ability of SVM to different kinds of nonlinear problems. Among the commonly used kernel functions, the polynomial kernel function $κ_{P o l y}$ (see Eq. 4) has strong generalization ability and weak learning ability. On the contrary, the localized Gaussian radial basis kernel function $κ_{R b f}$ (see Eq. 5) has robust learning ability, but its generalization ability is relatively weak. According to the Mercer kernel theory, it can effectively overcome the shortcomings of the existing kernel functions by combining these two types of kernel functions into a new hybrid kernel function (Zheng et al., 2005; Zhou et al., 2018). Therefore, the polynomial kernel function and the Gaussian radial basis kernel function are combined into a hybrid kernel function $κ_{H K}$ (see Eq. 6), and the hybrid kernel support vector machine (HKSVM) is developed in this study.

κ_{P o l y} (x, x_{i}) = (γ_{p} (x \cdot x_{i}) + 1)^{3} (4)

κ_{R b f} (x, x_{i}) = \exp (γ_{r} {‖ x - x_{i} ‖}^{2}) (5)

κ_{H K} (x, x_{i}) = λ κ_{P o l y} + (1 - λ) κ_{R b f} (6)

where $x$ is the input predictors, and $x_{i}$ is the input predictors of the $i - th$ training sample. $γ_{p}$ and $γ_{r}$ are the kernel parameters. $λ \in [0,1]$ represents the proportion of $κ_{P o l y}$ and $κ_{R b f}$ in the hybrid kernel function $κ_{H K}$ . It should be noted that HKSVM also has two important parameters, which are the penalty factor $C$ and the insensitivity coefficient $ε$ .

3.3.2 Extreme learning machine

Extreme learning machine (ELM) is a machine learning method based on feedforward neural network. ELM consists of a three-layer structure: input layer, hidden layer, and output layer. During the training period, the weights of the output layer can be obtained by the least square method (LSM). Given the input predictors $x$ , the runoff prediction value $y$ is calculated by the following equation:

y (x) = \sum_{l = 1}^{L} β_{l} G (ω_{l}, b_{l}, x) (7)

where $L$ is the number of hidden layer neurons. $ω_{l}$ and $b_{l}$ are the input weights and thresholds of the hidden layer neurons, respectively. $G (\cdot)$ is the activation function. $β_{l}$ is the connection weight vector connecting the hidden layer neurons and the output layer neurons, which can be obtained by the Moore-Penrose generalized inverse method (Huang et al., 2006).

3.3.3 General regression neural network

Generalized Regression Neural Network (GRNN) is of high fault tolerance and strong robustness, which is suitable for solving nonlinear problems and can also handle unstable data. The network structure of GRNN consists of the input layer, mode layer, summation layer and output layer. After the input predictors $x$ is input from the input layer, the following Eq. 8 can express the runoff prediction value.

y (x) = \frac{\sum_{i = 1}^{N} y_{i} \exp [- {‖ x - x_{i} ‖}^{2} / 2 σ^{2}]}{\sum_{i = 1}^{N} \exp [- {‖ x - x_{i} ‖}^{2} / 2 σ^{2}]} (8)

where $N$ is the number of training samples, and $y_{i}$ is the observed runoff corresponding to the $i - th$ training sample. $σ$ is the smoothing factor.

3.3.4 Multiple linear regression

Multiple Linear Regression (MLR) can be used to fit a linear relationship between multiple independent and dependent variables. After the specific MLR equation is obtained through training, the dependent variable can be predicted by the following equation:

y (x) = b_{0} + b_{1} x_{1} + b_{2} x_{2} + \dots + b_{m} x_{m} + μ (9)

where $x_{i}$ and $b_{i}$ are the $i - th$ input predictor and regression coefficient, respectively. $μ$ is a random error satisfying the Gaussian distribution. The solution of the regression coefficients in Eq. 9 usually adopts the LSM.

3.4 Hyper-parameters optimization for regression models

The performance of machine learning (ML) may be limited in practice since its forecasting results are largely influenced by the choice of hyper-parameters. Therefore, it is critical to obtain the hyper-parameters combination with the best generalization performance. In this study, we use the K-Fold cross-validation method (Soper, 2021) to evaluate the model performance. However, the cross-validation results of ML models are sensitive to the choice of hyper-parameters. When using conventional particle swarm optimization algorithms to estimate their hyper-parameters, there are problems of premature maturity and local convergence. To overcome these problems, we employed an improved particle swarm optimization algorithm (IPSO) in this study to determine a set of hyper-parameters that optimize the cross-validation results. Please refer to Lei et al. (2022) for more details about IPSO. The hyper-parameters that need to be tuned by IPSO are displayed in Table 1.

TABLE 1

TABLE 1. The range of hyper-parameters to be optimized for the model used in this paper.

3.5 Model performance evaluation metrics

In this study, we selected several indicators to quantitatively analyze the performance of various models, which are Nash-Sutcliffe efficiency (NSE), root mean square error (RMSE), the coefficient of correlation (R), and Kling-Gupta efficiency (KGE) (Gupta et al., 2009).

The value range of $N S E$ is $(- \infty, 1]$ . The closer it is to 1, the better the prediction effect and the higher the reliability of the model. It is calculated by the following equation:

N S E = 1 - \frac{\sum_{i = 1}^{n} {(y_{o, i} - y_{s, i})}^{2}}{\sum_{i = 1}^{n} {(y_{o, i} - {\bar{y}}_{o})}^{2}} (10)

where $n$ is the number of testing samples. $y_{o, i}$ and $y_{s, i}$ are the $i - th$ observed value and predicted value. ${\bar{y}}_{o}$ is the mean of the observed series.

The smaller the RMSE, the higher the prediction accuracy. It is calculated by the following equation:

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{o, i} - y_{s, i})}^{2}} (11)

$R$ is a statistical indicator that reflects the correlation between the predictions and the observations. The closer it is to 1, the higher the prediction accuracy. It is calculated by the following equation:

R = \frac{\sum_{i = 1}^{n} (y_{o, i} - {\bar{y}}_{o}) (y_{s, i} - {\bar{y}}_{s})}{\sqrt{\sum_{i = 1}^{n} {(y_{o, i} - {\bar{y}}_{o})}^{2} \sum_{i = 1}^{n} {(y_{s, i} - {\bar{y}}_{s})}^{2}}} (12)

where ${\bar{y}}_{s}$ is the mean of the predicted series.

$K G E$ is a new metric proposed to address the deficiencies of $N S E$ in model calibration and evaluation. It is calculated by the following equation:

K G E = 1 - \sqrt{{(r - 1)}^{2} + {(\frac{σ_{s}}{σ_{o}} - 1)}^{2} + {(\frac{μ_{s}}{μ_{o}} - 1)}^{2}} (13)

where $r$ is the linear regression coefficient between the observed and predicted values. $μ_{o}$ , $σ_{o}$ , $μ_{s}$ , $σ_{s}$ correspond to the mean and standard deviation of the observed and predicted series, respectively.

3.6 Monthly runoff prediction using the proposed feature importance analysis and machine learning model

The main purpose of this study is to screen out suitable input predictors for the ML models and to explore which physical factors have a significant impact on the monthly runoff of YLX station. In this study, the monthly runoff of the first 564 months (1960–2006) of YLX station was used as training data to train the model, and the monthly runoff of the last 120 months (2007–2016) was used as testing data to test the predictive accuracy of the monthly runoff prediction model. The flowchart of the monthly runoff prediction model by combining ML with feature importance analysis (FIA-ML) is presented in Figure 2.

FIGURE 2

FIGURE 2. The flowchart of monthly runoff prediction using the FIA-ML model.

This study proposed a novel monthly runoff prediction model combining ML with teleconnection analysis, which is different from the commonly used time series analysis model (TSAM). Therefore, to show the superiority of the proposed FIA-ML model, it is necessary to make a comparison with some traditional TSAMs. In the following analysis, we compared the FIA-ML model with the widely used long short-term memory neural network (LSTM) and seasonal autoregressive integrated moving average (SARIMA) in previous studies. LSTM is a variant of recurrent neural network (RNN), which can effectively solve the gradient explosion or disappearance of simple RNN, and control the transfer of runoff time series information through a gating mechanism (Yuan et al., 2018; Ghose et al., 2022). Considering that the monthly runoff series is affected by the interaction of seasonal effects, long-term trend effects and random fluctuations, SARIMA transforms the nonstationary monthly runoff series into stationary series by performing trend difference and seasonal difference operation, and then establish a statistical analysis model (Valipour, 2015). The TSAM requires less basin information and is easy to use, so it has been widely used in practice. However, with the impact of changing environmental on the stationariness of the runoff series, the prediction accuracy of TSAM will also be affected to a certain extent. In contrast, although the FIA-ML model is more complex, it fully considers a variety of physical factors affecting runoff, and it has better application prospects in the context of climate change. It should be noted that when comparing the runoff prediction accuracy of TSAM and FIA-ML models below, the input predictors of the FIA-ML models adopted their corresponding best input scenario.

4 Results and discussion

4.1 selection of model input predictors

The choice of model input predictors will directly affect the final runoff prediction results. In this study, we aimed to construct a set of physical predictors for the runoff prediction model by identifying the key physical factors affecting runoff, from the 137 physical factors including large-scale climate index, precipitation, and evaporation, etc. To ensure the quality of the data and the reliability of this study, we directly discarded the physical factors with missing data to avoid unreasonable interpolation. It should be mentioned that the influence of the physical factors on runoff has a lag effect. Thus, in this study, the predictors with 1–12-month lags were employed to inform the runoff prediction model, considering the seasonal variation characteristics of monthly runoff.

As shown in Figure 3, there are several comments should be noted as follows: 1) among the 88 large-scale circulation factors, NPVI (factor number 55: Northern Hemisphere Polar Vortex Intensity Index) with a 1-month lag and EATI (factor number 64: East Asian Trough Intensity Index) with a 12-month lag are the most important factors for improving the accuracy of runoff prediction. 2) The effect of evaporation on runoff is more important than precipitation, which is consistent with the climate characteristics of more evaporation and less precipitation in Northwest China. 3) YLX basin was less affected by oceanic action, such as ENSO and PDO. Among the sea temperature indices, the influence of IOWPA (factor number 101: Indian Ocean Warm Pool Area Index) with a 4-month lag and WPWPA (factor number 103: Western Pacific Warm Pool Area Index) with an 11-month lag was relatively significant.

FIGURE 3

FIGURE 3. the mutual information value among the 137 factors and the observed runoff at a lag of 1–12 months, and the blanks are missing factors (A), and the importance score of the physical factors whose mutual information values are ranked in the top 50, which are filled with color (B).

In this study, we synthesized 10 scenarios (Figure 2) for 4 prediction models, that is HKSVM (M1), ELM (M2), GRNN (M3), and MLR (M4). Thus, four metrics, namely NSE, KGE, RMSE and R, are used to find the optimal input scenario for each model. According to the level of MI (Figure 3A), we selected the top 50 physical factors, and then RF is used to order the importance of these factors (Figure 3B). Based on the order of importance, we sequentially added 5 physical factors each time as the model input. So, a total of 10 input scenarios was generated (Figure 2). Synthesizing the prediction accuracy of the four regression models under 10 input scenarios, the HKSVM and MLR perform best in scenario 4, and the ELM and GRNN perform best in scenario 3. It can be seen from Figure 4 that with the increase of input features, the prediction accuracy of each model firstly becomes better, but after the optimal input scenario appears, increasing the input features is not conducive to further improve the prediction accuracy. There are two main reasons for this result: 1) the input features are too few to reveal the complex variation mechanism of runoff, but the increase the number of input features will introduce some redundant features; 2) if the number of input features are too large, it will increase the observation error of samples and the complexity of the model, which is not good for training the model and could weaken its promotion potential. Therefore, this study adopted the method of gradually increasing the number of input features in order of the feature importance, and selected the best predictors set according to the performance of each model during the testing period.

FIGURE 4

FIGURE 4. The results of NSE (A), KGE (B), RMSE (C), and R (D) of each model for 10 input scenarios. The best input scenario is selected by the red circle. The models represented by M1-M4 are shown in Table 2.

4.2 Monthly runoff prediction simulation

The statistical results of the prediction accuracy evaluation indicators of each model during the training period and the testing period were summarized in Table 2. The training period is the learning phase of the model, and the quality of the learning will directly affect the actual runoff prediction effect. In practical engineering, the performance of the model during the testing phase is usually more concerned, because it reflects the generalization ability and practical application effect of the model. As shown in Table 2, the overall performance of FIA-ML models showed better during the testing period, compared with the runoff time series analysis models. Furthermore, among the FIA-ML models, HKSVM and ELM have better runoff forecasting ability, which further demonstrates that choosing an appropriate machine learning algorithm is also a way to improve the accuracy of runoff prediction. Besides, we can see that GRNN is the best model during the training period, but its performance is not good enough during the testing period. Obviously, GRNN shows overfitting, which is because GRNN is too sensitive to the samples appearing in the learning stage, resulting in the lack of ability to explore the general variation of out-of-sample data. In contrast, HKSVM combines the advantages of polynomial kernel and Gaussian radial basis kernel function, and has relatively strong learning ability and generalization ability. To further compare the effect of each model in practical application, we show the fitting quality of each model during the testing period Figures 5,6,7,8.

TABLE 2

TABLE 2. Prediction accuracy evaluation metrics of different models during training and testing period.

FIGURE 5

FIGURE 5. Comparison of observed and fitted runoff by different models for 2007–2016 (testing period).

FIGURE 6

FIGURE 6. Boxplot of absolute residuals (A) and Taylor diagrams (B) of the results of monthly runoff prediction during the testing period. The models represented by M1–M6 are shown in Table 2.

FIGURE 7

FIGURE 7. Scatter plot of observations and predictions by HKSVM (A), ELM (B), GRNN (C), MLR (D), LSTM (E), and SARIMA (F) during the testing period.

FIGURE 8

FIGURE 8. Observations and predictions of peak monthly runoff with different models during the testing period are displayed in (A), and the number of years that the predictions of each model meet the design requirements is displayed in (B). The models represented by M1-M6 are shown in Table 2.

Figure 5 shows the fitting performance of the 120-month runoff observations and the predictions of the monthly runoff prediction models during the testing simulation phase. The monthly runoff of YLX during the wet season fluctuates obviously across different years. The FIA-ML models can relatively better capture the variation of monthly runoff, especially with HKSVM and ELM regression models.

Figure 6A is a boxplot of absolute residuals, which can describe the distribution of the difference between the predicted and observed values. Figure 6B is a Taylor diagram (Taylor, 2001), which skillfully integrates the correlation coefficient, centered root mean square error, and standard deviation into a polar plot, avoiding the limitations of a single evaluation metric. Therefore, it can more intuitively show the difference between the predicted and observed values. One can see from Figure 6 that the simulation prediction results of M1 (HKSVM) and M2 (ELM) are closer to the actual runoff observations. This indicated that ELM and HKSVM have more robust capabilities to reveal the correlation between predictors and observed runoff.

Figure 7 shows the linear relationship between the observed runoff and the simulated value obtained by different prediction methods during the testing period. It is found that both the FIA-ML models and the runoff time series analysis models are feasible in tracking the dynamic changes of monthly runoff. In addition, ELM has the best effect from the perspective of $R^{2}$ , and MLR is the worst. While if we analyze from the trend line of linear fitting, HKSVM is closer to the actual trend line, and the runoff time series analysis models perform worse.

In practical engineering, the prediction accuracy of peak monthly runoff is more significant than runoff in other months. For this reason, we compared the prediction effect of each model on peak monthly runoff during the testing period. According to the standard for hydrological information and hydrological forecasting formulated by the Ministry of Water Resources of China, for medium and long-term runoff prediction, when the relative error of the prediction result is less than 20%, it is considered a valid prediction value (Zhang et al., 2011). As displayed in Figure 8, M1 (HKSVM) and M2 (ELM) have eight years to meet the design requirements, but the performance of M3 (GRNN) and M6 (SARIMA) are relatively poor, with only five years to meet the design requirements.

4.3 Discussion

When using ML for runoff prediction, the choice of predictors is critical. Because the quality of the effective information carried by the features will directly affect the accuracy of runoff prediction. Our simulations revealed that appropriately increasing the number of features can improve the accuracy of monthly runoff prediction. However, too many input features will increase the complexity of training the model and reduce the ability to capture the general changing characteristics of runoff process. In this study, the NSEs of FIA-ML models (M1-M4) were initially 0.878, 0.866, 0.815, and 0.818. When the optimal number of input features for each model was reached, their NSEs were improved by 3.4%, 5.2%, 7.1%, and 8.2%, respectively. However, if the number of input features continue to increase, the accuracy of runoff prediction will not continue to get better. This should be mainly because too many inputs will make the model training subject to some abnormal information interference, and reduce the training efficiency.

Although RF is an effective tool to measure the feature importance, it also has some shortcomings. One of the most important issues is that if there are too many input features in RF, the efficiency and accuracy of calculating the importance score of each feature will be affected. Thus, this study first uses MI to select the top 50 factors related to the observed runoff, which is equivalent to a rough screening from numerous possible factors. The method combined RF with MI can effectively overcome the shortcomings of the single use RF or MI to select predictors as the inputs of the ML models.

Compared with the TSAM, the FIA-ML model can improve the NSE by up to 5%, and more importantly, can explain the variation characteristics of runoff from the physical meaning. However, it should be noted that although the FIA-ML models can improve the prediction accuracy to a certain extent, some issues should be noted in the application. One is that the meteorological information we used is relatively less in this study, so, in future work, we can further consider more meteorological factors related to prediction of monthly runoff. Another is that although the IPSO can effectively improve the global search ability of particles, there is still the problem of falling into local convergence, so it is necessary to explore a more efficient methods for improving the global search ability in selecting model hyper-parameters. In addition, simply depending on the correlation between the runoff process and these physical factors is not enough to fully reveal the variation characteristics of runoff, since the runoff prediction system is open and complex. Therefore, the system dynamics characteristics of the runoff process can be considered in machine learning in further studies.

5 Conclusion

We proposed the model by combining ML with the feature importance analysis (FIA-ML), which can select key predictors from the numerous physical factors and effectively integrate hydrometeorological information and teleconnection climate factors into the ML models. This paper verified the applicability of the model in the YLX basin, and compared with the traditional time series analysis model (TSAM). Under changing environments, the TSAM cannot accurately capture the impact of climate change on the characteristics of runoff variability. By contrast, the FIA-ML models not only have better runoff prediction ability, but can, more importantly, explain which physical factors have a significant impact on the runoff in the YLX basin.

The FIA-ML models can effectively improve the learning efficiency of ML models and the accuracy of runoff prediction. Especially, HKSVM and ELM optimized by IPSO have a good fitting ability for the relationship between observed runoff and input predictors. Therefore, the FIA-ML model is a useful attempt to improve the accuracy of runoff prediction by establishing the teleconnection between climate change and runoff change.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

Author contributions

LY: conceptualization, formal analysis, writing-original draft, writing-review and editing, supervision. QL: methodology, writing-original draft, writing-review, and editing. CJ: writing-original draft. PY: writing-review and editing. BL: writing-review and editing. ZR: writing-review and editing. ZL: writing-review and editing.

Funding

This study is financially supported jointly by the National Natural Science Foundation of China (No. 51909053, U2240201, 51909112), the Youth Foundation of Education Department of Hebei Province (QN2019132), Xingtai City Science and Technology Bureau (2021ZZ032), all of which are greatly appreciated.

Acknowledgments

We are very grateful to the editors and reviewers for their critical comments and thoughtful suggestions.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Bian, Y., Sun, P., Zhang, Q., Luo, M., and Liu, R. (2022). Amplification of non-stationary drought to heatwave duration and intensity in eastern China: Spatiotemporal pattern and causes. J. Hydrol. X. 612, 128154. doi:10.1016/j.jhydrol.2022.128154

Climate-informed monthly runoff prediction model using machine learning and feature importance analysis

1 Introduction

2 Study area and data

2.1 Hydrological characteristics of the study area

2.2 Meteorological and large-scale climate factors

3 Methodologies

3.1 Mutual information based on k-nearest neighbors

3.2 Feature importance analysis based on random forest

3.3 Regression model

3.3.1 Hybrid kernel support vector machine

3.3.2 Extreme learning machine

3.3.3 General regression neural network

3.3.4 Multiple linear regression

3.4 Hyper-parameters optimization for regression models

3.5 Model performance evaluation metrics

3.6 Monthly runoff prediction using the proposed feature importance analysis and machine learning model

4 Results and discussion

4.1 selection of model input predictors

4.2 Monthly runoff prediction simulation

4.3 Discussion

5 Conclusion

Data availability statement

Author contributions

Funding

Acknowledgments

Conflict of interest

Publisher’s note

References

94% of researchers rate our articles as excellent or good