Investigating Predictability of the TRHR Seasonal Precipitation at Long Lead Times Using a Generalized Regression Model with Regularization

Peng, Xiao; Li, Tiejian; Albertson, John D.

doi:10.3389/feart.2021.724599

ORIGINAL RESEARCH article

Front. Earth Sci., 10 August 2021

Sec. Atmospheric Science

Volume 9 - 2021 | https://doi.org/10.3389/feart.2021.724599

This article is part of the Research TopicMachine Learning in Natural Complex SystemsView all 15 articles

Investigating Predictability of the TRHR Seasonal Precipitation at Long Lead Times Using a Generalized Regression Model with Regularization

Xiao Peng¹

Tiejian Li²

John D. Albertson¹*

¹School of Civil and Environmental Engineering, Cornell University, Ithaca, NY, United States
²Department of Hydraulic Engineering, Tsinghua University, Beijing, China

Skillful long-lead climate forecast is of great importance in managing large water systems and can be made possible using teleconnections between regional climate and large-scale circulations. Recent innovations in machine learning provide powerful tools in exploring linear/nonlinear associations between climate variables. However, while it is hard to give physical interpretation of the more complex models, the simple models can be vulnerable to over-fitting, especially when dealing with the highly “non-square” climate data. Here, as a compromise of interpretability and complexity, we proposed a regression model by coupling pooling and a generalized regression with regularization. Performance of the model is tested in estimating the Three-Rivers Headwater Region wet-season precipitation using the sea surface temperatures at lead times of 0–24 months. The model shows better predictive skill for certain long lead times when compared with some commonly used regression methods including the Ordinary Least Squares (OLS), Empirical Orthogonal Function (EOF), and Canonical Correlation Analysis (CCA) regressions. The high skill is found to relate to the persistent regional correlation patterns between the predictand precipitation and predictor SSTs as also confirmed by a correlation analysis. Furthermore, flexibility of the model is demonstrated using a multinomial regression model which shows good skill around the long lead time of 22 months. Consistent clusters of SSTs are found to contribute to both models. Two SST indices are defined based on the major clusters of predictors and are found to be significantly correlated with the predictand precipitation at corresponding lead times. In conclusion, the proposed regression model demonstrates great flexibility and advantages in dealing with collinearity while preserving simplicity and interpretability, and shows potential as a cheap preliminary analysis tool to guide further study using more complex models.

1 Introduction

Skillful long-lead (seasonal to annual) climate forecast is of great importance in managing large water systems. Examples include but are not limited to making water transferring plans for multi-reservoir systems running at annual to inter-annual time scales (Carpenter and Georgakakos, 2001; Block, 2011), informing long-term agricultural decision making (Lemos et al., 2002; Hansen et al., 2011), and developing early warning system for disaster mitigation (Wilhite and Svoboda, 2000; Verdin et al., 2005). While local climate variability always fails to persist through such long lead times, the prediction can be made possible using long-lead teleconnections between regional climate and large-scale circulations. Anomalies of large-scale atmospheric circulations can be anchored by ocean memory due to massive heat capacity of ocean water and be released to perturb other circulations at a much later time (Xie et al., 2009; Xie et al., 2016). These perturbations can therefore be indicated by SST anomalies. There are already well-established SST-based climate indices that have seen good use in long-lead climate forecasts such as the Niño SST indices (Rasmusson and Carpenter, 1982; Trenberth, 1997; Trenberth and Stepaniak, 2001), the Pacific Decadal Oscillation (PDO) (Mantua et al., 1997; Zhang et al., 1997), the Tropical Northern Atlantic (TNA) and the Tropical Southern Atlantic (TSA) indices (Enfield et al., 1999) etc.

Approaches commonly used in developing long-term prediction models based on large-scale teleconnections can be roughly categorized into two classes: 1) physically-based simulation and 2) statistical models. While the physically-based simulation is widely used in investigating causality chains of climate processes, it is usually computationally intensive and requires expertise for parameter calibration (Menemenlis et al., 2005; Sahastrabuddhe and Ghosh, 2021). Its statistical counterpart, in the meantime, provides an easy access to examining statistical associations between climate variables which could be further used to develop prediction models. The statistical models are becoming increasingly popular thanks to the advances of sensing technology and internet which makes tremendously more data available at exceptionally high temporal and spatial resolutions (Liu, 2015). Early efforts of statistical modelling are featured by qualitative analysis comparing time series of different climate variables (Thornthwaite, 1948; Von Storch and Zwiers, 2001). Most early work relied on insights of the expert researchers and were done with data of rather limited size. Recent innovations in machine learning have developed powerful tools for examining linear/non-linear associations between climate variables in massive volumes in a more automated way. Just to name a few examples here: Kernelization is used to extend study domain from linear associations to nonlinear associations (Ali et al., 2019; Bueso et al., 2020). Data processing tips like pooling and convolution are used to enhance model robustness by discarding/smoothing noises (Devineni and Sankarasubramanian, 2010; Schepen et al., 2018). Of the many machine learning approaches, neural network has become extremely popular across a wide range of spatial scales (local-global) (Goddard et al., 2001; Mekanik et al., 2013; Fan et al., 2015; Ham et al., 2019; Reichstein et al., 2019). Ham et al. (2019) even successfully extended lead time of skillful ENSO forecast to one and a half years using a convolution neural network trained on historical simulations, which beat many state-of-the-art dynamical systems in terms of correlation skill for the Niño 3.4 Index. However, even though efforts are being made to improve model interpretability (Gilpin et al., 2018; Carvalho et al., 2019; Worland et al., 2019), tools for explaining the machine learning models are still insufficient (Gilpin et al., 2018) and to find physical interpretation of these models is usually hard or even impossible due to high model complexity. In this study, we looked at a generalized regularized regression method (i.e., elastic net) coupled with pooling as a compromise between model interpretability and complexity, and examined its performance in predicting regional seasonal precipitation based on large-scale SST anomalies.

Regression has been broadly used in climate research and related fields. Typical applications include 1) change point detection (Solow, 1987; Mudelsee, 2000), 2) developing forecast models (Krishnamurti et al., 1999; Mekanik et al., 2013; Kharin and Zwiers, 2002), and 3) identification of covariates with high predictive skill (Wakabayashi and Kawamura, 2004; Matsui and Konishi, 2011). Not only can the regression model identify the linear relationships between the climate variables at a given temporal basis (e.g., monthly or annual), but it also has good interpretability for guiding further research using more complex, nonlinear statistical methods or physically-based modeling experiments. These two features make regression especially popular in exploring teleconnections between regional climate and large-scale circulations. Hurrell (1996) used a multivariate linear regression model to link changes in northern hemisphere temperature to extratropical climate indices. Krishnamurti et al. (1999) developed a superensemble method for improving weather and climate forecast skills by using coefficients from multiple regressions. Wakabayashi and Kawamura (2004) extracted four major teleconnection patterns in predicting Japan summer climate anomalies by combining the empirical orthogonal function (EOF) and regression. Van Oldenborgh and Burgers (2005) developed a synthetic precipitation generator with regression models using the Niño 3.4 Index as the sole regressor to examine decadal variation in global ENSO-precipitation teleconnections. Yang and DelSole (2012) used the regression coefficient maps to explore teleconnections between ENSO and different climate fields. More recently, Zhang et al. (2020) examined teleconnections between the Arctic sea ice decline and major climate indices using the quantile regression analysis. Only a few examples are listed here for context, as our intent is to test a generalized regression model with regularization in long-term seasonal precipitation forecast instead of doing a thorough review applications of regression in climate research.

However, over-fitting and over-parameterization are important issues for most regression analysis. These issues are particularly pertinent in climate research, since the remote sensing data are usually highly “non-square” (i.e., the total number of time series largely exceeds the length of the time series). Therefore, it is necessary to reduce the effective dimensionality of the problems. Two commonly used approaches have been: 1) to select only a few predominant features/patterns (e.g., the Principal Component Analysis, PCA (Schoof and Pryor, 2001; Wakabayashi and Kawamura, 2004; Li et al., 2020) or the Canonical Correlation Analysis, CCA (Mo, 2003; Yang and DelSole, 2012)); 2) to use only a few well-established climate indices (Van Oldenborgh and Burgers, 2005; Rust et al., 2015; Tan and Shao, 2017; Zhang et al., 2020). However, both methods have intrinsic disadvantages: Traditional dimensionality reduction methods like PCA and CCA try to decompose the global covariance structure of the predictors (PCA) or between the predictand and the predictors (CCA) and can miss important regional patterns while the climate indices are only defined by prior knowledge and thus could limit the domain where we want to explore the potential teleconnections.

In the past few decades, regularization has become increasingly popular in dealing with multicollinearity in regression. The two regularization approaches commonly used with regression are the L-1 norm (the least absolute shrinkage and selection operator, LASSO (Tibshirani, 1996)) and the L-2 norm (the ridge regression (Hoerl and Kennard, 1970)) of regression coefficients. Other popular regularization approaches include the Akaike’s Information Criterion (AIC) (Akaike, 1998) and the Bayesian Information Criterion (BIC) (Schwarz, 1978). Both L-1 and L-2 norm regularizations have shown good performance in alleviating or avoiding over-fitting in regression models in climate research (Matsui and Konishi, 2011; Soleh et al., 2015; DelSole and Banerjee, 2017; Kim et al., 2017; Li et al., 2020). Yet, it should be noted that the ridge regression does not directly provoke sparsity of the regression model while the LASSO regression tends to assign non-zero value to only one of many correlated predictors which can make the model difficult to interpret. The lack of interpretability of the lasso model is also pointed out in a recent paper from Stevens et al. (2021) where a graph-guided variation is used as an extra regularization to improve robustness of the regression model in predicting Southwestern United States winter precipitation. Here, we propose to use the elastic net regularization (Zou and Hastie, 2005) which linearly combines the LASSO and ridge regression regularizations. While the LASSO regularization guarantees sparsity of the model, the ridge regression regularization helps improve visualization of the regression coefficient map and therefore, interpretability of the model (Peng et al., 2020). On top of that, a pooling layer is added before developing the regression model. The pooling layer is commonly used in machine learning for reducing spatial dimensions (Zeiler and Fergus, 2013; Yu et al., 2014; Kalchbrenner et al., 2014). And this extra pooling layer should help improve robustness of the model by avoiding the realistic problem that major “hot” regions defining large-scale circulations are not fixed to certain spatial grids naturally.

The proposed model is tested to predict the Three-Rivers Headwater Region (TRHR) wet-season precipitation using the Pacific Ocean and Indian Ocean SSTs. The TRHR, located in the eastern Tibetan Plateau (TP), is often called China’s Water Tower as from it flow the three major rivers of China: the Yellow River, the Yangtze River, and the Lancang (Mekong) River. Consequently, the TRHR plays a critical role in providing invaluable ecological goods and serviced as well as other resources like energy and food. While great efforts have been devoted to studying teleconnections between the broader TP precipitation and large-scale climates (Benn and Owen, 1998; Shaman and Tziperman, 2005; Feng and Zhou, 2012; Dong et al., 2020), the quantitative studies focusing solely on the TRHR are rather limited and only looks at short lead times (Zhang et al., 2019; Zhao et al., 2019). In this study, we extend the forecast lead time up to 24 months and the performance are compared against some widely-used regression methods including OLS multi-linear regression, the EOF regression, and the CCA regression. The precipitation is predicted in true amplitudes and binary states (wetter or drier than normal) to demonstrate flexibility of the model. In this study, we seek a model that is computationally tractable for fast decision making support for stakeholders while retaining a relatively direct physical interpretation to aid further investigation of the underlying physical processes.

2 Data

We base our analysis on monthly precipitation data from Jan 1981 through Dec 2019 as collected from the Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS). The original gridded precipitation data incorporates satellite data with in-situ station data, with a resolution of 0.05° by 0.05° (Funk et al., 2015). In this study, we spatially averaged the TRHR precipitation over a rectangular area masking 89°E to 103°E and 31°N to 37°N as shown in Figure 1A. The monthly climatology for the spatially averaged precipitation is monomodal showing that over 80% of the annual precipitation falls during the 5-month period of May-Sept, which we define as the wet season in this study (also known as the growing season for the TRHR (Chen et al., 2020)).

FIGURE 1

FIGURE 1. (A) The study region of TRHR (black box) as plotted in an elevation map based on ETOPO-5 (Center, 1988) and (B) time series of the spatially averaged precipitation from CHIRPS (red) and CMA (blue) after standardization.

The CHIRPS precipitation is double checked against the station-based precipitation (1988–2017) collected from the China Meteorological Administration (CMA). 55 stations with missing data ratio lower than 20% within the study region are selected and the arithmetically averaged precipitation time series is compared against that from the CHIRPS precipitation. A systematic shift is observed around 2000 for both precipitation though the shift is less significant for the station-based precipitation. While difference in the shifts can be due to non-uniform distribution of the CMA stations, we do not want to diverge into this topic. Instead, the shift is removed by separately standardizing the precipitation over 1981–2000 and 2001–2019. The standardization is done by subtracting the mean and dividing by the standard deviation. The standardized CHIRPS precipitation shows good consistency with the station-based precipitation as shown in Figure 1B, and correlation is 0.67 for 39 samples (p-value < 0.01). The binary TRHR precipitation is used in the multinomial regression model and the two states are defined as: 0 or dry for standardized precipitations smaller than 0 and 1 or wet for standardized precipitations greater than 0.

SST is selected as the primary predictor since it can indicate perturbations in large-scale atmospheric circulations “anchored” in ocean memory (Xie et al., 2009, Xie et al., 2016). Also, the SST field is less spatially heterogeneous compared to that of other common climate variables including geopotential height, vertical velocity of atmosphere (OMEGA) and wind velocities (Peng et al., 2020), which can help improve robustness of the regression models. Monthly SST data is collected from the Hadley Centre Sea Ice and Sea Surface Temperature (HadISST) data set with a spatial resolution of 1° by 1° (Rayner et al., 2003) over Jan 1979–Dec 2019. Only SSTs from the Pacific Ocean and Indian Ocean basins are used to limit our study to regional processes and the basin range is based on the definitions from the National Oceanic and Atmospheric Administration (NOAA) via https://www.nodc.noaa.gov/woce/woce_v3/wocedata_1/woce-uot/summary/bound.htm. Similar positive shifts are observed for most parts of the Pacific Ocean and Indian Ocean as seen in Supplementary Figure S1. To be consistent with the standardization of precipitation, the SSTs are too standardized separately for 1979–2000 and 2001–2019 to remove effects of the trends. This step is to ensure that model skill as measured by the correlation coefficient in the later sections will not be biased by the trends. The only difference here is that standardization of the SSTs is done locally for each grid and uses the monthly climatology means and standard deviations to remove the seasonal cycle.

3 Methods

3.1 The Regularized Regression

Here, we propose a two-step generalized regression model with regularization for dealing with collinearity when developing linear prediction models. The model first reduces dimensionality of the predictors using pooling which is a commonly used method for down-sampling input representations. Then a regularized regression model is fitted using the pooled predictors to estimate real-valued or categorical predictand. A schematic is shown in Figure 2. In the pooling step, a new grid (in green) is defined by some characteristic values (e.g., maximum or median) of the four small grids. The extra step of standardization is to remove the systematic shift in precipitation around 2000 and to rescale precipitation and SSTs (as they have different amplitudes). In this study, we compared model performance using different pooling approaches (i.e., maximum/median/minimum pooling) with a squared window of four grids by four grids.

FIGURE 2

FIGURE 2. A schematic of the proposed the regression model coupling max pooling and elastic net.

The regularized regression is given by Eq. 1 with regression coefficient (β, β₀) (β₀ is the intercept), and it allows flexibility by using different deviance function (Dev) for predictands of different types. For example, the mean squared error (MSE) function is used for estimating the real-valued predictand and the log-likelihood function is used for categorical predictand (Hastie et al., 2009). The deviance function is rescaled by one over the total sample length N. The elastic net regression is adopted here and the regularization term uses a linear combination of the L-1 and L-2 norms of the regression coefficients as shown in Eq. 2.

(β, β_{0}) = \min_{β, β_{0}} (\frac{1}{N} D e v (β, β_{0}) + λ P_{α} (β)) (1)

P_{α} (β) = \frac{1 - α}{2} {‖ β ‖}_{2}^{2} + α {‖ β ‖}_{1} (2)

There are two hyperparameters in the model: α and λ. α balances the regularization between the L-1 and L-2 norms of the regression coefficients β and is set to 0.01 for better visualization (Peng et al., 2020). λ is usually decided using a k-fold (e.g., 5-fold) cross validation (CV) (Tibshirani, 1996) and the λ value associated with minimum cross-validated mean squared errors is used (often referred to as the MinMSE λ). However, this procedure can be computationally burdensome since we have to repeat the CV for all lead times. Therefore, a constant λ is firstly determined using the training set data at 0-months lead and is used for all lead times. A preliminary study demonstrates that the significant predictive skill spikes in the testing period are not sensitive over a rather broad range of λs as shown in Supplementary Figure S2. For the true-amplitude predictand, the Pearson’s correlation coefficient (CC) and the Nash–Sutcliffe efficiency (NSE) score are used for model performance evaluation. For the binary predictand, an accuracy score S is defined as given by Eq. 3 where 1 is an indicator function and $\hat{y}$ is the predicted probability of y being 1 (i.e., wet).

S = \frac{1}{N} \sum_{i = 1}^{N} \{y_{i} \cdot 1 ({\hat{y}}_{i} \geq 0.5) + (1 - y_{i}) \cdot 1 ({\hat{y}}_{i} < 0.5)\} (3)

3.2 Other Regression Models

Performance of the elastic net is compared against some commonly used regression methods in the two-step scheme including the OLS mutil-linear regression (see Hurrell (1996) for details), the EOF regression (see Wakabayashi and Kawamura (2004) for details), and the CCA regression (see Sun and Kim (2016) for details). All regression methods use the same pooled SSTs as the predictors. Though pooling can alleviate the issue of over-fitting, dimensionality of the pooled predictors is still highly non-square (39 years by 1,343 grids). The OLS and CCA regression seek for a linear combination of predictors that maximizes its correlation with the predictand and does not regularize the model complexity. The EOF regression first projects the original predictors onto some “dominant” basis vectors (often referred as EOFs) by decomposing the covariance matrix of the predictors, and then uses the EOFs as the new predictors. It can implicitly regularize the model complexity by using only a few EOFs explaining most variance of the original predictors. The EOF is implemented such that the original (pooled) SSTs are projected onto a set of orthonormal time series which constitute the predictors. It should be noted that it impossible to develop a prediction model this way since we are using data from testing set to construct the basis vectors. Here, the most dominant 50 EOFs accounting for over 88% variance of the original SST data are used in the EOF regression model.

3.3 The Correlation Analysis

A correlation analysis is designed to measure if any correlation patterns between the predictand precipitation and the predictor SSTs persist through time. We propose a new correlation metric L to quantitatively measure persistence of any correlation patterns: for a certain lead time, we first compute lagged correlations between the TRHR precipitation and SSTs at every grid for 1981–2000 and vectorize the correlation map into a column vector denoted by M₁; then this step is repeated for 2001–2019 to compute the column vector M₂; at last, L is defined by computing the correlation coefficients between the vectorized correlation maps M₁ and M₂. L is bounded by a upper limit of 1, which represents the extreme scenario where the correlations between the TRHR precipitation and the SSTs are perfectly consistent before and after 2001 and therefore, a good regression model trained on 1981–2000 should produce significantly high predictive skill on the testing period of 2001–2019. However, L being close to zero does not necessarily mean no predictive skill for regression models since L measures persistence of the global correlations between the TRHR precipitation and the SSTs while the regression model could pick some regional clusters of SSTs that have a persistent correlation with the predictand precipitation. The metric L is used here to estimate how much degradation of performance is resulted from over-fitting by comparing against the testing period predictive skill from the regression models.

4 Results and Discussions

4.1 Comparison of Regression Models

Performance of the regularized regression models in predicting the TRHR precipitation in true amplitudes is examined in this section. The period of 1981–2000 is set as the training period while 2001–2019 is set as the testing period to be consistent with the standardization procedure. Prelim analysis with randomly split data sets demonstrated consistent model skill patterns as shown in Supplementary Figure S3. Since both pooling and regularization are designed for effective dimensionality reduction and thus to avoid over-fitting, we first justify using the extra step of pooling by comparing predictive skill of the regularized regression models with and without pooling. The comparison of testing period correlation coefficients are shown for regression models without pooling and with maximum, median, and minimum pooling in Figure 3. Two spikes are observed at lead times of 13–14 months and 21–24 months. Significant improvement in model skill is shown for lead times of 13–14 months when pooling is used and at the lead time of 13 months, the predictive skill drops to below p-value = 0.1 significance level using non-pooled SSTs. For lead times of 21–24 months, consistent improvement, though less significant, is observed. The improvement could be due to the fact that while there exist some consistent large-scale circulation patterns, the signals may not be fixed to certain grids depending on the spatial resolution and projection coordinate system. Therefore, the model robustness can be improved by including signals of the neighboring grids with pooling. However, though not examined here, one must be careful with choosing the pooling window size since displacement of some circulation patterns can be important indicators of climate anomalies (McGregor et al., 2014; Manatsa et al., 2014) and this information may not be resolved when the pooling window is too large. It should be noted that the λ is re-calibrated for the regression models using non-pooled SSTs. And the statistical significance for the regularized regression is not straightforward to calculate and thus is not reported (Javanmard and Montanari, 2014). The maximum pooling is used in the following analyses.

FIGURE 3

FIGURE 3. Comparison of the predictive skill in the testing period from the elastic net regression models with no pooling (dark blue), maximum pooling (light blue), median pooling (green) and minimum pooling (yellow). The p-value = 0.1 significance level is plotted in the black dashed line.

Model skill for the testing period data as measured by the NSE scores is reported in Figure 4. Consistent patterns are observed as two spikes of NSE scores are found around lead times of 14 and 22 months. However, even at those two lead times, the predictive skill is barely satisfactory. By further looking at comparison between the observed and estimated precipitation, we figured that the elastic net model markedly underestimated the predictand amplitude. However, the extent of shrinking is consistent across the training and testing periods (as shown in Supplementary Figure S4A) and thus, in practice, one could ‘learn’ how much the amplitude is shrunk by looking at the training data and can then rescale testing period estimations. The rescaled estimations show significantly improved NSE scores (NSE = 0.36 for lead time of 14 months and 0.38 for lead time of 22 months (as shown in Supplementary Figure S4B)). A plausible explanation is that the elastic net regression sacrifices accuracy in amplitude estimation for model robustness by selecting only a few predictors and shrinking amplitudes of regression coefficients. This effect is more significant with highly non-square data as in our case since greater regularization must be applied. Therefore, amplitude-based measures such as NSE and the root mean square error (RMSE) may not be applicable for model evaluation. The rescaling method discussed above is not recommended since the model is designed to be biased for more robustness, and in following analyses, the Pearson’s correlation coefficient is used as the primary measure of model skill.

FIGURE 4

FIGURE 4. Predictive skill for the testing set as a function of lead time reported in the NSE scores.

We then justify using the regularization by comparing predictive skill of the regression with and without regularization. Comparison of the model skill from the elastic net, OLS, EOF, and CCA regression models are shown in Figure 5. Statistically significant positive predictive skill is only observed for the elastic and CCA regressions models. The elastic net models show two spikes of good predictive skill at lead times of 13–14 months and 21–24 months. Statistically significant positive skill is observed for the CCA regression models only at the lead time of 18 months while that of the elastic net regression almost hit the p-value = 0.1 significance level at the lead time of 19 months. Overall, the elastic net regression shows more potential in finding the linear associations between the TRHR precipitation and the SSTs. While statistically significant positive skill are only found at rather long lead times, this does not necessarily mean that there is no connection between the TRHR precipitation and large-scale climate fields at shorter lead times. We are limiting our analysis to only using SSTs from the Pacific and Indian Oceans which is only one sector of the complex large-scale circulations including a wider range of variables like geopotential heights, humidity, vertical velocity of atmosphere (OMEGA) and horizontal winds etc. Furthermore, we are limiting our analysis in the frame of linear models as we compare different types of regression models. Instead of developing accurate forecast models, our intent is to examine how pooling and regularization would improve performance of the linear models at rather low costs. The better performance of elastic net is understandable here since it explicitly regularizes model complexity and provokes sparsity in regression coefficients by using the L-1 norm regularization.

FIGURE 5

FIGURE 5. Prediction skill for the testing set of 2001–2019 as measured by the correlation coefficients as function of lead times for elastic net (red), OLS (blue), EOF (green), and CCA (yellow). The p-value = 0.1 significance level is plotted in the black dashed lines.

4.2 Source of High Model Skill

A correlation analysis is conducted to measure at a certain lead time, how well a linear model based on the global correlation between the TRHR precipitation and the SSTs can perform. The potential predictive skill is estimated by the new correlation metric L defined earlier as L measures how the time-shifted global correlation patterns persist from 1981 to 2000 (the training period) to 2001–2019 (the testing period). A comparison between L and the predictive skill of the elastic net model at varying lead times is shown in Figure 6. Spikes in L are observed at lead times of 7, 14, 18, and 22 months. Three of the spikes coincide with good predictive skill from the elastic net model (i.e., lead times of 14, 19, and 22 months) while only one of the spike coincide with good skill from other regression models (i.e., lead time of 18 months for the CCA regression model). To estimate how much potential are realized for each model, correlation coefficients between the series of L and model skill from regression models over the lead times of 0–24 months are computed and reported in Table 1. The only statistically significant correlation is found for the elastic net model (0.82 for 25 samples, p-value < 0.01) while rather low correlations are found for other regression models. The results suggest that the OLS, EOF, and CCA regression models do not perform well even when there exist persistent correlations between the predictand precipitation and the SSTs.

FIGURE 6

FIGURE 6. Comparison between the predictive skill from the elastic net model (bar) and Ls (blue diamond) at varying lead times. The statistical significance levels are not shown here as we are comparing correlation coefficients calculated using samples of different lengths (19 for the predictive skill and 39 for the Ls).

TABLE 1

TABLE 1. Correlations between L and prediction skill at varying lead times for different regression models.

Possible explanations are proposed here based on algorithms of the regression methods. For the OLS and CCA regression, the models could be over-fitted to the noisy SST signals for high training skill as both methods decompose the covariance between the predictand precipitation and SSTs to seek a linear combination that either minimizes the MSE (for OLS) or maximizes the correlation (for CCA). Thus, the models are less robust and can perform poorly when evaluated using the testing period samples. As for the EOF regression, the EOF re-constructs the predictors by projecting the global covariance of the SSTs onto some dominant orthonormal basis vectors (EOFs). There are two limiting factors: 1) the assumption of orthogonality may not be appropriate as the new predictors are constructed from the physical variable of SST; 2) any regional persistent correlation patterns between the TRHR precipitation and the SSTs could be lost if they do not make significantly large contribution to the global covariance. Technically, we are not predicting the TRHR precipitation with the EOF regression models since data of the full study period (1981–2019) is used for constructing the new set of predictors (i.e., EOFs).

Interestingly, comparably high predictive skill are observed for the elastic net models at lead times of 14 and 22 months while persistence of the global correlation is much lower at the lead time of 14 months as indicated by L. We specifically looked at the correlation maps between the TRHR precipitation and the SSTs before and after 2000 for the lead times of 14 and 22 months as shown in Figures 7, 8, respectively. For the lead time of 14 months: before 2000, the correlation map features a cluster of positive correlations [180E-210E, 45S-15S] to the east of Australia and an extended band of positive correlations over the mid-north Pacific Ocean [120E-210E, 15N-30N]. Scattered and less significant positive correlations are observed over the northern Indian Ocean and to the west of South America; After 2000, the correlation map is dominated by two major clusters of positive correlations to the east of Australia [180E-210E, 45S-15S] and to the west of South America [260E-280E, 60S-15S] and one major cluster of negative correlations over the southwestern Indian Ocean [30E-60E, 60S-30S]. Less significantly positive correlations are observed over the north-western Pacific which overlaps with the extended band before 2000. For the lead time of 22 months, both correlation maps before and after 2000 are dominated by large clusters of positive correlations over the northern Indian Ocean [60E-90E, 15S-15N] and eastern tropical Pacific Ocean [210E-270E, 15S-0]. The major difference is that clusters of positive correlations over the southern-eastern Pacific and the mid-western Pacific [120E-150E, 15N-30N] get enhanced in correlation amplitude and extended in spatial coverage. A comparison between Figures 7, 8 suggests a higher level of persistence in the global correlation between the predictand precipitation and the SSTs at the lead time of 22 months, which is consistent with the higher value of L in Figure 6. However, a lower value of L does not necessarily mean low predictive skill potential for a linear model as some regional persistent correlation patterns are observed (i.e., the positive correlation cluster to the east of Australia) for the lead time of 14 months. The correlation maps are further compared with the regression coefficient maps in Section 4.3 as we attempt to interpret the high model skill of the elastic net regression.

FIGURE 7

FIGURE 7. Correlation maps between the TRHR precipitation and the SSTs over the training period of 1981–2000 (top) and the testing period of 2001–2019 (bottom) at the lead time of 14 months.

FIGURE 8

FIGURE 8. Correlation maps between the TRHR precipitation and the SSTs over the training period of 1981–2000 (top) and the testing period of 2001–2019 (bottom) at the lead time of 22 months.

4.3 An Alternative Multinomial Regression Model

In this section, the binary precipitation is predicted using the multinomial regression version (i.e., the elastic net logistic regression) of our model. While new machine learning techniques like the classification and regression tree (CART) can have better model skill in multi-class prediction of climate variables (Choubin et al., 2018; Huang et al., 2021), the elastic net logistic regression is tested to demonstrate flexibility and consistency with altered deviance functions. The multinomial regression may have more use in practical application since amplitudes of the predictand tend to be underestimated because of the regularization (Peng et al., 2020). The logistic regression is implemented by simply replacing the MSE function with the log-likelihood function for Dev (Hastie et al., 2009). Though the logistic regression is a special case of the multinomial regression, the model could be easily generalized for predictand of more than two categories by separately fitting a regularized Poisson regression for each category of which the coefficients are used to estimate the coefficients of the multinomial regression model (Baker, 1994).

To extend the sample size, the leave-one-out cross validation is used (the testing sample size is thus increased from 19 to 39 here). The leave-one-out CV is not used in the previous analysis for predicting real-valued predictand since the Pearson’s correlation coefficient is used for performance evaluation and the leave-one-out CV could result in bias for violating continuity in data. The accuracy as measured by the S score as a function of lead time is shown in Figure 9. A consistent spike of high accuracy (over 70% accuracy) is observed at around lead times of 22 months. While a local maximum of accuracy is observed at the lead time of 14 months, it is not statistically significant. A major reason could be that the S score is not the equivalent measure of the correlation coefficient as we used in Figures 3, 5. A example of high correlation but low S is when the model could well predict the extreme events but performs poorly for the less extreme events.

FIGURE 9

FIGURE 9. Accuracy of predicting the wet-dry state of TRHR precipitation at varying lead times. The base skill (the null model of totally random guess) is plotted in the black dashed line. The one-tailed p-value = 0.1 significance level is estimated using bootstrapping and is plotted in the blue dashed line. The mean accuracy using a totally random guess strategy for 39 samples is collected from 10,000 repeated experiments and the 90th quantile value is used as estimation of the p-value = 0.1 significance level.

Consistency between the regression models using predictand and deviance function of different types is further examined by comparing the maps of the regressions coefficients for lead times of 14 and 22 months as shown in Figures 10, 11, respectively. For the lead time of 14 months, both models feature a major cluster of positive coefficients to the east of Australia while more non-zero coefficients are observed for the logistic regression model over the southeastern Pacific Ocean and the Indian Ocean. As for the lead time of 22 months, both maps are dominated by the large cluster of positive coefficients over the eastern tropical Pacific Ocean. While positive coefficients are observed over the northern Indian Ocean for both models, the coefficients are more sparsely distributed for the model using the true-amplitude predictand. More non-zero coefficients are found for the logistic regression model at both lead times, which could be due to a less optimal regularization as we only did the cross validation on a relatively sparse sequence of λs with a lead time of 0 months. But overall, the major clusters of non-zero regression coefficients are consistent across the models. And this is also confirmed by the results that statistically significant correlations are found between the vectorized coefficient maps of the two models. The correlation coefficients are 0.45 for the lead time of 14 months and 0.44 for the lead time of 22 months (1,343 samples) and are 0.35 for the lead time of 14 months (327 samples) and 0.40 for the lead time of 22 months (359 samples) when only the non-zero coefficients are considered.

FIGURE 10

FIGURE 10. Maps of regression coefficients from the elastic net models using true-amplitude (top) and binary (bottom) predictand precipitation at the lead time of 14 months.

FIGURE 11

FIGURE 11. Maps of regression coefficients from the elastic net models using true-amplitude (top) and binary (bottom) predictand precipitation at the lead time of 22 months.

The comparably high predictive skill at the lead time of 14 months is easy to interpret if we compare the coefficient maps from Figures 10, 11 to the correlation maps from Figures 7, 8. While less persistence is observed for the global correlation at the lead time of 14 months, the elastic net model managed to select the regional persistent cluster of positive correlations to the east of Australia. At last, two SST indices are defined based on the consistent patterns from the two regression models: the SST index 1 is defined by the mean SST over the domain of [180°W-160°W, 42°S-18°S]; the SST index 2 is defined by the mean SST over the domain of [120°W-88°W, 22°S-2°N]. Lagged correlations between the TRHR precipitation and the SST indices calculated using the non-pooled and pooled SSTs are plotted in Figure 12. Statistically significant positive correlations are observed at the corresponding lead times and consistent results are shown for non-pooled and pooled SSTs. The results suggest that the proposed framework managed to select certain regional SSTs that are consistently correlated with the predictand precipitation and demonstrated good interpretability. While correlation does not necessarily imply causation, the elastic net regression models show good potential here in guiding further research with its highly interpretable and flexible models.

FIGURE 12

FIGURE 12. Lagged correlations between the TRHR precipitation and the SST indices. The p-value = 0.01 significance level is plotted in the black dashed line.

5 Conclusion

In this paper, we tested a generalized regression model with regularization coupled with pooling in predicting the TRHR wet-season precipitation at lead times of 0–24 months using the Pacific and Indian Ocean SSTs. The regression is first tested using the true-amplitude predictand and is compared against some widely-used regression models including the OLS multi-linear regression, the EOF regression and the CCA regression. Significantly good predictive skill are observed using the elastic net regression models for certain long lead times which are further examined using a correlation analysis. The results suggest that the elastic net regression achieves good performance in identifying and using the persistent correlation patterns while the other three regression models show relatively poor performance. Low model skill at shorter lead times can be due to that only SST is used as the predictor while teleconnection signals can propagate through other climate fields chronologically. A multinomial elastic net regression model is then used to demonstrate flexibility and consistency of the proposed framework. Consistent model skill and regression coefficient maps are observed even when predictand and deviance functions of different types are used. By comparing the correlation analysis and the regression coefficient maps, we found that the elastic net model managed to select regional persistent correlation patterns as the contributing predictors while the other widely-used regression models are based on the global covariance either between the predictand and the predictors or within the predictors (and thus are vulnerable to over-fitting). At last, two SST indices are defined based on the major clusters of non-zeros coefficients from the elastic net models and are found to be significantly correlated to the TRHR precipitation at the corresponding lead times. Overall, the proposed framework demonstrates good interpretability in identifying covariates with high predictive skill and the potential in guiding further investigation using more complex, nonlinear statistical models or physically based modeling experiments.

Data Availability Statement

The original contributions presented in the study are included in the article/Supplementary Material. The codes are processed data can be found via https://github.com/cruiseryy/TRHR_elasticnet_pooling and further inquiries can be directed to the corresponding author.

Author Contributions

The idea originated from a discussion among the three authors. XP and JA designed the study and XP carried out the experiment, and wrote the article. JA oversaw the entire project. JA and TL offered technical advice during the experiment and provided inputs to the article writing. All authors contributed to the article and approved the submitted version.

Funding

The Cornell authors were supported by the Cornell China Center under the project of “Food-Energy-Water Security in the Three-Rivers Headwater Region: Integrating Science, Data and Decision Tools to Manage Under a Changing Climate and Development Pressure”. The Tsinghua author was supported by the National Key Research and Development Program of China (No. 2016YFE0201900), and the Research Fund of the State Key Laboratory of Hydroscience and Engineering, Tsinghua University (No. 2017-KY-04).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Acknowledgments

The rainfall data were collected from the CHIRPS (http://chg.geog.ucsb.edu/data/chirps/) and the China Meteorological Administration (CMA). The SST data were collected from the HadISST dataset (https://www.metoffice.gov.uk/hadobs/hadisst/). We thank the two reviewers for their insightful comments.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/feart.2021.724599/full#supplementary-material

References

Akaike, H. (1998). “Information Theory and an Extension of the Maximum Likelihood Principle,” in Selected Papers of Hirotugu Akaike (New York: Springer), 199–213. doi:10.1007/978-1-4612-1694-0_15