A Knowledge-Aided Robust Ensemble Kalman Filter Algorithm for Non-Linear and Non-Gaussian Large Systems

Lopez-Restrepo, Santiago; Yarce, Andres; Pinel, Nicolás; Quintero, O. L.; Segers, Arjo; Heemink, A. W.

doi:10.3389/fams.2022.830116

ORIGINAL RESEARCH article

Front. Appl. Math. Stat. , 09 March 2022

Sec. Dynamical Systems

Volume 8 - 2022 | https://doi.org/10.3389/fams.2022.830116

A Knowledge-Aided Robust Ensemble Kalman Filter Algorithm for Non-Linear and Non-Gaussian Large Systems

$\nSantiago Lopez-Restrepo,,$ Santiago Lopez-Restrepo^1,2,3^*

Andres Yarce^1,2,3^*

Nicolás Pinel⁴

O. L. Quintero¹

Arjo Segers⁵

A. W. Heemink²

¹Mathematical Modelling Research Group, Universidad EAFIT, Medellín, Colombia
²Department of Applied Mathematics, TU Delft, Delft, Netherlands
³SimpleSpace, Medellín, Colombia
⁴Grupo de Investigación en Biodiversidad Evolución y Conservación (BEC), Departamento de Ciencias Biolgicas, Universidad EAFIT, Medellín, Colombia
⁵TNO Department of Climate, Air and Sustainability, Utrecht, Netherlands

This work proposes a robust and non-Gaussian version of the shrinkage-based knowledge-aided EnKF implementation called Ensemble Time Local H_∞ Filter Knowledge-Aided (EnTLHF-KA). The EnTLHF-KA requires a target covariance matrix to integrate previously obtained information and knowledge directly into the data assimilation (DA). The proposed method is based on the robust H_∞ filter and on its ensemble time-local version the EnTLHF, using an adaptive inflation factor depending on the shrinkage covariance estimated matrix. This implies a theoretical and solid background to construct robust filters from the well-known covariance inflation technique. The proposed technique is implemented in a synthetic assimilation experiment, and in an air quality application using the LOTOS-EUROS model over the Aburrá Valley to evaluate its potential for non-linear and non-Gaussian large systems. In the spatial distribution of the PM_2.5 concentrations along the valley, the method outperforms the well-known Local Ensemble Transform Kalman Filter (LETKF), and the non-robust knowledge-aided Ensemble Kalman filter (EnKF-KA). In contrast to the other simulations, the ability to issue warnings for high concentration events is also increased. Finally, the simulation using EnTLHF-KA has lower error values than using EnKF-KA, indicating the advantages of robust approaches in high uncertainty systems.

1. Introduction

Data assimilation (DA) is a mathematical family of methods that allows the combination of observations and models. The model is used to fill observational gaps, and the observations constrain the model dynamics [1, 2]. In most DA methods, the aim is to minimize the estimated error variance. For instance, Kalman filter (KF) is an optimal method that minimizes the mean-squared-error in the estimation. The KF is optimal when the dynamic system is linear [3]. The Ensemble Kalman filter (EnKF) is a KF-based Monte Carlo approximation of the KF when the state space is large, and the model is non-linear [4]. The EnKF uses an ensemble of model realizations to approximate the first and second background error moments, making it efficient for large-scale models and suitable in the presence of non-linearities. However, in real DA applications, the assumptions required to obtain the optimal solution may not be accurate, degrading the filter performance [4, 5]. Additionally, small ensemble sizes may produce a poor approximation of the model uncertainty, causing a reduction in the filter accuracy or even filter divergence. When the system conditions do not satisfy the requirements of the KF-based method, the robust filters are a powerful and practical alternative to solve the estimation problem. Motivated by robust control ideas that have been established over many years in the field of control engineering [6], the robust filters emphasize the robustness of the estimation to have better tolerances to high uncertainty sources. Since their purpose is not the optimality in the estimation, the robust estimators do not require a strictly statistical representation of the system and the observations [7], showing a better performance than the KF-based methods in scenarios with a poor statistical uncertainty representation [8, 9]. There are several robust ensemble-based DA schemes based on different principles such as H_∞ formulation [8], replacing the traditional L₂ norm [10–12], robust covariance estimation [13, 14], and covariance inflation [6, 7]. The approach that we propose uses a shrinkage-based covariance estimator that improves the model robustness and performance when the ensemble size is small [15]. Additionally, our method incorporates adaptive covariance inflation closely related to the H_∞ formulation.

The uncertainty in chemical transport models (CTM) simulations could be reduced by the improvement of the emission inventory and the upgrade of meteorological data. Alternatively one could incorporate ground data, satellite information, or vertical in the simulations using DA techniques to reduce the uncertainty [16–19]. In Lopez et al.'s [19] study, DA over the Aburrá Valley has been applied using the LOTOS-EUROS CTM, building on earlier applications [16–18]. Aburrá Valley's pollution-related air quality issues have become worse over the last 10 years. Due to the Valley's meteorological dynamics transitioning between dry and rainy seasons, the air quality deteriorates two times a year dramatically, around the arrival of the Intertropical Convergence Zone (March-April, and with lower intensity in October-November) [20, 21]. During these times, the atmospheric boundary layer remains below the canyon's rim throughout the day, trapping all of the pollutants from the city in the lower atmosphere. The resulting concentrations of particulate matter smaller than 10 μm (PM₁₀) and 2.5 μm (PM_2.5) remain at levels considered hazardous for the general population, leading to bi-annual periods of worsened air quality known locally as “environmental contingencies,” during which special measures are taken. In this study, the application of the LOTOS-EUROS CTM to reproduce the PM_2.5 over the valley integrating ground based observations is taken as a real-life study case.

The study is organized as follows. section 2 describes the basic concepts of DA used and introduces the derivation of the proposed method. In section 3 using numerical experiments with a low-scale model, we compare the proposed method's robustness and performance against its related DA algorithms. In section 4, we show the evaluation of the proposed method in a real-life and complex application and discuss the results in terms of investigating the ability to reproduce particulate matter concentrations and forecasting capability of the proposed method. Finally, section 5 offers some concluding remarks and outlines the needed future work. The CTM implementation description is presented in the Appendix.

2. Robust Ensemble-based DA Using Prior Knowledge

In ensemble-based DA, an ensemble of model realizations

\begin{array}{l} X^{b} = [x^{b [1]}, x^{b [2]}, \dots, x^{b [N]}] \in ℝ^{n \times N}, & (1) \end{array}

is employed to estimate the first (x^b) and second moments (B) of the background error distributions, where x^b[i] ∈ ℝ^n×1 is the i-th ensemble member, and N is the total number of ensemble members. Hence

\begin{array}{l} x^{b} \approx {\bar{x}}^{b} = \frac{1}{N - 1} \cdot \sum_{e = 1}^{N} x^{b [e]} \in ℝ^{n \times 1}, & (2) \end{array}

and

\begin{array}{l} B \approx P^{b} = \frac{1}{N} \cdot Δ X \cdot {Δ X}^{T} \in ℝ^{n \times n}, & (3) \end{array}

where

\begin{array}{l} Δ X = X^{b} - {\bar{x}}^{b} \cdot 1^{T} \in ℝ^{n \times N}, & (4) \end{array}

is the anomalies matrix, ${\bar{x}}^{b}$ is the ensemble mean, P^b is the sample covariance matrix, and 1 is a vector with components all ones. Once an observation is available, the posterior state can be computed via an ensemble-based method as EnKF [4] or its variants, EnKS [4], EnHF [22], or 4DEnVAR [22] for instance. The widely-used stochastic EnKF computed the analysis state as a combination of the prior state and the differences between the observations and model outputs is the following [4]:

\begin{array}{l} X^{a} = X^{b} + P^{b} \cdot H^{T} \cdot {[R + H \cdot P^{b} \cdot H^{T}]}^{- 1} \cdot D \in ℝ^{n \times N}, & (5) \end{array}

where X^a is the analysis ensemble, H is the linear (or linearized) output operator, and the e-th column of the innovation matrix on the synthetic observations D ∈ ℝ^n×N reads $d^{[e]} = y + ϵ^{[e]} - H (x^{b [e]}) \in ℝ^{m \times 1}$ , with $ϵ^{[e]} ~ N (0, R)$ . The quality of analysis corrections is directly impacted by the accuracy in the estimation of B throw P^b, which is highly susceptible to the limited number of ensemble members, the state distribution, and the system uncertainty quantification.

2.1. LETKF

One of the most commonly used implementations of the EnKF method is the local ensemble transform Kalman filter (LETKF) [23], where the assimilation process is performed independently for each model variable. Around each model variable (grid point), a sub-domain of radius r is constructed, and the assimilation process is carried out within the local domain. Each local analysis is mapped onto the global domain to obtain the global analysis, and the assimilation is completed. In the assimilation process, all the information found within the sub-domain (i.e., observed components and error correlations) is used. LETKF's local approach has made it an interesting alternative for application in large-scale systems, so we use this method as a baseline to compare our proposed algorithm. The analysis state could be obtained following the implementation by Shin et al. [24] :

\begin{array}{l} Δ X = X^{b} - {\bar{x}}^{b} \cdot 1^{T} \in ℝ^{n \times N}, & (6a) \end{array}

\begin{array}{l} Δ Y = H \cdot Δ X & (6b) \end{array}

\begin{array}{l} P^{a} = {[Δ Y^{T} \cdot R^{- 1} \cdot Δ Y + (m - 1) \cdot I]}^{- 1}, & (6c) \end{array}

\begin{array}{l} D = y - H \cdot {\bar{x}}^{b}, & (6d) \end{array}

\begin{array}{l} w^{a} = P^{a} \cdot Y^{T} \cdot R^{- 1} \cdot D, & (6e) \end{array}

\begin{array}{l} {\bar{x}}^{a} = {\bar{x}}^{b} + Δ X \cdot w^{a}, & (6f) \end{array}

\begin{array}{l} X^{a} = X^{b} \cdot {[(n - 1) \cdot P^{a}]}^{1 / 2}, & (6g) \end{array}

where n, m, and N are the model resolution, the number of observations, and the number of ensemble members, respectively, P^a ∈ ℝ^n×n is the analysis ensemble covariance matrix, and 1 is a vector of the consistent dimension whose components are all ones. In the LETKF algorithm, the above analysis is applied per grid cell. The algorithm uses the following steps:

1. Compute in each domain simulated observations for all ensemble members.

2. Collect per domain also the observations from neighboring domains that are within r distance

3. Loop over grid cells.

(a) Select observations and simulations that are within range r.

(b) Compute analysis weights w^a.

4. Once all the local analyses are performed, map those to the global domain.

Note that the background error covariance matrix approximation in the LETKF is the sample covariance matrix (3), therefore for large radii of influence, the quality of the LETKF results could be influenced by spurious correlations.

2.2. Shrinkage-Based ENKF

A more robust family of covariance estimators for the case n ≫ N are the shrinkage based estimators [25, 26]. These kinds of estimators have the form [27]:

\begin{array}{l} B \approx \hat{B} (α) = α \cdot T + (1 - α) \cdot P^{b} \in ℝ^{n \times n}, & (7) \end{array}

where α ∈ [0, 1], and T ∈ ℝ^n×n is a user-defined matrix. The value of α is chosen to minimize

\begin{array}{l} α^{*} = arg min_{α} 𝔼 [‖ B - \hat{B} (α) ‖_{F}^{2}], & (8) \end{array}

where ||•||_F represents the Frobenius norm. A close formulation to calculate the weight value α using a general target matrix T_KA is proposed in [28, 29] (hereafter KA estimator),

\begin{array}{l} {\hat{B}}_{K A} = α_{K A} \cdot T_{K A} + (1 - α_{K A}) \cdot P^{b} \in ℝ^{n \times n}, & (9a) \end{array}

with

\begin{array}{l} α_{K A} = min (\frac{\frac{1}{N^{2}} \cdot \sum_{i = 1}^{N} ‖ {Δ x}^{[e]} ‖^{4} - \frac{1}{N} \cdot ‖ P^{b} ‖^{2}}{‖ P^{b} - T_{K A} ‖^{2}}, 1) . & (9b) \end{array}

This general target matrix enables the incorporation of prior information about the system into the error covariance matrix. Although T_KA must meet all requirements of a covariance matrix, T_KA must not fulfill any requirement about its structure and also can change dynamically, allowing a complete degree of freedom in the matrix computation. Sections 3, 4, and Lopez-Restrepo et al. [15] show some examples of how to compute T_KA. Additionally, the KA estimator does not make any distributional assumptions, thus can also be used for non-Gaussian covariance matrix estimation [29]. An implementation of the EnKF can be obtained using the KA estimator, known as EnKF-KA [15]:

\begin{array}{l} X^{a} = X^{b} + {\hat{B}}_{K A} \cdot H^{T} \cdot [R + H \cdot {\hat{B}}_{K A} \cdot H^{T}] \cdot D . \end{array}

In Lopez-Restrepo et al. [19], it is shown that incorporating prior information of the system in the data assimilation process can outperform the EnKF when n ≫ N, and when, there are errors in the model specifications.

2.3. Ensemble Time-Local H_∞ Filter

One of the most widely used robust filter is the H_∞ Filter (HF) [30]. The HF is based on the criterion of minimizing the supremum of the L₂ norm of the uncertainty sources [8]. The ideas beyond the HF filters come from the robust control theory and applications in linear and low-scale systems [31]. In recent years, several works have been started to develop implementations of the HF in DA due to its potential to solve some limitations of the EnKF [6, 7, 9, 31]. The HF ensures that the total energy of the estimation errors, is not larger than the uncertainty energy times a factor 1/γ:

\begin{array}{l} \sum_{k = 0}^{M} ‖ x_{k}^{t} - x_{k}^{a} ‖_{S_{k}}^{2} \leq \frac{1}{γ} (‖ x_{0}^{t} - x_{0}^{a} ‖_{Δ_{0}^{- 1}}^{2} + \sum_{k = 0}^{M} ‖ u_{k} ‖_{Q_{k}^{- 1}}^{2} \\ + \sum_{k = 0}^{M} ‖ v_{k} ‖_{R_{k}^{- 1}}^{2}), & (10) \end{array}

where x^t is the true state, x^a is the analysis state, S is a user-chosen matrix of weights, u and v are the model and observation uncertainty, respectively, Δ₀, Q, and R are the uncertainty weighting matrices with respect to the initial conditions, model error, and observations error, and M is the DA windows length [7]. To solve (10), the cost function $J^{H F}$ is defined as follows:

\begin{array}{l} J^{H F} = \frac{\sum_{k = 0}^{M} ‖ x_{k}^{t} - x_{k}^{a} ‖_{S_{k}}^{2}}{‖ x_{0}^{t} - x_{0} ‖_{Δ_{0}^{- 1}}^{2} + \sum_{k = 0}^{M} ‖ u_{k} ‖_{Q_{k}^{- 1}}^{2} + \sum_{k = 0}^{M} ‖ v_{k} ‖_{R_{k}^{- 1}}^{2}} . & (11) \end{array}

Then inequality (10) is equivalent to $J^{H F} \leq \frac{1}{γ}$ . Let γ^* be the value such that

\begin{array}{l} \frac{1}{γ^{*}} = \underset{{x_{k}^{a}}}{i n f} \underset{x_{0}, {u_{k}}, {v_{k}}}{s u p} J^{H F}, k \leq M, & (12) \end{array}

the optimal HF is then achieved when γ = γ^*. In this formulation, the evaluation of γ^* is an application of the minimax rule [32], a strategy that aims to provide robust estimates and is different from its Bayesian counterpart [7]. An Ensemble-based HF implementation for a nonlinear DA problem is the Ensemble time-local H_∞ filter (EnLTHF) proposed by Luo et al. [7]. In the EnLTHF, a local cost function is proposed:

\begin{array}{l} J_{k}^{H F} = \frac{‖ x_{k}^{t} - x_{k}^{a} ‖_{S_{k}}^{2}}{‖ x_{0}^{t} - x_{0} ‖_{Δ_{0}^{- 1}}^{2} + ‖ u_{k} ‖_{Q_{k}^{- 1}}^{2} + ‖ v_{k} ‖_{R_{k}^{- 1}}^{2}} . & (13) \end{array}

The local performance level γ_k satisfies:

\begin{array}{l} \frac{1}{γ_{k}} \geq \frac{1}{γ_{k}^{*}} = \underset{{x_{k}^{a}}}{i n f} \underset{x_{0}, {u_{k}}, {v_{k}}}{s u p} J_{k}^{H F}, & (14) \end{array}

The EnLTHF can be expressed in terms of the EnKF algorithm using the notation of Luo et al. [7]:

\begin{array}{l} [P_{k}^{a}, K_{k}] = E n K F (x_{k}^{a}, Q_{k}, H), & (15a) \end{array}

\begin{array}{l} G_{k} = {[I_{m} - γ_{k} \cdot P_{k}^{a} \cdot S_{k}]}^{- 1} \cdot K_{k}, & (15b) \end{array}

\begin{array}{l} x_{k}^{a (i)} = x_{k}^{b (i)} + G_{k} \cdot [y_{k} - H_{k} \cdot x_{k}^{b (i)} + v_{k}^{i}], & (15c) \end{array}

\begin{array}{l} x_{k}^{a} = (\sum_{i = 1}^{N} x_{k}^{a (i)}) / N, & (15d) \end{array}

\begin{array}{l} {(Δ_{k}^{a})}^{- 1} = {(P_{k}^{a})}^{- 1} - γ_{k} \cdot S_{k}, & (15e) \end{array}

subject to the constraint

\begin{array}{l} {(Δ_{k}^{a})}^{- 1} = {(P_{k}^{a})}^{- 1} - γ_{k} \cdot S_{k} \geq 0, & (15f) \end{array}

where the operator EnKF(·, ·, ·) means that $P_{k}^{a}$ and K_k are obtained through the EnKF.

2.4. Adaptive Inflation

A particular issue with ensemble-based DA algorithms is the covariance undersampling. Undersampling leads to further problems such as the ensemble collapse to an overconfident, but incorrect state, or even filter divergence [33]. The covariance inflation artificially increases uncertainties in the background covariance avoiding the underestimation of uncertainties and undersampling [34]. The magnitude of the inflation depends to a large degree on each system and application [35].

In (15e), the presence of the extra term −γ_k · S_k inflates the EnKF covariance matrix. In this way, it is possible to interpret the EnTLHF as an EnKF formulation with a specific value of inflation. This implies a theoretical and solid background to construct robust filters. Consider the case where S = I_n, which corresponds with an inflation of the analysis covariance matrix eigenvalues. To satisfy the constraint (15f), or what is equivalent, to make ${(Δ_{k}^{a})}^{- 1}$ semi-definite positive, consider the SVD decomposition of $P_{k}^{a}$

\begin{array}{l} P_{k}^{a} = V_{k} \cdot Σ_{k} \cdot U_{k}, & (16) \end{array}

where Σ_k = diag(σ_t,1, ..., σ_t,n) is a diagonal matrix with all the eigenvalues of $P_{k}^{a}$ in descending order, that is, σ_t,1 ≥ σ_t,2 ≥ .... ≥ σ_t,n and γ_k is a variable that satisfies

\begin{array}{l} {σ_{t, 1}}^{- 1} - γ_{k} \geq 0, \end{array}

that corresponds with

\begin{array}{l} γ_{k} \leq \frac{1}{σ_{t, 1}}, \end{array}

guaranteeing that ${(Δ_{k}^{a})}^{- 1}$ is semi-definite positive. It is convenient to introduce a performance level coefficient (PLC) c by defining

\begin{array}{l} γ_{k} \leq \frac{c}{σ_{t, 1}} . & (17) \end{array}

In contrast to conventional inflation schemes, γ_k is adaptive in time even for a fixed c value, and it is directly related with the analysis covariance matrix.

2.5. Ensemble Time Local H_∞ Filter Knowledge Aided (EnTLHF-KA)

According to sections 2.3 and 2.4, with a specific structure and inflation value, it is possible to obtain a robust version of the EnKF. Although the EnTLHF has shown to have a better performance than the EnKF in scenarios with high uncertainty [7, 36, 37], the limitations of the EnKF with respect to the ensemble size and the ensemble normality distribution are inherited in its robust version. When the ensemble size is small N << n, sampling errors can have an impact on the quality of covariances matrix estimation, causing problems such as filter divergence and spurious correlations [4, 35]. Even though many localization techniques have been developed to mitigate those problems, it usually prohibits its implementation in high dimensional applications [38]. The shrinkage-covariance estimator methods have shown a better performance than the classical sampling covariance matrix in scenarios with small ensemble sizes and non-Gaussianities [27, 39–41].

We propose a robust implementation of the EnKF-KA shrinkage-based method following the principles of the EnTLHF and the adaptive inflation denoted EnTLHF-KA. The EnTLHF-KA can be obtained similarly to the EnLTHF by taking as base the EnKF-KA:

\begin{array}{l} [{\hat{B}}_{K A}^{a}, K_{k}] = EnKF-KA (x_{k}^{a}, T_{K A}, H), & (18a) \end{array}

\begin{array}{l} G_{k} = {[I_{m} - γ_{k} \cdot {\hat{B}}_{K A}^{a} \cdot S_{k}]}^{- 1} \cdot K_{k}, & (18b) \end{array}

\begin{array}{l} x_{k}^{a (i)} = x_{k}^{b (i)} + G_{k} \cdot [y_{k} - H_{k} \cdot x_{k}^{b (i)} + v_{k}^{i}], & (18c) \end{array}

\begin{array}{l} x_{k}^{a} = (\sum_{i = 1}^{N} x_{k}^{a (i)}) / N, & (18d) \end{array}

where the operator EnKF-KA(·, ·, ·) represents the EnKF-KA shrinkage-based method (see section 2.2). For a specific PLC, the inflation value is obtained using (17).

3. Results in Low-Scale System

A series of synthetic DA experiments allow us to expose the robust filter benefits over the former methods and evaluate the robustness with controlled scenarios. The Lorenz-96 is one of the most used benchmarks for testing DA algorithms. The model is highly non-linear and with a strong relationship between the states. The Lorenz-96 dynamics are described by [42, 43]:

\begin{array}{l} \frac{d x_{j}}{d t} = {\begin{array}{l} (x_{2} - x_{n - 1}) \cdot x_{n} - x_{1} + F & for j = 1, \\ (x_{j + 1} - x_{j - 2}) \cdot x_{j - 1} - x_{j} + F & for 2 \leq j \leq n - 1, \\ (x_{1} - x_{n - 2}) \cdot x_{n - 1} - x_{n} + F & for j = n, \end{array} & (19) \end{array}

where n is the state number chosen as 40 and F is the external force. For consistency, periodic boundary conditions are assumed. We take the next considerations for the numerical experiments:

• The assimilation window consists of M = 500 observations.

• The number of observed components is m = 20, representing 50% of the model components.

• The observation statistics are associated with the Gaussian distribution,

\begin{array}{l} y_{t} ~ N (H \cdot x_{t}^{a}, ρ_{o}^{2} \cdot I), for 1 \leq t \leq M, & (20) \end{array}

where ρ_o = 0.001, and H is a linear operator that randomly chooses the m observed components.

• To avoid random fluctuations, each experiment is repeated 20 times (L = 20).

• We compare the performance and robustness of the EnTLHF-KA against the non-robust methods EnKF and EnKF-KA, and the robust method EnTLHF.

• We use a Gaspari-Cohn [44] matrix with an influence radius of 2 as target matrix T_KA for the EnKF-KA and the EnTLHF-KA. Following [7], we do not use covariance localization to avoid complicating the analysis of our experiment results.

• We take the Root-Mean-Square-Error (RMSE) of L experiments as a measure of performance,

\begin{array}{l} RMSE = \frac{1}{L} \cdot \sum_{l = 1}^{L} (\sqrt{\frac{1}{M} \cdot \sum_{t = 1}^{M} {({[x_{t}^{*} - x_{t}^{a}]}^{T} \cdot [x_{t}^{*} - x_{t}^{a}])}^{2}}) . & (21) \end{array}

• We chose a PLC value c = 0.5 for all the experiments, following Luo and Hoteit [7]. Other c values have been tested (not reported here), but no performance improvements were obtained.

3.1. Robustness Against Ensemble Members

When the state dimension is large, it is important to test the performance with relative small ensemble sizes. We evaluate both the accuracy and the robustness of the EnTLHF-KA with respect to the ensemble size. For this case, we set the observation error δ= 1 × 10⁻³, the observation frequency f = 1, and the external force F = 8. The ensemble size N ∈ [10, 20, 50, 100, 1, 000]. Figure 1 presents the RMSE value for those values of N.

FIGURE 1

Figure 1. Error evaluation of the robust and non-robust methods with respect to the ensemble member number.

The EnTLHF-KA has more constant RMSE values for different N. The other methods present variation in its performance when the ensemble size changes. In general, the RMSE values decrease for larger N values for all the methods. For N = 10, the EnTLHF-KA presents a superior performance compared to the others, followed by the EnKF-KA. This behavior is attributed to the shrinkage-based estimator used in both methods, that have shown a better covariance estimation when N << n [19, 41]. However, the adaptive inflation factor of the EnTLHF and the ENTLHF-KA improves these methods' performance against their non-robust counterpart. For larger ensemble size, both EnTLHF-KA and EnKF-KA tend to converge to the EnTLHF and EnKF, respectively, since the sampling ensemble matrix represents a good estimator for the covariance matrix and ${\hat{B}}_{K A}$ converge to P^b. Due to the good estimation of B by P^b, and all the EnKF assumptions are satisfied, the non-robust methods present lower RMSE value for large ensemble size. This example clarifies the different advantages and disadvantages of the robust approach compared to the optimal approach. Although the EnTLHF-KA performance is not the best in all the scenarios, its robustness allows it to have low RMSE values in all the scenarios.

3.2. Robustness Against Observation Error

Figure 2 shows the RMSE value when δ ∈ [1 × 10⁻⁴, 1 × 10⁻³, 1 × 10⁻², 1 × 10⁻¹]. The other model parameters are N = 20, f = 1, and F = 8. The idea now is to evaluate the impact of the observation error in the new robust EnTLHF-KA. It can be seen that the performance of the non-robust methods is affected by the increase of the observation error, causing divergence of the EnKF-KA. This kind of behavior is one of the main reasons for the development of new robust techniques [12]. The observation error's impact is much lower in the robust methods, and the performance is almost constant, especially in the EnTLHF-KA. When δ = 1 × 10⁻⁴, the EnKF and the EnKF-KA perform better than their robust counterpart, but the robust filters hold a good performance even for large observation errors.

FIGURE 2

Figure 2. Error evaluation of the robust and non-robust methods with respect to the observation error.

3.3. Robustness Against Model Errors

To evaluate the EnTLHF-KA robustness with respect to model errors, we compare the method's performance when F ∈ [6, 7, 8, 9, 10]. F = 8 corresponds with the assumption of a perfect model. Figure 3 presents the RMSE value for each F value and the comparison among the four filters. The RMSE values remain almost constant for both robust filters, with smaller values for the EnTLHF-KA. The adaptive inflation makes the analysis covariance matrix larger in the robust filters than in its non-robust counterpart, given the same background covariance. Consequently, the EnTLHF and the EnTLHF-KA put more weight in the observations, convenient when there are larger model errors.

FIGURE 3

Figure 3. Error evaluation of the robust and non-robust methods with respect to errors in the model.

3.4. Robustness Against Ensemble Distribution

The standard EnKF assumes that the ensemble state has a Gaussian distribution. This assumption is especially essential because the state covariance B is approximated by the ensemble sample covariance P^b. Although the ensemble at t₀ is Gaussian, non-linearities in the model dynamics can modify the ensemble distribution, causing the approximation of B by P^b to lose accuracy. Figure 4 presents an evaluation of the ensemble distribution for different times steps using the Lorenz-96 model. We use the Shapiro-Wilk to evaluate the Gaussianity of each state variable [45]. We take an initial Gaussian ensemble of 100 members as a reference. After 15-time steps, some variables begin to change their initial distribution, and after 30-time steps, the Gaussian assumption is not valid anymore for the ensemble.

FIGURE 4

Figure 4. Shapiro-Wilk test for each Lorenz component at a different time step. The ensemble size is 100. The white color represents that the null-hypothesis is not rejected (the ensemble for that specific variable is Gaussian). The gray color represents that the null-hypothesis is rejected (the ensemble for that specific variable is non-Gaussian).

We perform different experiments varying the observation frequency or the number of time steps between two available observations. Figure 5 shows the time averaged RMSE for the EnKF, EnKF-KA, EnTLHF, and the EnTLHF-KA using an observation frequency f ∈ [1, 5, 10, 20, 30, 50] times steps. We set an ensemble size of N = 20, an observation error of δ = 1 × 10⁻³, and the external force F = 8. The EnKF performance decreases considerably when f increases, and after the value of f = 30 the method diverges. This result illustrates the importance of the Gaussian distribution for obtaining a good representation of B throw P^b. The adaptive inflation increases EnTLHF robustness and performance, even when both EnKF and EnTLHF are using the same approximation of B. Nevertheless, the EnTLHF performance decreases considerably when f = 50. In contrast, EnKF-KA and EnTLHF-KA use a shrinkage-based estimator for B. The KA estimator does not assume a Gaussian distribution, as other shrinkage-based estimators do [27, 46]. Thus, the EnKF-KA presents better performance than EnKF for large f values and similar error levels than EnTLH without incorporating adaptive inflation. In the case of the EnTLHF-KA, the combination of both the shrinkage-based estimator and the adaptive inflation produces high robustness and performance even when the ensemble distribution is non-Gaussian.

FIGURE 5

Figure 5. Error evaluation of the robust and non-robust methods with respect to the observation frequency.

4. Application to a Non-linear Non-Gaussian Large Scale System

The implementation of the LOTOS-EUROS CTM over the Aburrá Valley is used as a real study case. This application consists of a non-linear and non-Gaussian large system, so it is a good opportunity to test the proposed method potential. The complete implementation and observations description is presented in the Appendix. The period of interest for all data evaluations, simulations, and DA experiments spans from February 25 to March 15, 2019. During these days, the PM concentrations are higher due to the Northbound transit of the Inter-Tropical Convergence Zone over the study domain. The data to be assimilated is located at the surface but the proposed method also applies to satellite data at different scales and resolutions.

In order to test the proposed method, we performed a total of four different LOTOS-EUROS simulations:

1. a LOTOS-EUROS model simulation without DA (henceforth LE) for having a free run model under regular initial and boundary conditions looking for further comparison;

2. a DA simulation using the LETKF introduced in section 2.1 (henceforth LE-LETKF);

3. a DA simulation using the shrinkage-based EnKF-KA developed in Lopez-Restrepo et al. [15] (henceforth LE-KA);

4. a DA simulation using the robust and shrinkage-based EnTLHF-KA proposed in 2.5 (henceforth LE-Robust).

The set of validation sites is split into two groups: the stations located in the bottom part of the valley (BS, represented by circles in Figure 12), and the stations located in the city's outskirts or hills (OS, represented by stars in Figure 12). The objective of this division is to evaluate the simulation performance in regions where the PM_2.5 concentration regimes are different. All the simulations were evaluated using both validation station's sets, and the performance metrics Mean Fractional Bias (MFB) [47], Root Mean Square Error (RMSE) [48], and Pearson Correlation Factor [49]. The three ensemble-based algorithms estimate both concentrations and emissions, following the stochastic representation presented in Lopez-Restrepo et al. [19]. For all the methods, an ensemble size N of 25 members and a localization radius r of 5 km were used.

The DA methods are evaluated with forecast experiments, in which a model simulation over a limited number of days is performed using information from the assimilation. Forecasting experiments were performed to test the model's capability to predict the PM concentrations in the valley up to three days ahead. We applied the methodology proposed by Lopez-restrepo et al. [50], with all days from March 9 to 13 having predictions as the first, second, and third day of a forecast. We are especially interested in evaluating the ability of the model to predict warning-triggering episodes (AQI in orange, red, or purple levels, as shown in Table 1). All forecast simulations used the estimated emission correction factors from the last assimilation day, in each of the three forecast days. This inheritance scheme has shown the best option for the LE implementation over the Aburrá Valley [19].

TABLE 1

Table 1. Air Quality Index (AQI) as defined for the Aburrá Valley with respect to PM_2.5 concentrations according to the ranges established by the Metropolitan Area.

This is specially relevant in the sense that the robust method is evaluated in the forecast, enhancing the capability of reducing uncertainty in an operational fashion and direct implementation for decision making within our applied research programs in air pollution.

4.1. Target Matrix

The shrinkage-based algorithm EnKF-KA and the robust EnTLHF-KA were implemented to be used with the LOTOS-EUROS model. This was mainly aimed by the fact that there are great opportunities for DA applied to CTM models and air pollution scenarios for decision making. The challenging the problem, the creative solutions arise. The aim of EnKF-KA and the robust EnTLHF-KA algorithms is to improve the model representation in the complex orography conditions of the Aburrá Valley. Both shrinkage-based algorithms required a target matrix T_KA to compute the covariance matrix B according to Equation (10). The matrix T_KA should guide the covariance structure in B by limiting the spurious correlations between elements at a large distance [40], or in the case of the EnKF-KA and the EnTLHF-KA, to incorporate previously obtained knowledge directly in the DA process [15]. For this application, we are interested in using the target matrix to represent the valley's complex orography in the covariance estimation. Previous works have shown issues reproducing the pollutant dynamics into the Aburrá valley due to the limited representation of the valley in the simulation model [19, 21]. Even with high-resolution meteorological simulations, it is still challenging to capture the transport of pollutants in the narrow valleys [51].

The main purpose of the T_KA matrix is to reduce the covariance between elements in the state that are distant in the vertical direction but close in the horizontal direction. Thus, observations located in the bottom part of the valley (where the pollutant concentration are higher) should not have a high impact in the city's outskirts (where the concentrations are lower) and vice versa. A first version of the target matrix $T_{K A}^{*}$ was built using a fourth-order-polynomial covariance function as described in Gaspari and Cohn [44]. To incorporate the previous knowledge and improve the valley representation into the model, we reduced the correlation as a function of vertical distance, with zero correlation for vertical distances exceeding 600 m. Other distances were tested too, without significant changes in the result. The chosen formulation preserves the dependency on the horizontal distance that is necessary to remove the spurious correlations and incorporates the physical restriction of the valley. To ensure that T_KA is positive semidefinite, we applied the method presented in Higham [52] to obtain the positive semidefinite matrix that is closest to $T_{K A}^{*}$ in the Frobenius norm. Figure 6 illustrates the influence area of the Gaspari-Cohn based covariance matrix, the $T_{K A}^{*}$ covariance matrix, and the T_KA covariance matrix for two locations. The influence area corresponds with a row (or column) of the covariance matrix. It is possible to see how the proposed $T_{K A}^{*}$ matrix (Figure 6C) follows the valley shape according to the orography shown in Figure 6B unlike the Gaspari-Cohn covariance matrix (Figure 6A). The generalization applies to very complex boundary conditions in large scale systems not only for the solution of the differential equations but also for the estimation tasks of the robust filters. Additionally, there are no significant modifications between the T_KA (Figure 6D) and the $T_{K A}^{*}$ matrix. Finally, the T_KA matrix is used as the target matrix for both EnKF-KA and EnTLHF-KA methods. Note that the final covariance between the state inside and outside the valley will not be necessary zero because the final covariance matrix B_KA is a convex combination of T_KA and P^b.

FIGURE 6

Figure 6. Comparison of the influence area of two selected states (blue dots) between a distance depending localization, and the target covariance matrix based on the distance and the orography. (A) Influence area Gaspari-Cohn matrix. (B) Aburra Valley orography. (C) Influence area $T_{K A}^{*}$ . (D) Influence area T_KA.

4.2. Evaluation of LE simulations

The concentration fields produced by model simulations with or without DA were compared with the observations from official monitoring stations (Figure 12), dividing the study into stations at the bottom of the valley (BS stations) and stations at the outskirts of the city (OS stations). The averaged assessment statistics over the validation station are shown in Table 2. In all validation stations, the simulation results without DA (LE) underestimated the observed concentrations. This is for example reflected in a high RMSE value. The correlation coefficient was low, which means that the model could not fully capture the temporal variations at hourly and daily scales. The three simulations using DA had MFB values similar to 0 for the BS stations (bottom of the valley), without a noticeable difference. DA was thus successful in reducing the discrepancy between the model and observations. The RMSE also decreased by 45.03% in the LE-LETKF, 41.57% in the LE-KA, and 41.91% in the LE-Robust simulations compared to the RMSE of the LE simulation. According to Mogollón-sotelo et al. [53], Table 2 based on EPA [54] and Boylan and Russell [47], the R values were all above the criterion for good results. In contrast, over the OS stations (outskirts of the city), the simulations using the shrinkage-based methods presented better statistics compared to the LE-LETKF. For instance, the RMSE's improvements in OS stations using shrinkage-based methods are 15.02% for the LE-KA and 22.22% for the LE-Robust compared with the LE-LETKF.

TABLE 2

Table 2. Statistical evaluation of different simulations.

TABLE 3

Table 3. Weather research forecast model (WRF) model domains description.

TABLE 4

Table 4. WRF model set up.

In general, all DA simulations showed lower scores in the OS stations than in the BS stations, mainly because of the poor representation in these areas by the background simulation (LE simulation) and the lack of close observations. Even so, the LE-Robust looks more robust among all the stations.

Figure 7 shows diurnal cycles in the four chosen validation stations during the simulation phase. Those stations illustrate the differences between BS and OS, and are representative of all validation stations. The LE diurnal cycle differs from the observations in magnitude in the BS stations, and in the OS stations in both magnitude and temporal behavior. The highest peak of concentration in the BS stations around 09:00 is primarily due to traffic dynamics and is partially captured by the LE simulation. For example, the LE morning peak emerged faster in the simulations at station 44 than in the observations. This time lag could be due to a poor spatial representation of mobile sources in the emission inventory, or a failure by the meteorology or the model to reproduce the dynamics of the valley, indicating premature transport of particulate matter to these regions. In comparison, at 22:00 h, the LE simulation presents the highest point at station 44 (Figure 7C), which does not correspond with the observations. The LE simulation in the other OS station 85 (Figure 7D), cannot fit the observation interval, indicating a late morning peak and a minimum around 21:00 that does not appear in the measurements. The LE simulation shows a general underestimation of concentrations, with a better replication of the PM2.5 dynamics at the bottom of the valley.

FIGURE 7

Figure 7. Daily cycle at different stations. The upper panel corresponds with stations located at the bottom of the valley. The bottom panel corresponds with stations located on the outskirts of the city. (A) Daily cycle at station 25. (B) Daily cycle at station 28. (C) Daily cycle at station 44. (D) Daily cycle at station 85.

The simulations using DA presented diurnal cycles closer to the observations, with a marked difference in performance between BS stations and OS stations. In the BS stations (Figures 7A,B), the three methods showed very similar daily cycles capturing the magnitude and the variability of the observations with high accuracy. These simulations corrected the concentration underestimation presented in the LE simulation and improved the temporal profile. Unlike in the BS stations, in the OS stations, the three DA methods showed different results.

The LE-LETKF tends to overestimate the concentrations and has different diurnal variability concerning the observations. In station 44, the LE-LETKF persistently displayed higher values than the observed, and a low variability around the day, with small peaks and valleys. In station 85, the LE-LETKF showed higher concentration values than the observations, and the morning peak appears later (similar to the LE simulation). The discrepancy in the magnitude and the lack of representation of the temporal variability suggest that the LE-LETKF simulation assimilates observations located in regions where the PM presents a different temporal behavior than those grid cells located in the outskirts.

On the other hand, the two simulations using the shrinkage-based covariance estimator and the target matrix T_KA (LE-KA and LE-Robust) improve the performance in the OS stations. The LE-KA simulation showed a similar temporal variability in both OS stations, although a concentration underestimation.

The LE-Robust displayed a high agreement between the simulated daily cycle and the observations. The difference in magnitude between the LE-Robust and LE-KA simulations can be explained by the fact that the robust methods tend to put more weight in the observations when there is high uncertainty in the background [7], such as the case in this application. Finally, the shrinkage-based simulations tend to follow the diurnal variability, which suggests that the T_KA matrix could limit the influence of observations from areas with a different temporal profile.

4.3. Spatial Distribution

To better understand the influence of the target matrix T_KA on shrinkage-based methods, it is important to analyze the spatial distribution of the concentrations over the valley. Figure 8 shows a three-dimensional representation of the average value of PM_2.5 over March 9. In these graphs, values less than 5 μg/m³ are omitted. The averaged observed values are shown using the same color bar for all the validation stations by a circle and a star for the BS and OS stations, respectively.

FIGURE 8

Figure 8. 3D maps of concentrations averaged over March 9 for different simulations. The values less than 5 μg/m³ are omitted. The circles correspond with BS stations, and the stars correspond with OS stations. (A) LE, (B) LE-LETKF, (C) LE-KA, and (D) LE-Robust.

The LE simulation has a spatial pattern similar to the observations, with the highest concentrations in the center and south part of the Medellín city (refer to Figure 12 for reference). In general, the concentrations are higher in the bottom part of the valley, where most of the population and industry facilities are located. This characteristic is well captured by the LE simulation. Nevertheless, the LE simulation tends to underestimate the concentration along the valley and the hills.

The three DA simulations are able to correct the concentration bias in the bottom part of the valley. The LE-LETKF assimilation increases the concentrations in the hills to values higher than the observations. In station 85, located on the west slope of the valley (see Figure 12 for reference), the concentrations simulated by LE-LETKF are almost everywhere higher than the observed. This is because the concentrations in the west hill are influenced by observations located in the lower part of the valley, characterized by high concentrations. Those observations influence the grid cells located on the hill, generating values that do not correspond to the validation station. Both shrinkage-based simulations match better with the observations on the hills. In the case of station 85, both methods have the same range of values as the observed concentrations.

The use of the T_KA matrix limits the influence of the observations located at the bottom of the valley on the grid cells at the slopes. As shown in Figure 6D, the influence of the observations is limited by horizontal and vertical distance, representing better the dynamics in the valley. A particular situation is observed at station 94 (see Figure 12 for reference), located on the top of the east slope. Although the observed values are in the range of 5–10 μg/m³, all the simulations, even the DA simulations, show values under 5 μg/m³ (not plotted in Figure 8). The underestimation can be explained by an absence of emissions in the emission inventory (emission uncertainties), and the limited number of observations in that part of the domain.

4.4. Forecast Results

A fundamental prerequisite for a simulation and assimilation method of air quality to be valuable for a decision-making process is that it can predict the concentrations a few days in advance. Figure 9 shows examples of forecasts from March 12, 16:00 to March 15, 16:00. As was mentioned previously, the forecast runs are using the emission correction factors estimated between March 10, 16:00 and March 11, 16:00. The LE simulation persistently underestimates the concentrations, as observed in the assimilation window's results. In the BS stations, the three assimilation methods initiate a forecast that is quite close to the observations on the first day and remains with an acceptable similarity in the following two forecast days. As shown in the previous evaluations, the concentrations in the assimilation window are very similar for the three methods in the lower part of the valley. Thus, also the estimated emission correction factors are similar, leading to rather small differences between the forecasts. However, in the OS stations, the LE-LETKF forecasts show magnitudes and a temporal behavior that is different from the observations. This discrepancy in the values suggests an incorrect estimation of the emission correction factors on the slopes of the valley by LE-LETKF. The forecasts generated by the shrinkage-based methods are more similar to the observations. The LE-KA and LE-Robust show a good forecasting skills for the OS stations, with temporal behavior and magnitudes close to those observed for the first and second forecast days.

FIGURE 9

Figure 9. Forecast from March 12, 16:00 to March 15, 16:00 at different stations. The gray vertical dashed line represents the end of the assimilation window and the beginning of the forecast window. Bottom station (A) Forecast cycle at station 25. (B) Forecast cycle at station 28. Outskirt stations (C) Forecast cycle at station 44. (D) Forecast cycle at station 85.

To be valuable for the public, a forecast should correctly warn for elevated air pollution events. The portion of true negatives, true positives, false negatives, and false positives regarding the prediction of warning-triggering episodes (AQI in orange, red, or purple levels, see Table 1) is summarized by the confusion matrix [55].

Figure 10 shows the confusion matrices for LE-LETKF, LE-KA, and LE-Robust assimilations and forecasts. In the assimilation or forecast window, the LE simulation did not give an alert at any station; for that reason, we do not provide its confusion matrix. DA simulations have a ratio between true negatives and true positives equal to or greater than 90% of the 20 alarms registered in the assimilation window, 18 correspond to BS stations.

FIGURE 10

Figure 10. Comparison of confusion matrices for the data assimilation (DA) and forecast window depending on warning or no warning per station. The values are calculated across all the days of the corresponding window. The value of 0 corresponds with no warning, the value of 1 corresponds with a warning. For the LE simulation, there are neither warnings in the DA window nor forecast windows (A,B).

FIGURE 11

Figure 11. WRF and LOTOS-EUROS model nested domain configuration. The red squares correspond with the LOTOS-EUROS domains, the black squares correspond with the WRF domains.

FIGURE 12

Figure 12. (A) Validation network. The circles and stars represent the bottom part stations (BS), and the outskirt stations (OS), respectively. (B) Assimilation network. The gray raster corresponds with the LOTOS-EUROS model grid, and the black lines are the municipalities, borders.

In the forecast window, the forecast skill of the three models was lower than in the assimilation window. From the 10 actually observed alerts in the forecast period, the DA simulations could replicate 8. A higher proportion of false-positive alerts was reported by the LE-LETKF, documenting nine false alerts more than the shrinkage-based approaches. The high amount of false-positive alerts is due to the overestimation of the LE-LETKF concentration in the OS stations, where the additional alerts were recorded incorrectly. In general, the LE-KA and LE-Robust simulations had better alert forecast performance than the LE-LETKF simulation.

4.5. Discussion and Comments

In a free run scenario for a CTM model, the LOTOS-EUROS model has served as an example for some contributions. Previous studies already suggested the need for meteorological fields at a higher resolution to correctly represent the dynamics and transport of pollutants in the Aburrá Valley [19]. Simulation without DA and using weather research forecast Model (WRF) meteorology (LE simulation) shows an improvement compared to implementations using the lower resolution ECMWF meteorology. This procedure improves the model performance. An underestimation of PM_2.5 concentrations is strongly reduced (although still present) and an increment in the correlation is observed. It is important to continue evaluating the model's performance with different configurations of the WRF model, specifically to reproduce the dominant dynamics of pollutant transport in inhabited valleys [21, 51]. Additionally, it is necessary to carry out a more exhaustive evaluation of the model's vertical resolution, given the new possibilities offered by the coupling with the WRF model. Finally, a reduction in meteorology's uncertainty will improve the estimation of the emissions using DA and could help to create more accurate emission inventories. Data assimilation for uncertainty reduction of the WRF model is under research.

The DA considerably improves the simulations by the model. With each of the three assimilation methods, smaller differences and higher similarities between the simulated and observed concentrations were found, as shown in Table 2. The standard metrics that are used to compare the various algorithms showed an improvement compared to previous EnKF implementations, assimilating the same observations [50]. This improvement is due to the better background obtained using WRF meteorology and the impact of the localization schemes present in the DA algorithms. Using the new assimilation schemes, the spatial distribution of concentrations within the valley is better resolved.

Under the assumption the WRF meteorological fields are on a basis improving the model representation of reality, we will focus on the main differences between the model in a free run and the assimilation. Using a target covariance matrix to adapt the covariances computed from the ensemble results in better representation of the actual covariance structure. The target covariance matrix limits the influence of observations located in the lower part of the valley on the grid cells located in the hills of the valley and vice versa. This makes it possible to separate the different regimes and avoids incorrect corrections in concentrations, as could occur with the standard LEKTF method. The forecast experiments also suggest a better estimate of the emission correction factors when shrinkage methods are employed. As a result, the forecasts of dangerous pollution levels is improved in all the stations (shown in Figure 10). These results encourage further improvement of these types of methods and to incorporate more and more prior knowledge in the covariance estimation. Possible new directions include dynamic target matrices dependent on the weather or on patterns in public behavior.

Both shrinkage-based methods, EnKF-KA and EnTLHF-KA, showed lower error statistics than the standard LETKF. The use of the shrinkage estimator and the incorporation of orography information through the T_KA matrix allows both methods to achieve satisfactory results with a relatively low number of ensemble members (25). Previous experiments in toy models (Lorenz96 and 2D advection-diffusion model) and real pseudo applications (SPEEDY model) have shown that the shrinkage-based family of methods can improve DA when the size of the ensemble is small [15, 40], supported by our results in a real high-dimensional application. This capability is important given the computational difficulty involved in generating many simulations of highly complex models. Although the overall performance of both methods is similar, the robust method achieves better results, especially in stations on the slopes of the valley. This is very important for this family of models because it seems to improve estimation results even if the solution of the differential equation may not be deeply accurate.

The EnTLHF-KA algorithm tends to put more weight on the observations than the EnKF-KA in the analysis step due to the adaptive inflation term that is present. Additionally, the robust methods do not require a completely correct characterization of the observation representation errors or the uncertainties of the model [7]. This characteristic benefits the EnTLHF-KA in our application, given the lack of precise information on the modeling system's uncertainties, e.g., emissions inventory, meteorology, composition, and reaction schemes.

Although the methods presented in this work were tested in a specific setting, their formulation is quite general and could be used in other applications [15]. The basic concept of both EnKF-KA and EnTLHF-KA is to incorporate information or prior system knowledge that is not captured by the model directly in the DA.

In our case, for example, this principle works as a modification to the well-known concept of distance-based location. Several works have followed this line, mainly in history matching applications [56, 57] but with a different approach. We believe that EnKF-KA and EnTLHF-KA possess sufficiently interesting characteristics to be applied and tested in areas other than that shown in this work.

5. Conclusion

This study introduces the concept of robustness from control and systems to a family of DA techniques. We aimed to the natural development of a filter's family that not only avoids spurious correlation but also can be generalized, computationally efficient, and very robust inspired in real life complex systems [15, 19]. We developed the intuition for adding the H_∞ robustness to a shrinkage-based estimator finding a simple and very understandable solution. Using a low-scale model implementation, easily extendable for example to biological systems [58–60] or closed loop estimators for biotechnological process [61, 62], we compared the proposed method's robustness and performance against the standard EnKF, the shrinkage-based EnKF-KA, and the robust filter EnTLHF. The EnTLHF-KA has lower RMSE values in conditions with high observation error and model errors than the other methods. When the number of ensembles is small, the shrinkage estimator gives a better approximation of the background covariance matrix than the sample covariance matrix, generating lower errors in both shrinkage-based algorithm, especially in the EnTLHF-KA. The combination of the non-Gaussian shrinkage estimator and the adaptive inflation grant a higher robustness to the EnTLHF-KA when the ensemble distribution is non-Gaussian.

Additionally , we presented an application using the chemical transport model LOTOS-EUROS over a densely populated valley. The proposed method outperform the standard LETKF, especially in places with complex orography. Incorporating the orography characteristics in the DA through a target matrix, limits the influence of observations in grid cells that are far away in vertical distance. The final result can be understood as a localization scheme that does not depend only on the horizontal distance, but also on the change in orography. The robustness of the EnTLHF-KA allows having a high similarity between the simulated and observed PM_2.5 concentrations, even with a small ensemble size and an incomplete representation of the system uncertainties. The model's forecasting capabilities are also improved, achieving a good representation of the concentrations on the first forecast day, being acceptable until the third day. After assimilation, the model is an accurate tool for forecasting alerts for high levels of air pollution.

Data Availability Statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.

Author Contributions

SL-R: conceptualization, methodology, software, and writing—original draft. AY: methodology and software. NP: conceptualization, methodology, writing—review, and editing. OQ: conceptualization, methodology, writing—original draft, editing, and supervision. AS: methodology, software, writing—review, and editing. AH: writing—review, editing, and supervision. All authors have read and agreed to the published version of the manuscript.

Conflict of Interest

SL-R and AY were employed by the company SimpleSpace.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Acknowledgments

The authors acknowledge the supercomputing resources made available by the Centro de Computación Científica Apolo at Universidad EAFIT (http://www.eafit.edu.co/apolo) to conduct this work.

Abbreviations

DA, Data Assimilation; KF, Kalman Filter; EnKF, ENsemble Kalman Filter; LETKF, Local Ensemble Transform Kalman Filter; KA, Knowledge-Aided; EnKF-KA, Ensemble Kalman Filter Knowledge-Aided; HF, H_∞ Filter; EnTLHF, ENsemble Time Local H_∞ Filter; EnTLHF-KA, ENsemble Time Local H_∞ Filter Knowledge-Aided; RMSE, Root Mean Square Error; CTM, Chemical Transport Model; LE, LOTOS-EUROS simulation without data assimilation; LE-LETKF, LOTOS-EUROS simulation using the LETKF; LE-KA, LOTOS-EUROS simulation using the EnKF-KA; LE-Robust, LOTOS-EUROS simulation using the EnTLHF-KA; BS, Bottom Stations; OS, Outskirts Stations.

References

1. Lahoz WA, Schneider P. Data assimilation: making sense of earth observation. Front Environ Sci. (2014) 2:16. doi: 10.3389/fenvs.2014.00016

A Knowledge-Aided Robust Ensemble Kalman Filter Algorithm for Non-Linear and Non-Gaussian Large Systems

1. Introduction

2. Robust Ensemble-based DA Using Prior Knowledge

2.1. LETKF

2.2. Shrinkage-Based ENKF

2.3. Ensemble Time-Local H∞ Filter

2.4. Adaptive Inflation

2.5. Ensemble Time Local H∞ Filter Knowledge Aided (EnTLHF-KA)

3. Results in Low-Scale System

3.1. Robustness Against Ensemble Members

3.2. Robustness Against Observation Error

3.3. Robustness Against Model Errors

3.4. Robustness Against Ensemble Distribution

4. Application to a Non-linear Non-Gaussian Large Scale System

4.1. Target Matrix

4.2. Evaluation of LE simulations

4.3. Spatial Distribution

4.4. Forecast Results

4.5. Discussion and Comments

5. Conclusion

Data Availability Statement

Author Contributions

Conflict of Interest

Publisher's Note

Acknowledgments

Abbreviations

References

Appendix

The Chemical Transport Model LOTOS-EUROS Setup

The WRF Meteorology

The Data Used for Assimilation and Validation

Nomenclature

List of Symbols

95% of researchers rate our articles as excellent or good

2.3. Ensemble Time-Local H_∞ Filter

2.5. Ensemble Time Local H_∞ Filter Knowledge Aided (EnTLHF-KA)