A survey on Bayesian nonparametric learning for time series analysis

Vélez-Cruz, Nayely

doi:10.3389/frsip.2023.1287516

REVIEW article

Front. Signal Process., 16 January 2024

Sec. Statistical Signal Processing

Volume 3 - 2023 | https://doi.org/10.3389/frsip.2023.1287516

This article is part of the Research TopicWomen in Signal Processing 2023View all 4 articles

A survey on Bayesian nonparametric learning for time series analysis

Nayely Vélez-Cruz*

School of Electrical, Computer and Energy Engineering, School of Complex Adaptive Systems, Arizona State University, Tempe, AZ, United States

Time series analysis aims to understand underlying patterns and relationships in data to inform decision-making. As time series data are becoming more widely available across a variety of academic disciplines, time series analysis has become a rapidly growing field. In particular, Bayesian nonparametric (BNP) methods are gaining traction for their power and flexibility in modeling, predicting, and extracting meaningful information from time series data. The utility of BNP methods lies in their ability to encode prior information and represent complex patterns in the data without imposing strong assumptions about the underlying distribution or functional form. BNP methods for time series analysis can be applied to a breadth of problems, including anomaly detection, noise density estimation, and time series clustering. This work presents a comprehensive survey of the existing literature on BNP methods for time series analysis. Various temporal BNP models are discussed along with notable applications and possible approaches for inference. This work also highlights current research trends in the field and potential avenues for further development and exploration.

1 Introduction

Time series data collection has become increasingly prevalent in recent years across a range of industries, including finance, healthcare, and social media. The growth of cloud computing platforms has also facilitated the storage and processing of large and high-dimensional time series data. Time series analysis is thus becoming a rapidly growing field. Several challenges in this area include anomaly or change point detection, clustering multiple time series based on similar underlying patterns, and making predictions from time series with missing values or irregular sampling. To this extent, Bayesian nonparametric (BNP) methods are gaining traction for their power and flexibility in modeling, forecasting, handling missing values, and extracting meaningful information from time series data. The utility of BNP methods lies in their ability to encode prior information and represent complex patterns in the data without imposing strong assumptions about the underlying distribution or functional form. This makes them well-suited for a large range of time series problems where traditional models are too restrictive.

Bayesian nonparmetric methods center on the construction of statistical models over infinite-dimensional probability spaces. Unlike parametric methods, which assume a specific form for the underlying data distribution, BNP methods allow the model to learn from the data and adapt in complexity accordingly. BNP models use stochastic processes as their building blocks, the main ones being Dirichlet processes, Beta processes, and Gaussian processes. The rest of this work is organized as follows: In Section 2, I summarize the standard BNP models, how they are employed for various time series analysis problems, and highlight some important extensions. Section 3 centers on current state-of-the art developments, which focus on integrating BNP methods with deep learning for the analysis of high-dimensional data, as well as scaling these methods to large datasets. Section 4 highlights three areas for practical application: object tracking, healthcare and biomedical data analysis, and speech signal processing. I conclude by discussing some potential avenues for further exploration. Refer to Table 1 for the list of acronyms used throughout this work.

TABLE 1

TABLE 1. Table of acronyms.

2 Standard nonparameric priors

2.1 Dirichlet processes and their extensions

At the core of BNP methods is the Dirichlet process (DP), which can be thought of as a probability distribution over the space of probability distributions. More formally, given a base distribution H over a measurable space Ω and a positive real number α, a random distribution $G$ is said to be a draw from a DP, $G \sim DP (α, H)$ , if for any finite partition A₁, A₂, … , A_k of Ω, the vector $[G (A_{1}), \dots, G (A_{k})]$ is Dirichlet-distributed with parameters [αH (A₁), …, αH (A_k)]. The most commonly-used constructive definitions of the DP are the Chinese Restaurant Process (CRP) and Pólya urn, which are obtained after marginalizing out the random measure $G$ , as well as the stick-breaking representation. They offer intuitive ways to understand the properties of DPs and have led to the development of straightforward inference algorithms based on Markov Chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC) methods (Das, 2014). These representations are discussed in Teh (2007). The main applications of Dirichlet processes are for clustering and density estimation (Rabaoui et al., 2011; Görür and Edward Rasmussen, 2010; Caron et al., 2012; Li et al., 2019). The primary advantage of using DPs for clustering is that one does not have to specify the number of clusters a priori, whereas density estimation tasks do not require a priori specification of a functional form.

Extensions of the standard DP have been developed for problems in which the distribution of the observations is assumed to change in time, leading to several constructions of time-dependent DPs. These are well-suited for evolutionary clustering tasks, where the number of clusters and their associated parameters can vary with time (Caron et al., 2012; Ahmed and Xing, 2008; Ren et al., 2008; Zhu et al., 2005; Moraffah and Papandreou-Suppappola, 2022). The construction of time-dependent DPs is based on the dependent Dirichlet process (DDP), which is a stochastic process indexed by covariates such as space or time (Griffin and Steel, 2006). This construction can be done through Poisson processes (Campbell et al., 2013; Lin et al., 2010), stick-breaking (Campbell et al., 2013), or through Pólya urn and CRP (Caron et al., 2012; Ahmed and Xing, 2008). In a time-dependent DP, dependency is introduced between successive mixing distributions indexed by time t, ${G_{t}}_{t = 1}^{T}$ , where t = 1, … , T is a discrete-time index. Following the setup in Caron et al. (2012), let $y_{t} = [y_{1, t}, \dots, y_{n_{t}, t}]$ denote a vector of n_t observations at time t which are independently and identically distributed (iid) samples from Eq. 1:

F_{t} (\cdot) = \int_{Θ} f (y_{t} | θ_{t}) d G_{t} (θ_{t}) (1)

where F_t is an unknown pdf to be estimated and Θ is the set of latent parameters for the mixed pdf f (y_t|θ_t). The random probability measure (RPM) $G_{t}$ is DP-distributed according to Eq. 2:

G_{t} \sim DP (α, H) (2)

Note that $G_{t}$ also satisfies the following stick-breaking construction (Eq. 3):

G_{t} = \sum_{k = 1}^{\infty} V_{k, t} δ_{θ_{k, t}} (3)

where k denotes the cluster index, $V_{k, t} = β_{k, t} \prod_{j = 1}^{k - 1} (1 - β_{j, t})$ , and $β_{k, t} \overset{iid}{\sim} β (1, α)$ . The hierarchical model is given by Eqs 4, 5:

θ_{k, t} | G_{t} \overset{iid}{\sim} G_{t} (4)

y_{k, t} | θ_{k, t} \overset{iid}{\sim} f (y_{k, t} | θ_{k, t}) (5)

Time-dependency can be introduced through a stochastic process on the weights (Nieto-Barajas et al., 2012; Griffin and Steel, 2006), and/or atoms (Griffin and Steel, 2009). Note that in the construction in Eq. 3, time-dependency is introduced in both. In Mena and Ruggiero (2016), dependency is introduced on the weights by replacing the stick-breaking construction with a one-dimensional Wright-Fisher diffusion. Another diffusion process construction based on Fleming-Voit diffusion is introduced in Ascolani et al. (2021). In Rodriguez and Ter Horst (2008), dependency is introduced on the atoms through a dynamic linear model. In Lucca et al. (2013), dependency is introduced on the atoms through a simple linear autoregressive process, and an extension based on the Ornstein-Uhlenbeck process is developed. In Campbell et al. (2013), dependency is introduced on both the weights and atoms through a Poisson process construction. Although these approaches focus on discrete-time, an extension to the continuous time domain based on geometric stick-breaking processes is introduced in Mena et al. (2011), where dependency is introduced on the weights. These are summarized in Table 2.

TABLE 2

TABLE 2. Summary of time-dependent Dirichlet processes based on the stick-breaking construction. Time-dependency on the weights and/or atoms, the specific construction, and inference algorithm are listed. See reference associated with each model for more details.

Many constructions of time-dependent DPs are based on the Pólya urn scheme or the CRP, which facilitate efficient inference (Caron et al., 2012; Ahmed and Xing 2008; Zhu et al., 2005; Blei and Frazier, 2011). It is important to note how the cluster assignment probabilities are computed in these various methods. Time-dependency alters the standard formulas for calculating the probabilities of assigning new data to a cluster, as they become dependent in some way on the number of times a cluster has been chosen at previous time points. For example, the model may take into account the entire history’s cluster assignments for t = 1: t − 1 (Caron et al., 2017; Caron et al., 2012), the previous time point’s (t − 1) assignments (Özkan et al., 2011; Ahmed and Xing, 2008), or be based on some other metric (Saad and Mansinghka, 2018). These are summarized in Table 3. In Zhu et al. (2005), the authors develop a time-sensitive DP model for time-varying topic modeling. They introduce a decaying exponential weight function into the probabilities of assigning data to a cluster which incorporates the entire history of previous cluster assignments. This model is quite flexible as it allows different clusters to have different decay rates and can capture different types of dynamic behavior such as periodicity, but it is not consistent under marginalization. In Ahmed and Xing (2008), the authors introduce a time-dependent DP based on the recurrent CRP. This approach assumes that the data arrive in T consecutive epochs, where data in the same epoch are assumed to be fully exchangeable and there are an infinite number of clusters in each epoch. The cluster assignment probabilities are computed by taking into account the previous epoch’s (t − 1) cluster assignments and the number of points already assigned to the cluster in the current epoch rather than the entire previous history. An advantage of this approach is that it captures time-varying cluster popularity. Other notable works include the distance dependent CRP, which captures the property that data points which are closer in time are more likely to cluster together (Blei and Frazier, 2011). The model introduced in Ren et al. (2008) also exhibits this property while simultaneously allowing the possibility of repetition, as temporally distant data may share parameters. Many of these time-dependent DPs also allow clusters to stay, re-emerge, and die out over time (Caron et al., 2017; Lin et al., 2010). These dynamics can be incorporated through a cluster removal step as in Caron et al. (2017), Caron et al. (2012), which has been used for time-varying density estimation (Jaoua et al., 2014; Rodriguez and Ter Horst, 2008).

TABLE 3

TABLE 3. Summary of time-dependent Dirichlet processes based on the Pólya urn or CRP constructions. Note the differences in computing the cluster assignment probabilities.

Of particular note are more recent works which focus on clustering multiple time series exhibiting similar behaviors. In Lin et al. (2019), the authors introduce the DP nonlinear State-Space Mixture (DPnSSM), which clusters multiple time series which exhibit nonlinear dynamics. By placing DP priors over the unknown parameters in the nonlinear transition dynamics, the model induces clustering of the time series based on their specific dynamics. Similar work by Nieto-Barajas and Contreras-Cristán (2014) employs a Poisson-Dirichlet process mixture model which can use trends, seasonality, or temporal components as a basis for clustering. Interestingly, Saad and Mansinghka (2018) introduce a temporally re-weighted CRP and a hierarchical extension for forecasting, missing data imputation, and clustering multivariate time series. Their approach identifies similar segments within individual time series, and is then used to cluster hundreds of time series into groups, where each group’s underlying dynamics are modeled jointly. Several approaches for specifically clustering biological time series have also been developed. In McDowell et al. (2018), the authors use a DP mixture of Gaussian processes to cluster gene expression time series. A Gaussian process prior, which will be discussed in depth in Section 2.2, is placed over the unknown transition dynamics while the DP allows clustering of gene expression time series data based on these dynamics. In Yu et al. (2016), fetal heart rate signals are clustered using the hierarchical Dirichlet process (HDP), which facilitated the identification of clusters specific to diseased and healthy fetuses. Overall, these methods have been successful in identifying shared features in multiple time series across a variety systems and have been applied to a diverse range problems including multiple object tracking, evolutionary topic modeling, and video segmentation (Neiswanger et al., 2014; Moraffah et al., 2019; Srebro and Roweis, 2005; Barker and Davis, 2014).

2.1.1 Hidden markov models

An important extension of the Dirichlet process for the analysis of time series data is its application to Hidden Markov Models (HMMs). Unlike the traditional HMM, which requires the number of hidden states to be specified a priori, incorporating a DP prior on the state transition dynamics provides a distribution over an infinite number of states. This allows the number of hidden states to be learned directly from the data. The HDP-HMM and its extensions have a range of applicability, including speech recognition, image segmentation, and genomics (Fox et al., 2011b; Teh and Jordan, 2010; Yau et al., 2011). A summary of these models is given in Table 4.

TABLE 4

TABLE 4. Summary of Bayesian nonparametric HMMs. See reference associated with model of interest for more information.

2.1.1.1 Infinite hidden markov model (iHMM)

One early development is the infinite hidden Markov Model (iHMM) (Beal et al., 2001), which uses a two-level DP to model the state transition dynamics (the transition probabilities for each row of the transition matrix). The iHMM is constructed via a coupled set of Pólya urns. Specifically, the transition from state s_t to s_t+1 is modeled as a DP with concentration β according to Eq. 6:

P (s_{t + 1} = j | s_{t} = i, n, β) = \frac{n_{i j}}{\sum_{l = 1}^{K} n_{i l} + β} (6)

where n_ij is the number of transitions from state i to j. Novel transitions occur with a finite probability $\frac{β}{\sum_{l = 1}^{K} n_{i l} + β}$ . When this occurs, the model draws a state from a second DP (Eq. 7) with concentration parameter γ and counts $n^{o}$ , referred to as the “oracle”

P (s_{t + 1} = j | s_{t} = i, n^{o}, γ) = \{\begin{cases} \frac{n_{j}^{o}}{\sum_{l = 1}^{K} n_{l}^{o} + γ} & If j is within the set of states \{1, \dots, K\} \\ \frac{γ}{\sum_{l = 1}^{K} n_{l}^{o} + γ} & If j is a new state \end{cases} (7)

A prior mass α is used to control the probability of self transitions. To complete the model, parameters θ_k for the emission distribution corresponding to each state k are drawn from a base distribution H and the likelihood is given as $F (y_{t} | s_{t}, θ_{s_{t}})$ . With a DP prior, there are a small number of states which appear few times. To incorporate power law behavior in the model, a Pitman-Yor process prior can be used instead (Van Gael, 2012). An important extension of the iHMM is the input/output iHMM (Van Gael, 2012), where the hidden states s_t, observations y_t, or both can be dependent on a set of inputs e_t. Although this can capture the effects of exogenous inputs on the system dynamics, the number of parameters to be learned increases to $O (E K^{2})$ compared with $O (K^{2})$ for the standard iHMM.

2.1.1.2 Hierarchical Dirichlet process-hidden markov model (HDP-HMM)

An alternative construction is the Hierarchical Dirichlet Process-Hidden Markov Model (HDP-HMM) (Teh et al., 2004). In an HDP-HMM each row in the transition probability matrix corresponds to a draw from an HDP where an HDP prior is used over an infinite state space. This allows outgoing states to share transitions into the same set of states. The generative model is constructed via stick-breaking and the hierarchy is given by Eqs 8–12 as follows.

θ_{k} | H \sim H, for k = 1, \dots, \infty (8)

β | γ \sim GEM (γ) (9)

π_{k} | α_{0} \sim DP (α_{0}, β) (10)

s_{t} | s_{t - 1}, π_{s_{t - 1}} \sim π_{s_{t - 1}} (11)

y_{t} | s_{t}, θ_{s_{t}} \sim F_{θ_{s_{t}}} (12)

Where H is the HDP base distribution, $π_{s_{t - 1}}$ is the vector of transition probabilities going from s_t−1 into a new sate, α₀, β, γ are the HDP hyperparameters, and $θ_{s_{t}}$ are the emission distribution parameters which associate each observation y_t with a state s_t. Teh et al. (2004) showed that the Chinese restaurant franchise representation of the HDP-HMM is equivalent to the iHMM. Although, in general, the HDP-HMM encourages sparse transition matrices, an issue is that it switches too rapidly between redundant states. This is addressed by the “sticky” HDP-HMM.

2.1.1.3 “Sticky” HDP-HMM and extensions

Influential work by Fox et al. (2011a) extended the HDP-HMM to capture state persistence, known as the “sticky” HDP-HMM (Eq. 13). It is given by

π_{j} | β, α, κ \sim DP (α + κ, \frac{α β + κ δ_{j}}{α + κ}) (13)

where the parameter κ captures state persistence by increasing the probability of self-transition. The sticky HDP-HMM prior allows the dynamical model to switch between an unknown number of states while preventing the model from switching too quickly between redundant states. Two extensions of the “sticky” HDP-HMM were introduced in Fox et al. (2011a) to learn switching linear dynamical systems, where each state in the HMM is associated with a linear dynamical model. The first is the HDP-Autoregressive-HMM (HDP-AR-HMM), which places an HDP prior on the mode-specific matrices ${A_{1}^{(k)}, \dots, A_{r}^{(k)}}$ and noise covariances Σ^(k). The observation dynamics are modeled by Eq. 14

y_{t} = \sum_{i = 1}^{r} A_{i}^{(s_{t})} y_{t - i} + e_{t}^{(s_{t})} (14)

where e_t is the noise vector and the mode dynamics are given by $s_{t} | s_{t - 1} \sim π_{s_{t - 1}}$ . The second extension is the HDP-Switching Linear Dynamical Systems (SLDS), which is a linear state space model with unknown states x_t. It is given by Eqs 15, 16

x_{t} = A^{(s_{t})} x_{t - 1} + e_{t}^{(s_{t})} (15)

y_{t} = C x_{t} + w_{t} (16)

Where w_t is the observation noise and C is the measurement matrix. Again, the mode dynamics are given by $s_{t} | s_{t - 1} \sim π_{s_{t - 1}}$ . Recent work by Hong et al. (2023) has applied the sticky HDP-HMM to the problem of attack detection in cyber-physical systems, where the attacks and their lengths are unknown. An advantage of this approach is that it can be applied to data that is of mixed continuous and categorical types.

2.1.1.4 Beta process-autoregressive HMM (BP-AR-HMM)

An issue with the HDP-HMM arises when applied to the analysis of multiple time series in that it assumes that all time series sequences share the same sets of states. To address this problem and to increase the flexibility of these methods, another type of Bayesian nonparametric HMM has been developed using the Beta process (BP), referred to as the Beta Process-Autoregressive HMM (BP-AR-HMM) (Fox et al., 2013). Note that the Beta process is used as a nonparametric prior over latent binary feature matrices. This construction captures shared dynamical behaviors across multiple time series and indicates which behaviors are exhibited by each time series (Fox et al., 2013). The hierarchical model is given by Eqs 17–21:

B | B_{0} \sim BP (1, B_{0}) (17)

X_{i} | B \sim BeP (B) (18)

π_{j}^{(i)} | f_{i}, γ, κ \sim Dir ([γ, \dots, γ + κ, γ, \dots] \otimes f_{i}) (19)

s_{t}^{(i)} \sim π_{s_{t - 1}^{(i)}}^{(i)} (20)

y_{t}^{(i)} = \sum_{j = 1}^{r} A_{j, s_{t}^{(i)}} y_{t - j}^{(i)} + e_{t}^{(i)} (s_{t}^{(i)}) (21)

Where the global weights are provided by B, which is a draw from a Beta Process (BP), and X_i is a realization of a Bernoulli process such that $X_{i} = \sum_{k} f_{i k} δ_{θ_{k}}$ , f_ik is the binary indicator for the ith time series and kth feature in the vector x. Note that κ is the parameter capturing state persistence in the “sticky” HDP-HMM. Finally, 21 is the vector autoregressive process describing the dynamics of the observations. The main advantage of this approach is that it allows each time series to exhibit a subsest of a larger collection of states observed over all sequences while allowing hidden states to be shared across sequences (Niekum 2015; Fox et al., 2013). Furthermore, jointly modeling the time series can help supplement limited data and can improve parameter estimates by pooling information from other time series, similar to multi-task learning. This circumvents the need for model selection since the number of modes is learned directly from the data.

2.1.1.5 HDP-hidden semi-markov model (HDP-HsMM)

In the standard HDP-HMM and “sticky” HDP-HMM, the state durations are geometrically distributed and the self-transition parameter is shared across all states. These assumptions are not suitable for situations where we may want to capture bimodal or multimodal state distributions. For instance, in modeling the behavior of consumers in a shopping mall, some users might briefly check something and leave, while others might spend a long time browsing. An alternative to the “sticky” HDP-HMM is the HDP-Hidden Semi-Markov Model (HDP-HsMM), which incorporates explicit duration semi-Markovianity by placing a distribution over the state duration (Johnson and Willsky, 2012). Once the state is reached, a duration time is drawn from the duration distribution and the system stays in the state until the duration period ends before transitioning to a new state. Let ${(z_{s})}_{s = 1}^{S}$ denote S super-states, which emit segments of observations y_{t(s):t(s+1)−1}. As well, let ${(x_{t})}_{t = 1}^{T}$ denote a label sequence, where each x_t corresponds to an observation. The model (Eqs 22–29) is given as:

β | γ \sim GEM (γ) (22)

π_{k} | β, α \sim DP (α, β), for k = 1, \dots, \infty (23)

θ_{k} | H, λ \sim H (λ) (24)

ω_{k} | Ω \sim Ω (25)

z_{s} | {\{π_{k}\}}_{j = 1}^{\infty}, z_{s - 1} \sim {\tilde{π}}^{(z_{s - 1})} (26)

D_{s} | ω \sim D (ω_{z_{s}}) (27)

x_{t (s) : t (s + 1) - 1} = z_{s} (28)

y_{t (s) : t (s + 1) - 1} \overset{iid}{\sim} f (θ^{(z_{s})}) (29)

Where w_k are duration distribution-specific parameters for each state k = 1, … ∞, ${\tilde{π}}^{(z_{s - 1})}$ are transition probabilities constructed by removing the atom corresponding to z_s−1 so that there are no self-transitions in the superstate sequence. The HDP-HsMM was shown to mix faster compared to the sticky HDP-HMM when applied to the problem of speaker diarization (Johnson and Willsky, 2012).

2.2 Gaussian processes

Another set of fundamental BNP methods for analyzing time series data is based on the Gaussian process (GP). The GP is formally defined as a potentially infinite collection of random variables such that the joint distribution of any finite subset is a multivariate Gaussian. It is used as a prior over unknown functions. A draw from a GP is denoted as f(x) ∼GP(m_f(x), k_f(x, x′)), where $x \in R^{N}$ , m_f(x) is the mean function and k_f(x, x′) is the covariance function. Given the observations of the function f on x, the predictive distribution of f* = f(x*) at new inputs x* is given by Eq. 30

p (f^{*} | x^{*}, f, x) = N (f^{*} | μ_{post}, Σ_{post}) (30)

The posterior mean and covariance are given by Eqs 31, 32:

μ_{post} = m_{x^{*}} + K_{x^{*}, x} {[K_{x, x} + σ^{2} I]}^{- 1} y (31)

Σ_{post} = K_{x^{*}, x^{*}} - K_{x^{*}, x} {[K_{x, x} + σ^{2} I]}^{- 1} K_{x, x^{*}} (32)

One of the key strengths of GPs is in their ability to incorporate prior knowledge about a range of interesting dynamics, such as change points, periodicity, delays, long and short-term dynamics, smooth variation and more (Roberts et al., 2013; Saad et al., 2023). Moreover, they are suitable for handling missing data, modeling errors, quantifying uncertainty, and tend to be more robust to overfitting, making them a powerful tool. Some notable uses of GPs in the realm of time series analysis are for system identification in state-space models (SSMs), where the GP is used as a prior over the transition dynamics and/or measurement function. This is a particularly challenging problem because the unknown transition function depends on the unknown state at time t. In general, for unknown states x_t at time t, measurements y_t, and unknown transition functions f, the state space formulation is given by the following hierarchy in Eqs 33–35:

f (x) \sim GP (m_{f} (x), k_{f} (x, x^{'})) (33)

x_{t} = f (x_{t - 1}) + w_{t - 1} (34)

y_{t} = g (x_{t}) + v_{t} (35)

Where w_t−1 and v_t are typically additive Gaussian noise terms for the process and measurement dynamics, respectively. Several works have tackled the system identification issue using different strategies (Eleftheriadis et al., 2017; Frigola et al., 2013; Frigola et al., 2014). In Deisenroth et al. (2012), the authors introduce a novel method for computing a Gaussian approximation to the smoothing distribution referred to as the Rauch-Tung-Striebal (RTS) smoother, which outperforms popular smoothing algorithms such as the extended Kalman filter (EKF) and unscented Kalman filter (UKF). In Frigola et al. (2013), the authors introduce a multi-step approach which involves marginalizing out the unknown function f and drawing a sample from the smoothing distribution using particle Gibbs with ancestor sampling (PGAS). Thus, the unknown states can be sampled from the smoothing distribution without knowledge of the transition function. Follow-up work in Frigola et al. (2014) introduces a more efficient learning approach based on variational sparse GPs, which reduces the computational complexity to be linear in the length of the time series, making it faster to compute predictions of future trajectories. As well, GP approaches have been shown to effectively predict chaotic time series (Petelin and Kocijan, 2014). In Aalto et al. (2018); McDowell et al. (2018), GPs were used to infer underlying dynamics which were then used as a basis for clustering gene regulatory network time series.

Although appealing, one drawback of these methods is that they suffer from scalability issues, making them unsuitable for large and high-dimensional datasets. However, methods for improving scalability will be discussed in Section 3.4. As well, GPs cannot model multimodal or heavy-tailed marginal distributions. This can make them less robust to outliers and results in less accurate uncertainty estimates. Instead, a Sudent’s-t process may be more appropriate for such situations (Tracey and Wolpert, 2018), which has an extra parameter controlling the kurtosis of the distribution. Lastly, the expressivity of GPs is limited by the choice of kernel, making them unsuitable to learn complex relationships or features from time series data. An alternative to these last two issues is to use a deep learning BNP approach, which can equip GPs with the power to learn complex representations from data. This will be discussed in Section 4.1.

2.3 Multi-output GPs

Standard GPs, also called single-output GPs, model a single output variable as a function of input variables. A useful extension of GPs is the multi-output GP (MOGP), where an input or set of inputs can have multiple correlated outputs. For example, in a healthcare scenario, changing the drug dosage may have an affect on heart rate, blood sugar, and cholesterol levels. In such a scenario, the MOGP would model each output (heart rate, blood sugar, and cholesterol levels) jointly. The joint modeling leads to better predictive performance compared to single-output GPs. As well, sharing information between outputs typically reduces overfitting and is particularly useful when some outputs have limited data. To construct an MOGP, one just needs to specify a covariance kernel over the outputs in addition to one over the inputs. For k = 1, … , K outputs, the joint prior is given by Eqs 36, 37:

f (x) = [\begin{matrix} f_{1} (x) \\ f_{2} (x) \\ ⋮ \\ f_{K} (x) \end{matrix}] \sim G P ([\begin{matrix} m_{1} (x) \\ m_{2} (x) \\ ⋮ \\ m_{K} (x) \end{matrix}], K (x, x^{'})) (36)

f (x) \sim M O G P (m (x), K (x, x^{'})) (37)

K (x, x^{'}) = [\begin{matrix} k ((x, 1), (x^{'}, 1)) & k ((x, 1), (x^{'}, 2)) & \dots & k ((x, 1), (x^{'}, K)) \\ k ((x, 2), (x^{'}, 1)) & k ((x, 2), (x^{'}, 2)) & \dots & k ((x, 2), (x^{'}, K)) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ k ((x, K), (x^{'}, 1)) & k ((x, K), (x^{'}, 2)) & \dots & k ((x, K), (x^{'}, K)) \end{matrix}] (38)

where in general, the covariance kernel in Eq. 38 is given by the following multiplication of two separable kernels (Eq. 39):

k ((x, k), (x^{'}, k^{'})) = k_{input} (x, x^{'}) \times k_{output} (k, k^{'}) (39)

The regression model is specified as follows

y_{k} (x) = f_{k} (x) + ϵ_{k} (40)

ϵ_{k} \sim N (0, σ_{k}^{2})

The model in Equation 40 can easily be extended to time series through the addition of a time index as y_k,t = f_k (x_t) + ϵ_k,t. Note that the covariance function takes into account correlations between inputs x and x′ as well as correlations between outputs k and k′. Through the construction of the covariance kernel, it is possible to simultaneously capture a range of underlying structures. In general construction can be as straightforward as adding or multiplying two or more separable kernels (e.g., multiplying a periodic kernel by a squared exponential kernel), or can be done through coregionalization models in which one defines a matrix B where entry (i, j) describes how outputs i and j are correlated with each other (Liu et al., 2022a). In addition to leveraging information from more data-rich outputs to help inform other data-poor outputs, MOGPs can also be applied to problems involving heterogeneous outputs (e.g., continuous, binary, and categorical) (Moreno-Muñoz et al., 2018). As well, they have been applied to multi-fidelity datasets with multiple correlated outputs (Lin et al., 2021). Unfortunately, the computational complexity of MOGPs is $O (Q^{3} N^{3})$ for Q outputs due to the operations on the covariance matrices. Alternatively, using a convolution process, which results in a non-separable kernel, can capture dependencies between outputs in a sparse GP framework and reduces the computational complexity to $O (N^{3} Q)$ (Alvarez and Lawrence, 2008). Sparse GPs are also used in Moreno-Muñoz et al. (2018); Lin et al. (2021). MOGPs have exhibited success particularly in the realm of biomedicine and can aid in predicting patient health outcomes (Cheng et al., 2020).

2.3.1 Multi-task GP

A notable application of MOGPs is to multi-task learning. In multi-task learning, the objective is to improve the performance of multiple related learning tasks by pooling the information across the different tasks (Bonilla et al., 2007). The underlying assumption is that the tasks are not completely independent and can benefit from the knowledge contained in one another. It should be distinguished from transfer learning, which aims to use the knowledge gained from one or more source tasks to help learn a different, but related target task. In contrast, multi-task learning aims to improve the learning of multiple related tasks by learning them simultaneously, allowing the model to leverage shared information across the tasks. Multi-task learning therefore focuses on the identification of shared structures, making it suitable for multivariate time series analysis. Bayesian nonparametric methods for multi-task learning leverage the power of Gaussian processes to extract knowledge from multiple related time series, resulting in the multi-task GP (MTGP).

The key concept underlying the MTGP is the construction of a multi-task covariance kernel, which characterizes the correlation within and between tasks. This proceeds in the same manner as in a MOGP, except that each output k is now a task. Although better able to capture rich structures in data compared to standard GPs, these methods suffer from a few drawbacks. First, it is assumed that all tasks are equally important which may not be the case in real-world applications. Furthermore, these methods are computationally intensive due to the large number of parameters involved, as the covariance matrix associated each task or time series will have its own set of hyperparameters. Lastly, the complexity associated with the inversion of the covariance matrix makes it unsuitable for large and high-dimensional time series data. To address these issues, recent work has focused on developing deep learning paradigms with sparse inference methods for BNP multi-task learning, which will be discussed in Section 3.

2.3.2 Multi-resolution GP

Many real-world problems involve the analysis of data collected from different sources with varying resolutions or where the underlying phenomenon may have spatial and/or temporal multi-scale features. For example, EEGs may exhibit long-term patterns relating to circadian rhythms, but abrupt changes or sharp spikes may also be present due to a seizure. A limitation of standard GPs is that although they can capture long-range dependencies and sudden changes, they cannot do so simultaneously. To address this limitation, the multi-resolution GP (mGP) was introduced Fox and Dunson (2012), which is constructed by coupling a collection of smooth GPs in a hierarchical manner. Each GP is defined over an element of a nested partition $A = {A^{0}, A^{1}, \dots, A^{l - 1}}$ . A global GP is introduced to capture long-range dependencies f⁰ ∼GP(0, c⁰) and GPs are drawn independently from Eq. 41 for each $A_{i}^{l}$ as

f^{l} (A_{i}^{l}) \sim GP (f^{l - 1} (A_{i}^{l}), c_{i}^{l}) (41)

where c_i is the covariance function. Recent work in Longi et al. (2022) applied the mGP framework to model the effects of multiple time scales in GP-SSMs, which can capture slow and fast transitions. mGPs have also been extended to the multi-task setting in Hamelijnck et al. (2019).

2.4 Posterior inference

Approaches for posterior inference are based on Markov Chain Monte Carlo (MCMC) or variational inference (VI). MCMC methods are asymptotically exact, but can be slow and difficult to scale, particularly in high-dimensional settings. The most commonly-used MCMC method is Gibbs sampling, which iteratively samples from the conditional distributions of the variables of interest and constructs a Markov chain whose stationary distribution is the target posterior distribution. In contrast, VI is an optimization-based approach, which seeks the distribution within the chosen variational family that is closest to the true posterior, as measured by the Kullback-Leibler (KL) divergence. VI is scalable to large datasets but can be more prone to approximation errors as the accuracy depends on the choice of the variational family. A stochastic extension of VI (SVI) is often used in conjunction with sparse Gaussian processes for scalability to large datasets. This section discusses both classes of methods. Table 5 provides references for the reader to further explore inference methods associated with their model of interest.

TABLE 5

TABLE 5. Summary of inference approaches available for the three classes of BNP models.

2.4.1 Markov chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) methods generate samples from a target posterior distribution $π_{(} x)$ by constructing a Markov chain whose stationary distribution is the posterior. Gibbs sampling is widely used for BNP model inference due its ease of implementation and theoretical accuracy. When using conjugate priors, the conditional posteriors which are iteratively sampled from are obtained in closed form. However, the Gibbs sampler is slow to converge and tends to exhibit poor mixing as the chain may get stuck in posterior modes. To make the computations more tractable, one can set a truncation level where the infinite mixture is approximated by a finite mixture with T components. Unfortunately, this method may be prone to approximation errors. A solution is to introduce an adaptive truncation method which adapts to the complexity of the data (Griffin 2016). Another alternative is to use slice sampling, which proceeds by sampling uniformly from horizontal “slices” in the target posterior distribution (Neal, 2003). Although slice sampling does not require the use of conjugate priors, can be parallelized, and is straightforward to implement (Zhu et al., 2020), its computational efficiency and accuracy decreases in high dimensions. Slice sampling is also less effective when there are high correlations between variables. An MCMC method which outperforms slice sampling in high-dimensional settings is Hamiltonian Monte Carlo (HMC), which updates parameters using a mechanism derived form Hamiltonian dynamics. HMC efficiently moves through the parameter space by exploiting the gradient of the log-probability distribution to guide the construction of proposals. Furthermore, due to the use of Hamiltonian dynamics, the HMC can make large steps in the parameter space instead of making small local moves and is well-suited to dealing with highly correlated variables (Betancourt, 2017). It is less likely to get stuck in posterior modes since all of the parameters are updated simultaneously. However, the gradient computation leads to scalability issues for large datasets and online settings.

2.4.2 Stochastic gradient MCMC (SGMCMC)

Stochastic Gradient MCMC (SGMCMC) presents a scalable alternative to standard MCMC methods by using noisy gradient estimates obtained from mini-batches of the data. Two such methods are stochastic gradient HMC (SGHMC) and stochastic gradient Langevin dynamics (SGLD), which introduce noise to the gradient computations of the log-posterior densities. In SGLD, Langevin dynamics are introduced into the standard stochastic gradient descent (SGD) and the parameters are updated by adding a Gaussian noise term (Nemeth and Fearnhead, 2021). The trajectory of the parameters resembles the behavior of particles undergoing Brownian motion as described by Langevin dynamics. As the name suggests, SGHMC is a stochastic extension of HMC. In Chen et al., (2014), a friction term is added to the momentum update to ensure that the Hamiltonian dynamics have the target distribution as the invariant distribution as simply injecting the noise to the gradient results in the target distribution no longer being the invariant distribution. The addition of the friction term thus stabilizes the Hamiltonian dynamics. An advantage of SGLD over SGHMC is that it is typically easier to implement and requires tuning fewer hyperparameters. However, since SGLD lacks the momentum component present in SGHMC, SGLD may be less efficient in exploring the parameter space in high-dimensional settings (Chen et al., 2014).

2.4.3 Optimization-based inference

Optimization-based methods, such as VI and stochastic variational inference (SVI), are often faster and outperform MCMC on large datasets at the expense of precision. The main idea underlying standard VI is to posit a family of simpler, tractable distributions (known as the variational family) and then find the member of this family that is closest to the target distribution (Blei et al., 2017). Optimization is performed through coordinate ascent or gradient descent on the entire dataset, making it computationally intensive for large datasets, although parallel VI is discussed in Campbell et al. (2015). In contrast, SVI operates similarly to SGMCMC by computing noisy estimates of the gradient on mini batches of the data, thereby making it more computationally efficient and well-suited to online settings. As in SGMCMC, the added stochasticity aids the model in escaping local optima. However, compared to VI, the added noise in SVI can lead to more errors. As well, the accuracy is influenced by the choice of minibatch size and the specific choices of the minibatches themselves may result in biased estimates.

2.5 Concluding remarks

Standard BNP approaches based on Dirichlet processes, Gaussian processes, and their most widely used extensions constitute a powerful statistical toolkit for a diverse range of time series analysis problems, such as within- and across-time series clustering, inferring unknown transition dynamics, and identifying change points in switching systems. As large and high-dimensional time series are becoming increasingly prevalent, more sophisticated approaches are needed to extract complex features and relationships from such data. The next section highlights some ongoing research trends in this regard as well as potential avenues for further exploration.

3 Current research trends

Although this work has highlighted several advantages of BNP methods for time series analysis, significant challenges still remain when analyzing both large and high-dimensional time series data that coalesce around the “curse of dimensionality”. This manifests in local kernel methods, such as the GP, as local kernels degenerate to one-nearest-neighbor classifiers in high dimensions (Agrawal, 2020; Bengio et al., 2005). As well, due to the lack of expressivity of the kernels, GPs lack the power to learn complex and abstract representations from high-dimensional datasets and instead act as smoothers (Ober et al., 2021). Dirichlet processes also face certain challenges in high-dimensional settings. Clusters may be more difficult to identify in high dimensions as distances between points become increasingly homogeneous and not all dimensions may be informative for clustering. Density estimation in high-dimensions also presents a challenge as large amounts of data are required to obtain meaningful estimates. In regard to inference, standard MCMC approaches also suffer from the “curse of dimensionality”, as data in high-dimensional settings can be sparse so the rate of convergence tends to decrease with an increase in dimension (Nagler and Czado, 2016). To address these challenges, current research trends center on the development of BNP deep learning methods for large and high-dimensional data. Integrating deep learning architectures and BNP methods enables the models to learn sophisticated representations of data which can improve predictions, provide reliable uncertainty estimates, and reduce the overfitting which deterministic neural networks are prone to. Sections 3.1 and 3.2 highlight two approaches: deep kernel learning (Wilson et al., 2016a; Al-Shedivat et al., 2017) and deep Gaussian processes (Damianou and Lawrence, 2013). However, an increase in model complexity necessitates computationally efficient methods. To this extent, another area of ongoing research centers on the development of scalable inference for BNP deep learning models, which will be discussed in Section 3.4.

3.1 Deep kernel learning

Pivotal research by Neal (1996) has demonstrated an important theoretical connection between GPs and neural networks, which is that infinitely wide neural networks converge to a GP. This result also extends to deep neural networks (Lee et al., 2017). His work provides a theoretical basis for which neural networks and GPs can be interwoven. Deep kernel learning (DKL) is one such manifestation. In DKL, a deep learning neural network is used as the kernel function of a GP. The neural network transforms high-dimensional inputs into a lower dimensional feature space representation where these features become an input into a GP. This confers a model with both of the advantages of GPs and neural networks: quantifying uncertainty and learning complex abstract representations. Equation 42 defines a deep kernel as follows:

k_{deep} (x, x^{'}) = k (ϕ (x; ω), ϕ (x^{'}; ω) | θ, ω) (42)

where $k : R^{D} \times R^{D} \mapsto R$ is any base kernel with hyperparameters θ and $ϕ : R^{M} \mapsto R^{D}$ is a nonlinear mapping given by a deep neural network with weights ω. Stochastic variational inference is typically used to train the model, making it amenable to large datasets (Wilson et al., 2016b). Other methods such as stochastic gradient Langevin dynamics (SGLD) and semistochastic block gradient descent can be used as well. DKL methods have garnered much attention in the machine learning community (Wilson et al., 2016b; Strohbeck et al., 2022; Botteghi et al., 2022), particularly for their reputation for avoiding overfitting through optimization of the marginal likelihood. However, recent work by Ober et al. (2021) has illuminated a fascinating issue. When the number of hyperparameters is large, the marginal likelihood becomes a poor training objective which in turn decreases the performance of the DKL model, resulting in overfitting. In these situations, the DKL model performs worse than standard GP regression and deterministic neural networks (Ober et al., 2021). Fortunately, this issue is resolved through the use of sampling-based methods, such as Hamiltonian Monte Carlo (which does not scale to large datasets) and SGLD (which scales to large datasets through the use of mini-batching). A similar approach was employed in Li et al. (2021), who proposed a deep Bayesian Gaussian process which combines DKL with Bayesian inference. This has the advantage of producing more reliable uncertainty estimates and predictions compared to standard DKL and deep Bayesian neural networks when applied to large-scale electronic health records. These works indicate that DKL benefits from employing a fully Bayesian treatment.

DKL has recently been applied to the discovery of dynamical models and latent states from high-dimensional noisy time series data (Botteghi et al., 2022). In Botteghi et al. (2022), an encoder is used to compress high-dimensional measurements into low-dimensional state variables and SVI is used for learning. The efficacy of this model was demonstrated through learning the stochastic motion of a pendulum with external perturbations from high-dimensional noisy images. Although not specifically DKL, a related method from the image processing community integrates DP clustering with deep learning neural networks. In Wang et al. (2022), the authors develop Deep Nonparametric Bayes (DNB) for jointly estimating the number of clusters, cluster labels, and learning deep representations in image data. This is done by first passing the images through a convolutional neural network, then performing DP clustering on the features. These methods have also been applied to the development of deep factor analysis models (Mittal et al., 2020) and deep tracking models (Zhang and Paisley, 2018). An interesting possible application is to time series image data, which may be useful in biomedical applications, for example, in grouping heterogeneous cancer patient populations in subpopulations based on the progression of their cancer type.

3.2 Composite models: Deep vs. nested

Another set of approaches centers on the construction of deep Gaussian processes. The term “deep” is used rather inconsistently in the literature to refer to several different nested or composite extensions of standard Bayesian nonparametric models with deep architectures. These research areas are fairly new, so there has not yet been sufficient time to develop a standardized language. Based on the analysis of the literature, it was found that there are two terms used to refer to GPs with deep architectures: nested Gaussian processes (nGPs) and deep Gaussian processes (DGPs). In an nGP, or more generally any nested BNP model, the hyperparameters of the model are drawn from another BNP model. In the case of an nGP, for example, the mean function or the hyperparameters of the covariance kernel of a GP are themselves drawn from a GP prior. Similarly, the atoms of a nested Dirichlet process (nDP) are themselves draws from a DP. The term DGP is used more broadly. According to Damianou and Lawrence (2013), a DGP is a multi-layer network consisting of GPs, where the input of each GP is the output of another GP. Intriguingly, the literature on DGPs also contains several works which use an nGP model but refer to it as a DGP (Zhao, 2021; Zhao et al., 2021; Lu et al., 2020). Adding to the convolution, nGP-centric papers do not reference DGP studies and vice versa, obscuring the distinct dependencies each model encapsulates. Per the definition in Damianou and Lawrence (2013), the DGP model refers to a cascade of transformations on the input in which one GP feeds its output into another GP. This introduces a state or sequence-specific dependency in the model structure based on direct input/output relationships. In contrast, nGPs or other nested BNP models exhibit behavior-specific dependency. For example, in an nGP, the behavior of the primary predictive GP at a specific input point is influenced by the hyperparameters at that input. Instead of the hyperparameters being static, they are dynamically modeled by another GP. This allows them to vary across the input space, resulting in heightened flexibility and adaptability in comparison to standard GPs with unchanging hyperparameters. To enhance conceptual clarity, I will refer to DGPs defined in Damianou and Lawrence (2013) as sequence-dependent DGPs, while nGPs and DGPs built upon the nGP foundation will be designated as behavior-dependent DGPs. This choice of nomenclature retains the inherent deep architecture of both models but underscores the unique dependencies each encapsulates. I will continue using the term “nested” for non-GP nested BNP models.

3.2.1 Sequence-dependent DGPs

Although there are several ways to construct a sequence-dependent DGP (Dunlop et al., 2018), this work highlights construction via composition since it provides the most intuitive understanding of the architecture. A sequence-dependent DGP consists of three types of nodes: parent latent node Z, intermediate latent space nodes X_h−1, and leaf nodes Y (Teng et al., 2018). The parent latent node represents the initial input of the DGP. The intermediate latent node corresponds to h = 1, … , H − 1, where H is the number of hidden layers. The leaf node Y represents observed output, or target variables of the model. The graphical model in Figure 1 depicts the architecture for H hidden layers.

FIGURE 1

FIGURE 1. Graphical model representation for the deep Gaussian process adapted from Salimbeni and Deisenroth (2017). The output of each layer becomes the input into the next layer.

The generative process is given by Eqs 43, 44:

Y = f^{Y} (X_{1}) + ϵ_{1} (43)

⋮

X_{H - 1} = f^{H - 1} (Z) + ϵ_{H - 1} (44)

where the joint posterior is given in Eq. 45:

p (Y, {\{f^{h}\}}_{h = 1}^{H - 1}) = p (Y | f^{H - 1} (f^{H - 2} (\dots f^{1} (X)))) \prod_{h = 1}^{H - 1} p (f^{H}) (45)

Note that a temporal extension (Eqs 46–49) can be formulated as follows:

Y_{t} = f^{Y} (X_{t, 1}) + ϵ_{t, 1} (46)

⋮

X_{t, H - 1} = f^{H - 1} (Z_{t}) + ϵ_{t, H - 1} (47)

Y_{t} = f^{2} (f^{1} (X_{t})) (48)

p (Y, {\{f^{h}\}}_{h = 1}^{H - 1}) = \prod_{t - 1}^{T} p (Y_{t} | f^{H - 1} (f^{H - 2} (\dots f^{1} (X_{t})))) \prod_{h = 1}^{H - 1} p (f^{H}) (49)

Where independent GP priors are placed over the functions corresponding to each layer $f^{h} \sim GP (m_{f}^{h} (X), k_{f}^{h} (X, X^{'}))$ for h = 1, … , H and f^Y denotes the function corresponding to the final layer of the DGP which outputs the leaf node Y. Note that ϵ_t,H−1 is the error term corresponding to H − 1 layer. Standard GP kernels, such as the radial basis function (RBF) or squared exponential are typically used for $, k_{f}^{h} (X, X^{'})$ . For image data, the convolutional kernel is a more suitable option (Kumar et al., 2018). Through the nonlinear transformations which occur at each level of the sequence-dependent DGP they are able to capture highly complex and nonlinear mappings as well as model non-Gaussian distributions. This is because as the data passes through the layers, they undergo a series of transformations that gradually build up a hierarchy of increasingly abstract representations. Early layers might capture simple patterns or local structures, while deeper layers might capture more complex, global relationships. Increasingly abstract features can be learned at each layer. Furthermore, they can capture correlations between multiple outputs making them easily extendable to multi-task settings (Agrawal, 2020; Alaa and van der Schaar, 2017). As with standard GPs, sequence-dependent DGPs provide a quantification of uncertainty in predictions, which is desirable in tasks where safety is critical, such as healthcare or product development. Sampling- and optimization-based approaches can be used for inference. Sequential Monte Carlo (SMC) inference of sequence-dependent DGP state space models is introduced in Liu et al. (2022b). Implementation of these methods using scalable inference algorithms, such as those in Salimbeni and Deisenroth (2017); Wang et al. (2016), makes them well-suited for analysis of high-dimensional and large data sets.

Although there are few applications of sequence-dependent DGPs to the analysis of time series data, recent work by Chen et al. (2020) developed a sequence-dependent DGP approach to predict flight trajectories. They found that the ability of the sequence-dependent DGP to represent nonlinear features helped improve prediction accuracy as it could better capture flight environment stochasticity. Other work has extended sequence-dependent DGPs to state-space models (DGP-SSM) (Eq. 50) (Liu et al., 2022b; Taubert et al., 2022; Zhao, 2021). The formulation is given as

\begin{align} Transition: x_{0, t} & = f_{0} (x_{0, t - 1}) + ϵ_{t} \\ Deep: x_{l, t} & = f_{l} (x_{l - 1, t}) + ϵ_{l - 1, t}, for l = 1 \dots L layers \\ Measurement: y_{t} & = g (x_{L, t}) + ϵ_{L, t} \end{align} (50)

where ϵ_L,t is the error term corresponding to the Lth layer at time t, g is the possibly unknown measurement function, f_l is the function mapping x_l−1,t to x_l,t, and f₀ is the transition function mapping the state x_0,t−1 at time t − 1 to the state x_0,t at time t. In Liu et al. (2022b), the authors consider a DGP-SSM where both the state transition and measurement functions are assumed unknown. To induce sparsity, a feature-based representation of the GPs is used. An SMC algorithm is developed for inference on a simulated dataset consisting of two hidden layers, a five-dimensional measurement, and two-dimensional state vector. How this type of SMC algorithm scales to high-dimensional and large time series data remains an open question. To this extent, Taubert et al. (2022) introduce a computationally efficient algorithm combining sparse GPs and stochastic expectation propagation for inference in DGP-SSMs. The algorithm was evaluated on kinematic data with 99 degrees of freedom and was shown to outperform the standard GP dynamical model in terms of prediction accuracy via mean squared error (MSE). Comparison to other comparable methods such as DKL is needed to fully evaluate the efficacy of their approach, but this is an exciting area for future research.

3.2.2 Behavior-dependent DGPs and nested BNP models

In a behavior-dependent DGP (Eq. 51), a GP prior is placed on the parameters of a GP, such as the mean function or the hyperparameters of the covariance kernel (Zhao et al., 2021). Like the sequence-dependent DGP, this layering can continue indefinitely. An example of such a construction is given by

\begin{align} f_{t} \sim GP (μ_{t}, k_{t}) \\ μ_{t} \sim GP (μ_{p}, k_{p}), \end{align} (51)

where t and p correspond to different levels of nesting and the mean of f_t, which is μ_t is drawn from a GP. Although sequence and behavior-dependent DGPs can model complex relationships between data in a hierarchical manner, they offer different advantages. In particular, the behavior-dependent DGP construction allows for explicit modeling of non-stationary behavior as the GP parameters are allowed to vary across the input space. Non-stationarity is inherent in its construction. In contrast, sequence-specific DGPs typically implicitly model non-stationarity through the successive nonlinear transformations of the input. However, non-stationarity can also be introduced in the kernel construction. Incorporating behavior-dependent DGPs in a state-space model framework enables us to capture several levels of spatial and/or temporal granularity in the dynamics. By allowing a GP to modulate the parameters of another GP, they can adapt very locally to data. This makes them well-suited for modeling data with rapid fluctuations or change points, varying scales of trends, or input region specific patterns and have the ability to model a larger class of functions (Zhao et al., 2021). Behavior-dependent DGPs have exhibited success in imputing missing data in high-dimensional settings while modeling complex spatiotemporal relationships in healthcare settings (Imani et al., 2019).

Other related methods are based on nested partition models (Mansinghka et al., 2016), such as nested Dirichlet processes (nDPs, Eqs 52–55) (Rodríguez et al., 2008), which can be used to cluster entire distributions. The hierarchical model for the nDP is given as

G_{r} | γ, G_{0} \sim DP (γ, G_{0}) (52)

G_{j} | α, Q \sim DP (α DP (γ G_{0})) (53)

θ_{j i} | G_{j} \sim DP (α DP (γ H)), also denoted by distribution Q (54)

x_{j i} | θ_{j i} \sim F (θ_{j i}) (55)

Which admits the stick-breaking construction in Eq. 56

G_{r} = \sum_{l = 1}^{\infty} w_{l r} δ ϕ_{l r}^{*}, ϕ_{l r}^{*} \sim G_{0}, {(w_{l r})}_{l = 1}^{\infty} \sim GEM (γ) (56)

and

G_{j} \sim Q ≜ \sum_{r = 1}^{\infty} π_{r} δ_{G_{r}}, {(π_{r})}_{r = 1}^{\infty} \sim GEM (α) (57)

≜ DP (α DP (γ G_{0})) (58)

Note that Q in Eq. 57 is defined as an infinite-dimensional distribution over all possible DPs. With probability π_r, a DP G_r is selected which sets G_j = G_r. This allows clustering of the distributions themselves. A marginalized nDP based on the Pólya urn construction is introduced in Zuanetti et al. (2018). Unfortunately, the nDP can degenerate to the fully exchangeable case if two populations share at least one latent variable (Camerlenghi et al., 2019). More recently, a latent nested partition model was introduced in Camerlenghi et al. (2019) to overcome the degeneracy issue of the nDP. An open area of research is the development of temporal extensions of these nested partition models to allow for time-varying distributional cluster membership, which may be more reflective of the dynamic nature of real-world populations.

3.3 Deep Gaussian process multi-task learning

Extending DGPs to multi-task settings results in powerful models equipped with the advantages of MTGPs and the ability to learn complex relationships and patterns in data. An advantage of incorporating a deep architecture in an MTGP model is that it is less sensitive to the specific form of the covariance kernel, which is not the case in standard GP or MTGP analysis. The deep architecture helps mitigate this sensitivity, making the model more robust and adaptable to various types of data. Deep architectures also provide increased expressivity. This is especially advantageous when handling tasks characterized by intricate and nonlinear dependencies. Such complexities may pose challenges for shallower models (Boustati et al., 2020; Boustati and Savage, 2019). Deep MTGP models not only maintain the ability to provide informative uncertainty quantifications but also uphold robustness against overfitting. Moreover, the hierarchical representation and feature learning capabilities of DGPs contribute to better generalization across tasks. The model can leverage shared features and representations to make predictions on new or unseen tasks, promoting more effective transfer of knowledge between related tasks.

3.4 Posterior inference scaling to large datasets

Many of the approaches just discussed are computationally intensive and require methods to scale to large datasets. Although it is important to note that GP inference in general admits parallelization and distributed inference (Agrawal, 2020), one should exercise caution when dealing with time series data which exhibit strong dependencies or high correlations so that the integrity of the temporal dependencies is maintained. Stochastic variational inference algorithms for GPs have been successfully applied to large datasets (Hensman et al., 2013) and provide a path for scaling deep learning BNP methods to large and high-dimensional datasets (Hoffman et al., 2013). One recent approach is the use of sparse variational Gaussian processes (SVGPs). SVGPs reduce the computational complexity of a standard Gaussian process from $O (n^{3})$ to $O (n m^{2})$ by fitting a GP using a set of m ≪ n inducing points, thereby providing an approximation to the exact GP. This approach has been extended to sequence-dependent DGPs in Salimbeni and Deisenroth (2017), who introduce a doubly stochastic variational inference method which does not impose independence between layers, as there are typically high correlations between layers. It was also shown to scale to billions of data points and can be applied to inference problems using non-conjugate priors. In Adam et al. (2021), the authors introduce a dual parameterization for the SVGP which resulted in faster hyperparameter learning and inference. However, an issue with SVGPs is that for large M, the computational complexity scales to $O (m^{3} + n m^{2})$ . This issue is addressed in Cunningham et al. (2023), who use an inter-domain approach to project the GP onto a basis of B-splines, facilitating computationally efficient linear algebra. An alternative method is deep ensemble kernel learning (DEKL) (Agrawal, 2020). DEKL partitions the feature network architecture into several learners, which facilitates model parallelism. As well, the kernel hyperparameters admit optimization in closed form and the model does not rely on inducing point approximations. Stochastic gradient Langevin dynamics (SGLD) can also be used. Dubey et al. (2016) introduced a method to reduce the variance in the noisy gradients, and an algorithm for distributed inference was introduced in Das et al. (2016). In Havasi et al. (2018), stochastic gradient Hamiltonian Monte Carlo (SGHMC) was shown to provide more accurate inference in sequence-dependent DGPs than variational inference methods when applied to medium-large datastets. Methods scalable to large datastets are introduced in Gürbüzbalaban et al. (2021).

4 Practical applications

Bayesian nonparametric (BNP) methods showcase versatility in numerous domains. This section highlights their diverse applications in three areas, specifically focusing on object tracking, healthcare and biomedical data analysis, and speech signal processing.

4.1 Object tracking

Object tracking aims to locate and follow the movement of objects captured by a sensor, such as radar, GPS, or a camera over time. Unlike traditional methods that assume a fixed number of objects or rely on predefined detection models, Bayesian nonparametric methods allow for flexibility in handling varying object counts and diverse types of objects, as well as adapting to varying environments (Moraffah et al., 2020). In Fox et al. (2006), the authors consider the problem of multiple-object tracking when the number of objects is unknown. This involves assigning measurements collected by a sensor to their underlying targets. Using a linear state-space model to model the dynamics of each target and measurement, a Dirichlet process (DP) prior is placed on the number of targets. Gibbs sampling is used to infer the target-measurement associations and identify 10 targets based on their measurement trajectories. In Caron et al. (2012); Neiswanger et al. (2014), a time-dependent DP construction based on the generalized Pólya urn scheme is introduced to track multiple objects in videos. Each object is modeled as a multiplication of a multivariate Gaussian and multinomial distribution to capture its location and RGB color distribution. A dependent Dirichlet process (DDP) prior is placed on the set of parameters corresponding to these distributions, which allows the model to capture a range of object shapes and orientations (Neiswanger et al., 2014). Using a variety of inference algorithms including particle Gibbs and sequential Monte Carlo (SMC) with a local Gibbs step, the authors apply their method to three scenarios. The first is a video containing six ants exhibiting erratic behaviors where the video background is a similar color scheme to the ants. The second is human motion tracking, and the third aims to track a population of T cells, where there is a large number of T cells per frame. Their method exhibited high performance accuracy in each of these scenarios. In, Moraffah et al. (2020), the authors consider the problem of tracking a moving object in a highly-cluttered environment. As the goal is to identify whether the measurement corresponds to the object of interest or whether it is “clutter”, this is ultimately a clustering problem. As such, the authors use two conditionally independent DPs as priors on the labels corresponding to the target and clutter measurement labels. Their approach was shown to outperform comparable methods including traditional Bayesian filtering and nearest-neighbor filters, demonstrating the utility of taking into account clutters. Approaches based on Hierarchical Dirichlet Processes (HDPs) for identifying human motions have also been introduced (Tu et al., 2019; Dhir et al., 2016). In Tu et al. (2019), the authors introduce the Multi-label Hierarchical Dirichlet Process (ML-HDP) for multi-action recognition. Interestingly, their three-tier model employs a similar construction to those used in topic modeling, as the model consists of high-level actions at one level which are combinations of atomic actions at the second level (similar to latent topics in a topic model), which themselves are combinations of local features at the lowest level. Such a construction has the ability to capture a wide range of human behaviors and is well-suited for weakly-supervised settings (Tu et al., 2019).

On the other hand, object tracking methods based on deep BNP models are limited and the full potential of these models to the object tracking problem has yet to be realized. To this extent, although not specific to time series, Sun et al. (2021) developed a deep kernel learning (DKL) method for recognizing targets in remote sensing images which relies on deep saliency kernel learning analysis. The problems posed by remote sensing include the effects of varying weather conditions on the images as well as the presence of clutter-induced noise. Furthermore, remote sensing images can exhibit diverse patterns and features, but subtle differences may be present. A poorly designed kernel mapping function may struggle to differentiate between similar features. The flexibility of DKL architectures allow for the design of network architectures that are better suited to handle the specific challenges posed by the structure of kernel mapping functions in remote sensing applications. The approach introduced in Sun et al. (2021) was shown to outperform methods such as support vector machines, dynamic Bayesian networks, and convolutional neural networks on a range of real-world and synthetic datasets.

The field of object tracking has witnessed significant advancements through the application of Bayesian nonparametric methods, particularly in scenarios with unknown object counts and diverse types of objects. While deep Bayesian nonparametric models are yet to be fully explored in this context, recent developments, such as the work in Sun et al. (2021), showcase the potential for addressing complex challenges in object tracking applications, such as those encountered in remote sensing. As research continues to evolve, these methodologies hold promise for enhancing the robustness and adaptability of object tracking systems across diverse real-world scenarios.

4.2 Healthcare and biomedical data analysis

Analyzing healthcare and biomedical data poses several challenges, such as missing data and irregularly spaced samples, high dimensional and large datasets, diverse time series originating from different patients and different systems, and the amalgamation of mixed data types. In tackling these complexities, GPs have shown to be exceptionally valuable for healthcare data analysis. GPs and their variants offer distinctive advantages in this domain, excelling in tasks like missing data imputation (Imani et al., 2019), predictive modeling (Colopy et al., 2016), multi-task learning (Dürichen et al., 2014), and early warning detection (Zhang et al., 2022). In Rinta-Koski et al. (2018), a standard GP was employed to predict in-hospital mortality among premature infants. The study examined data from 598 NICU patients for which seven variables were considered, including gestational age at birth, birth weight, systolic, and mean and diastolic arterial blood pressure. A three-part covariance kernel consisting of a sum of squared exponential, linear, and constant kernels was used to capture the bias and linear trend, as well as nonlinear effects. Comparison to other classifiers such as the support vector machine and the linear probit model demonstrated the utility of GPs in predicting in-hospital death. Particularly noteworthy was the model’s efficacy when combining features from time series data, such as ECG heart rate and arterial blood pressure, with clinical scores calculated upon admission. A similar study in Colopy et al. (2016) aimed to identify which patients in the step-down unit (SDU) are at risk of readmission to the ICU by forecasting patient heart rate time series. Using SDU time series data consisting of 333 patients and measurements for heart and respiratory rates, blood-oxygen saturation, and systolic and diastolic blood pressure, the study employed a change-point detection approach to identify the transition from a normal heart rate state to an abnormal state. This was done by considering the deviation of the observed measurements from the forecast. If such a deviation was sufficient, then this indicated a deteriorating patient condition, which provided 6–8 h of advanced warning detection.

Employing multi-task Gaussian processes (MTGPs) further enhances the benefits derived from the analysis of healthcare and biomedical data compared to standard GPs. MTGPs can leverage information from different patients and different types of time series types, offering solutions to the aforementioned challenges and improving modeling accuracy by taking into account the correlation between different types of physiological time series. In Dürichen et al. (2014), an MTGP is applied to real-world and synthetic datasets consisting of different physiological time series. The data are sparse, noisy and contained unevenly-spaced samples. The study found that taking into account correlation between the different vital sign time series yields improved predictive performance in comparison to standard GPs, particularly in regions of incomplete data. In Chen et al. (2023), an MTGP model was introduced to estimate treatment effects in panel data. Their model accounts for temporal correlations within and across treatment and control groups. Furthermore, like MOGPs, MTGPs can also handle data of mixed types. In Zhang and Shen (2012), MTGPs were applied to the problem of Alzheimer’s disease diagnosis using multi-modal data, where the clinical variables of interest were of mixed continuous and categorical types. However, the high computational cost $O (m^{3} n^{3})$ presents an obstacle. To this extent, sparse multi-output GP methods can be employed which involve placing sparsity-inducing priors on the weight matrix for the multi-output covariance kernel (Cheng et al., 2020). This facilitates scalability to large datasets. In summary, these studies underscore the diverse applications of MTGPs in addressing the intricacies of multi-faceted healthcare and biomedicine datasets, showcasing their potential to improve predictive accuracy and handle data of varying types and complexities.

Extending BNP deep learning approaches to multi-task settings results in powerful models equipped with the advantages of MTGPs and the ability to learn complex relationships and patterns in data. This is demonstrated in work by Zhang et al. (2022), who developed a real-time early-warning model to predict COVID-19 patients at risk of being placed on a ventilator. In their work, an MTGP was used for missing data imputation and for transferring irregularly sampled data to a regularly spaced grid. The data was then fed into a neural network for prediction of the risk score trajectory. Their approach allowed the model to predict the outcome prior to the patient needing to be placed on a ventilator. A different but related model was developed in Alaa and van der Schaar (2017), who employed a multi-task DGP to assess a patient’s risk of multiple adverse outcomes. The dataset consists of variables corresponding to covariates associated with each subject, the time until an event (e.g., cardiovascular or cancer) occurred, and an indicator denoting the type of event that occurred. The event times are modeled as a multi-output function of the patients’ covariates using a multi-task DGP. This allows for non-Gaussian outputs as survival times may exhibit asymmetric distributions. The model consists of a two-layer sequence-dependent DGP, where the latent variables are the outputs of a multivariate GP which in turn become the inputs to the GP modeling the survival times. Furthermore, the use of a multi-task DGP facilitates the joint modeling of complex survival distributions and complex interactions between the different covariates with minimal assumptions (Alaa and van der Schaar, 2017). The efficacy of the multi-task DGP was evaluated on both real-world and synthetic datasets. In particular, the synthetic dataset was constructed with high heterogeneity between patient cohorts. In this case, the multi-task DGP model was shown to outperform the MTGP due to the highly nonlinear relationships between the covariates and survival times, as well as the complex form of the survival time distributions. Similar results were obtained for a real-world breast cancer survivor dataset consisting of 61,050 subjects. These findings highlight the potential of BNP deep learning approaches in multi-task settings to enhance predictive modeling in healthcare, offering valuable insights for personalized patient care and medical decision-making.

4.3 Speech signal processing

Speech signal processing is a multifaceted area encompassing various objectives such as speech signal representation, feature extraction (e.g., formant analysis and pitch extraction), speech recognition, speaker diarization, noise reduction, and emotion recognition. Techniques such as infinite Hidden Markov Models (iHMMs), GPs, and HDP-HMMs enable a data-driven analysis of speech signals, facilitating the discovery of underlying structures, improving speech synthesis methods, and enhancing the overall efficiency and accuracy of speech-related tasks. This section highlights the application of BNP methods to two problems: speaker diarization and speech synthesis.

Speaker diarization is the task of partitioning an audio recording of a conversation into segments corresponding to individual speakers (Fox et al., 2011b). This is a challenging problem since the number of speakers as well and their individual speech patterns are often unknown a priori. The flexible nature of the nonparametric paradigm places no assumptions in this regard. The class of BNP approaches largely suited for this challenge consists of HDP-HMMs and iHMMs. Fox et al. (2011b) introduced the “sticky” HDP-HMM (see Section 2.1.1), which extends the HDP-HMM to capture state persistence and prevents the model from switching too quickly between states. The inclusion of a state persistence parameter reflects the natural tendency of speakers to exhibit persistence in their speech patterns. Both the transition and emission distributions receive a nonparametric treatment, as speaker specific emissions are better approximated by a multimodal distribution (Fox et al., 2011b). The sticky HDP-HMM was shown to exhibit improved performance on real-world and synthetic datasets. However, the geometrically distributed state durations places restrictions on the duration structure. This led to the development of the Hierarchical Dirichlet Process Hidden semi-Markov Model (HDP-HsMM), allowing for a more versatile selection of state duration distributions. In Johnson and Willsky (2012), the Poisson distribution is used as the state duration distribution. It is frequently employed to model the count of events within fixed intervals. Unlike the geometric distribution, the Poisson distribution accommodates variations in the rate of event occurrences, which is advantageous in scenarios where speakers may exhibit varying speech patterns or engage in dynamic conversational behaviors, such as in a debate.

Speech synthesis is the artificial production of human speech by a computer or other device. It involves extracting dependencies between acoustic and linguistic features to produce speech patterns. To produce more natural sounding speech patterns, DGP-based models have recently been introduced (Koriyama and Kobayashi, 2019b; Koriyama and Kobayashi, 2019a; Mitsui et al., 2021). This hierarchical nature enables the model to capture dependencies at different levels of abstraction, from low-level acoustic features to high-level linguistic and semantic information. DGPs are shown to be more effective than deep neural networks (DNNs), as DGPs are less vulnerable to overfitting since the training objective is based on the maximization of the marginal likelihood (Koriyama and Kobayashi, 2019b). DGP latent variable models have also been introduced for semi-supervised prosody modeling (Koriyama and Kobayashi, 2019a). By treating missing prosody labels as latent variables, the model is able to learn and generate expressive and natural-sounding synthetic speech, even when some prosody information is not explicitly provided during training. The application of DGPs to speech synthesis has been extended to multi-speaker speech synthesis (Mitsui et al., 2021). Instead of using one model per speaker, one model is used for multiple speakers, in a similar vein to multi-task learning. Two methods are introduced for multi-speaker speech synthesis. In the first method, simple one-hot speaker codes are combined with a DGP model for training, similar to a single-speaker model. The second method incorporates a more complex model, the Deep Gaussian Process Latent Variable Model (DGPLVM), into a DGP-based acoustic model and considers both acoustic features and speaker representations as observed and latent variables, respectively. This trains the system to generate speech while accounting for speaker similarity and other factors like speaking rates. The research demonstrates the ability to generate speech for non-existent speakers by sampling from the latent space learned by DGPLVM, offering potential applications in synthesizing diverse voices while safeguarding speaker privacy. This capability can be utilized for creative purposes, such as generating multiple characters for entertainment purposes or providing users with their preferred voices in multi-speaker speech synthesis.

5 Conclusion and future directions

This work has presented a comprehensive survey on existing Bayesian nonparametric methods for time series analysis. These methods provide potential solutions for several challenges which arise when analyzing time series data, including those associated with high-dimensional and large datasets (Hoffman et al., 2013; Al-Shedivat et al., 2017), irregularly spaced or missing samples (Imani et al., 2019), unknown underlying mechanisms (Frigola et al., 2014), and mixed data types (Hong et al., 2023). The use of deep BNP methods presents an exciting area of research which increases the expressivity of standard BNP methods and captures state-or behavior-specific dependencies while generally remaining robust to overfitting and quantifying uncertainty. However, there are several areas for improvement and future research.

To begin with, the increased expressivity of deep BNP models comes at the expense of interpretability. Due to the many transformations of the input, models with deeper architectures can obscure the relationships between inputs and output predictions. How do we balance this tradeoff? One strategy involves approximating the DGP as a GP, where the moments of the DGP are used to construct effective GP kernels with analytic forms (Lu et al., 2020). This simplifies the modeling process and paves the way for enhanced interpretability, although it may introduce approximation errors.

Second, the application of deep BNP methods to time series data is limited. Thus, the application and evaluation of deep BNP models on different types of time series and domains is needed. Does the efficacy of DKL over DGP, or vise versa, depend on the specific dataset? What are the features of such datasets where this may be the case? Do certain datasets require deeper architectures than others for extracting meaningful information? One particular domain that has been left largely unexplored by deep BNP models is climate data. Climate change is one of the most pressing problems that our society faces. Predicting future climate change trajectories and developing adequate intervention is necessary for ensuring the health of our planet. However, climate data are complex, nonlinear, and noisy, and it is notoriously difficult to make predictions in climate systems. As well, climate extremes exhibit non-stationary behavior which are caused by temporal variation in the statistical properties in climatic factors (Abrahamczyk and Uzair, 2023). Climate systems also connect local short-term weather patterns with long-term global climate change, which necessitates the use of methods that can capture long and short-range dependencies. Given the myriad benefits that deep BNP models have to offer in these areas, their application to climate data could yield significant insights concerning future climate change trajectories. Moreover, the ability to provide robust uncertainty estimates is essential to inform policy-making.

Another avenue for future research is applying deep BNP methods to cluster multiple time series data from different dynamical systems with unknown transition dynamics. DKL can be used as a prior on the state transition dynamics, and a DP prior on the DKL can induce clustering based on similar dynamics. This can be advantageous in scenarios where there are multiple time series from various systems or processes and there is little domain expertise to inform the construction of the covariance kernel. Furthermore, an extension to a state-space setting can be done in a straightforward manner, where the transition and/or measurement functions are unknown. Such an approach will likely result in an increase in computational complexity and necessitate the development of more sophisticated inference algorithms. However, the inference methods discussed in Section 3.4 are promising in this regard.

Lastly, future research could focus on the development of algorithms which can determine the optimal number of layers in deep BNP models. This can aid in reducing computational complexity, mitigate biases by alleviating the need for manual tuning, and potentially bolster interpretability by employing no more than the requisite number of layers. Furthermore, the optimal model depth might vary depending on the specific characteristics of the dataset. Developing an algorithm which optimizes the number of layers can increase the longevity of these types of deep learning BNP methods as the time series data landscape continues to evolve.

Author contributions

NV-C: Conceptualization, Investigation, Writing–original draft, Writing–review and editing.

Funding

The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.

Conflict of interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Aalto, A., Viitasaari, L., Ilmonen, P., Mombaerts, L., and Goncalves, J. (2018). Continuous time Gaussian process dynamical models in gene regulatory network inference. arXiv preprint arXiv:1808.08161.