State-space estimation of spatially dynamic room impulse responses using a room acoustic model-based prior

MacWilliam, Kathleen; Dietzen, Thomas; Ali, Randall; van Waterschoot, Toon

doi:10.3389/frsip.2024.1426082

ORIGINAL RESEARCH article

Front. Signal Process., 02 September 2024

Sec. Audio and Acoustic Signal Processing

Volume 4 - 2024 | https://doi.org/10.3389/frsip.2024.1426082

This article is part of the Research TopicAudio and Acoustics of MovementView all 5 articles

State-space estimation of spatially dynamic room impulse responses using a room acoustic model-based prior

Kathleen MacWilliam¹*

Thomas Dietzen¹

Randall Ali^1,2

Toon van Waterschoot¹

¹STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, Department of Electrical Engineering (ESAT), KU Leuven, Leuven, Belgium
²Institute of Sound Recording, Department of Music and Media, University of Surrey, Guildford, United Kingdom

Room impulse responses (RIRs) between static loudspeaker and microphone locations can be estimated using a number of well-established measurement and inference procedures. While these procedures assume a time-invariant acoustic system, time variations need to be considered for the case of spatially dynamic scenarios where loudspeakers and microphones are subject to movement. If the RIR is modeled using image sources, then movement implies that the distance to each image source varies over time, making the estimation of the spatially dynamic RIR particularly challenging. In this paper, we propose a procedure to estimate the early part of the spatially dynamic RIR between a stationary source and a microphone moving on a linear trajectory at constant velocity. The procedure is built upon a state-space model, where the state to be estimated represents the early RIR, the observation corresponds to a microphone recording in a spatially dynamic scenario, and time-varying distances to the image sources are incorporated into the state transition matrix obtained from static RIRs at the start and end points of the trajectory. The performance of the proposed approach is evaluated against state-of-the-art RIR interpolation and state-space estimation methods using simulations, demonstrating the potential of the proposed state-space model.

1 Introduction

A room impulse response (RIR) is the time-domain representation of the linear time-invariant (LTI) system that uniquely characterizes the cumulative impact of a room on sound waves between a specific static source and a microphone’s position, effectively representing the room’s acoustic environment. This concept is fundamental to various acoustic signal processing applications, including source localization (Evers et al., 2020), dereverberation (Naylor and Gaubitch, 2010), echo cancellation (Elko et al., 2003), and spatial audio reproduction (Schissler et al., 2017). Considerable research has been dedicated to developing robust measurement (Stan et al., 2002; Szöke et al., 2018) and estimation (Lin and Lee, 2006; Crocco and Bue, 2015; Ratnarajah et al., 2022) techniques for capturing RIRs, particularly in scenarios where the source and microphone remain static within the acoustic environment (Stan et al., 2002; Szöke et al., 2018). However, real-world situations often involve spatially dynamic scenarios, where sources and microphones are subject to movement. This paper addresses the challenge of accurately estimating the early part of RIRs along a trajectory in a time-varying acoustic scenario with a stationary source and a moving microphone. This can be thought of as a time-varying system identification problem where the system to be identified may be referred to as a time-variant RIR.

In this context, it is important to consider the apparent contradiction between the previously defined RIR as an LTI system representation and the concept of a time-variant RIR. In this paper, the acoustic environment itself is assumed to be time-invariant, in which case an RIR between two static positions indeed corresponds to an LTI system. RIRs, however, are inherently location-variant—they depend on the locations of the source and the microphone. The time-variant RIR, as considered here, can be defined as the time-varying system representation (Cherniakov, 2003) relating the source signal to the microphone signal if the microphone moves—that is, has a time-varying location. In discrete time, this time-variant RIR can also be thought of as a collection of RIRs in the LTI sense over different time-instances: the location of the microphone changes at each time step, leading to a new discrete position in space associated with a time-invariant RIR. Time-variant RIR estimation, as defined above, has relevance across numerous acoustic signal processing applications, especially amid the growing interest in virtual acoustic environments. For instance, Ajdler et al. (2007) demonstrated the use of time-varying acoustic system models for enabling the rapid measurement of head-related impulse responses. Moreover, sub-optimal estimations of time-variant RIRs can impair the effectiveness of echo-cancellation systems in telepresence and communication technologies (Nophut et al., 2024).

In order to contextualize our contribution in estimating the early part of a time-variant RIR, it is necessary to provide a brief outline of the related literature. In this context, the concept of spatial RIR interpolation is highly relevant given that we can consider a time-varying RIR as a collection of RIRs over a discrete set of locations. RIR interpolation facilitates sound field rendering for dynamic source–microphone positions by filling spatial gaps in RIR data, as measurements or simulations are usually limited to sparse grids. Numerous approaches have been proposed for RIR interpolation, including compressed sensing methods (Mignot et al., 2013; Katzberg et al., 2018), spherical harmonics (Borra et al., 2019), physics-based models (Antonello et al., 2017; Hahmann and Fernandez-Grande, 2022), directional RIRs (Zhao et al., 2022), and neural networks (Pezzoli et al., 2022; Karakonstantis et al., 2024). Haneda et al. (1999) introduced a frequency-domain approach for interpolating room transfer functions (RTFs), which is particularly effective at lower frequencies; it was extended by Das et al. (2021). Additionally, several techniques for the interpolation of head-related transfer functions (HRTFs), which are critical for accurate binaural rendering and share some methodological approaches with RTF interpolation, have been explored (Carty, 2010). For the purpose of this paper, we focus on the interpolation of an RIR at a point between two microphone locations given their estimated RIRs for a common stationary source. Kearney et al. (2009) introduced Dynamic Time Warping (DTW)-based interpolation, dividing RIRs into early reflections and diffuse decay. This method temporally aligns and linearly interpolates early reflections while modeling the tail using the approach of Masterson et al. (2009) and laying the foundation for subsequent developments. Garcia-Gomez and Lopez (2018) expanded on Kearney’s work, enhancing algorithm robustness and computational efficiency while retaining the core interpolation technique. Building on this, Bruschi et al. (2020) refined peak finding and matching aspects of the interpolation technique. Geldert et al. (2023) proposed a novel approach using partial optimal transport for interpolation, enabling non-bijective mapping of sound events between the early part of RIRs. It is important to highlight that these interpolation approaches operate outside of a conventional system identification framework. Instead, they use a limited set of measured RIRs and predominantly rely on a room acoustic sound propagation model such as the image source method (ISM) (Allen and Berkley, 1979). As opposed to interpolation strategies, fully data-driven approaches for estimating time-variant RIRs have also been investigated. In this case, the estimation relies directly on the source and microphone signals and can be framed as a system identification or adaptive filtering problem. This approach has been largely motivated by the need to obtain rapid measurements of head-related impulse responses (Enzner, 2008; Hahn and Spors, 2015) and for echo cancellation (Antweiler and Symanzik, 1995; Enzner, 2010), where a typical scenario involves an excitation signal being continuously captured by a moving microphone. Using carefully designed excitation signals (Hahn and Spors, 2015; Kuhl et al., 2018), time-variant RIRs can be estimated using a normalized least mean squares (NLMS) algorithm (Enzner, 2008; Antweiler et al., 2012) or more generally using a Kalman filter (Enzner, 2010). In the context of this paper, a Kalman filter is of particular interest as it is derived from a state-space model of the dynamic system where the state to be estimated is the time-variant RIR (Enzner, 2010). In such a state-space model, the evolution of the time-variant RIR is explicitly modeled using a first-order difference equation, which allows more modeling flexibility than other adaptive algorithms such as NLMS. One popular choice for the first-order difference model is to relate the states at two time instants by a transition factor and an additional process noise term. The transition coefficient and process noise covariance are typically set according to the expected variability of the state, influencing the convergence behavior of the Kalman filter. For instance, in Nophut et al. (2024), the transition coefficient was modeled as a function of the microphone velocity.

We here consider the problem of estimating the early part of the time-variant RIR between a stationary source and a microphone moving on a linear trajectory at constant velocity. We propose integrating RIR interpolation, derived from a room acoustic model, into a state-space-based framework for RIR estimation, thereby merging data-driven approaches with physical modeling. More specifically, rather than relying solely on a state transition factor within the state equation, we propose incorporating the ISM into a state transition matrix between the early segments of consecutive RIRs. We derive an analytical model for this transition matrix and subsequently propose estimating it from static RIRs at the trajectory’s start and end points using a DTW-based algorithm. The proposed approach's performance is evaluated through simulations by comparing it to two alternatives: one that relies solely on the state equation for RIR estimation, resembling a pure interpolation method with the ISM-based transition matrix, and another that uses a conventional state-space estimation with a simple state transition factor. Our findings suggest that the proposed state-space model outperforms both alternatives in terms of normalized misalignment between the simulated “ground-truth” RIRs and estimated RIRs.

The subsequent sections of this paper are organized as follows. Section 2 introduces the signal model and provides an overview of the most pertinent state-of-the-art methods. Section 3 elaborates on the proposed RIR state-space model and outlines the update equations of the Kalman filter used to recursively estimate the defined state. Section 4 offers detailed derivations of the proposed room acoustic model-based state transition matrix. Finally, Section 5 presents experimental validation through simulations, followed by a discussion of the results.

2 Signal model, problem statement, and related state of the art

2.1 Signal model and problem statement

We first introduce the signal model to formally define the concept of an RIR as employed in this paper. When we henceforth use the term “RIR”, it will specifically pertain to the early part of the RIRs, as we do not address the estimation of the late reverberant tail. We also assume that the source location remains static. Let $k$ denote the time index, and let the location of the microphone relative to the linear trajectory be denoted by the one-dimensional location index $l$ . If the microphone location is time-variant, an RIR defines the LTI system relating a source signal $x (k)$ to an observed signal $y (l, k)$ . Let the RIR for microphone location $l$ be denoted by $h (l, n)$ , where $n$ indexes the time-shift of the RIR samples. The signals $x (k)$ and $y (l, k)$ are then related by

y (l, k) = \sum_{n = 0}^{N - 1} h (l, n) x (k - n) + v (k), (1)

where $v (k)$ is a noise term that we generally assume to be present in the observed signal and may include late reverberation. Now, defining the vectors $x (k)$ and $h (l)$ as

x (k) = {(\begin{matrix} x (k) & x (k - 1) & \dots & x (k - N + 1) \end{matrix})}^{T}, (2)

h (l) = {(\begin{matrix} h (l, 0) & h (l, 1) & \dots & h (l, N - 1) \end{matrix})}^{T}, (3)

the convolution in Equation 1 can alternatively be written as

y (l, k) = x^{T} (k) h (l) + v (k) . (4)

Within the scope of this paper, we assume a linear microphone trajectory of length $d$ , divided into $L$ equidistant locations spaced by $Δ d$ and indexed as $l = 0, \dots, L - 1$ , as illustrated in Figure 1, such that $L = d / Δ d + 1$ . We consider the problem of estimating $h (l)$ using the source signal $x (k)$ and the observed signal $y (l, k)$ if the microphone moves along the trajectory at a constant velocity $μ_{rx}$ —that is, if $l$ varies over time. For this, we make the assumptions that $h (0)$ and $h (L - 1)$ are known, and that the time intervals which the individual reflections occupy within the time-variant RIR over the range of the trajectory do not overlap. An example of $h (0)$ and $h (L - 1)$ including only first-order reflections is shown in Figure 2, where peaks corresponding to the same source or reflection are labeled by the same number in each RIR.

Figure 1

Figure 1. Linear microphone trajectory of length $d$ divided into $L$ equidistant locations spaced by $Δ d$ and indexed as $l = 0, \dots, L - 1$ such that $d = (L - 1) Δ d$ .

Figure 2

Figure 2. An example of simulated RIRs, $h (0)$ and $h (L - 1)$ , including only first-order reflections. Peaks corresponding to the same source or reflection are labeled by the same number in each RIR.

2.2 Related state of the art

Before introducing the proposed approach, it is instructive to briefly introduce the most relevant concepts used in the state of the art. On the one hand, we consider RIR interpolation approaches that estimate $h (l)$ from $h (0)$ and $h (L - 1)$ only based on a room acoustic sound propagation model, without making use of an observed signal $y (l, k)$ . On the other hand, we consider data-driven approaches that estimate $h (l)$ in an adaptive manner from a data set containing $x (k)$ and $y (l, k)$ over multiple time indexes $k$ for a moving microphone, but without exploiting a room acoustic sound propagation model.

As previously mentioned, some RIR interpolation approaches (Kearney et al., 2009; Geldert et al., 2023) are motivated by the ISM for room acoustic sound propagation, which expresses an RIR as a sum of contributions from the original source and so-called image sources representing reflections from the boundaries of the room. Based on this model, RIR interpolation approaches infer the location-variant time of arrival (TOA) as well as the amplitude of the direct component and individual reflections (or equivalently, source and image source components) at a particular location $l$ using the RIRs $h (0)$ and $h (L - 1)$ . The general principle can be described as follows. Assuming that reflections corresponding to the same image source in $h (0)$ and $h (L - 1)$ can be identified, let $n_{r} (0)$ and $n_{r} (L - 1)$ be vectors collecting the respective time-shift indices of the direct component and these reflections in $h (0)$ and $h (L - 1)$ . The corresponding time-shift indices of $h (l)$ can be interpolated as

n_{r} (l) = n_{r} (0) + r o u n d (β (l) (n_{r} (L - 1) - n_{r} (0))), (5)

where $n_{r} (L - 1) - n_{r} (0)$ can be conceived as sample-based time differences of arrival (TDOAs), $β (l) \in [0,1]$ with $β (0) = 0$ and $β (L - 1) = 1$ , and $r o u n d (\cdot)$ defines a rounding operation. The function $β (l)$ can be defined such that the interpolation is correct for the direct component (Kearney et al., 2009). Applying a similar interpolation rule for the amplitudes of the direct component and reflections, an estimate of $h (l)$ can then be obtained. The collections of corresponding indices $n_{r} (0)$ and $n_{r} (L - 1)$ can, for instance, be found by means of Dynamic Time Warping with $h (0)$ and $h (L - 1)$ as inputs (Kearney et al., 2009; Bruschi et al., 2020). Here, an accurate identification of all corresponding reflections requires that their order of arrival is the same in $h (0)$ and $h (L - 1)$ .

Data-driven approaches (Enzner, 2010) to the estimation of location- or time-variant RIRs do not rely on explicit modeling of room acoustic sound propagation, but instead perform adaptive system identification given a data set containing $x (k)$ and $y (l, k)$ over multiple time indexes $k$ . For the moment, let us assume that the moving microphone crosses one location $l$ at each time instant $k$ , such that the location-variant RIR $h (l)$ can equivalently be thought of as a time-variant RIR $h (k)$ with $k = l$ . The convolution in Equation 1 then becomes

y (k) = \sum_{n = 0}^{N - 1} h (k, n) x (k - n) + v (k) . (6)

This can be expressed similarly to Equation 4 using the definitions in Equations 2 and 3. A popular approach to adaptively estimate $h (k)$ is the Kalman filter, whose update equations are obtained from a state-space model where $h (k)$ is the state. A commonly used state-space model can be defined as

h (k) = α h (k - 1) + w (k), (7)

y (k) = x^{T} (k) h (k) + v (k), (8)

where the state equation in Equation 7 models the evolution of the state as a first-order difference equation, and the observation equation (Equation 8) relates the state to the observation $y (k)$ and is identical to Equation 6. The model commonly chosen in Equation 7 defines a factor $α$ , referred to as the “forgetting” or “transition” factor, and a noise term $w (k)$ , referred to as “process noise”, for which the covariance matrix is required in the Kalman filter. In practice, the choice of $α$ and the covariance matrix of $w (k)$ is typically not directly motivated by a physical model of the RIR evolution. Instead, they are often considered tuning parameters that can be used to control the convergence behavior of the Kalman filter, although they can be tuned depending on physical parameters such as the velocity of the microphone (Nophut et al., 2024).

3 Proposed RIR state-space model

In the proposed approach, we aim to include prior knowledge on room acoustic sound propagation into a state-space model for time-variant RIR estimation. To this end, rather than resorting to a scalar factor as in Equation 7, we assume that the relation between two RIRs $h (l)$ and $h (l - 1)$ on the trajectory can approximately be modeled by a location-invariant transition matrix $A$ as

h (l) \approx A h (l - 1) (9)

for $l = 1, \dots, L - 1$ . The transition matrix $A$ will be used to model changes in the TOA of the direct component and the early reflections between the neighboring locations $l$ and $l - 1$ and will be applied in the state equation of the proposed state-space model. As described in Section 4, the analytical definition of $A$ will be based on the ISM, which also serves as the foundation of the RIR interpolation approaches discussed in Section 2.2. Furthermore, we will show that the transition matrix $A$ can be modeled as location-invariant if the time intervals which the individual reflections occupy in the time-variant RIR over the range of the trajectory do not overlap, which is also implicitly assumed in Kearney et al. (2009) and Geldert et al. (2023). In practice, an estimate of $A$ can be obtained from the presumed knowledge of the RIRs $h (0)$ and $h (L - 1)$ at the start and the end of the trajectory, as will be discussed in Section 4.3.

While it is possible to interpolate between $h (0)$ and $h (L - 1)$ based on $A$ only as $h (l) \approx A^{l} h (0)$ , a more accurate estimate can be obtained by using the signals recorded along the trajectory with a microphone moving at a constant velocity. Without loss of generality, let $k = 0$ denote the start of the recording at location $l = 0$ . For a given velocity $μ_{rx}$ of the microphone and temporal sampling period $T_{s}$ , the smallest possible spatial period $Δ d$ that could be used is given by $μ_{rx} T_{s}$ . In this case, one RIR estimate can be obtained per time sample. In the proposed approach, we also consider spatial periods of

Δ d = Ω μ_{rx} T_{s},

where $Ω$ is a positive integer that can be understood as a spatial down-sampling factor. If the microphone is at location with index $l$ , we then observe sample

k = l Ω .

With Equation 9 and defining $y_{Ω} (l) = y (l Ω)$ , $x_{Ω} (l) = x (l Ω)$ , and $v_{Ω} (l) = v (l Ω)$ for ease of presentation, we can define the state-space model of the proposed approach as

h (l) = A h (l - 1) + w (l), (10)

y_{Ω} (l) = x_{Ω}^{T} (l) h (l) + v_{Ω} (l), (11)

where $w (l)$ is the process noise accounting for modeling errors in the definition of $A$ . From the state-space model, an estimate $\hat{h} (l)$ of $h (l)$ can be obtained using the Kalman filter with one recursion per location $l$ . The Kalman filter update equations are outlined in the next section.

3.1 Update equations of the Kalman filter

The Kalman filter (Simon, 2006) can be used to recursively estimate the state defined by a state-space model. For the proposed state-space model in Equations 10 and 11, let $Q = E [w (l) w^{T} (l)]$ and $R = E [v_{Ω}^{2} (l)]$ be the presumed location-invariant covariance matrix of $w (l)$ and variance of $v_{Ω} (l)$ , respectively, where $E [\cdot]$ denotes the expectation operation. The update equations of the Kalman filter for the model in Equations 10 and 11 are then given by

\hat{h} (l) = A {\hat{h}}^{+} (l - 1), . (12)

P (l) = A P^{+} (l - 1) A^{T} + Q, . (13)

k (l) = \frac{P (l) x_{Ω} (l)}{x_{Ω}^{T} (l) P (l) x_{Ω} (l) + R}, (14)

{\hat{h}}^{+} (l) = \hat{h} (l) + k (l) (y_{Ω} (l) - x_{Ω}^{T} (l) \hat{h} (l)), . (15)

P^{+} (l) = (I - k (l) x_{Ω}^{T} (l)) P (l), (16)

where Equations 12 and 13 are commonly referred to as the prediction step producing prior estimates, and Equations 14–16 are referred to as the update step producing posterior estimates. In these equations, $\hat{h} (l)$ and ${\hat{h}}^{+} (l)$ are state estimates, $P (l)$ and $P^{+} (l)$ are estimates of the state-estimation error covariance matrix, $k (l)$ is the Kalman gain, $I$ the identity matrix, and the superscript ⁺ distinguishes posterior from prior estimates. In the prediction step Equations 12 and 13, the previously acquired posterior estimates ${\hat{h}}^{+} (l - 1)$ and $P^{+} (l - 1)$ are propagated based on the state Equation 10 only, without taking into account the observation at recursion $l$ , yielding the prior estimates $\hat{h} (l)$ and $P (l)$ . The update step Equations 14–16 is based on the observation Equation 11. Here, the Kalman gain $k (l)$ and the error term $y_{Ω} (l) - x_{Ω}^{T} (l) \hat{h} (l)$ are computed based on $y_{Ω} (l)$ and $x_{Ω} (l)$ and are used to update the prior estimates $\hat{h} (l)$ and $P (l)$ , yielding the posterior estimates ${\hat{h}}^{+} (l)$ and $P^{+} (l)$ .

To implement the Kalman filter, $Q$ and $R$ are required as inputs. While $R$ can potentially be obtained from background noise recordings, $Q$ can be considered a tuning parameter as the variance of $w (l) = h (l) - A h (l - 1)$ will not be known exactly in practice. The Kalman filter also needs to be initialized by defining ${\hat{h}}^{+} (0)$ and $P^{+} (0)$ . As we assume that $h (0)$ is known, we set ${\hat{h}}^{+} (0) = h (0)$ . This choice of ${\hat{h}}^{+} (0)$ corresponds to the assumption that the initial state estimation error is 0, such that $P^{+} (0)$ can be chosen to have a small norm. Finally, a practical implementation requires an estimate of $A$ , which will be discussed in more detail in Sections 4.2 and 4.3.

At this point, it is instructive to interpret Equations 12–16 in relation to the state of the art as discussed in Section 2. In the interpolation approaches in Kearney et al. (2009) and Geldert et al. (2023) on the one hand, recorded signals are not available, which corresponds to the assumption that $y_{Ω} (l) = 0$ and $x_{Ω} (l) = 0$ . In this case, we find that ${\hat{h}}^{+} (l) = \hat{h} (l)$ and $P^{+} (l) = P (l)$ in Equations 15 and 16, i.e. $\hat{h} (l)$ no longer depends on $P (l)$ and the Kalman filter update equations effectively reduce to $\hat{h} (l) = A \hat{h} (l - 1) = A^{l} h (0)$ . As $A$ is defined by interpolating TOAs of reflections between $h (0)$ and $h (L - 1)$ based on a room acoustic model, this case is conceptually similar to the aforementioned interpolation approaches and can be referred to as “linear interpolation.” The state-space space approaches in Enzner (2010), on the other hand, are obtained if $Ω = 1$ and the state transition matrix $A$ is replaced by a scalar $α$ .

4 Proposed room acoustic model-based transition matrix

This section provides a detailed explanation of the methodology followed to obtain a suitable room acoustic model-based transition matrix for use in Equation 10 and is organized as follows. In Section 4.1, we derive an analytical expression for a location-variant transition matrix model $A (l)$ based on the ISM. A suitable modification of the analytical transition matrix for use in the state-space model is then proposed in Section 4.2 in order to obtain a location-invariant transition matrix model $A$ valid within the limits of the trajectory. The models presented are parameterized by the exact reflection TOAs, which are not inherently available in practice. Therefore, Section 4.3 introduces an approach based on DTW to obtain an estimate of $A$ from $h (0)$ and $h (L - 1)$ by TDOA interpolation.

4.1 Analytical location-variant transition matrix model

The objective of this section is to derive a transition matrix model $A (l) \in R^{N \times N}$ that approximates the mapping between two RIRs $h (l)$ and $h (l - 1)$ :

h (l) \approx A (l) h (l - 1) . (17)

An illustration of this relation along a given trajectory is shown in Figure 3a.

Figure 3

Figure 3. (a) Location-variant transition matrix model $A (l)$ providing mapping between two RIRs $h (l)$ and $h (l - 1)$ along a given trajectory. (b) Location-invariant transition matrix model $A$ that approximates the mapping between two RIRs $h (l)$ and $h (l - 1)$ along a given trajectory where the assumptions detailed in Section 4.2 hold. (c) Equivalent transition matrix models, $A (1)$ and $A$ , for the special case of $L = 2$ —that is, a transition matrix between the start and end points of the trajectory.

To achieve this objective, we make use of the ISM. In this model, the RIR is expressed as the sum of contributions from the original sound source and additional image sources, which represent reflections within the room. Let the RIR at location $l$ with continuous time-shift $t$ be denoted by $h (l, t)$ . With $δ (t)$ denoting the Dirac delta function, the ISM models $h (l, t)$ as

h (l, t) = \sum_{r \in R} a_{r} (l) δ (t - τ_{r} (l)), (18)

where $r \in R = {0,1, . . R - 1}$ is the (image) source or reflection index, and $τ_{r} (l)$ and $a_{r} (l)$ , respectively, denote the TOA and amplitude of the (image) source $r$ observed at location $l$ . Ultimately, we aim to express a discretized version of $h (l, t)$ in terms of a discretized version of $h (l - 1, t)$ as in Equation 17. To this end, we start by introducing parameters of $h (l - 1, t)$ into $h (l, t)$ as follows.

h (l, t) = \sum_{r \in R} a_{r} (l - 1) \frac{a_{r} (l)}{a_{r} (l - 1)} δ (t - τ_{r} (l) + τ_{r} (l - 1) - τ_{r} (l - 1)) . (19)

The inclusion of $a_{r} (l - 1) / a_{r} (l - 1)$ and $τ_{r} (l - 1) - τ_{r} (l - 1)$ preserves the definition of $h (l, t)$ in Equation 18, while allowing us to find a representation that more closely resembles $h (l - 1, t)$ . By incorporating $τ_{r} (l - 1) - τ_{r} (l - 1)$ , we can isolate a Dirac delta function that only includes a time shift of $τ_{r} (l - 1)$ using the following identity derived from the sifting property (Hoskins, 2009):

δ (t - τ_{r} (l) + τ_{r} (l - 1) - τ_{r} (l - 1)) = \int_{- \infty}^{+ \infty} δ (t - τ_{r} (l) + τ_{r} (l - 1) - t^{'}) δ (t^{'} - τ_{r} (l - 1)) d t^{'}

Therefore, Equation 19 can alternatively be expressed as

h (l, t) = \sum_{r \in R} \int_{t^{'}} \frac{a_{r} (l)}{a_{r} (l - 1)} δ (t - τ_{r} (l) + τ_{r} (l - 1) - t^{'}) a_{r} (l - 1) δ (t^{'} - τ_{r} (l - 1)) d t^{'} . (20)

To obtain a discrete time-shift representation in terms of the time-shift index $n$ , we impose a band limit with critical sampling period $T_{s}$ . Band-limiting a Dirac delta function results in a sinc-function (Hoskins, 2009), which will be sampled at discrete time indices. Defining

Δ_{r} (l) = τ_{r} (l) - τ_{r} (l - 1) (21)

as the TDOA of reflection $r$ between locations $l$ and $l - 1$ , we can now write Equation 20 as

h (l, n) = \sum_{r \in R} \sum_{n^{'}} \frac{a_{r} (l)}{a_{r} (l - 1)} s i n c (n - \frac{Δ_{r} (l)}{T_{s}} - n^{'}) a_{r} (l - 1) s i n c (n^{'} - \frac{τ_{r} (l - 1)}{T_{s}}) . (22)

At this point, it is advantageous to introduce a time-shift-dependent approximation of Equation 22. At any $n$ , the RIR sample $h (l, n)$ is dominated by a subset of reflections only, which we denote as $\tilde{R} (l, n)$ . Due to the decaying nature of the sinc-function, the product $s i n c (n - Δ_{r} (l) / T_{s} - n^{'}) s i n c (n^{'} - τ_{r} (l - 1) / T_{s})$ in Equation 22 will have a prominent peak only if the two sinc-functions are closely aligned—that is, if $n T_{s}$ is near $Δ_{r} (l) + τ_{r} (l - 1) = τ_{r} (l)$ . In the following, the product is considered negligible for $| n T_{s} - τ_{r} (l) | > ϵ$ , where $ϵ$ is a misalignment threshold and $2 ϵ$ can be understood as the effective temporal width of reflection $r$ . For ease of presentation, we define the interval of width $2 ϵ$ around $τ_{r} (l)$ as

T_{r} (l) = [τ_{r} (l) - ϵ, τ_{r} (l) + ϵ], (23)

which contains the TOAs of the non-negligible components of reflection $r$ in the RIR at location $l$ . With Equation 23, the misalignment condition $| n T_{s} - τ_{r} (l) | > ϵ$ can alternatively be expressed as $n \notin T_{r} (l) / T_{s}$ . Based on this, we define $\tilde{R} (l, n)$ as¹

\tilde{R} (l, n) = \{r \in R |n \in \frac{T_{r} (l)}{T_{s}}\} . (24)

With Equation 24, we can therefore approximate Equation 22 as

h (l, n) \approx \sum_{n^{'}} \sum_{r \in \tilde{R} (l, n)} \frac{a_{r} (l)}{a_{r} (l - 1)} s i n c (n - \frac{Δ_{r} (l)}{T_{s}} - n^{'}) a_{r} (l - 1) s i n c (n^{'} - \frac{τ_{r} (l - 1)}{T_{s}}), (25)

where we have swapped the order of summation over $n$ and $r$ . Analogous to Equation 18, we introduce the coefficients $h (l - 1, n^{'})$ defined by the ISM as

h (l - 1, n^{'}) = \sum_{r^{'} \in R} a_{r^{'}} (l - 1) s i n c (n^{'} - \frac{τ_{r^{'}} (l - 1)}{T_{s}})

and rewrite Equation 25 as

h (l, n) \approx \sum_{n^{'}} \sum_{r \in \tilde{R} (l, n)} \frac{a_{r} (l)}{a_{r} (l - 1)} s i n c (n - \frac{Δ_{r} (l)}{T_{s}} - n^{'}) h (l - 1, n^{'}) - e (l, n) (26)

with the error term $e (l, n)$ defined as

e (l, n) = \sum_{n^{'}} \sum_{\begin{array}{c} r \in \tilde{R} (l, n) \\ r^{'} \in R \\ r^{'} \neq r \end{array}} \frac{a_{r} (l)}{a_{r} (l - 1)} s i n c (n - \frac{Δ_{r} (l)}{T_{s}} - n^{'}) a_{r^{'}} (l - 1) s i n c (n^{'} - \frac{τ_{r^{'}} (l - 1)}{T_{s}}) . (27)

Given the relation between $h (l, n)$ and $h (l - 1, n^{'})$ in Equation 26, we can define the elements of $A (l)$ in Equation 17 assuming that $e (l, n)$ as defined in Equation 27 is negligible. Before turning to the elements of $A (l)$ , let us first verify the conditions under which this assumption holds. As before, we can argue that the product $s i n c (n - Δ_{r} (l) / T_{s} - n^{'}) s i n c (n^{'} - τ_{r^{'}} (l - 1) / T_{s})$ will be negligible if the two sinc-functions are misaligned, where $| n T_{s} - Δ_{r} (l) - τ_{r^{'}} (l - 1) | > ϵ$ . As shown in Supplementary Appendix SA, this condition is equivalent to

T_{r^{'}} (l - 1) \cap T_{r} (l - 1) = \emptyset, r^{'} \neq r, (28)

which essentially says that the reflections in $h (l - 1, n)$ should be well-separated in time. Neglecting $e (l, n)$ and referring back to the vector representation of RIRs in Equation 3, it can be seen that Equation 26 can be expressed in the form $h (l) \approx A (l) h (l - 1)$ as anticipated in Equation 17, where the element of $A (l)$ at index $(n + 1, n^{'} + 1)$ is given by

A_{n + 1, n^{'} + 1} (l) = \sum_{r \in \tilde{R} (l, n)} \frac{a_{r} (l)}{a_{r} (l - 1)} s i n c (n - \frac{Δ_{r} (l)}{T_{s}} - n^{'}) (29)

if $\tilde{R} (l, n) \neq \emptyset$ and otherwise $A_{n + 1, n^{'} + 1} (l) = 0$ —that is, at indices where $h (l, n)$ does not contain reflections. In the actual implementation of such a matrix, one may choose to truncate the sinc-functions in Equation 29 for $| n^{'} T_{s} - τ_{r} (l - 1) | > ϵ$ . As illustrated in Figure 4, this results in a matrix $A (l)$ composed of Toeplitz-structured submatrices of size $⌊ \frac{2 ϵ}{T_{s}} ⌋ \times ⌊ \frac{2 ϵ}{T_{s}} ⌋$ containing sampled sinc-functions, with one such submatrix per reflection $r$ . These sampled sinc-functions are vertically and horizontally centered around $τ_{r} (l) / T_{s}$ and $τ_{r} (l - 1) / T_{s}$ , which are non-integer in general.² Practically speaking, we note that submatrices centered above the main diagonal of $A (l)$ shift the corresponding reflection in $h (l - 1)$ toward smaller TOAs, while entries below shift it toward larger TOAs.

Figure 4

Figure 4. An example of an analytical location-variant transition matrix $A (l)$ constructed from the early segments of two impulse responses, $h (l)$ and $h (l - 1)$ . The grid lines indicate the positions of the respective reflection TOAs.

4.2 Analytical location-invariant transition matrix model

For the application at hand, we anticipate employing a location-invariant transition matrix model as described in Equation 9 and illustrated in Figure 3b. We show that by using a location-invariant matrix, in which TOA intervals are defined to span the entire trajectory, we can reduce the necessity for accurate TOA estimates. This efficiency is achieved because these TOA intervals, along with TDOA estimates, can be obtained directly from $h (0)$ to $h (L - 1)$ using the DTW-based method outlined in Section 4.3. In this section, we therefore derive a location-invariant transition matrix $A$ from the definition of $A (l)$ in Equation 29. As will be shown, location-invariance can be achieved within the limits of the trajectory under some assumptions.

In Equation 29, we observe two dependencies on $l$ : in the TDOA $Δ_{r} (l)$ and in the set $\tilde{R} (l, n)$ . The dependency of $Δ_{r} (l)$ on $l$ disappears under the following conditions. As illustrated in Figure 5, under the assumptions of a far-field scenario where the (image) sources are far away in relation to the length of the trajectory, $a_{r} (l) / a_{r} (l - 1) = 1$ and $Δ_{r} (l) = Δ_{r}$ , meaning corresponding reflection TOAs are shifted by the same amount at each step along the trajectory. Consequently, the reflection TOA $τ_{r} (l)$ can be approximated by the linear relation

τ_{r} (l) = τ_{r} (0) + l Δ_{r} . (30)

Figure 5

Figure 5. Illustration of the far-field assumption used to mitigate the dependence of $Δ_{r} (l)$ on $l$ .

The dependence of $\tilde{R} (l, n)$ on $l$ as defined in Equation 24 can be resolved as follows. The model in Equation 9 is required to hold for $l = 1, \dots, L - 1$ . Over this range of $l$ , the components of reflection $r$ occupy the interval

T_{r} = [τ_{r | m i n}, τ_{r | m a x}] (31)

with τ_{r | m i n} = m i n (τ_{r} (1), τ_{r} (L - 1)) - ϵ, (32)

τ_{r | m a x} = m a x (τ_{r} (1), τ_{r} (L - 1)) + ϵ, (33)

which can be understood as an extension of Equation 23.³

Based on Equation 31, we define the set similar to Equation 24 without the dependency on $l$ :

\tilde{R} (n) = \{r \in R |n \in \frac{T_{r}}{T_{s}}\} (34)

The approximation in Equation 26 can then be replaced by

h (l, n) \approx \sum_{n^{'}} \sum_{r \in \tilde{R} (n)} s i n c (n - \frac{Δ_{r}}{T_{s}} - n^{'}) h (l - 1, n^{'}) - e (l, n), (35)

with the error term $e (l, n)$ redefined as

e (l, n) = \sum_{n^{'}} \sum_{\begin{array}{c} r \in \tilde{R} (n) \\ r^{'} \in R \\ r^{'} \neq r \end{array}} s i n c (n - \frac{Δ_{r}}{T_{s}} - n^{'}) s i n c (n^{'} - \frac{τ_{r^{'}} (l - 1)}{T_{s}}) . (36)

Again, the product $s i n c (n - Δ_{r} / T_{s} - n^{'}) s i n c (n^{'} - (τ_{r^{'}} (l - 1) / T_{s})$ in Equation 36 will be negligible if the two sinc-functions are misaligned, where $| n T_{s} - Δ_{r} - τ_{r^{'}} (l - 1) | > ϵ$ . As shown in the Supplementary Appendix SB, this condition is equivalent to

T_{r^{'}}^{-} \cap T_{r}^{-} = \emptyset, r^{'} \neq r (37)

where

\begin{align} T_{r}^{-} & = T_{r} - Δ_{r} \\ = [τ_{r | m i n}^{-}, τ_{r | m a x}^{-}], \end{align} (38)

with τ_{r | m i n}^{-} = τ_{r | m i n} - Δ_{r} = m i n (τ_{r} (0), τ_{r} (L - 2)) - ϵ - Δ_{r}, (39)

τ_{r | m a x}^{-} = τ_{r | m a x} - Δ_{r} = m a x (τ_{r} (0), τ_{r} (L - 2)) + ϵ - Δ_{r}, (40)

and the limits in Equations 39, 40 are obtained from Equations 32, 33, and $τ_{r} (1) - Δ_{r} = τ_{r} (0)$ and $τ_{r} (L - 1) - Δ_{r} = τ_{r} (L - 2)$ according to Equation 30. The condition in Equation 37 essentially states that the intervals that the individual reflections occupy over the range of the trajectory (with exclusion of the end point $l = L - 1$ ) may not overlap. This can be understood as an extension of condition Equation 28 derived for the location-variant transition matrix. Neglecting $e (l, n)$ and referring back to the vector representation of RIRs in Equation 3, it can be seen that Equation 35 can be expressed in the form $h (l) \approx A h (l - 1)$ as anticipated in Equation 9, where the element of $A$ at index $(n + 1, n^{'} + 1)$ is given by

A_{n + 1, n^{'} + 1} = \sum_{r \in \tilde{R} (n)} s i n c (n - \frac{Δ_{r}}{T_{s}} - n^{'}) . (41)

if $\tilde{R} (n) \neq \emptyset$ and otherwise $A_{n + 1, n^{'} + 1} (l) = 0$ .

Given the assumption of access to exact reflection TOAs $τ_{r} (0)$ and $τ_{r} (L - 1)$ , we estimate $Δ_{r}$ using Equation 30 as follows:

Δ_{r} = \frac{τ_{r} (L - 1) - τ_{r} (0)}{L - 1}, (42)

Note that from a conceptual point of view, the terms $τ_{r} (l)$ , $τ_{r} (0)$ , $τ_{r} (L - 1) - τ_{r} (0)$ , and $l / (L - 1)$ in Equations 30 and 42 are comparable to $n_{r} (l)$ , $n_{r} (0)$ , $n_{r} (L - 1) - n_{r} (0)$ , and $β (l)$ in Equation 5, respectively. Note, however, that the latter are sample-based, while the former are defined in continuous time-shifts and can indeed be used to implement non-integer shifts by using the sinc-functions in Equation 29.

Keeping in mind Figures 3a and b, for an intuitive understanding of the relationship between $A (l)$ and $A$ , we refer the reader to the toy example illustrated in Figure 6. Here, a trajectory with $L = 3$ is considered, and matrices $A (l)$ and $A$ are, respectively, formed using Equations 29 or 41 with integer TOAs and $2 ϵ / T_{s} = 3$ . It can be seen that the result of taking $A^{L - 1}$ yields an identical matrix to that of the multiplication of $A (L - 1) \dots A (2) A (1)$ (Figures 6c, e). Furthermore, we note that the resulting matrix is equivalent to the matrix constructed using either Equations 29 or 41 for the case of $L = 2$ : a transition matrix between the start and end points of the trajectory as shown in Figure 3c.

Figure 6

Figure 6. Illustration of the analytical transition matrices for a toy problem along a trajectory with $L = 3$ , with $\frac{2 ϵ}{T_{s}} = 3$ and integer reflection TOAs. Subfigures (a) and (b) show the location-variant matrices $A (l)$ for $l = 1$ and $l = 2$ , respectively. Subfigure (c) shows their product, A(2) A(1). Subfigure (d) shows the location-invariant matrix $A$ , while (e) depicts $A^{2}$ (i.e., $A^{L - 1}$ for L = 3).

4.3 Dynamic Time Warping transition matrices

As it stands, the analytical solution proposed in Section 4.2 cannot be applied directly to RIRs and requires inherent knowledge of the TOAs in $h (0)$ and $h (L - 1)$ . Therefore, we propose an approach to estimate the matrix $A$ using a DTW algorithm to temporally align elements in $h (0)$ and $h (L - 1)$ . DTW achieves temporal alignment by flexibly warping time, ensuring that similar patterns in the two sequences are matched, despite variations in timing. As discussed in Section 2.2, prior applications of DTW have involved linear interpolation between two impulse responses. However, the method proposed here builds upon the observation that the warp path derived from DTW can be linked to both the set $\tilde{R} (n)$ and $τ_{r} (L - 1) - τ_{r} (0)$ , the latter of which is necessary for estimating $Δ_{r}$ via Equation 42. Leveraging these parameters enables the construction of an appropriate analytical location-invariant transition matrix $A$ , denoted as ${\hat{A}}_{dtw}$ .

The DTW algorithm (Müller, 2007) involves computing an accumulated cost matrix $D \in R^{(N + 1) \times (N + 1)}$ , representing the cumulative costs of aligning each element in $h (L - 1)$ with each element in $h (0)$ , respectively, indexed by $n$ and $n^{'}$ . Here, the cost is defined as the Euclidean distance $‖ h (L - 1, n) - h (0, n^{'}) ‖_{2}$ , and $D$ is constructed using the recurrence relation

D_{n + 2, n^{'} + 2} = ‖ h (L - 1, n) - h (0, n^{'}) ‖_{2} + \min \{D_{n + 1, n^{'} + 2}, D_{n + 2, n^{'} + 1}, D_{n + 1, n^{'} + 1}\}, (43)

which is initialized such that $D_{1,1} = 0$ , while $D_{n + 2,1}$ and $D_{1, n^{'} + 2}$ are set to $\infty$ for all $n$ and $n^{'}$ , respectively⁴. Owing to the prominent reflection peaks in the structure of early RIRs, the matrix $D$ will be formed with grid-like entries (Figure 7b)

Figure 7

Figure 7. (a) Location-invariant matrix $A$ for the special case of $L = 2$ and (b) accumulated distance matrix $D$ constructed from $h (0)$ and $h (L - 1)$ overlaid with DTW warp path (white line).

A warp path, denoted by pairs of indices $(n + 2, n^{'} + 2)$ representing optimal alignment in the sense of smallest accumulated cost, is established by backtracking from $D_{N + 2, N + 2}$ to $D_{2,2}$ . The recurrence relation in Equation 43 guarantees a monotonically increasing order, thus ensuring that peaks in sequences are mapped in the same order they appear, preserving temporal correspondence during alignment. Note that a single point in one sequence can correspond to multiple points in the other, and vice versa. An illustration of the warp path through $D$ can be seen in Figure 7b.

The elements along the horizontal and vertical “gridlines” in $D$ contain local maxima values, whose row and column indices should respectively correspond to integer estimates of reflection TOA time-shift indices $\frac{τ_{r} (L - 1)}{T_{s}}$ and $\frac{τ_{r} (0)}{T_{s}}$ . The elements at the intersections of these “gridlines” represent local minima in $D$ , signifying potential alignment points between reflection peaks in $h (L - 1)$ and $h (0)$ . Consequently, the warp path tends to traverse these intersections in a diagonal manner, guided by the minimization of alignment costs. Here, a clear link between the shape of the warp path through $D$ (Figure 7b) and the analytical matrix $A$ constructed using Equation 41 for $L = 2$ (Figure 7a) becomes apparent. Recall that the case of $L = 2$ results in a matrix $A$ that maps the RIR at the start of the trajectory to the RIR at its end (Figure 3c).

In order to use the shape of the warp path through $D$ , we need to construct a matrix $W \in R^{N \times N}$ , with ones placed at the indices corresponding to the pairs in the warp path and zeros elsewhere:

W_{n + 1, n^{'} + 1} = \{\begin{cases} 1 & if (n + 2, n^{'} + 2) is in the warp path \\ 0 & otherwise \end{cases}

Such a matrix $W$ is shown in Figure 8d for the previous toy problem with integer TOAs and $\frac{2 ϵ}{T_{s}} = 3$ . It can be seen that the diagonal segments of $W$ are consistent with the toy problem matrix $A^{L - 1}$ , which is again displayed in Figure 8b. We therefore propose to leverage the positions of the diagonal segments in $W$ to obtain the estimates of $Δ_{r}$ and $\tilde{R} (n)$ needed to construct $A$ as follows.

Figure 8

Figure 8. Illustration of the analytical transition matrices for a toy problem along a trajectory with $L = 3$ , with $\frac{2 ϵ}{T_{s}} = 3$ and integer reflection TOAs. Subfigure (a) shows the location-invariant matrix A, while (b) depicts $A^{2}$ (i.e., $A^{L - 1}$ for $L = 3$ ). Subfigures (c) and (d) respectively present the matrix $W$ , derived from the DTW warp path, and the estimated location-invariant matrix ${\hat{A}}_{dtw}$ .

Firstly, integer estimates of $(τ_{r} (L - 1) - τ_{r} (0)) / T_{s}$ are obtained by measuring the “distances” (number of elements) from the diagonal segments in $W$ to the main diagonal. These estimates can in turn be used to find estimates of $Δ_{r}$ using the relation in Equation 42 and the known sampling period $T_{s}$ . We employ ${\hat{Δ}}_{r}$ and ${\hat{τ}}_{r}$ to distinguish the estimates found in this manner:

{\hat{Δ}}_{r} = \frac{{\hat{τ}}_{r} (L - 1) - {\hat{τ}}_{r} (0)}{L - 1}, (44)

Secondly the start and end indices of the diagonal segments, denoted, respectively, as $(n_{r | st}, n_{r | st}^{'})$ and $(n_{r | en}, n_{r | en}^{'})$ , are retrieved. These indices are then used to find a suitable estimate of the range $T_{r}$ for the set $\tilde{R} (n)$ in Equation 34 as

{\hat{T}}_{r} = [{\hat{τ}}_{r | m i n}, {\hat{τ}}_{r | m a x}] (45)

{\hat{τ}}_{r | m i n} = T_{s} \cdot m i n (n_{r | st}^{'} + \frac{{\hat{Δ}}_{r}}{T_{s}}, n_{r | st}), (46)

{\hat{τ}}_{r | m a x} = T_{s} \cdot m a x (n_{r | en}, n_{r | en}^{'} + \frac{{\hat{Δ}}_{r}}{T_{s}}), (47)

It should be ensured that the estimated ranges still result in the condition in Equation 37 being met—that is, that corresponding column ranges do not overlap. Finally, we construct the matrix ${\hat{A}}_{dtw}$ using Equations 41 and 34 with $Δ_{r} = {\hat{Δ}}_{r}$ and $T_{r} = {\hat{T}}_{r}$ where ${\hat{Δ}}_{r} and {\hat{T}}_{r}$ are obtained using Equations 44 to 47. Figure 8c illustrates an example of a matrix obtained using this method for the previous toy problem. It is apparent that ${\hat{A}}_{dtw}$ closely aligns with the desired $A$ matrix shown in Figure 8a. Importantly, the described method allows us to derive $A$ without needing to explicitly estimate individual TOAs.

5 Simulations

In this section, we systematically evaluate the performance of transition matrices $A$ and ${\hat{A}}_{dtw}$ in state-space RIR estimation. The algorithms considered are summarized in Table 1, with $LI - A$ and $KF - α$ serving as reference methods, and $KF - A$ and $KF - {\hat{A}}_{dtw}$ representing our primary contributions from the proposed methods outlined in Section 3. In Section 5.1, we provide details on the experimental setup, including the simulated acoustic environment, Kalman filter parameters, and the performance measure used. The results of the experiments are presented in Section 6.

Table 1

Table 1. Description of the algorithms compared in the experimental results.

5.1 Acoustic environment

For simplicity, we consider a box-shaped room with dimensions bounded by $[0,4.50] \times [0,5.80] \times [0,2.90]$ m. We place a fixed-position sound source at $[1.05, 2.98, 1.17]$ m, and define a linear trajectory for the microphone spanning approximately 0.7 m, starting at $[1.94, 3.10, 1.09]$ m and ending at $[1.99, 2.95, 0.37]$ m. Along this trajectory, we have established that all first-order reflections arrive in the same order. The set-up is shown in Figure 9.

Figure 9

Figure 9. 2-D illustration depicting the source position and trajectory of the microphone within a box-shaped room, as utilized for simulations.

At a standard speech sampling rate of 16 kHz (Nophut et al., 2024) and with a microphone velocity of 0.25 m/s, the ISM (Allen and Berkley, 1979) is used to simulate the ground-truth early RIRs $h (l)$ , including reflections up to the second order, at various points along the trajectory. Subsequently, the observed output signal $y_{Ω} (l)$ is obtained using Equation 11 for a given input $x_{Ω} (l)$ and measurement noise $v_{Ω} (l) \sim N (0, σ_{y}^{2})$ , where the variance $σ_{v}^{2}$ is adjusted based on the experimental conditions. White Gaussian noise is chosen as the input signal, denoted as $x (l) \sim N (0, σ_{x}^{2})$ , with the variance $σ_{x}^{2}$ set as $10 \log_{10} (σ_{x}^{2}) = - 20$ dB.

5.2 Kalman filter parameters

As outlined in Section 3.1, the Kalman filter relies on specific input parameters for its operation. These parameters include the process noise covariance matrix $Q$ and the variance of the measurement noise $R$ . Additionally, the filter requires initialization with a state estimate ${\hat{h}}^{+} (0)$ and an initial guess for the state estimation error covariance matrix $P^{+} (0)$ .

Given the assumption of knowledge of the RIR at the start of the trajectory $h (0)$ , the filter is initialized with ${\hat{h}}^{+} (0) = h (0)$ . Since we are simulating measurement noise with variance $σ_{v}^{2}$ , we set $R = σ_{v}^{2}$ . As previously mentioned, $R$ can in practice potentially be estimated using background noise recordings. The appropriate $A$ matrix is selected for each algorithm, as outlined in Table 1, with the considered width $2 ϵ$ of the $s i n c$ functions in the proposed analytical transition matrices set to $\frac{2 ϵ}{T_{s}} = 20$ . It was found that for the reference Kalman filter, a transition factor of $α = 1$ was suitable, and varying this did not change the results in a significant manner. After some initial testing of the Kalman filter, the tuning parameter $Q$ was chosen as $Q = σ_{w}^{2} I$ , where $I$ is an identity matrix and $10 \log (σ_{w}^{2}) = - 30$ dB.

5.3 Performance measure

In assessing the accuracy of our estimated RIRs ${\hat{h}}^{+} (l)$ , we employ the normalized misalignment $M_{dB} (l)$ . This metric is defined thus:

M_{dB} (l) = 20 \log_{10} (\frac{‖ {\hat{h}}^{+} (l) - h (l) ‖_{2}}{‖ h (l) ‖_{2}}) .

Here, $h (l)$ represents the ground truth early RIRs.

5.4 Experiments

We categorize experiments into the following four sets to evaluate algorithm performance.

$•$ Experiment 1: Ideal case. In the first set of experiments, we examine the “ideal” scenario where no noise is present on the observed output signal $y_{Ω} (l)$ $(σ_{v}^{2} = 0)$ and every sample along the trajectory is considered $(Ω = 1)$ . Simulated RIRs exclusively include first-order reflections.

$•$ Experiment 2: Noise sensitivity. Building upon the setup in Experiment 1 with $Ω = 1$ and only first-order reflections included, the second set of experiments evaluates algorithm robustness when different noise levels are added to the observed output signal—that is, varying $SNR = {6,0, - 6}$ dB.

$•$ Experiment 3: Spatial sampling effects. The third set of experiments employs different spatial downsampling factors $Ω$ to assess the impact of using fewer data points along the trajectory. In this case, we once again consider $σ_{v}^{2} = 0$ and only first-order reflections, while varying $Ω = {2,8,32}$ . Note that $Ω = 2$ corresponds to $50 %$ of the samples along the trajectory being used.

$•$ Experiment 4: Second-order reflections included. The final set of experiments investigates the inclusion of second-order reflections in the RIRs and the corresponding simulated observed signal. We once again examine the “ideal” scenario where no noise is present on the observed output signal $y_{Ω} (l)$ $(σ_{v}^{2} = 0)$ and every sample along the trajectory is considered $(Ω = 1)$ . Please note that the analytical transition matrix $A$ is still constructed solely using first-order TOAs, as not all second-order reflections meet the necessary assumptions along the given trajectory. As a robustness measure, after constructing $A$ , ones are added along the main diagonal in any part of the matrix with empty rows. This step ensures that second-order reflections are not entirely discarded. On the other hand, the DTW algorithm operates directly on the RIRs, and thus, when constructing ${\hat{A}}_{dtw}$ , it inherently assumes that all reflections arrive in the same order in $h (0)$ and $h (L - 1)$ . In other words, it may incorrectly associate some reflections in $h (0)$ and $h (L - 1)$ as belonging to the same image source.

6 Results

We present the results corresponding to the experiments outlined above. The reader is urged to take careful note of the different $y$ -axis scaling in the presented results.

6.1 Result 1: ideal case

The results of Experiment 1 are presented at the top of Figure 10, where the normalized misalignment $M_{dB} (l)$ between the ground-truth RIR $h (l)$ and the estimated RIR $\hat{h} (l)$ is displayed for the various algorithms at different positions along the trajectory. The bottom of Figure 10 shows the start and end point RIRs used to obtain our transition matrices. This is provided to give the reader an idea of how the source and reflection TOAs change over the trajectory. Furthermore, it can be seen from the labeled TOAs that there are no overlapping intervals, which aligns with our assumptions.

Figure 10

Figure 10. (Top) Result 1: Normalized misalignment along the trajectory between the “ground-truth” RIR $h (l)$ and the estimated RIR $\hat{h} (l)$ using algorithms detailed in Table 1. This is for the case of a noiseless observed output signal $y_{Ω} (l)$ . (Bottom) Simulated RIRs corresponding to the start and end of the trajectory, denoted as $h (0)$ and $h (L - 1)$ , respectively. Peaks corresponding to the same source or reflection are labeled by the same number in each RIR.

At the beginning of the trajectory, all algorithms exhibit zero error owing to their initialization with a known RIR $h (0)$ . Following a subsequent steep increase in error, distinctive properties of the curves resulting from the different algorithms become evident. The reference Kalman filter algorithm $(KF - α)$ , which utilizes a transition factor of $α = 1$ and relies solely on microphone observations, demonstrates a gradual decrease in error throughout the remainder of the trajectory. Excluding the start and end segments, it can be observed that Algorithm $LI - A$ , whose performance solely relies on the model of $A$ , results in an error curve with a subtle dip in it. This non-monotonic behavior can be attributed to the position of the direct source relative to the linear microphone trajectory (Figure 9) and the fact that the source is not sufficiently in the far-field for our assumption that TDOAs remain approximately constant to hold. Toward the end of the trajectory, Algorithm $LI - A$ exhibits a rapid decrease in error. This is because the matrix $A$ is derived from accurate TOAs at the start and end of the trajectory, and therefore estimates should most closely align with the ground-truth RIRs around these positions.

In general, changes in the error curve of the Kalman filter algorithms using a transition matrix ( $KF - A$ and $KF - {\hat{A}}_{dtw}$ ) correlate with changes in the error curve of Algorithm $LI - A$ . However, a more significant dip in error is observed, and the curves nearly converge with those of Algorithm $KF - α$ at the end of the trajectory. This behavior may be a result of the rapid convergence of $KF - A$ and $KF - {\hat{A}}_{dtw}$ under ideal conditions (i.e., no noise), making the Kalman filter more sensitive to the degradation of the transition matrix model accuracy.

If we now consider the relative performance of the algorithms, we note that the Kalman filter employing “ideal” analytical transition matrix Algorithm $KF - A$ showcases a substantial improvement which is, on average, approximately $- 15$ dB compared to the reference Kalman filter, Algorithm $KF - α$ . Importantly, similar improvement extends to the case where the analytical transition matrix approximated using DTW is used—Algorithm $KF - {\hat{A}}_{dtw}$ . Moreover, the advantage of employing a state-space model over a linear interpolation method becomes evident from the comparatively poor overall results of Algorithm $LI - A$ . The observed enhancements in accuracy support our preliminary hypothesis: the combination of room acoustic model-based interpolation and a data-driven state-space model provides a more accurate approach to early RIR estimation.

6.2 Result 2: noise sensitivity

The results of Experiment 2 are presented in Figure 11 and once again exhibit error curve shapes consistent with those observed in Experiment 1. However, as anticipated, overall performance is worse, except for Algorithm $LI - A$ , which remains unaffected since it does not rely on the noisy microphone measurements. Additionally, we observe slower convergence and less pronounced dips in the error curves of the algorithms $KF - A$ and $KF - {\hat{A}}_{dtw}$ compared to the noiseless case. This is likely because the less reliable observations prevent convergence to such low error levels in the first place, making the dips less noticeable.

Figure 11

Figure 11. Result 2: Normalized misalignment along the trajectory between the “ground-truth” RIR $h (l)$ and the estimated RIR $\hat{h} (l)$ using algorithms detailed in Table 1. This assessment is conducted for different levels of noise added to the observed output signal $y_{Ω} (l)$ .

It is evident that even under poor SNR conditions, our methods consistently outperform both reference algorithms. Notably, at an SNR of $- 6$ dB, an average of approximately $- 5$ dB difference persists, further underscoring the advantages of the proposed approach, even when measurements contain high levels of noise.

6.3 Result 3: spatial sampling effects

The results of Experiment 3 are presented in Figure 12. As fewer measurements are used, the disparity between the results obtained using the linear interpolation method and those obtained using a Kalman filter naturally diminishes. Additionally, the performance of the reference Kalman filter $(KF - α)$ deteriorates more rapidly than that of our proposed methods ( $KF - A$ and $KF - {\hat{A}}_{dtw}$ ). We further note an apparent shift in the position of the dip in the error curve of Algorithms $KF - A$ and $KF - {\hat{A}}_{dtw}$ as fewer samples are used. This is likely a consequence of the change in how often the state equation is propagated.

Figure 12

Figure 12. Result 3: Normalized misalignment along the trajectory between the “ground-truth” RIR $h (l)$ and the estimated RIR $\hat{h} (l)$ using algorithms detailed in Table 1. This is for the case of a noiseless observed output signal $y_{Ω} (l)$ with different spatial sampling along the trajectory as a function of $Ω$ .

For $Ω = 8$ , equivalent to approximately $12.5 %$ of the available sampling points, Algorithms $KF - α$ and $LI - A$ exhibit similar performance, while our proposed algorithms consistently outperform both by an average of approximately $- 12$ dB. Despite the trivial choice between the two reference methods when only a small number of measurements are available, as demonstrated by the case of $Ω = 32$ (equivalent to $3.13 %$ of sampling points), our proposed method still demonstrates an overall improvement in performance.

6.4 Result 4: second-order reflections included

The results of Experiment 4 are shown at the top of Figure 13. It is important to recall that in Experiment 4, the analytical transition matrix $A$ used in Algorithm $KF - A$ is still constructed using only first-order TOAs. This is because not all second-order reflections meet the necessary assumptions along the given trajectory, as indicated by the reflection TOA labels toward the end of the RIR at the bottom of Figure 13—for example, the interval over which reflection 6 moves overlaps with reflections 7, 8, and 9. However, by adding ones along the diagonal of $A$ in any empty rows, we ensure that second-order reflections are not entirely discarded. This allows us to leverage prior knowledge, even if incomplete, and explains the strong convergence with Algorithm $KF - α$ toward the end of the trajectory. That being said, Algorithm $KF - A$ still demonstrates an average improvement of approximately $- 7$ dB over Algorithm $KF - α$ .

Figure 13

Figure 13. (Top) Result 4: Normalized misalignment along the trajectory between the “ground-truth” RIR $h (l)$ and the estimated RIR $\hat{h} (l)$ using algorithms detailed in Table 1. This is for the case of a noiseless observed output signal $y_{Ω} (l)$ with simulated RIRs inclusive of second-order reflections. (Bottom) Simulated RIRs corresponding to the start and end of the trajectory, denoted as $h (0)$ and $h (L - 1)$ , respectively. Peaks corresponding to the same source or reflection are labeled by the same number in each RIR.

It is important to reiterate that in estimating the matrix ${\hat{A}}_{dtw}$ , we directly utilize the RIRs $h (0)$ and $h (L - 1)$ , thus implicitly incorporating second-order reflection information. However, in this case, not all second-order reflections adhere to the assumption that the time intervals’ individual reflections occupy over the trajectory do not overlap. Given that DTW inherently maps sequences in a monotonic manner, assuming consistent order among significant points like peaks, certain reflection peaks might be incorrectly mapped between RIRs. Consequently, the points along the trajectory where second-order reflections in the ground-truth RIRs begin to overlap likely correspond to the jumps observed in the error curve of Algorithm $KF - {\hat{A}}_{dtw}$ . Nonetheless, this serves as a robust test demonstrating that despite deviations from certain assumptions regarding all reflections, the proposed estimated matrix still yields an overall enhancement over Algorithm $KF - α$ .

7 Conclusion

This paper investigates the estimation of early segments of time-varying RIRs through a state-space model incorporating the ISM within the state transition matrix. Simulation results indicate that this approach outperforms both RIR interpolation and a purely data-driven state-space model using a transition factor. Moreover, a practical method for estimating such a matrix has been proposed and has a similar performance to the “ideal” analytical transition matrix derived. It is important to acknowledge that certain assumptions inherent to our method limit its application within specific areas of a room. This necessitates further research in order to improve the robustness of the approach, potentially through adaptive estimation of the transition matrix. Furthermore, experimental validation using real measurement data is required to assess the effectiveness of the proposed approach in real-world scenarios.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Author contributions

KM: conceptualization, formal analysis, investigation, methodology, validation, writing–original draft, and writing–review and editing. TD: conceptualization, methodology, supervision, writing–original draft, and writing–review and editing. RA: conceptualization, supervision, writing–original draft, and writing–review and editing. TV: conceptualization, funding acquisition, supervision, and writing–review and editing.

Funding

The authors declare that financial support was received for the research, authorship, and/or publication of this article. This work was supported by

$•$ The European Research Council under the European Union’s Horizon 2020 research and innovation program/ERC Consolidator Grant: SONORA (no. 773268). This paper reflects only the authors’ views, and the EU is not liable for any use that may be made of the contained information.

$•$ KU Leuven Internal Funds C14/21/075 “A holistic approach to the design of integrated and distributed digital signal processing algorithms for audio and speech communication devices”

$•$ FWO Research Project: “The Boundary Element Method as a State-Space Realization Problem” (G0A0424N)

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frsip.2024.1426082/full#supplementary-material

Footnotes

¹For notational convenience when working with an interval $[a, b]$ , we introduce $\frac{[a, b]}{c}$ to represent $[\frac{a}{c}, \frac{b}{c}]$ (see Equations 24 and 34), and $[a, b] - c$ to represent $[a - c, b - c]$ (Equation 38).

²In the special case that $τ_{r} (l) / T_{s}$ and $τ_{r} (l - 1) / T_{s}$ are integers, the sinc-functions are sampled symmetrically around their peak, which implies that all but the peak sample lie at the zero crossings of the sinc-functions, resulting in diagonal submatrices.

³Note that if $Δ_{r} < 2 ϵ$ , we find that $T_{r} = ⋃_{l = 1}^{L - 1} T_{r} (l)$ .

⁴In this paper, matrices are indexed from (1,1), following the convention of many programming languages. However, our RIR sequences are indexed from 0. To align with the zero-based indexing of our RIRs and accommodate the additional row and column used solely for initialization, we index the entries of the accumulated cost matrix as $D_{n + 2, n^{'} + 2}$

References

Ajdler, T., Sbaiz, L., and Vetterli, M. (2007). Dynamic measurement of room impulse responses using a moving microphone. J. Acoust. Soc. Amer. (JASA) 122, 1636–1645. doi:10.1121/1.2766776

PubMed Abstract | CrossRef Full Text | Google Scholar

Allen, J. B., and Berkley, D. A. (1979). Image method for efficiently simulating small-room acoustics. J. Acoust. Soc. Amer. (JASA) 65, 943–950. doi:10.1121/1.382599

CrossRef Full Text | Google Scholar

Antonello, N., De Sena, E., Moonen, M., Naylor, P. A., and Van Waterschoot, T. (2017). Room impulse response interpolation using a sparse spatio-temporal representation of the sound field. IEEE/ACM Trans. Audio Speech Lang. Process. 25, 1929–1941. doi:10.1109/taslp.2017.2730284

CrossRef Full Text | Google Scholar

Antweiler, C., and Symanzik, H. G. (1995). “Simulation of time variant room impulse responses,” in Proc. 1995 IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP ’95), Detroit, MI, May 08-12, 1995, Vol. 5, 3031–3034. doi:10.1109/icassp.1995.479484

CrossRef Full Text | Google Scholar

Antweiler, C., Telle, A., Vary, P., and Enzner, G. (2012). “Perfect-sweep NLMS for time-variant acoustic system identification,” in Proc. 2012 IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP ’12), Kyoto, Japan, 25-30 March 2012, 517–520. doi:10.1109/ICASSP.2012.6287930

CrossRef Full Text | Google Scholar

Borra, F., Gebru, I. D., and Markovic, D. (2019). “Soundfield reconstruction in reverberant environments using higher-order microphones and impulse response measurements,” in Proc. 2021 IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP ’19), Brighton, Great Britain, 12-17 May 2019, 281–285. doi:10.1109/ICASSP.2019.8682961

CrossRef Full Text | Google Scholar

Bruschi, V., Nobili, S., Cecchi, S., and Piazza, F. (2020). “An innovative method for binaural room impulse responses interpolation,” in Audio Engineering Society Convention 148 (AES148Conv).

Google Scholar

Carty, B. (2010). “Movements in binaural space: issues in HRTF interpolation and reverberation, with applications to computer music,”. Ph.D. thesis (Maynooth, Ireland: National University of Ireland).

Google Scholar

Cherniakov, M. (2003). An introduction to parametric digital filters and oscillators. John Wiley & Sons.

Google Scholar

Crocco, M., and Bue, A. D. (2015). “Room impulse response estimation by iterative weighted l1-norm,” in Proc. 23rd European Signal Process. Conf. (EUSIPCO ’15), Nice, France, August 31-September 4, 2015, 1895–1899.

Google Scholar

Das, O., Calamia, P., and Gari, S. V. A. (2021). “Room impulse response interpolation from a sparse set of measurements using a modal architecture,” in Proc. 2021 IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP ’21), Toronto, Canada, 6-11 June 2021, 960–964. doi:10.1109/ICASSP39728.2021.9414399

CrossRef Full Text | Google Scholar

Elko, G. W., Diethorn, E., and Gänsler, T. (2003). “Room impulse response variation due to temperature fluctuations and its impact on acoustic echo cancellation,” in Proc. 2003 Int. Workshop Acoustic Echo Noise Control (IWAENC ’03), Kyoto, Japan, September 8-11, 2003, 67–70.

Google Scholar

Enzner, G. (2008). “Analysis and optimal control of lms-type adaptive filtering for continuous-azimuth acquisition of head related impulse responses,” in Proc. 2008 IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP ’08), Las Vegas, Nevada, 31 March-4 April 2008, 393–396. doi:10.1109/ICASSP.2008.4517629

CrossRef Full Text | Google Scholar

Enzner, G. (2010). “Bayesian inference model for applications of time-varying acoustic system identification,” in Proc. 18th European Signal Process. Conf. (EUSIPCO ’10), Aalborg, Denmark, August 23-27, 2010, 2126–2130.

Google Scholar

Evers, C., Löllmann, H. W., Mellmann, H., Schmidt, A., Barfuss, H., Naylor, P. A., et al. (2020). The locata challenge: Acoustic source localization and tracking. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1620–1643. doi:10.1109/taslp.2020.2990485

CrossRef Full Text | Google Scholar

Garcia-Gomez, V., and Lopez, J. J. (2018). “Binaural room impulse responses interpolation for multimedia real-time applications,” in AES 144th Convention (AES148Conv), Milan, Italy, May 23-26, 2018.

Google Scholar

Geldert, A., Meyer-Kahlen, N., and Schlecht, S. J. (2023). “Interpolation of spatial room impulse responses using partial optimal transport,” in Proc. 2023 IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP ’23), Rhodes Island, Greece, June 4-June 10, 2023, 1–5. doi:10.1109/ICASSP49357.2023.10095452

CrossRef Full Text | Google Scholar

Hahmann, M., and Fernandez-Grande, E. (2022). A convolutional plane wave model for sound field reconstruction. J. Acoust. Soc. Amer. (JASA) 152 (5), 3059–3068. doi:10.1121/10.0015227

CrossRef Full Text | Google Scholar

Hahn, N., and Spors, S. (2015). “Continuous measurement of impulse responses on a circle using a uniformly moving microphone,” in Proc. 23rd European Signal Process. Conf. (EUSIPCO ’15), Nice, France, August 31-September 4, 2015, 2536–2540. doi:10.1109/EUSIPCO.2015.7362842

CrossRef Full Text | Google Scholar

Haneda, Y., Kaneda, Y., and Kitawaki, N. (1999). Common-acoustical-pole and residue model and its application to spatial interpolation and extrapolation of a room transfer function. IEEE Trans. Speech, Audio Process 7, 709–717. doi:10.1109/89.799696

CrossRef Full Text | Google Scholar

Hoskins, R. F. (2009). Delta functions: Introduction to generalised functions. Horwood Publishing.

Google Scholar

Karakonstantis, X., Caviedes-Nozal, D., Richard, A., and Fernandez-Grande, E. (2024). Room impulse response reconstruction with physics-informed deep learning. J. Acoust. Soc. Amer. (JASA) 155 (2), 1048–1059. doi:10.1121/10.0024750

CrossRef Full Text | Google Scholar

Katzberg, F., Mazur, R., Maass, M., Böhme, M., and Mertins, A. (2018). “Spatial interpolation of room impulse responses using compressed sensing,” in Proc. 2018 Int. Workshop Acoustic Signal Enhancement (IWAENC ’18), Tokyo, Japan, September 17–20, 2018, 426–430. doi:10.1109/IWAENC.2018.8521390

CrossRef Full Text | Google Scholar

Kearney, G., Masterson, C., Adams, S., and Boland, F. (2009). “Dynamic time warping for acoustic response interpolation: Possibilities and limitations,” in Proc. 17th European Signal Process. Conf. (EUSIPCO ’09), Glasgow, Scotland, 24-28 August 2009, 705–709.

Google Scholar

Kuhl, S., Nagel, S., Kabzinski, T., Antweiler, C., and Jax, P. (2018). “Tracking of time-variant linear systems: Influence of group delay for different excitation signals,” in Proc. 2018 Int. Workshop Acoustic Signal Enhancement (IWAENC ’18), Tokyo, Japan, September 17–20, 2018, 131–135. doi:10.1109/IWAENC.2018.8521372

CrossRef Full Text | Google Scholar

Lin, Y., and Lee, D. (2006). Bayesian regularization and nonnegative deconvolution for room impulse response estimation. IEEE Trans. Signal Process 54, 839–847. doi:10.1109/tsp.2005.863030

CrossRef Full Text | Google Scholar

Masterson, C., Kearney, G., and Boland, F. (2009). “Acoustic impulse response interpolation for multichannel systems using dynamic time warping,” in Proc. AES 35th Int. Conf. Audio for Games, London, United Kingdome, February 11-13, 2009.

Google Scholar

Mignot, R., Daudet, L., and Ollivier, F. (2013). Room reverberation reconstruction: Interpolation of the early part using compressed sensing. IEEE Trans. Audio Speech Lang. Process. 21, 2301–2312. doi:10.1109/tasl.2013.2273662

CrossRef Full Text | Google Scholar

Müller, M. (2007). “Dynamic time warping,” in Information retrieval for music and motion. Berlin, German: Springer, 69–84.

Google Scholar

Naylor, P. A., and Gaubitch, N. D. (2010). Speech dereverberation. Vol. 2. Berlin, Germany: Springer.

Google Scholar

Nophut, M., Preihs, S., and Peissig, J. (2024). Velocity-controlled Kalman filter for an improved echo cancellation with continuously moving microphones. J. Audio Eng. Soc. (JAES) 72, 33–43. doi:10.17743/jaes.2022.0116

CrossRef Full Text | Google Scholar

Pezzoli, M., Perini, D., Bernardini, A., Borra, F., Antonacci, F., and Sarti, A. (2022). Deep prior approach for room impulse response reconstruction. Sensors 22, 2710. doi:10.3390/s22072710

PubMed Abstract | CrossRef Full Text | Google Scholar

Ratnarajah, A., Ananthabhotla, I., Ithapu, V. K., Hoffmann, P. F., Manocha, D., and Calamia, P. T. (2022). “Towards improved room impulse response estimation for speech recognition,” in Proc. 2023 IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP ’23), Rhodes Island, Greece, 4-10 June 2023, 1–5. doi:10.1109/ICASSP49357.2023.10094770

CrossRef Full Text | Google Scholar

Schissler, C., Stirling, P., and Mehra, R. (2017). “Efficient construction of the spatial room impulse response,” in 2017 IEEE Virtual Reality (VR), Los Angeles, CA, 18-22 March 2017, 122–130. doi:10.1109/VR.2017.7892239

CrossRef Full Text | Google Scholar

Simon, D. (2006). Optimal state estimation: Kalman, H infinity, and nonlinear approaches. Hoboken, NJ: John Wiley & Sons.

Google Scholar

Stan, G.-B., Embrechts, J.-J., and Archambeau, D. (2002). Comparison of different impulse response measurement techniques. J. Audio Eng. Soc. (JAES) 50, 249–262.

Google Scholar

Szöke, I., Skácel, M., Mošner, L., Paliesek, J., and Černocký, J. H. (2018). Building and evaluation of a real room impulse response dataset. IEEE J. Sel. Topics Signal Process. (JSTSP) 13, 863–876. doi:10.1109/JSTSP.2019.2917582

CrossRef Full Text | Google Scholar

Zhao, J., Zheng, X., Ritz, C., and Jang, D. (2022). Interpolating the directional room impulse response for dynamic spatial audio reproduction. Appl. Sci. 12, 2061. doi:10.3390/app12042061

CrossRef Full Text | Google Scholar

Keywords: state-space model, transition matrix, acoustic room impulse response interpolation, time-varying system, dynamic time warping

Citation: MacWilliam K, Dietzen T, Ali R and van Waterschoot T (2024) State-space estimation of spatially dynamic room impulse responses using a room acoustic model-based prior. Front. Sig. Proc. 4:1426082. doi: 10.3389/frsip.2024.1426082

Received: 30 April 2024; Accepted: 01 August 2024;
Published: 02 September 2024.

Edited by:

David V. Anderson, Georgia Institute of Technology, United States

Reviewed by:

Jelena Ćertić, University of Belgrade, Serbia
Victor Lazzarini, Maynooth University, Ireland

Copyright © 2024 MacWilliam, Dietzen, Ali and van Waterschoot. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Kathleen MacWilliam, a2F0aGxlZW4ubWFjd2lsbGlhbUBlc2F0Lmt1bGV1dmVuLmJl

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

State-space estimation of spatially dynamic room impulse responses using a room acoustic model-based prior

1 Introduction

2 Signal model, problem statement, and related state of the art

2.1 Signal model and problem statement

2.2 Related state of the art

3 Proposed RIR state-space model

3.1 Update equations of the Kalman filter

4 Proposed room acoustic model-based transition matrix

4.1 Analytical location-variant transition matrix model

4.2 Analytical location-invariant transition matrix model

4.3 Dynamic Time Warping transition matrices

5 Simulations

5.1 Acoustic environment

5.2 Kalman filter parameters

5.3 Performance measure

5.4 Experiments

6 Results

6.1 Result 1: ideal case

6.2 Result 2: noise sensitivity

6.3 Result 3: spatial sampling effects

6.4 Result 4: second-order reflections included

7 Conclusion

Data availability statement

Author contributions

Funding

Conflict of interest

Publisher’s note

Supplementary material

Footnotes

References

94% of researchers rate our articles as excellent or good

94% of researchers rate our articles as excellent or good