Information geometry of Markov Kernels: a survey

Wolfer, Geoffrey; Watanabe, Shun

doi:10.3389/fphy.2023.1195562

REVIEW article

Front. Phys. , 27 July 2023

Sec. Statistical and Computational Physics

Volume 11 - 2023 | https://doi.org/10.3389/fphy.2023.1195562

This article is part of the Research Topic Advances in Information Geometry: Beyond the Conventional Approach View all 4 articles

Information geometry of Markov Kernels: a survey

Geoffrey Wolfer¹*

Shun Watanabe²

¹RIKEN, Center for AI Project, Tokyo, Japan
²Department of Computer and Information Sciences, Tokyo University of Agriculture and Technology, Tokyo, Japan

Information geometry and Markov chains are two powerful tools used in modern fields such as finance, physics, computer science, and epidemiology. In this survey, we explore their intersection, focusing on the theoretical framework. We attempt to provide a self-contained treatment of the foundations without requiring a solid background in differential geometry. We present the core concepts of information geometry of Markov chains, including information projections and the pivotal information geometric construction of Nagaoka. We then delve into recent advances in the field, such as geometric structures arising from time reversibility, lumpability of Markov chains, or tree models. Finally, we highlight practical applications of this framework, such as parameter estimation, hypothesis testing, large deviation theory, and the maximum entropy principle.

1 Introduction

Markov chains are stochastic models that describe the probabilistic evolution of a system over time and have been successfully used in a wide variety of fields, including physics, engineering, and computer science. Conversely, information geometry is a mathematical framework that provides a geometric interpretation of probability distributions and their properties, with applications in diverse areas such as statistics, machine learning, and neuroscience. By combining the insights and methods from both fields, researchers have, in recent years, developed novel approaches for analyzing and modeling systems with time dependencies.

1.1 Outline and scope

As the fields of information geometry and Markov chains are broad, it is not possible to review all topics exhaustively, and we had to confine the scope of our survey to certain basic topics. Our focus will be on time-discrete, time-homogeneous Markov chains that take values from a finite alphabet. In particular, we will not cover time-continuous Markov chains [1, 2] nor discuss quantum information geometry or hidden Markov models [3, 4]. Our introduction to information geometry in the distribution setting will be limited to the basics. For a more comprehensive treatment, we recommend referring to the monographs [5, 6].

This survey is structured into five sections.

Section 1 is a brief introduction that provides an outline, lists the main concepts and results found in this survey, and clarifies its scope.

In Section 2, we lay out the notation that will be used throughout this paper and provide a primer on irreducible Markov chains and information geometry in the context of distributions. Along the way, we recall how to extend notions of entropy and Kullback–Leibler (KL) divergence from distributions to Markov chains.

In Section 3, following Nagaoka [7], we introduce a Fisher metric and a pair of dual affine connections on the set of irreducible stochastic matrices, which allows us to define the orthogonality of curves and parallel transport. We then proceed to define exponential families (e-families) and mixture families (m-families) of Markov chains. Importantly, the set of irreducible stochastic matrices is shown to form both an e-family and m-family, endowing it with the structure of a dually flat manifold. We explore minimality conditions for exponential families and chart transition maps between their natural and expectation parameters. Additionally, we define geodesics and their generalizations and conclude the section with a discussion on information projections and decomposition theorems. Specifically, similar to the distribution setting, the dual affine connections induce two notions of convexity, leading to Pythagorean identities.

In Section 4, we explore some recent developments in the field. First, we list and analyze the geometric properties of important subfamilies of stochastic matrices, such as symmetric or bistochastic Markov chains. The highlights of this section include the analysis of geometric properties induced by the time reversibility of Markov chains. This analysis leads to the establishment of the em-family structure of the reversible set, the derivation of closed-form expressions for reversible information projections, and the characterization of the reversible set as geodesic hulls of contained families. We continue this section by discussing some notable advancements in the context of data processing of Markov chains. Mirroring congruent embeddings in a distribution setting, we present a construction of embeddings of families of stochastic matrices that are congruent with respect to the lumping operation of Markov chains. These embeddings preserve the Fisher metric, the pair of dual affine connections, and the e-family structure. Additionally, we explore the establishment of a foliation structure on the manifold of lumpable stochastic matrices. Lastly, we conclude this section by presenting results in the context of tree models.

Section 5 is devoted to applications of the information geometry framework to large deviations, estimation theory, hypothesis testing, and the maximum entropy principle.

2 Preliminaries

2.1 Notation

Let $X$ be a finite space of symbols. All vectors will be written as row vectors. A vector $v \in R^{X}$ is non-negative (resp., positive), indicated by v ≥ 0 (resp., v > 0), when v(x) ≥ 0 (resp., v(x) > 0) for any $x \in X$ . For $x \in X$ , the vector $e_{x} \in R^{X}$ is defined by $e_{x} (x^{'}) = δ [x = x^{'}]$ for $x^{'} \in X$ , where $δ [\cdot]$ is the function that takes the value 1 when the predicate in the argument is true and 0 otherwise. For two vectors $u, v \in R^{X}$ , the Hadamard product of u and v is defined by (u∘v)(x) = u(x)v(x), and we will also use the shorthand (u/v)(x) = u(x)/v(x). For convenience, for k vectors u₁, …, u_k, we write $○_{i = 1}^{k} u_{i} = u_{1} \circ u_{2} \circ \dots \circ u_{k}$ , and for vector u and positive real number α, u^∘α is such that u^∘α(x) = u(x)^α. For p ≥ 0, we write ${‖v‖}_{p} = {(\sum_{x \in X} {|v (x)|}^{p})}^{1 / p}$ . We denote by $P (X)$ the set of all distributions over $X$ ,

P (X) ≜ \{μ \in R^{X} : μ \geq 0, {‖μ‖}_{1} = 1\},

and $P_{+} (X) \subset P (X)$ refers to the positive subset. X ∼ μ means that the random variable X is distributed according to a distribution $μ \in P (X)$ , and for $μ, ν \in P (X)$ , the absolute continuity of ν with respect to μ is denoted by ν ≪ μ.

2.2 Irreducible Markov chains

A time-discrete, time-homogeneous Markov chain is a random process $X = {\{X_{t}\}}_{t \in N}$ that takes values on the state space $X$ and satisfies the Markov property. Namely, for t ≥ 2 and for any $x_{1}, \dots, x_{t}, x_{t + 1} \in X$ ,

P_{μ} (X_{t + 1} = x_{t + 1} | X_{t} = x_{t}, \dots, X_{1} = x_{1}) = P (X_{t + 1} = x_{t + 1} | X_{t} = x_{t}),

with $P (X_{1} = x_{1}) = μ (x_{1})$ for an initial distribution $μ \in P (X)$ . The transition probabilities of the process can be organized in a row-stochastic matrix P, where $P (x, x^{'}) = P (X_{t + 1} = x^{'} | X_{t} = x)$ . We write X ∼ (μ, P) for the Markov chain started from μ and with transition matrix P. Let the vector space $F (X) ≜ R^{X^{2}}$ , whose elements can be conveniently represented by real square matrices of size $|X|$ , simultaneously understood as linear operators on $R^{X}$ . We introduce the set of all row-stochastic matrices over the space $X$ ,

W (X) ≜ \{P \in F (X) : \forall x \in X, e_{x} P \in P (X)\} . (1)

As we assume $|X| < \infty$ , for any member P of $W (X)$ , there exists a fixed point $π \in P (X)$ such that πP = π, and we call π a stationary distribution for P. Let $E \subset X^{2}$ define the set of positive probability transitions on the state space. When $(X, E)$ is a fully connected digraph, we say that P is irreducible. Algebraically, this means that for any pair of states $x, x^{'} \in X$ , there exists $p \in N$ such that P^p(x, x′) > 0, or less tersely, there exists a path on the graph $(X, E)$ from x to x′. When P defines an irreducible Markov chain, the stationary distribution π is unique and positive. Moreover, when the initial distribution μ = π, we say that the chain is stationary, write $P_{π} (\cdot)$ for probability statements over a stationary trajectory and X ∼ P as a shorthand for X ∼ (π, P). We denote the irreducible set:

W (X, E) ≜ \{P \in W (X) : P is irreducible over (X, E)\} .

It will also be convenient to define $F (X, E)$ , the real functions over $E$ , and identify this set with all functions over $X^{2}$ that are null outside of $E$ . Note that $F (X, E)$ can be endowed with the structure of a $|E|$ -dimensional vector space. We write $F_{+} (X, E) \subset F (X, E)$ for the positive subset. For $n \in N$ , the probability of observing a stationary path x₁ → x₂ → ⋯ → x_n induced from a π-stationary P is given by

Q^{(n)} (x_{1}, x_{2}, \dots, x_{n}) ≜ P_{π} (X_{1} = x_{1}, \dots, X_{n} = x_{n}) = π (x_{1}) \prod_{t = 1}^{n - 1} P (x_{t}, x_{t + 1}) . (2)

In particular,

Q ≜ Q^{(2)} \in P (X^{2})

is called the edge measure pertaining to P. Observe that the map from an irreducible transition matrix P to its edge measure is one-to-one (see, e.g., [8]) and that the set of all edge measures $Q (X, E)$ can be expressed as

Q (X, E) = \{Q \in P (X^{2}) \cap F_{+} (X, E) : \sum_{x^{'} \in X} Q (x, x^{'}) = \sum_{x^{'} \in X} Q (x^{'}, x), \forall x \in X\} . (3)

We refer the reader to Levin et al. [9] for a thorough treatment of Markov chains.

2.3 Entropy and divergence rates for Markov chains

Let us first recall the definition of the Shannon entropy of a random variable. We let $μ \in P (X)$ and X ∼ μ. The entropy H of the random variable X, which measures the average level of surprise inherent to the possible outcomes, is defined by

H (X) = - \sum_{x \in X} μ (x) \log μ (x),

and where by convention 0 log 0 = 0. The entropy rate of a stationary stochastic process $X = {(X_{t})}_{t \in N}$ corresponds to the number of bits to describe one random variable in a stochastic process averaged over time. Namely,

H (X) ≜ \lim_{n \to \infty} \frac{1}{n} H (X_{1}, X_{2}, \dots, X_{n}), (4)

where for any $n \in N$ , H(X₁, X₂, …, X_n) is the joint entropy of the random variables X₁, X₂, …, X_n. Particularly, when X forms an irreducible Markov chain with transition matrix $P \in W (X, E)$ and stationary distribution π, the entropy rate can be written as

H (X) = - \sum_{(x, x^{'}) \in E} Q (x, x^{'}) \log P (x, x^{'}),

where Q is the edge measure pertaining to P. In other words, the entropy rate of the process is computed from P only. We can thus overload H to define

\begin{aligned} H : W (X, E) & \to R_{+} \\ P & \mapsto H (P) = H (X), for X \sim P . \end{aligned}

For two random variables X ∼ μ, X′ ∼ μ′ with $μ, μ^{'} \in P (X)$ , we define the Kullback–Leibler divergence from X′ to X by

D (X ‖ X^{'}) ≜ \{\begin{cases} \sum_{x \in X} μ (x) \log \frac{μ (x)}{μ^{'} (x)} & when μ ≪ μ^{'}, \\ \infty & otherwise . \end{cases} (5)

Extending the aforementioned definition to Markov processes, the information divergence rate [10] (see also [73, Section 3.5]) of $X \sim P \in W (X, E)$ , from another chain $X^{'} \sim P^{'} \in W (X, E^{'})$ , is given by

\begin{aligned} D (X ‖ X^{'}) & = \lim_{n \to \infty} \frac{1}{n} D (X_{1}, X_{2}, \dots, X_{n} ‖ X_{1}^{'}, X_{2}^{'}, \dots, X_{n}^{'}) \\ = \{\begin{cases} \sum_{(x, x^{'}) \in E} Q (x, x^{'}) \log \frac{P (x, x^{'})}{P^{'} (x, x^{'})} & when E \subset E^{'}, \\ \infty & otherwise, \end{cases} \end{aligned}

which is also agnostic on initial distributions, inviting us to lift the definition of D to stochastic matrices:

\begin{aligned} D : W (X, E) \times W (X, E^{'}) & \to R_{+} \cup \{\infty\} \\ P, P^{'} & \mapsto D (X ‖ X^{'}) for X \sim P and X^{'} \sim P^{'} . \end{aligned} (6)

2.4 Information geometry

We briefly introduce basic concepts related to information geometry in the context of distributions. The central idea is to regard $P_{+} (X)$ as a $(|X| - 1)$ -dimensional smooth manifold and statistical models, i.e., parametric families of distributions $M = {\{μ_{θ}\}}_{θ \in Θ \subset R^{d}}$ , as smooth submanifolds of $P_{+} (X)$ . At each point $μ \in P_{+} (X)$ , we define a (0,2)-tensor,

\begin{aligned} g_{μ} : T_{μ} P_{+} (X) \times T_{μ} P_{+} (X) \to R \\ U_{μ}, V_{μ} \mapsto g_{μ} (U_{μ}, V_{μ}) = \sum_{x \in X} μ (x) (U_{μ} \log μ (x)) (V_{μ} \log μ (x)), \end{aligned}

where $T_{μ} P_{+} (X)$ is the tangent plane at the point μ, and U_μ log μ(x) is the directional derivative of the $C^{\infty} (P_{+} (X))$ function, μ↦ log μ(x) with respect to the tangent vector U_μ. This leads to the definition of a Riemannian metric, termed Fisher metric [5, Section 2.2]:

\begin{aligned} g : Γ (T P_{+} (X)) \times Γ (T P_{+} (X)) \to C^{\infty} (P_{+} (X)) \\ U, V \mapsto g (U, V) : P_{+} (X) \to R, μ \mapsto g (U, V) (μ) = g_{μ} (U_{μ}, V_{μ}), \end{aligned}

where $Γ (T P_{+} (X))$ is the set of all vector fields [5, Section 1.3] and $C^{\infty} (P_{+} (X))$ the set of all smooth real functions on $P_{+} (X)$ . Letting $θ : M \to Θ \subset R$ be a chart map¹, μ_θ denote the distribution at coordinates θ = (θ¹, …, θ^d), and ∂_i = ∂ ⋅/∂θⁱ, we write ${(\partial_{i})}_{i \in [d]}$ for the θ-induced basis of $T_{μ_{θ}} P_{+} (X)$ . We can express the Fisher metric at coordinates θ as

g_{i j} (θ) = \sum_{x \in X} μ_{θ} (x) \partial_{i} \log μ_{θ} (x) \partial_{j} \log μ_{θ} (x) .

In addition to $g$ , we define a pair of affine connections by their associated covariant derivatives [5, Chapter 1, (1.38)]:

\nabla^{(e)}, \nabla^{(m)} : Γ (T P_{+} (X)) \times Γ (T P_{+} (X)) \to Γ (T P_{+} (X)) .

In the parametrization θ, the connections are specified by their coefficients (Christoffel symbols):

\begin{aligned} Γ_{i j, k}^{(e)} (θ) & ≜ g_{μ_{θ}} (\nabla_{\partial_{i}}^{(e)} \partial_{j}, \partial_{k}) = \sum_{x \in X} μ_{θ} (x) \partial_{i} \partial_{j} \log μ_{θ} (x) \partial_{k} μ_{θ} (x), \\ Γ_{i j, k}^{(m)} (θ) & ≜ g_{μ_{θ}} (\nabla_{\partial_{i}}^{(m)} \partial_{j}, \partial_{k}) = \sum_{x \in X} μ_{θ} (x) \partial_{i} \partial_{j} μ_{θ} (x) \partial_{k} \log μ_{θ} (x), \end{aligned}

where $\nabla_{\partial_{i}}^{(e)} \partial_{j}$ is the covariant derivative of ∂_j with respect to ∂_i. The canonical divergence associated with $g, \nabla^{(e)}$ and ∇^(m) is the Kullback–Leibler divergence (5). The connections ∇^(e) and ∇^(m) are conjugate [5, Chapter 3, (3.1)] in the sense where for any vector fields $U, V, W \in Γ (T P_{+} (X))$ ,

U g (V, W) = g (\nabla_{U}^{(e)} V, W) + g (V, \nabla_{U}^{(m)} W) .

As a consequence, the curvature tensors associated with ∇^(e), ∇^(m) vanish simultaneously. In particular, they vanish for $P_{+} (X)$ , and we say that the manifold is dually flat. A complete review of the distribution setting, including exponential and mixture families, is outside the scope of this survey. We refer the reader to Amari and Nagaoka [5] for a complete treatment of the topic.

3 The dually flat manifold of irreducible stochastic matrices

Similar to the distributional setting, we regard $W (X, E)$ , the set of irreducible stochastic matrices over some prescribed fully connected digraph $(X, E)$ , as a smooth manifold, on which we introduce a Riemannian metric together with a dually flat structure (Section 2.3). In turn, we will define exponential and mixture families of stochastic matrices. We will further examine notions of geodesic convexity and information projections.

3.1 The information manifold

Our first order of business is to establish a dually flat structure on the set of stochastic matrices, following Nagaoka [7]. A smooth manifold structure can be established on $W (X, E)$ , using the map introduced by Nagaoka [7, p.2], reported in (15). One possible construction is based on the definition of the informational divergence between two Markov processes at (6) and gives rise to a metric and dual affine connections [11, 12]. We proceed to confirm that while the structure can be defined without invoking asymptotic notions, the obtained Fisher metric and affine connections are indeed asymptotically consistent with their distributional counterparts for path measures.

3.1.1 Divergence as a general contrast function

Recall the definition of the information divergence from one stochastic matrix $P^{'} \in W (X, E^{'})$ to another $P \in W (X, E)$ given at (6). We henceforth focus on the setting where the supports are identical $E = E^{'}$ ; that is, stochastic matrices P, P′ belong to $W (X, E)$ and $D (P ‖ P^{'}) < \infty$ . We are interested in parametric families of irreducible matrices. Namely, for some open and connected parameter space $Θ \subset R^{d}$ , we define

V = \{P_{θ} : θ \in Θ\} \subset W (X, E),

and regard $V$ as a smooth submanifold of $W (X, E)$ with a global coordinate system θ. For $P, P^{'} \in V$ , for simplicity, let us write θ = (θ¹, …, θ^d) = θ(P), θ′ = θ(P′), $\partial_{i} = \partial / \partial_{θ^{i}}$ , and $\partial_{i}^{'} = \partial / \partial_{θ^{' i}}$ and use the shorthand $D (θ ‖ θ^{'}) = D (P_{θ} ‖ P_{θ^{'}})$ . The information divergence rate $D : V \times V \to R_{+}$ we defined in (6) is C³ and satisfies the following properties of a contrast function:

(i) $D (θ ‖ θ^{'}) \geq 0$ for any θ, θ′ ∈ Θ (non-negativity).

(ii) $D (θ ‖ θ^{'}) = 0$ if and only if θ = θ′ (identity of indiscernibles).

(iii) $\partial_{i} D (θ ‖ θ^{'}) |_{θ = θ^{'}} = \partial_{j}^{'} D (θ ‖ θ^{'}) |_{θ = θ^{'}} = 0$ for any i, j ∈ [d] (vanishing gradient on the diagonal).

(iv) $- \partial_{i} \partial_{j}^{'} D (θ ‖ θ^{'}) |_{θ = θ^{'}} = \partial_{i}^{'} \partial_{j}^{'} D (θ ‖ θ^{'}) |_{θ = θ^{'}} = \partial_{i} \partial_{j} D (θ ‖ θ^{'}) |_{θ = θ^{'}}$ is positive definite.

We call

\begin{aligned} D^{*} : V \times V & \to R \\ (P, P^{'}) & \mapsto D^{*} (θ ‖ θ^{'}) = D (θ^{'} ‖ θ), \end{aligned}

the dual divergence of D.

3.1.2 Fisher metric and dual affine connections

From any divergence function D on a manifold $V$ verifying the aforementioned properties (i), (ii), (iii), and (iv), one can construct a conjugate connection manifold:

(V, g, \nabla, \nabla^{*}),

where the Riemannian metric $g$ and Christoffel symbols of ∇ and ∇^* are expressed in the chart $θ : V \to Θ$ and for any i, j, k ∈ [d] as

\begin{aligned} g_{i j} (θ) & = g_{P_{θ}} (\partial_{i}, \partial_{j}) = - \partial_{i} \partial_{j}^{'} D (θ ‖ θ^{'}) |_{θ = θ^{'}}, \\ Γ_{i j, k} (θ) & = g_{P_{θ}} (\nabla_{\partial_{i}}^{(e)} \partial_{j}, \partial_{k}) = - \partial_{i} \partial_{j} \partial_{k}^{'} D (θ ‖ θ^{'}) |_{θ = θ^{'}}, \\ Γ_{i j, k}^{*} (θ) & = g_{P_{θ}} (\nabla_{\partial_{i}}^{(m)} \partial_{j}, \partial_{k}) = - \partial_{i}^{'} \partial_{j}^{'} \partial_{k} D (θ ‖ θ^{'}) |_{θ = θ^{'}} . \end{aligned} (7)

As the metric and connections are derived from the KL divergence, they all depend solely on the transition matrices and are, in particular, agnostic of initial distributions. From calculations, we obtain the Fisher metric [7, (9)]:

\begin{aligned} g_{i j} (θ) = \sum_{(x, x^{'}) \in E} Q_{θ} (x, x^{'}) \partial_{i} \log P_{θ} (x, x^{'}) \partial_{j} \log P_{θ} (x, x^{'}), \end{aligned} (8)

and the coefficients for the pair of torsion-free affine connections ∇^(e) (e-connection) and ∇^(m) (m-connection) [7, (19, 20)]:

\begin{aligned} Γ_{i j, k}^{(e)} (θ) = \sum_{(x, x^{'}) \in E} \partial_{i} \partial_{j} \log P_{θ} (x, x^{'}) \partial_{k} Q_{θ} (x, x^{'}), \\ Γ_{i j, k}^{(m)} (θ) = \sum_{(x, x^{'}) \in E} \partial_{i} \partial_{j} Q_{θ} (x, x^{'}) \partial_{k} \log P_{θ} (x, x^{'}) . \end{aligned} (9)

On the one hand, the metric encodes notions of distance and angles on the manifold. In particular, the information divergence D locally corresponds to the Fisher metric. In other words, for $θ \in Θ \subset R^{d}$ and $δ θ \in R^{d}$ such that θ + δθ ∈ Θ,

\begin{aligned} D (θ + δ θ ‖ θ) & = \frac{1}{2} \sum_{i, j \in [d]} δ θ^{i} δ θ^{j} \partial_{i}^{'} \partial_{j}^{'} D (θ^{'} ‖ θ) |_{θ^{'} = θ} + o ({‖δ θ‖}_{2}^{2}) = \frac{1}{2} δ θ g (θ) δ θ^{⊺} + o ({‖δ θ‖}_{2}^{2}), \\ D (θ ‖ θ + δ θ) & = \frac{1}{2} \sum_{i, j \in [d]} δ θ^{i} δ θ^{j} \partial_{i}^{'} \partial_{j}^{'} D (θ ‖ θ^{'}) |_{θ^{'} = θ} + o ({‖δ θ‖}_{2}^{2}) = \frac{1}{2} δ θ g (θ) δ θ^{⊺} + o ({‖δ θ‖}_{2}^{2}) . \end{aligned}

Consider two curves $γ, σ : R \to V$ , and suppose that they intersect at some point $P_{0} \in V$ , achieved without loss of generality at γ(0) and σ(0). We define the angle between the curves γ and σ at P₀ as the angle formed by the two curves in the tangent space at P₀:

g_{P_{0}} (\dot{γ} (0), \dot{σ} (0)),

and we will say that the two curves are orthogonal at P₀ when the inner product is null. On the other hand, affine connections define notions of straightness on the manifold. The fact that the connections are coupled with the metric $g$ introduces a generalization of the invariance of the inner product under the parallel translation of Euclidean geometry. Letting $Π_{γ}^{(e)}, Π_{γ}^{(m)} : T_{P} V \to T_{P^{'}} V$ denote parallel translations along a curve γ from P to P′ with respect to ∇^(e) and ∇^(m), for any $U, V \in T_{P} V$ ,

g_{P^{'}} (Π_{γ}^{(e)} (U), Π_{γ}^{(m)} (V)) = g_{P} (U, V) .

3.1.3 Asymptotic consistency with information rates

Recall from (2) that a stationary Markovian trajectory has a probability described by the path measure Q⁽ⁿ⁾. For every $n \in N$ , one can consider the manifold $Q^{(n)} \subset P (X^{n})$ of all path measures of length n. Computing the limit of the metric and connection coefficients [7, 13],

\begin{aligned} \lim_{n \to \infty} \frac{1}{n} g_{i j}^{[n]} (θ) & = g_{i j} (θ), \\ \lim_{n \to \infty} \frac{1}{n} Γ_{i j, k}^{[n], (e)} (θ) & = Γ_{i j, k}^{(e)} (θ), \\ \lim_{n \to \infty} \frac{1}{n} Γ_{i j, k}^{[n], (m)} (θ) & = Γ_{i j, k}^{(m)} (θ), \end{aligned} (10)

where $g^{[n]}$ , ∇^[n],(e), and ∇^[n],(m) are the Fisher metric and e/m-connections on $P (X^{n})$ , with $g^{n} (θ) = g_{Q_{θ}^{(n)}}$ . Therefore, the Fisher metric for stochastic matrices essentially corresponds to the time density of the average Fisher metric, and a similar interpretation can be proposed for the affine connections.

3.2 Exponential families and mixture families

Similar to the distribution setting, we proceed to define exponential families (e-families) and mixture families (m-families) of stochastic matrices.

3.2.1 Definition of exponential families

Definition 3.1. (e-family of stochastic matrices [7]). Let $Θ = R^{d}$ . We say that the parametric family of stochastic matrices

V_{e} = \{P_{θ} : θ = (θ^{1}, \dots, θ^{d}) \in Θ\} \subset W (X, E)

is an exponential family (e-family) of stochastic matrices with natural parameter θ, when there exist functions $K, g_{1}, \dots, g_{d} \in F (X, E)$ and $R \in R^{Θ \times X}, ψ \in R^{Θ}$ , such that, for any $(x, x^{'}) \in E$ and θ ∈ Θ,

\log P_{θ} (x, x^{'}) = K (x, x^{'}) + \sum_{i = 1}^{d} θ^{i} g_{i} (x, x^{'}) + R (θ, x^{'}) - R (θ, x) - ψ (θ) . (11)

For some fixed θ ∈ Θ, we may write for convenience ψ_θ for ψ(θ) and R_θ for $R (θ, \cdot) \in R^{X}$ .

Note that R and ψ are analytic functions of θ and that ψ is a convex potential function. R and ψ are completely determined from g₁, …, g_d and K by the Perron–Frobenius (PF) theory, and we can introduce a stochastic rescaling mapping [7, 13]:

\begin{aligned} s : F_{+} (X, E) & \to W (X, E) \\ \tilde{P} (x, x^{'}) & \mapsto P (x, x^{'}) = \frac{\tilde{P} (x, x^{'}) v (x^{'})}{ρ v (x)}, \end{aligned} (12)

where ρ and v are, respectively, the PF root and right PF eigenvector of $\tilde{P}$ . Following this notation, we can rewrite Definition 3.1 more simply as

P_{θ} = s (\exp (K + \sum_{i = 1}^{d} θ^{i} g_{i})),

where exp is understood to be entry-wise. In particular, $W (X, E)$ forms an e-family. Indeed, with $X ≅ [m]$ and $m \in N$ in the parametrization proposed by Ito and Amari [14], we pick an arbitrary $x_{*} \in X$ and write

\begin{align} \log P (x, x^{'}) & = \sum_{i = 1, i \neq x_{*}}^{m} \log \frac{P (x_{*}, i) P (i, x_{*})}{P (x_{*}, x_{*}) P (x_{*}, x_{*})} δ_{i} (x^{'}) \\ + \sum_{i = 1, i \neq x_{*}}^{m} \sum_{j = 1, j \neq x_{*}}^{m} \log \frac{P (i, j) P (x_{*}, x_{*})}{P (x_{*}, j) P (i, x_{*})} δ_{i} (x) δ_{j} (x^{'}) \\ + \log P (x, x_{*}) - \log P (x^{'}, x_{*}) + \log P (x_{*}, x_{*}) . \end{align} (13)

The basis is given by

\begin{aligned} g_{i} & = 1^{⊺} δ_{i}, i \in [m], i \neq x_{*} \\ g_{i j} & = δ_{i}^{⊺} δ_{j}, i, j \in [m], i, j \neq x_{*} \end{aligned}

and the parameters are

\begin{aligned} θ^{i} = \log \frac{P (x_{*}, i) P (i, x_{*})}{P (x_{*}, x_{*}) P (x_{*}, x_{*})}, θ^{i j} = \log \frac{P (i, j) P (x_{*}, x_{*})}{P (x_{*}, j) P (i, x_{*})} . \end{aligned}

We can alternatively define e-families as e-autoparallel submanifolds of $W (X, E)$ [7, Theorem 6], where a submanifold $V \subset W (X, E)$ is said to be autoparallel with respect to an affine connection ∇ when for any $U, V \in Γ (T V)$ , it holds that $\nabla_{U} V \in Γ (T V)$ .

3.2.2 Affine structures and characterization of minimal exponential families

We define the set of functions [7, 13, 15]

N (X, E) ≜ \{f \in F (X, E) : \exists (f, c) \in (R^{X}, R), h (x, x^{'}) = f (x^{'}) - f (x) + c\}, (14)

and observe that we can endow $N (X, E)$ with the structure of a $|X|$ -dimensional vector subspace of the $|E|$ -dimensional space $F (X, E)$ . We can thus define the quotient space of generators

G (X, E) ≜ F (X, E) / N (X, E),

of dimension $|E| - |X|$ and the diffeomorphism

\begin{aligned} Δ : G (X, E) & \to W (X, E) \\ g & \mapsto Δ (g) = s (\exp \circ g), \end{aligned} (15)

where ∘ stands here for function composition. Essentially, there is a one-to-one correspondence between vector subspaces of $G (X, E)$ and e-families.

Theorem 3.1. ([7, Theorem 2]). A submanifold $V \subset W (X, E)$ forms an e-family if and only if there exists an affine subspace $A \subset G (X, E)$ such that $Δ (A) = V$ . In this case, $\dim V = \dim A$ .

As a corollary [7, Corollary 1], $W (X, E)$ is trivially an exponential family of dimension $|E| - |X|$ . A family $V$ will be called minimal (or full) whenever the functions g₁, …, g_d in Definition 3.1 are linearly independent in $G (X, E)$ . In this case, we will say that g₁, …, g_d form a basis for $V$ .

3.2.3 Mixture families

In the stochastic matrix setting, the notion of a mixture family is naturally defined in terms of edge measures.

Definition 3.2. (m-family of stochastic matrices [15]). We say that a family of irreducible stochastic matrices $V_{m}$ is a mixture family (m-family) of irreducible stochastic matrices on $(X, E)$ when the following holds.There exists affinely independent $Q_{0}, Q_{1}, \dots, Q_{d} \in Q (X, E)$ , and

V_{m} = \{P_{ξ} \in W (X, E) : Q_{ξ} = (1 - \sum_{i = 1}^{d} ξ^{i}) Q_{0} + \sum_{i = 1}^{d} ξ^{i} Q_{i}, ξ \in Ξ\},

where $Ξ = \{ξ \in R^{d} : Q_{ξ} (x, x^{'}) > 0, \forall (x, x^{'}) \in E\}$ , and Q_ξ is the edge measure that pertains to P_ξ. Note that Ξ is an open set, ξ is called the mixture parameter, and d is the dimension² of the family $V_{m}$ .

It is easy to verify that $W (X, E)$ also forms an m-family, and it is possible to define m-families as m-autoparallel submanifolds of $W (X, E)$ .

3.2.4 Dual expectation parameter and chart transition maps

For an exponential family $V_{e}$ with natural parametrization [θⁱ], following Definition 3.1, one may introduce [7] the expectation parameter [η_i] as follows. For i ∈ [d] and θ ∈ Θ,

η_{i} (θ) = \sum_{(x, x^{'}) \in E} Q_{θ} (x, x^{'}) g_{i} (x, x^{'}) = E_{(X, X^{'}) \sim Q_{θ}} [g_{i} (X, X^{'})], (16)

where Q_θ is the edge measure corresponding to the stochastic matrix at coordinates θ. When $V_{e}$ is minimal, η defines an alternative coordinate system to the natural parametrization θ for $V_{e}$ .

Theorem 3.2. [15, Lemma 4.1] The following statements are equivalent:

(i) The functions g₁, …, g_d are linearly independent in $G (X, E)$ .

(ii) The mappings θ∘η⁻¹ and η∘θ⁻¹ are one-to-one.

(iii) The Hessian matrix ${[\partial_{i} \partial_{j} ψ (θ)]}_{i j} ≻ 0$ for any θ ∈ Θ.

(iv) The Hessian matrix ${[\partial_{i} \partial_{j} ψ (θ)]}_{i j} ≻ 0$ for θ = 0.

(v) The parametrization $θ : V \to Θ$ is faithful.

Defining the Shannon negentropy³ potential function $φ : R^{d} \to R$ to satisfy

φ (η) + ψ (θ) = ⟨ θ, η ⟩,

we can express [7, Theorem 4] the chart transition maps (see Figure 1) between the expectation [η_i] and natural [θⁱ] parameters of the e-family $V_{e}$ as

\begin{aligned} η ◦ θ^{- 1} : R^{d} & \to R^{d} \\ θ & \mapsto η_{i} (θ) = \partial_{i} ψ (θ), \\ θ ◦ η^{- 1} : R^{d} & \to R^{d} \\ η & \mapsto θ^{i} (η) = \partial^{i} φ (η), \end{aligned}

where we wrote ∂ⁱ⋅ = ∂ ⋅/∂η_i. We can also obtain the counterpart [13, Lemma 5] of (16) for θ◦η⁻¹,

\begin{aligned} θ^{i} (η) = \sum_{(x, x^{'}) \in E} \partial^{i} Q_{η} (x, x^{'}) (\log P_{η} (x, x^{'}) - K (x, x^{'})) . \end{aligned} (17)

FIGURE 1

FIGURE 1. Natural and expectation parametrizations of an e-family $V_{e}$ , together with their chart transition maps.

3.2.5 Dual flatness

A straightforward computation shows that all the e-connection coefficients $Γ_{i j, k}^{(e)}$ for an e-family $V_{e}$ and all the m-connection coefficients $Γ_{i j, k}^{(m)}$ for an m-family $V_{m}$ are null. We say that $V_{e}$ is e-flat and that $V_{m}$ is m-flat. From the conjugacy of the affine connections, curvature tensors associated with ∇^(e) and ∇^(m) vanish simultaneously. As a consequence, for any smooth submanifold $V \subset W (X, E)$ ,

V is m - f l a t \Leftrightarrow V is e - f l a t,

which is sometimes called the fundamental theorem of information geometry [79, Theorem 3]. In other words, e-families and m-families are both e-flat and m-flat [7, Theorem 5], and for any $V$ , it is enough to find an affine coordinate system in which either the e-connection or m-connection coefficients are null for it to be dually flat. For i, j ∈ [d], recall that $g_{i j} (θ) = g_{P_{θ}} (\partial_{i}, \partial_{j})$ . Similarly, we define $g^{i j} (η) = g_{P_{η}} (\partial^{i}, \partial^{j})$ . The coefficients of the Fisher metric and its inverse are recovered by

\begin{aligned} g_{i j} (θ) & = \partial_{i} \partial_{j} ψ (θ), \\ g^{i j} (η) & = \partial^{i} \partial^{j} φ (η), \\ {[g^{i j} (η)]}^{i j} & = {[g_{i j} (θ)]}_{i j}^{- 1} . \end{aligned} (18)

Thus, φ is also strictly convex, and the coordinate systems [θⁱ] and [ηⁱ] are mutually dual with respect to $g$ . The two coordinate systems are related by the Legendre transformation, and we can express their dual potential functions as

\begin{aligned} φ (η) = \max_{θ \in Θ} \{〈 θ, η 〉 - ψ (θ)\}, ψ (θ) = \max_{η \in H} \{〈 θ, η 〉 - φ (η)\} . \end{aligned}

3.2.6 Geodesics and geodesic hulls

An affine connection ∇ defines a notion of the straightness of curves. Namely, a curve γ is called a ∇-geodesic whenever it is ∇-autoparallel, $\nabla_{\dot{γ}} \dot{γ} = 0$ , where $\dot{γ} (t)$ is the velocity vector at time parameter t. The geodesic between two points $P_{0}, P_{1} \in W (X, E)$ is the straight curve that goes through the two points. As our manifold is equipped with two dual connections, there are two distinct notions of straight lines, and the arc between the two points will not necessarily correspond to the shortest path between the two elements with respect to the Riemannian metric, unlike in Euclidean geometry. Specifically, the e-geodesic going through P₀ and P₁ is given [7, Corollary 2] by

γ_{P_{0}, P_{1}}^{(e)} ≜ \{P_{t} = s (P_{0}^{◦ 1 - t} ◦ P_{1}^{◦ t}) : t \in R\}, (19)

and the m-geodesic [7, Theorem 7] by

γ_{P_{0}, P_{1}}^{(m)} ≜ \{P_{t} : Q_{t} = (1 - t) Q_{0} + t Q_{1}, t \in R, Q_{t} \in Q (X, E)\}, (20)

where $Q (X, E)$ is the set of all edge measures introduced in (3). A submanifold $V \in W (X, E)$ forms an e-family if and only if for any two points $P_{0}, P_{1} \in V$ , $γ_{P_{0}, P_{1}}^{(e)}$ lies entirely in $V$ [7, Corollary 3], and a similar claim holds for m-families. We generalize the aforementioned objects beyond two points to more general subsets of $W (X, E)$ , by defining geodesic hulls [13] (see Figure 2).

FIGURE 2

FIGURE 2. E-hull $e - h u l l (\{P, P^{'}, P^{″}\})$ of three points. It is instructive to note that although a set of three points forms a zero-dimensional manifold, we construct a manifold of dimensions possibly up to two.

Definition 3.3. (Exponential hull [13, Definition 7]). Let $V \subset W$ :

\begin{aligned} e - h u l l (V) = {s (\tilde{P}) : \tilde{P} = ○_{i = 1}^{k} P_{i}^{ο α_{i}}, k \in N, α_{1}, \dots, α_{k} \in R, \sum_{i = 1}^{k} α_{i} = 1, P_{1}, \dots P_{k} \in V}, \end{aligned}

where $s$ is defined in (12).

Definition 3.4. (Mixture hull [13, Definition 8]). Let $V \subset W$ :

\begin{aligned} m - h u l l (V) = {P : Q \in Q, Q = \sum_{i = 1}^{k} α_{i} Q_{i}, k \in N, α_{1}, \dots, α_{k} \in R, P_{1}, \dots, P_{k} \in V}, \end{aligned}

where Q (resp., Q_i) is the edge measure that pertains to P (resp., P_i).

When a family $V$ forms both an m-family and an e-family, we say it forms an em-family.

3.3 Information projections and decomposition theorems

The projection of a point onto a surface is among the most natural geometric concepts. In Euclidean geometry, projecting on a connected convex body leads to a unique closest solution point. However, the dually flat geometry on $W (X, E)$ is based on two different notions of straightness, inducing two different flavors of geodesic convexity. Furthermore, the divergence function we consider is not symmetric in its arguments, hence the need for two definitions of projections as minimizer with respect to the first and second arguments. This section goes back to and hinges around the notion of divergence defined in (6), projection, and orthogonality and explores the Bregman geometry of $W (X, E)$ .

3.3.1 Information divergence as a Bregman divergence

For a continuously differentiable and strictly convex function $f : Ξ \to R$ on a convex domain $Ξ \subset R^{d}$ , we call Bregman divergence B_f [16] with generator f (see Figure 3) the function

\begin{aligned} B_{f} : Ξ \times Ξ & \to R_{+} \\ (ξ, ξ^{'}) & \mapsto B_{f} (ξ : ξ^{'}) = f (ξ) - f (ξ^{'}) - \sum_{i \in [d]} \partial_{i} f (ξ^{'}) (ξ^{i} - ξ^{i}) . \end{aligned}

FIGURE 3

FIGURE 3. Geometrical interpretation of a Bregman divergence.

When we let $P_{θ}, P_{θ^{'}} \in V_{e}$ some e-family following Definition 3.1, one can verify with direct computations [15, 17] that

\begin{aligned} D (θ ‖ θ^{'}) & = ψ (θ^{'}) - ψ (θ) - \sum_{i \in [d]} \partial_{i} ψ (θ) (θ^{' i} - θ^{i}) = B_{ψ} (θ^{'} : θ), \\ H (θ) & = ψ (θ) - \sum_{i \in [d]} η_{i} θ^{i} = - φ (η) . \end{aligned} (21)

As ψ and φ are convex conjugate,

\begin{aligned} D (θ ‖ θ^{'}) = B_{ψ^{*}} (η : η^{'}) = B_{φ} (η : η^{'}), \end{aligned}

where we used the shorthands η = η(θ) and η′ = η(θ′); hence, the KL divergence is the Bregman divergence associated with the Shannon negentropy function, and as any Bregman divergence, it verifies the law of cosines:

\begin{aligned} B_{φ} (η, η^{'}) + B_{φ} (η^{'}, η^{″}) = B_{φ} (η, η^{″}) + \sum_{i \in [d]} (\partial^{i} φ (η^{″}) - \partial^{i} φ (η^{'})) (η_{i} - η_{i}^{'}), \end{aligned} (22)

which can be re-expressed [7, (23)] as

\begin{aligned} D (θ ‖ θ^{'}) + D (θ^{'} ‖ θ^{″}) & = D (θ ‖ θ^{″}) + \sum_{i \in [d]} (θ^{'' i} - θ^{' i}) (η_{i} - η_{i}^{'}) \\ = D (θ ‖ θ^{″}) + g_{P_{θ^{'}}} (\dot{γ} (0), \dot{σ} (0)), \end{aligned}

for γ an m-geodesic going through P_θ and P_θ′ and σ an e-geodesic going through P_θ′ and P_θ″.

3.3.2 Canonical divergence

One may naturally wonder whether it is possible to recover the divergence D defined at (6) from $g$ and ∇^(e), ∇^(m) only. This is referred to as the inverse problem in information geometry. It is easily understood that such a divergence is not unique. In fact, there exist an infinity of divergence functions that could have given rise to the dually flat geometry on $W (X, E)$ [18]. However, it is possible to single out one particular divergence, termed canonical divergence [5], which is uniquely defined from $g$ and ∇^(e), ∇^(m). For $P, P^{'} \in W (X, E)$ , its expression is given in a dual coordinate system [θⁱ], [η_i] by

D (P ‖ P^{'}) = φ (η) + ψ (θ^{'}) - \sum_{i \in [d]} η_{i} θ^{' i},

where η = η(P) and θ′ = θ(P′). One can verify from (21) that we indeed recover the expression at (6).

3.3.3 Geodesic convexity and convexity properties of information divergence

Geodesic convexity is a natural generalization of convexity in Euclidean geometry for subsets of Riemannian manifolds and functions defined on them. As straight lines are defined with respect to an affine connection ∇, a subset $C$ of $W (X, E)$ is said to be geodesically convex with respect to ∇ when ∇-geodesic joining⁴ two points in $C$ remain in $C$ at all times. In particular, $C$ is e-convex (resp., m-convex), when for any $P_{0}, P_{1} \in C$ and any t ∈ [0, 1], it holds that $γ_{P_{0}, P_{1}}^{(e)} (t) \in C$ (resp., $γ_{P_{0}, P_{1}}^{(m)} (t) \in C$ ), where $γ_{P_{0}, P_{1}}^{(e)}$ and $γ_{P_{0}, P_{1}}^{(m)}$ are defined in (19, 20). An immediate consequence is that an e-family (resp., m-family) $V \subset W (X, E)$ is e-convex (resp., m-convex). On a geodesically convex domain $C \subset W (X, E)$ , a function $f : C \to R$ is said to be a geodesically convex (resp., strictly geodesically convex) if the composition $f ◦ γ : [0,1] \to R$ is a convex (resp., strictly convex) function for any geodesic $γ : [0,1] \to C$ contained within $C$ . In particular, the information divergence defined in (6) is strictly m-convex in its first argument and strictly e-convex in its second argument [15, Theorem 3.3]. Namely, for t ∈ (0,1), $P, P_{0}, P_{1} \in W (X, E)$ , with P₀ ≠ P₁,

\begin{aligned} D (γ_{P_{0}, P_{1}}^{(m)} (t) ‖ P) & < (1 - t) D (P_{0} ‖ P) + t D (P_{1} ‖ P), \\ D (P ‖ γ_{P_{0}, P_{1}}^{(e)} (t)) & < (1 - t) D (P ‖ P_{0}) + t D (P ‖ P_{1}) . \end{aligned}

However, for $|t| > 1$ , the opposite inequality holds [13]:

\begin{aligned} D (P ‖ γ_{P_{0}, P_{1}}^{(e)} (t)) > (1 - t) D (P ‖ P_{0}) + t D (P ‖ P_{1}) . \end{aligned}

Unlike in the distribution setting, where the KL divergence is jointly m-convex, this property does not hold true for stochastic matrices [21, Remark 4.2].

3.3.4 Pythagorean inequalities

In the more familiar Euclidean geometry, projecting a point P onto a subset $C$ of $R^{d}$ consists in finding the point in $C$ that minimizes the Euclidean distance between P and $C$ . If $C$ is convex, the minimization problem admits a unique solution and a Pythagorean inequality holds between the point, its projection, and any other point in $C$ . Similar ideas are made possible on $W (X, E)$ by the Bregman geometry induced from D. Let $C_{m} \subset W (X, E^{'})$ (resp., $C_{e} \subset W (X, E^{'})$ ) with $E^{'} \subset E$ be non-empty, closed, and m-convex (resp., e-convex). We define the e-projection onto $C_{m}$ as the mapping

\begin{aligned} P_{e} : W (X, E) \to C_{m}, P \mapsto \underset{\bar{P} \in C_{m}}{a r g m i n} D (\bar{P} ‖ P), \end{aligned}

and the m-projection onto $C_{e}$ as the mapping

\begin{aligned} P_{m} : W (X, E) \to C_{e}, P \mapsto \underset{\bar{P} \in C_{e}}{a r g m i n} D (P ‖ \bar{P}) . \end{aligned}

For a point P in context, we simply write P_e = P_e(P) and P_m = P_m(P).

Theorem 3.3. (Pythagorean inequalities for geodesic e-convex [21, Proposition 4.2], m-convex sets [23, Lemma 1]). The following statements hold.

(i) P_e exists in the sense where the minimum is attained for a unique element in $C_{m}$ .

(ii) For $P_{0} \in C_{m}$ , P₀ = P_e if and only if

\forall \bar{P} \in C, D (\bar{P} ‖ P) \geq D (\bar{P} ‖ P_{0}) + D (P_{0} ‖ P) .

(iii) P_m exists in the sense where the minimum is attained for a unique element in $C_{e}$ .

(iv) For $P_{0} \in C_{e}$ , P₀ = P_m if and only if

\forall \bar{P} \in C, D (P ‖ \bar{P}) = D (P ‖ P_{0}) + D (P_{0} ‖ \bar{P}) .

3.3.5 Pythagorean equality for linear families

Inequalities become equalities when projecting onto e-families and m-families.

Theorem 3.4. (Pythagorean theorem for e-families, m-families [19], [15, Section 4.4]). The following statements hold.

(i) P_e exists in the sense where the minimum is attained for a unique element in $C_{m}$ .

(ii) For $P_{0} \in C_{m}$ , P₀ = P_e if and only if

\forall \bar{P} \in C, D (\bar{P} ‖ P) = D (\bar{P} ‖ P_{0}) + D (P_{0} ‖ P) .

(iii) P_m exists in the sense where the minimum is attained for a unique element in $C_{e}$ .

(iv) For $P_{0} \in C_{e}$ , P₀ = P_m if and only if

\forall \bar{P} \in C, D (P ‖ \bar{P}) = D (P ‖ P_{0}) + D (P_{0} ‖ \bar{P}) .

3.4 Bibliographical remarks

The construction of the conjugate connection manifold from a general contrast function in Section 3.1.1 and Section 3.1.2 follows the general scheme of Eguchi [11, 12], which can also be found in [79, Definition 5, Theorem 4]. The expression for the Fisher metric at (Eq. 8) and the conjugate affine connections at (Eq. 8) were introduced by Nagaoka [7, (9), (19), (20)]. One-dimensional e-families of stochastic matrices were first introduced by Nakagawa and Kanaya [19], whereas the general construction in the multi-dimensional setting was done by Nagaoka [7], who also established the characterization in Theorem 3.1 of minimal e-families in terms of affine structures of in [7, Theorem 2]. Curved exponential families of transition matrices and mixture families make their first named appearances in Hayashi and Watanabe [15; Section 8.3; Section 4.2]. See also [13, Definition 1] for two alternative equivalent definitions of an m-family. The expectation parameter for exponential families in (16) and its expression as the gradient of the potential function were discussed on multiple occasions [7, Theorem 4], [19, (28)], [15, Lemma 5.1]. Theorem 3.2 was taken from [15, Lemma 4.1]. The expression for the chart transition map from expectation to natural parameters in (17) was obtained from [13, Lemma 5]. Geodesics discussed in Section 3.2.6 were introduced in one-dimension in [19] and multiple dimensions in [7], whereas mixture and exponential hulls of sets first appeared in [13]. Nagaoka [7] established the dual flatness of the manifold discussed in Section 3.2.5 and matched the information divergence with the canonical divergence. The expression of the informational divergence and entropy for exponential families in (21) was given in [15, 17]. The law of cosines was also mentioned by Adamčík [20] for general Bregman projections. The convexity properties of the divergence appeared in Hayashi and Watanabe [15, Theorem 3.3] and Hayashi and Watanabe [15, Lemma 4.5], and their strict version was discussed in [21, Section 4] together with the case $|t| > 1$ . The Pythagorean inequality for projections onto m-convex sets [Theorem 3.3 (i), (ii)] was shown to hold by Csiszár et al. [23, Lemma 1]. The inequality for the “reversed projection” onto e-convex sets was found in [21]. The equality in the Pythagorean theorem for e-families and m-families was first found in [19, Lemma 5] for the one-dimensional setting and in [15, Corollary 4.7, Corollary 4.8] for multiple dimensions.

3.4.1 Timeline

The idea of tilting or exponential change of measure, which gives rise to e-families in the context of distributions, can be traced back to Miller [22]. However, in this section, we focused on the milestones toward the geometric construction of Nagaoka [7], and we deferred the history of the development of the large deviation theory to Section 5.2. The first to recognize the exponential family structure of stochastic matrices is Csiszár et al. [23] by considering information projections onto linearly constrained sets and inferring exponential families as the solution to the maximum entropy problem, as discussed in more detail in Section 5.1. The notion of an asymptotic exponential family was implicitly described by Ito and Amari [14] and was formalized by Takeuchi and Barron [24] and Takeuchi and Kawabata [25]. A later result by Takeuchi and Nagaoka [26] proved that asymptotic exponential families and their non-asymptotic counterparts are in fact equivalent.

www.frontiersin.org

3.4.2 Alternative constructions

Some alternative definitions of exponential families of Markov chains include [27–32]. However, they do not enjoy the same geometric properties as the one of Definition 3.1. Thus, we do not discuss them in detail.

4 Recent advances

One area of recent progress has been the analysis of the geometric properties of significant submanifolds of $W (X, E)$ . In Section 4.1, we briefly discuss symmetric, bistochastic, and memoryless classes. In Section 4.2, we turn the spotlight onto the structure-rich family of irreducible and reversible stochastic matrices. In Section 4.3, we mention some recent progress in connecting the dually flat geometry of Section 3.1 to the theory of lumpability of Markov chains. We end with a discussion on finite state machine (FSMX) models in Section 4.4.

4.1 Symmetric, bistochastic, and memoryless stochastic matrices

In this section, we briefly survey known geometric properties of notable submanifolds of $W (X, E)$ . We also refer the reader to Table 1, adapted from [13, Table 1], for a more visual classification.

TABLE 1

TABLE 1. Geometry of submanifolds of irreducible Markov kernels for $|X| \geq 3$ .

4.1.1 Memoryless class

We say that a stochastic matrix $P \in W (X)$ is memoryless, when it can be expressed as

P = (\begin{matrix} —— & π & —— \\ —— & π & —— \\ —— & π & —— \end{matrix}),

for $π \in P (X)$ . We note that π is the stationary distribution of P, and that for such P to be irreducible, it is necessary that π > 0; hence, $P \in W (X, X^{2})$ . Markov chains defined by a memoryless stochastic matrix correspond in fact to an iid process. We write $W_{i i d} (X, X^{2})$ for the set of all memoryless stochastic matrices.

Lemma 4.1. ([13, Lemma 7, Lemma 8]). The two following statements hold:

(i) $W_{i i d} (X, X^{2})$ forms an e-family of dimension $|X| - 1$ .

(ii) $W_{i i d} (X, X^{2})$ does not form an m-family.

Recall the parametrization of Ito and Amari [14], reported in (13). Coefficients θ^ij in the expression represent memory in the process, and thus vanish. For $X ≅ [m]$ and an arbitrary x_* ∈ [m], we can re-write

\begin{aligned} \log P (x, x^{'}) = & \sum_{\begin{array}{c} i = 1 \\ i \neq x_{*} \end{array}}^{m} θ^{i} g_{i} (x, x^{'}) + \log π (x_{*}), \end{aligned} (23)

where for i ∈ [m], i ≠ x _*,

θ^{i} = \log \frac{π (i)}{π (x_{*})}, g_{i} (x, x^{'}) = δ_{i} (x^{'}) .

4.1.2 Bistochastic class

Bistochastic matrices, also called doubly stochastic matrices, are row- and column-stochastic. In other words, $P \in W (X)$ is bistochastic if and only if the transposition $P^{⊺} \in W (X)$ . In particular, the stationary distribution of a bistochastic matrix is uniform. We denote $W_{b i s} (X, X^{2})$ as the set of positive bistochastic matrices.

Lemma 4.2. The two following statements hold:

(i) $W_{b i s} (X, X^{2})$ forms an m-family of dimension ${(|X| - 1)}^{2}$ [15, Example 4].

(ii) For $|X| > 2$ , $W_{b i s} (X, X^{2})$ does not form an e-family [13, Lemma 10].

4.1.3 Symmetric class

A symmetric stochastic matrix P satisfies P(x, x′) = P(x′, x) for any pair of states $x, x^{'} \in X$ . Writing $W_{s y m} (X, X^{2})$ for the set of positive symmetric matrices, note that $W_{s y m} (X, X^{2})$ lies at the intersection of reversible (see Section 4.2) and doubly stochastic matrices, enjoying all their properties (e.g., uniform stationary distribution, self-adjointness). However, perhaps surprisingly, $W_{s y m} (X, X^{2})$ does not form an e-family.

Lemma 4.3. ([13, Lemma 9, Lemma 10]). The two following statements hold:

(i) $W_{s y m} (X, X^{2})$ forms an m-family of dimension $|X| (|X| - 1) / 2$ ,

(ii) For $|X| > 2$ , $W_{s y m} (X, X^{2})$ does not form an e-family.

4.2 Time-reversible stochastic matrices

In Section 4.2.1, we begin by briefly introducing time reversals and time reversibility in the context of Markov chains. In Section 4.2.2, we proceed to analyze geometric structures that are invariant under the time reversal operation. In Section 4.2.3, we inspect the e-family and m-family nature of the submanifold of reversible stochastic matrices and reversible edge measures. In Section 4.2.4 and Section 4.2.5, we, respectively, discuss reversible information projections and how to generate the reversible set as a geodesic hull of structured subfamilies.

4.2.1 Reversibility

Consider a Markov chain ${(X_{t})}_{1 \leq t \leq n}$ with transition matrix $P \in W (X, E)$ , started from its stationary distribution π. When we look at the random process in reverse time ${(X_{n + 1 - t})}_{1 \leq t \leq n}$ , the Markov property is still verified. In fact, the transition matrix P ^* of this time-reversed Markov chain is given by P ^*(x, x′) = π(x′)P(x′, x)/π(x). The time reversal P ^* shares the same stationary distribution as the original chain, and irreducibility is preserved, although $P^{*} \in W (X, E^{*})$ , where $E^{*} = \{(x^{'}, x) : (x, x^{'}) \in E\}$ is the symmetric image of the connection digraph $E$ . When P ^* = P, the transition probabilities of the chain forward and backward in time coincide, and we say that the chain is time-reversible. Equivalently, we may say that P verifies the detailed balance equation:

π (x) P (x, x^{'}) = π (x^{'}) P (x^{'}, x) .

We write $W_{r e v} (X, E)$ for the set of reversible chains that are irreducible with connection digraph $(X, E)$ . Note that the edge set must necessarily satisfy $E = E^{*}$ ; otherwise, $W_{r e v} (X, E) = \emptyset$ .

Time-reversibility is a central concept across a myriad of scientific fields, from computer science (queuing networks [33], storage models, Markov Chain Monte Carlo algorithms [34], etc.) to physics (many classical or quantum natural laws appear as being time-reversible [35]). The theory of reversibility for Markov chains was originally developed by Kolmogorov [36, 37], and we refer the reader to [38] for a more complete historical exposition.

Reversible Markov chains enjoy a particularly rich mathematical structure. Perhaps first and foremost, reversibility implies self-adjointness of P with respect to the Hilbert space ℓ ₂(π) of real functions over $X$ endowed with the weighted inner product ${⟨ f, g ⟩}_{π} = \sum_{x \in X} f (x) g (x) π (x)$ . Key properties of reversible stochastic matrices induced from self-adjointness include a real spectrum, control from above and below the mixing time by the inverse of the absolute spectral gap [9, Chapter 12], and stability of spectrum estimation procedures [39]. Reversibility has also been explored in the context of algebraic statistics [40] or Bayesian statistics [41]. In this section, we focus on the properties of reversibility and families of reversible stochastic matrices from an information geometric viewpoint.

4.2.2 Geometric invariants

The time reversal operation is known to preserve some geometric properties of families of transition matrices. Consider $V \subset W (X, E)$ , a family of irreducible stochastic matrices. The time-reversal family [13, Definition 3], denoted as $V^{*}$ , is defined by

V^{*} ≜ \{P^{*} : P \in V\} .

Lemma 4.4. ([13, Proposition 1]). Let $V_{e}$ (resp., $V_{m}$ ) be an e-family (resp., m-family) in $W (X, E)$ . Then, $V_{e}$ (resp., $V_{m}$ ) forms an e-family (resp., m-family) in $W (X, E^{*})$ .

Moreover, the time reversal operation leaves the divergence between stochastic matrices unchanged [80, Proof of Proposition 2]:

P_{1}, P_{2} \in W (X, E) \Rightarrow D (P_{1} ‖ P_{2}) = D (P_{1}^{*} ‖ P_{2}^{*}) . (24)

When $V_{r} \subset W_{r e v} (X, E)$ , we say that the family $V_{r}$ is reversible, and in this case $V_{r}^{*} = V_{r}$ , with $E^{*} = E$ . From the definition of an e-family $V_{e}$ , it is possible to determine whether $V_{e}$ is reversible. It is convenient to first introduce the class of log-reversible functions [13, Definition 4, Corollary 1]:

\begin{aligned} F_{r e v} (X, E) ≜ \{h \in F (X, E) : \exists f \in R^{X}, \forall x, x^{'} \in X, h (x, x^{'}) = h (x^{'}, x) + f (x^{'}) - f (x)\} . \end{aligned} (25)

Lemma 4.5. ([13, Theorem 2]). Let $V_{e} \subset W (X, E)$ be an e-family that follows the expression of (11). Then $V = V^{*}$ if and only if $E = E^{*}$ and $K \in F_{r e v} (X, E)$ and for all i ∈ [d], $g_{i} \in F_{r e v} (X, E)$ .

4.2.3 The em-family of reversible stochastic matrices

The class of functions $F_{r e v} (X, E)$ introduced in (25) can be endowed with the structure of a vector space [13, Lemma 4], which verifies the following inclusions:

N (X, E) \subset F_{r e v} (X, E) \subset F (X, E),

where $N (X, E)$ was defined in (14). Immediately, $|X| \leq \dim F_{r e v} (X, E) \leq |E|$ , and this enables us to further define the quotient space of reversible generators:

G_{r e v} (X, E) ≜ F_{r e v} (X, E) / N (X, E) .

It is possible to verify that

W_{r e v} (X, E) = Δ (G_{r e v} (X, E)),

where Δ is the diffeomorphism defined in (15). The following result is then a consequence of Theorem 3.1.

Theorem 4.1. ([13, Theorem 3, Theorem 5, Theorem 6]). $W_{r e v} (X, E)$ forms an e-family and an m-family of dimension

\begin{aligned} \dim W_{r e v} (X, E) = \frac{|E| + |ℓ (E)|}{2} - 1, \end{aligned} (26)

where $ℓ (E) ≜ \{(x, x^{'}) \in E : x^{'} = x\}$ is the set of loops in the connection graph $(X, E)$ .

Theorem 4.2. ([13, Theorem 4, Theorem 5]). Let $P \in W_{r e v} (X, E)$ , with stationary distribution π. Pick an arbitrary element $e_{*} = (x_{*}, x_{*}^{'}) \in E \ ℓ (E)$ , and define

\begin{aligned} T (E) & ≜ \{(x, x^{'}) \in E : x^{'} \leq x, (x, x^{'}) \neq e_{*}\}, \\ g_{*} & ≜ δ_{x_{*}}^{⊺} δ_{x_{*}^{'}} + δ_{x_{*}^{'}}^{⊺} δ_{x_{*}} . \end{aligned}

For $(i, j) \in T (E)$ , the collection of functions

g_{i j} = δ_{i}^{⊺} δ_{j} + δ_{j}^{⊺} δ_{i},

forms a basis for $W_{r e v} (X, E)$ . We can write P as a member of the m-family of reversible stochastic matrices by expressing its edge measure Q as

\begin{aligned} Q = \frac{g_{*}}{2} + \sum_{(i, j) \in T (E)} (g_{i j} - g_{*}) \frac{Q (i, j)}{1 + δ_{i} (j)}, \end{aligned}

and we can write P as a member of the e-family,

\begin{array}{l} \log P (x, x^{'}) & = \sum_{(i, j) \in T (E)} \frac{1}{2 (1 + δ_{i} (j))} \log \frac{P (i, j) P (j, i)}{P (x_{*}, x_{*}^{'}) P (x_{*}^{'}, x_{*})} g_{i j} (x, x^{'}) \\ + \frac{1}{2} \log π (x^{'}) - \frac{1}{2} \log π (x) + \frac{1}{2} \log P (x_{*}, x_{*}^{'}) P (x_{*}^{'}, x_{*}), \end{array}

when $(x, x^{'}) \in E$ , and P(x, x′) = 0 otherwise.

4.2.4 Reversible information projections

Let $P \in W (X, E)$ with $E^{*} = E$ . We recall the definitions (see Section 3.3) of the m-projection P _m and the e-projection P _e of P onto $W_{r e v} (X, E)$ ,

\begin{aligned} P_{m} ≜ \underset{\bar{P} \in W_{r e v} (X, E)}{a r g m i n} D (P ‖ \bar{P}), P_{e} ≜ \underset{\bar{P} \in W_{r e v} (X, E)}{a r g m i n} D (\bar{P} ‖ P) . \end{aligned}

There are known closed-form expressions for P _m and P _e. Moreover, the fact that $W_{r e v} (X, E)$ forms an em-family (Theorem 4.1) leads to a pair of Pythagorean inequalities (see Figure 4), and the invariance of D under time reversals highlighted in (24) implies a bisection property.

FIGURE 4

FIGURE 4. Information projections onto $W_{r e v} (X, E)$ , and illustrations of Pythagorean identities and bisection property of Theorem 4.3.

Theorem 4.3. ([13, Theorem 7, Proposition 2]). Let $P \in W (X, E)$ with $E^{*} = E$ :

\begin{aligned} P_{m} & = \frac{P + P^{*}}{2}, \\ P_{e} & = s ({\tilde{P}}_{e}), with {\tilde{P}}_{e} (x, x^{'}) = \sqrt{P (x, x^{'}) P^{*} (x, x^{'})}, \end{aligned}

where $s$ is defined in Eq. (12). Moreover, for any $\bar{P} \in W_{r e v} (X, E)$ , P _m and P _e satisfy the following Pythagorean identities:

\begin{aligned} D (P ‖ \bar{P}) & = D (P ‖ P_{m}) + D (P_{m} ‖ \bar{P}), \\ D (\bar{P} ‖ P) & = D (\bar{P} ‖ P_{e}) + D (P_{e} ‖ P) . \end{aligned}

Furthermore, the following bisection property holds

\begin{aligned} D (P ‖ P_{m}) = D (P^{*} ‖ P_{m}), D (P_{e} ‖ P) = D (P_{e} ‖ P^{*}) . \end{aligned}

Finally, we mention that the entropy production σ(P) for a Markov chain with transition matrix P, which plays a central role in discussing irreversible phenomena in non-equilibrium systems, can be expressed in terms of the canonical divergence [81, (22)] as follows:

σ (P) = \frac{1}{2} \sum_{x, x^{'} \in X} (Q (x, x^{'}) - Q (x^{'}, x)) \log \frac{Q (x, x^{'})}{Q (x^{'}, x)} = \frac{1}{2} (D (P ‖ P^{*}) + D (P^{*} ‖ P)) .

4.2.5 Characterization of the reversible family as geodesic hulls

It is known that the set of bistochastic matrices—also known as the Birkhoff polytope—is the convex hull of the set of permutation matrices (theorem of Birkhoff and von Neumann [42–44]). By recalling from Section 3.2.6 the definition of geodesic hulls (Definition 3.3, Definition 3.4) of families of stochastic matrices, results in a similar spirit are known for generating the positive and reversible family as geodesic hulls of particular subfamilies.

Theorem 4.4. ([13, Theorem 9, Theorem 10]). It holds that

(i)

m - h u l l (W_{i i d} (X, X^{2})) = W_{r e v} (X, X^{2}),

where $W_{i i d} (X, X^{2})$ is the family of memoryless stochastic matrices discussed in Section 4.1.1 .

(ii) For $|X| \geq 3$ , ⁵

e - h u l l (W_{s y m} (X, X^{2})) = W_{r e v} (X, X^{2}),

where $W_{s y m} (X, X^{2})$ is the family of positive symmetric stochastic matrices discussed in Section 4.1.3.

4.3 Markov morphisms, lumping, and embeddings of Markov chains

In the context of distributions, Čencov [45] introduced Markov morphisms in an axiomatic manner as the natural mappings to consider for statistics. The Fisher information metric can then be characterized as the unique invariant metric tensor under Markov morphisms [45–47]. In the context of stochastic matrices, we saw that the metric and connections introduced in Section 3 were asymptotically consistent with Markov models. This section connects with the axiomatic approach of Čencov and proposes a class of data processing operations that are arguably natural in the Markov setting.

4.3.1 Lumpability

We briefly recall lumpability in the context of distributions and data processing. Consider a distribution $μ \in P (Y)$ , and let Y ₁, Y ₂, …, be a sequence of random variables independently sampled from μ. Suppose we define a deterministic, surjective map $κ : Y \to X$ , where $X$ is a space not larger than $Y$ , and we inspect the random process defined by ${(κ (Y_{t}))}_{t \in N}$ . Note that κ induces a partition of the space $Y = ⋃_{x \in X} S_{x}$ , $x \neq x^{'} \Rightarrow S_{x} \cap S_{x^{'}} = \emptyset$ with $S_{x} = \{y \in Y : κ (y) = x\} = κ^{- 1} (\{x\})$ . The new process is again a sequence of independent random variables sampled identically from the push-forward distribution κ(μ) = μ◦κ ⁻¹, where we used an overloaded definition $κ : P (Y) \to P (X)$ . Namely, the probability of the realization $x \in X$ is the probability of the preimage $S_{x}$ ; for $x \in X$ ,

κ (μ) (x) = \sum_{y \in Y} δ [κ (y) = x] μ (y) .

When $X = Y$ , symbols are merely being permuted. As with any data-processing operation, monotonicity of information dictates that two distributions can only be brought closer together with respect to D by the action of κ:

D (κ (μ) ‖ κ (ν)) \leq D (μ ‖ ν) .

Crucially, in the independent and identically distributed setting, the lumping operation can be understood both as a form of processing of the stream of observations and as an algebraic manipulation of the distribution that generated the random process.

For Markov chains, the concept of lumpability is vastly richer. The first fact one must come to terms with is that a Markov chain may lose its Markov property after a processing operation on the data stream [48, 49], even for an operation as basic as a lumping. A chain is said to be lumpable [50] with respect to a lumping map $κ : Y \to X$ , when the Markov property is preserved for the lumped process.

Theorem 4.5. ([50, Theorem 6.3.2]). Let $P \in W (Y, E)$ . P is lumpable if and only if for all $x, x^{'} \in X$ and for all $y_{1}, y_{2} \in S_{x}$ , it holds that $P (y_{1}, S_{x^{'}}) = P (y_{2}, S_{x^{'}})$ , where for $y \in Y, S \subset Y$ , $P (y, S) = \sum_{y^{'} \in S} P (y, y^{'})$ .

The subset of $W (Y, E)$ of all lumpable stochastic matrices is written $W_{κ} (Y, E)$ . We overload the operation $κ : W_{κ} (Y, E) \to W (X, D)$ and the κ-lumped stochastic matrix is denoted as κ(P) with, for any $x, x^{'} \in X$ ,

κ (P) (x, x^{'}) = P (y, S_{x^{'}}), y \in S_{x} .

4.3.2 Embeddings of Markov chains

Embeddings of stochastic matrices that correspond to conditional models were proposed and analyzed in [51–53]. However, the question of Markov chains, where one considers the stochastic process, was only recently explored in [21]. Looking at reverse operations to lumping, we are interested in embedding an irreducible family of chains $V \subset W (X, D)$ into a space of irreducible chains $W (Y, E)$ defined on a larger state space $Y$ , with some compatible edge set $E$ . In [21], it is postulated that natural morphisms should satisfy the following requirements:

A.1 Morphisms should preserve the Markov property.

A.2 Morphisms should be expressible as algebraic operations on stochastic matrices.

A.3 Morphisms should have operational meaning on trajectories of observations.

The following definition of a Markov morphism was proposed in [21].

Definition 4.1. (Markov morphism for stochastic matrices [21, Definition 3.2]). A map $λ : W (X, D) \to W_{κ} (Y, E)$ is called a κ-compatible Markov morphism for stochastic matrices when for any $y, y^{'} \in E$ ,

λ (P) (y, y^{'}) = P (κ (y), κ (y^{'})) Λ (y, y^{'}),

where $Λ \in F_{+} (Y, E)$ , and for any $y \in Y, x^{'} \in X$ , it holds that

(κ (y), x^{'}) \in D \Rightarrow {(Λ (y, y^{'}))}_{y^{'} \in S_{x^{'}}} \in P (S_{x^{'}}) .

The constraints on the function Λ in Definition 4.1 ensure that the objects produced by λ are stochastic matrices and are κ-lumpable. Furthermore, given the full description of P and Λ, one can directly compute the embedded λ(P), thereby satisfying A.1 and A.2. Alternatively, when given a sequence of observations ${\{X_{t}\}}_{1 \leq t \leq n} \sim P$ and without even knowing P, one can apply a random mapping ϕ _Λ on the trajectory and simulate a trajectory ${\{ϕ_{Λ} (X_{t})\}}_{1 \leq t \leq n} \sim λ (P)$ generated from the embedded chain, essentially satisfying axiom A.3. A key feature of a Markov morphism λ is that the divergence between two points and their image is unchanged [21, Lemma 3.1]. Namely, for two points $P, P^{'} \in V \subset W (X, D)$ ,

D (λ (P) ‖ λ (P^{'})) = D (P ‖ P^{'}) .

As a consequence, the Fisher metric and affine connections are preserved [21, Lemma 3.1] (see Figure 5), in the sense where for $U_{P}, V_{P} \in T_{P} V$ ,

\begin{aligned} g_{P} (U_{P}, V_{P}) = g_{λ (P)} (λ_{*} (U_{P}), λ_{*} (V_{P})), \end{aligned}

and for any vector fields $U, V \in Γ (T V)$ ,

\begin{aligned} λ_{*} (\nabla_{U}^{(m)} V) & = \nabla_{λ_{*} (U)}^{(m)} λ_{*} (V), \\ λ_{*} (\nabla_{U}^{(e)} V) & = \nabla_{λ_{*} (U)}^{(e)} λ_{*} (V), \end{aligned}

where

λ_{*} : T_{P} V \to T_{λ (P)} λ (V)

defined by ${(λ_{*} (U_{P}))}_{λ_{*} (P)} = {(d λ)}_{P} (U_{P})$ is the pushforward map associated with the diffeomorphism λ. Furthermore, Markov morphisms are e-geodesic affine maps [21, Theorem 3.2]. Namely, for any $P_{0}, P_{1} \in W (X, D)$ ,

λ (γ_{P_{0}, P_{1}}^{(e)}) = γ_{λ (P_{0}), λ (P_{1})}^{(e)} .

FIGURE 5

FIGURE 5. Markov morphisms (Definition 4.1) preserve the Fisher metric and the pair of dual affine connections.

However, they are no m-geodesic affine, which means that generally

λ (γ_{P_{0}, P_{1}}^{(m)}) \neq γ_{λ (P_{0}), λ (P_{1})}^{(m)} .

A more restricted class of embeddings, termed memoryless embeddings, preserve m-geodesics [21, Lemma 3.6], whereas e-geodesics are even preserved by the more general class of exponential embeddings [21, Theorem 3.2]. The concept of lumpability is easily extended to bivariate functions [21, Definition 3.3].

Definition 4.2. (κ-lumpable function). $f \in F (Y, E)$ is a κ-lumpable function if and only if for any $x, x^{'} \in X$ and for any $y_{1}, y_{2} \in S_{x}$ , it holds that

f (y_{1}, S_{x^{'}}) = f (y_{2}, S_{x^{'}}) .

The set of all κ-lumpable functions is denoted as $F_{κ} (Y, E)$ .

Lumpable functions $F_{κ} (Y, E)$ form a vector space of dimension $|E| + |D| - \sum_{(x, x^{'}) \in D} |S_{x}|$ [21, Lemma 3.3].

Definition 4.3. (Linear congruent embedding). A linear map $ϕ : F (X, D) \to F_{κ} (Y, E)$ is called a κ-congruent embedding when it is a right inverse of κ and satisfies the two following monotonicity conditions. For any lumpable function $f \in F (X, D)$ ,

\begin{aligned} f \geq 0 & \Rightarrow ϕ (f) \geq 0, \\ f > 0 & \Rightarrow ϕ (f) > 0 . \end{aligned}

Theorem 4.6. (Characterization of Markov morphisms as congruent linear embeddings). Let $ϕ : W (X, D) \to W_{κ} (Y, E)$ . The two following statements are equivalent:

(i) ϕ is a κ-congruent linear embedding.

(ii) ϕ is a κ-compatible Markov morphism.

Theorem 4.6 is a counterpart for a similar fact for finite measure spaces in the distribution setting, which can be found in Ay et al. [6, Example 5.2].

As Markov morphisms and linear congruent embeddings can be identified, it will be convenient to refer to them simply as Markov embeddings. We proceed to give two examples of embeddings.

4.3.2.1 Hudson expansions

Let ${\{X_{t}\}}_{t \in N}$ be a Markov chain with transition matrix $\bar{P} \in W (X, X^{2})$ . The stochastic process ${\{(X_{t}, X_{t + 1})\}}_{t \in N}$ also forms a Markov chain on state space $X^{2}$ . Considered by Kemeny and Snell [50] to be the natural reverse operation of lumping, the Hudson [21, 50] expansion can be expressed as a Markov embedding [21, Theorem 3.4]. In particular, this yields an example of an embedding that is not m-geodesically convex [21, Lemma 3.4].

4.3.2.2 Symmetrization embedding for grained reversible stochastic matrices

Suppose a given stochastic matrix $\bar{P} \in W_{r e v} ([n], D)$ with stationary distribution $\bar{π} (x) = p (x) / m$ for $p \in N^{n}$ and $m \in N$ . The embedding $λ : W (X, D) \to W_{κ} (Y, E)$ constructed [21, Corollary 3.2] by

\begin{aligned} κ (j) & = \underset{i \in [n]}{a r g m i n} \{\sum_{k = 1}^{i} p (k) \geq j\}, j \in [m], \\ Λ (j, j^{'}) & = \frac{δ [(κ (j), κ (j^{'})) \in D]}{p (κ (j^{'}))}, \end{aligned}

is such that $λ (\bar{P}) \in W_{s y m} ([m], E)$ , with $E = \{(j, j^{'}) \in {[m]}^{2} : (κ (j), κ (j^{'})) \in D\}$ . The constructed embedding is memoryless, thus m-geodesically affine. This approach can be used to reduce inference problems in Markov chains from a reversible to a symmetric setting [54].

4.3.3 The foliated manifold of lumpable stochastic matrices

There is generally no left inverse for a lumping map κ. However, for any κ-lumpable $P_{0} \in W_{κ} (Y, E)$ , there always exists a Markov morphism $λ^{(P_{0})} : W (X, D) \to W_{κ} (Y, E)$ , termed canonical embedding [21, Lemma 3.2], such that

P_{0} = (λ^{(P_{0})} ◦ κ) (P_{0}) . (27)

For fixed ${\bar{P}}_{0} \in W (X, D)$ and $P_{0} \in W (Y, E)$ , it is interesting to introduce the two following submanifolds:

\begin{aligned} L ({\bar{P}}_{0}) & ≜ κ^{- 1} (\{{\bar{P}}_{0}\}), \\ J (P_{0}) & ≜ λ^{(P_{0})} (W (X, D)) . \end{aligned}

Less tersely, $L ({\bar{P}}_{0})$ corresponds to the set of stochastic matrices that lump into ${\bar{P}}_{0}$ , whereas $J (P_{0})$ is the image of the entire set $W (X, D)$ by the canonical embedding (27) associated with P ₀. It can be shown [21, Lemma 5.1] that $L ({\bar{P}}_{0})$ and $J (P_{0})$ , respectively, form an m-family and an e-family in $W (Y, E)$ , of dimensions

\begin{aligned} \dim L ({\bar{P}}_{0}) & = |D| - |X|, \\ \dim J (P_{0}) & = |E| - \sum_{(x, x^{'}) \in D} |S_{x}| . \end{aligned}

It is not hard to show that the submanifold $W_{κ} (Y, E)$ of $W (Y, E)$ is generally not autoparallel with respect to either the e-connection or the m-connection. Perhaps surprisingly, it is nevertheless possible to construct a mutually dual foliated structure on $W_{κ} (Y, E)$ (see Figure 6).

FIGURE 6

FIGURE 6. Mutually dual foliated structure on $W_{κ} (Y, E)$ .

Theorem 4.7. ([21, Theorem 5.1]). Let ${\bar{P}}_{0} \in W (X, D)$ . Then,

\begin{aligned} W_{κ} (Y, E) & = ⋃_{P_{0} \in L ({\bar{P}}_{0})} J (P_{0}) \\ \forall P_{0}, P_{0}^{'} \in L ({\bar{P}}_{0}), P_{0}, \neq P_{0}^{'} & \Rightarrow J (P_{0}) \cap J (P_{0}^{'}) = \emptyset, \\ \dim W_{κ} (Y, E) & = |E| - \sum_{(x, x^{'}) \in D} |S_{x}| + |D| - |X| . \end{aligned}

The following Pythagorean identity [21, Theorem 5.2] follows as a direct application of Theorem 4.7. For ${\bar{P}}_{0} \in W (X, D)$ , $P_{0}, P_{0}^{'} \in L ({\bar{P}}_{0})$ , and $P^{'} \in J (P_{0})$ ,

D (P_{0}^{'} ‖ P^{'}) = D (P_{0}^{'} ‖ P_{0}) + D (P_{0} ‖ P^{'}),

and P ₀ is both the e-projection onto $L ({\bar{P}}_{0})$ and the m-projection onto $J (P_{0})$ (see Figure 6).

4.4 Tree models

For a finite alphabet $Y$ , let $Y^{*} = {ϵ} \cup Y \cup Y^{2} \cup \dots$ be the set of all finite length sequences on $Y$ , where ϵ is the null string. For a string $y_{1}^{n} = (y_{1}, \dots, y_{n})$ , strings $y_{1}^{n}, y_{2}^{n}, \dots, y_{n - 1}^{n}, y_{n}$ and ϵ are called postfixes of $y_{1}^{n}$ . A finite subset $T \subset Y^{*}$ is termed a tree if all postfixes of any element of T belong to T. An element of T is termed a leaf if it is not a postfix of any other element of T. The set of all leaves of T is denoted by ∂T.

For a string $s \in Y^{*}$ , let γ(s) be the element of ∂T that matches a postfix of s, if it exists. We refer to γ(s) as the context of the string s, and $|s|$ denotes the length of the string s. When $| s | \geq \max_{s^{'} \in \partial T} | s^{'} |$ , γ(s) is uniquely defined.

Definition 4.4. (Tree model). For a given tree T and

k = \max_{s^{'} \in \partial T} | s^{'} |, (28)

let us consider the set $W (Y^{k}, E)$ of kth order Markov transition matrices, where

E = \{((y_{1}, \dots, y_{k}), (y_{1}^{'}, \dots, y_{k}^{'})) : y_{i} = y_{i - 1}^{'} \forall i = 2, \dots, k\} . (29)

The tree model induced by the tree T is

M_{T} ≔ \{P \in W (Y^{k}, E) : \forall y^{k}, {\tilde{y}}^{k} \in Y^{k}, γ (y^{k}) = γ ({\tilde{y}}^{k}) \Rightarrow P (y^{k}, \cdot) = P ({\tilde{y}}^{k}, \cdot)\} . (30)

The tree model is a well-studied model of Markov sources in the context of data compression [55, 56], and it can be categorized based on the structure of the underlying tree as follows:

Definition 4.5. (Finite State Machine X (FSMX) model). For a tree model $M_{T}$ induced by tree T, if ∂T satisfies the condition that γ(sy) is defined for all $(s, y) \in \partial T \times Y$ (this means that sy is not an internal node of T for every $(s, y) \in \partial T \times Y$ ), then the tree model $M_{T}$ is referred to as FSMX model. If a tree model is not FSMX, it is referred to as non-FSMX (see Figure 7).

FIGURE 7

FIGURE 7. Example of an FSMX tree (left) and a non-FSMX tree (right).

Theorem 4.8. ([25, 57]). A tree model $M_{T}$ is an e-family if and only if it is an FSMX model.

5 Applications

In this section, we give details of some application domains of the geometric perspective.

5.1 Maximum entropy principle

Recall that the maximum entropy probability distribution over a fixed alphabet $X$ is uniform. In the Markovian setting, for a fixed fully connected digraph $(X, E)$ , the stochastic matrix $U \in W (X, E)$ , which maximizes the entropy rate [58–61] of the process H defined in (4), is given by $s (δ_{E})$ , where $δ_{E} : X \to \{0,1\}$ is defined by $δ_{E} (x, x^{'}) = δ [(x, x^{'}) \in E]$ , and where $s$ is the stochastic rescaling map introduced in (12). Let $L \subset W (X, E)$ be an m-family of stochastic matrices. One can express $L$ as a polytope generated by a set of linear constraints:

L = \{P \in W (X, E) : \forall i \in [d], \sum_{(x, x^{'}) \in E} Q (x, x^{'}) g_{i} (x, x^{'}) = c_{i}\} .

It is known [23] that the e-projection (Section 3.3) of an arbitrary $P \in W (X, E)$ onto $L$ belongs to an e-family. Namely, for $ξ \in R$ , let

P_{ξ} = s ({\tilde{P}}_{ξ}), {\tilde{P}}_{ξ} = P ο \exp (\sum_{i \in [d]} ξ^{i} g_{i}),

and write ψ(ξ) for the logarithm of the PF root of ${\tilde{P}}_{ξ}$ . By the Lagrange multiplier method, the solution to the minimization problem is readily obtained to be at $ξ^{*} = {a r g m a x}_{ξ \in R^{d}} \{⟨ ξ, c ⟩ - ψ (ξ)\}$ . By rewriting,

\underset{\bar{P} \in L}{a r g m i n} D (\bar{P} ‖ P) = \underset{\bar{P} \in L}{a r g m a x} \{H (\bar{P}) + E_{(X, X^{'}) \sim \bar{Q}} \log P (X, X^{'})\},

and observing that for P = U the maxentropic chain $E_{(X, X^{'}) \sim \bar{Q}} \log P (X, X^{'})$ is a function of the edge graph $(X, E)$ only ⁶ , we obtain that

\underset{\bar{P} \in L}{a r g m i n} D (\bar{P} ‖ U) = \underset{\bar{P} \in L}{a r g m a x} H (\bar{P}) .

In other words, the e-projection onto $L$ follows the principle of maximum entropy.

5.2 Large deviation theory

The topic of large deviation theory is the study of the probabilities of rare events or fluctuations in stochastic systems, where the likelihood of these events occurring is exponentially small in the system parameters. In this context, we provide a concise overview of the classical asymptotic results and offer references to recent developments of finite sample upper bounds for the probability of large deviations. For X ₁, …, X _n, a Markov chain started from an initial distribution μ and with transition matrix P, a function $f : X \to R$ , and for some $η \geq E_{π} f$ , we are interested in the rate of decay of the following probability:

P_{μ} (\frac{1}{n} \sum_{t = 1}^{n} f (X_{t}) \geq η) .

Similar in spirit to the heart of the approach taken in the iid setting, we proceed with an exponential change of measure (also known as tilting or twisting) of P and define for $θ \in R$ ,

{\tilde{P}}_{θ} (x, x^{'}) = P (x, x^{'}) e^{θ f (x^{'})} .

We denote by ρ _θ the Perron–Frobenius root of the matrix ${\tilde{P}}_{θ}$ , its logarithm by ψ(θ) = log ρ _θ, and its associated right eigenvector by v _θ. We then define $P_{θ} = s ({\tilde{P}}_{θ}) \in W (X, E)$ and note that ${\{P_{θ}\}}_{θ \in R}$ corresponds to constructing a one-dimensional exponential family of transition matrices generated by f.

5.2.1 Asymptotic theory

The large deviation rate is given by the convex conjugate (Fenchel–Legendre dual) of the log-Perron–Frobenius eigenvalue of the matrix ${\tilde{P}}_{θ}$ .

Theorem 5.1. ([64, Theorem 3.1.2]). For $η \geq E_{π} f$ ,

\lim_{n \to \infty} - \frac{1}{n} \log P_{μ} (\frac{1}{n} \sum_{t = 1}^{n} f (X_{t}) \geq η) = R^{*} (η) = \sup_{θ \in R} \{θ η - ψ (θ)\} .

Theorem 5.2. ([75, Theorem 6.3]). When

\sup_{θ \in R} \{θ η - ψ (θ)\} = R^{*} (η),

is achieved for θ = θ ^* , as n → ∞,

P_{μ} (\frac{1}{n} \sum_{t = 1}^{n} f (X_{t}) \geq η) \sim \frac{E_{X \sim μ} [v_{θ^{*}} (X)]}{θ^{*} \sqrt{2 π n σ_{θ^{*}}^{2}}} e^{- n R^{*} (η)},

where $σ_{θ^{*}}^{2} = \partial^{2} ψ (θ) |_{θ = θ^{*}}$ is the asymptotic variance ⁷ of f, and $v_{θ^{*}}$ is the right Perron–Frobenius eigenvector of ${\tilde{P}}_{θ^{*}}$ .

5.2.2 Finite sample theory

Moulos and Anantharam [62] achieved the most recent and tightest result. They established a finite sample bound with a prefactor that does not depend on the deviation η, which holds for a large class of Markov chains, surpassing the earlier results [17, 63, 64].

Theorem 5.3. ([62, Theorem 1]). Let $P \in W (X, X^{2})$ , with stationary distribution π and a function $f : X \to R$ . Then, for $η \geq E_{π} f$ ,

P (\frac{1}{n} \sum_{t = 1}^{n} f (X_{t}) \geq η) \leq C (P, f) e^{- n R^{*} (η)},

with

C (P, f) ≜ \max_{x, x^{'}, x^{″} \in X} \frac{P (x, x^{'})}{P (x, x^{″})} .

Lastly, the subsequent uniform multiplicative ergodic theorem is known to hold.

Theorem 5.4. ([62, Theorem 3]). For $P \in W (X, X^{2})$ and $f : X \to R$ ,

\sup_{θ \in R} |ψ_{n} (θ) - ψ (θ)| \leq \frac{\log C (P, f)}{n},

where ψ _n is the scaled log-moment-generating-function,

ψ_{n} (θ) ≜ \frac{1}{n} \log E_{μ} [\exp (θ \sum_{t = 1}^{n} f (X_{t}))],

and C(P, f) is the constant defined in Theorem 5.3.

For a more detailed exposition of the aforementioned results in a broader context, please refer to [62].

5.2.3 Timeline

www.frontiersin.org

5.3 Parameter estimation

Let $g : X^{2} \to R$ , and suppose we wish to estimate $E_{(X, X^{'}) \sim Q} [g (X, X^{'})]$ , from one trajectory X ₁, …, X _n from a stationary Markov chain with transition matrix $P \in W (X, E)$ and stationary distribution $π \in P_{+} (X)$ . An important special case is when there exists $f \in R^{X}$ such that for any $x, x^{'} \in X$ , g(x, x′) = f(x′). Then, the quantity of interest is simply $E_{π} f$ . The sample mean evaluated on a stationary Markov trajectory X ₁, …, X _n is defined by

{\hat{f}}_{n} (X_{1}, \dots, X_{n}) = \frac{1}{n} \sum_{t = 1}^{n} f (X_{t}) .

The statistical behavior of ${\hat{f}}_{n}$ is of particular interest for the topic of Markov Chain Monte Carlo methods. By using the strong law of large numbers, the almost sure convergence to the true expectation holds:

{\hat{f}}_{n} (X_{1}, \dots, X_{n}) \overset{a . s .}{\to} E_{π} f (X_{1}) .

Furthermore, defining the asymptotic variance of f as

σ_{\infty}^{2} (f) ≜ \lim_{m \to \infty} V a r [\frac{1}{\sqrt{m}} \sum_{t = 1}^{m} f (X_{t})], (31)

the following Markov chain version of the central limit theorem [65] holds

\sqrt{n} ({\hat{f}}_{n} (X_{1}, \dots, X_{n}) - E_{π} f) \overset{a . s .}{\to} N (0, σ_{\infty}^{2} (f)) .

Although asymptotic analysis may be of mathematical interest, for modern tasks, it is crucial to have a finite sample theory that explains the behavior of the sample mean. With regard to the original bivariate function problem, the sample mean for a sliding window of pairs of observations can be defined as follows:

{\hat{g}}_{n} (X_{1}, \dots, X_{n}) ≜ \frac{1}{n - 1} \sum_{t = 1}^{n - 1} g (X_{t}, X_{t + 1}) .

One can construct by exponential tilting the following one-dimensional parametric family of transition matrices:

V_{e} = \{P_{θ} (x, x^{'}) = P (x, x^{'}) \exp (θ g (x, x^{'}) + R_{θ} (x^{'}) - R_{θ} (x) - ψ (θ)) : θ \in R\},

where R _θ and ψ are fixed using the PF theory (see Section 3.2). Essentially, $V_{e}$ is a one-dimensional e-family of transition matrices, and for θ = 0, the original P is recovered. At any natural parameter $θ \in R$ , the quantity of interest $E_{(X, X^{'}) \sim Q_{θ}} [g (X, X^{'})]$ is the expectation parameter η(θ) of P _θ. Recall from (18) that the Fisher information at coordinates θ can be expressed as the second derivative of the potential function, that is, $\partial^{2} ψ (θ) = g (θ)$ . There exists [15, Lemma 6.2] a constant $C \in R$ such that

\frac{1}{n} g (0) {(1 - \frac{C}{\sqrt{n}})}^{2} \leq V a r [{\hat{g}}_{n} (X_{1}, \dots, X_{n})] \leq \frac{1}{n} g (0) {(1 + \frac{C}{\sqrt{n}})}^{2} .

Defining the asymptotic variance for the bivariate g as

σ_{\infty}^{2} (g) ≜ \lim_{m \to \infty} V a r [\frac{1}{\sqrt{m}} \sum_{t = 1}^{m - 1} g (X_{t}, X_{t + 1})],

it follows that

σ_{\infty}^{2} (g) = g (0) .

Note that it coincides with the reciprocal of the Fisher information with respect to the expectation parameter; see Eq. 18. Essentially, this establishes that the sample mean evaluated on pairs of observations ${\hat{g}}_{n} (X_{1}, \dots, X_{n})$ is asymptotically efficient; it attains the Markov counterpart of the Cramér–Rao lower bound. Similar results for the multi-parametric case, non-stationary case, and curved exponential families are obtained in [15].

5.4 Hypothesis testing

We let $P_{0}, P_{1} \in W (X, E)$ be two irreducible stochastic matrices with respective stationary distributions π ₀ and π ₁. We call P ₀ the null hypothesis and P ₁ the alternative hypothesis. We observe a trajectory X ₀, X ₁, …, X _n sampled from an unknown Markov chain (P ₀ or P ₁). A randomized test function is defined by

\begin{aligned} T_{n} : X^{n} & \to [0,1] \\ (x_{0}, x_{1}, \dots, x_{n}) & \mapsto T_{n} (x_{0}, x_{1}, \dots, x_{n}) . \end{aligned}

We interpret $T_{n}$ as the probability of rejecting the null hypothesis ⁸ under a random experiment [76, p.58]. In particular, if the range of $T_{n}$ is $\{0,1\}$ , the randomized test becomes deterministic. The set of all test functions will be denoted by

T_{n} (X) ≜ \{T_{n} : X^{n} \to \{0,1\}\} .

We write $P_{0}, P_{1}, E_{0}, E_{1}$ to denote probability statements and expectations under the null and alternative hypotheses. We define the error probability of the first kind α (also known as the size of the test, type I error, or significance) and second kind β (type II error), respectively, as follows:

\begin{aligned} α (T_{n}) & ≜ E_{0} [T_{n} (X_{0}, \dots, X_{n})] \\ β (T_{n}) & ≜ E_{1} [1 - T_{n} (X_{0}, \dots, X_{n})] . \end{aligned}

Then, 1 − β is called the power of the test. Fixing $\bar{α} \in R^{+}$ , we define the most powerful test to be the test function $T_{n}^{*}$ that maximizes the power under the size constraint $α (T_{n}) \leq \bar{α}$ :

(i) $α (T_{n}) \leq \bar{α}$ .

(ii) $β (T_{n}^{*}) \leq β (T_{n})$ for any $T_{n} \in T_{n} (X)$ .

The Neyman–Pearson lemma asserts the existence of a test, which can be achieved through the likelihood ratio test.

Lemma 5.1. [78]. There exist $T_{n}^{*} \in T_{n} (X)$ and $η \in R^{+}$ such that

(i)

(a) $α (T_{n}^{*}) = α$ .

(b) $T_{n}^{*} (x_{0}, x_{1}, \dots, x_{n}) = \{\begin{cases} \frac{P_{1} (X_{0} = x_{0}, \dots, X_{n} = x_{n})}{P_{0} (X_{0} = x_{0}, \dots, X_{n} = x_{n})} > η \\ \frac{P_{1} (X_{0} = x_{0}, \dots, X_{n} = x_{n})}{P_{0} (X_{0} = x_{0}, \dots, X_{n} = x_{n})} \leq η, \end{cases}$

(ii) If $T_{n} \in T_{n} (X)$ satisfies (a) and (b) for $η \in R^{+}$ , then $T_{n}$ is most powerful at level α.

If we ignore the effect of the initial distribution that is negligible asymptotically, the Neyman–Pearson accepts the null hypothesis if

\frac{1}{n} \sum_{t = 1}^{n - 1} \log \frac{P_{0} (x_{t}, x_{t + 1})}{P_{1} (x_{t}, x_{t + 1})} \geq η

for a threshold η and observation (x ₁, …, x _n). Employing the large deviation bound (e.g., [17, Section 8]), we can evaluate the Neyman-Pearson test’s performance in terms of rare events as follows:

\begin{array}{l} \lim_{n \to \infty} - \log P_{0} (\frac{1}{n} \sum_{t = 1}^{n - 1} \log \frac{P_{0} (X_{t}, X_{t + 1})}{P_{1} (X_{t}, X_{t + 1})} \leq η) & = D (P_{θ (η)} ‖ P_{0}), \\ \lim_{n \to \infty} - \log P_{1} (\frac{1}{n} \sum_{t = 1}^{n - 1} \log \frac{P_{0} (X_{t}, X_{t + 1})}{P_{1} (X_{t}, X_{t + 1})} > η) & = D (P_{θ (η)} ‖ P_{1}), \end{array}

where

V_{e} ≜ \{P_{θ} (x, x^{'}) ≔ s [\exp (θ \log P_{0} (x, x^{'}) + (1 - θ) \log P_{1} (x, x^{'}))] : θ \in R\}

is the exponential family passing through P ₀ and P ₁ (see Figure 8), and $P_{θ (η)} \in V_{e}$ is the intersection of $V_{e}$ and the mixture family $V_{m}$ given by

V_{m} ≜ \{P \in W (X, E) : \sum_{(x, x^{'}) \in E} P (x, x^{'}) \log \frac{P_{0} (x, x^{'})}{P_{1} (x, x^{'})} = η\} .

FIGURE 8

FIGURE 8. Geometric interpretation of the Neyman–Pearson test as the orthogonal bisector to the e-geodesic passing through both the null and alternative hypotheses.

Note that the e-family $V_{e}$ and the m-family $V_{m}$ are orthogonal in that the Pythagorean identity holds

D (P ‖ P_{θ}) = D (P ‖ P_{θ (η)}) + D (P_{θ (η)} ‖ P_{θ}),

for any $P \in V_{m}$ and $P_{θ} \in V_{e}$ . The Neyman–Pearson test can be understood as a method that bisects the space $W (X, E)$ by means of an m-family, which is perpendicular to the e-family that links the two hypotheses. For a given 0 < r < D(P ₀‖P ₁), if we set the threshold η = η(r) so that D(P _θ(η(r))‖P ₀) = r, the Neyman–Pearson test attains the exponential trade-off:

\begin{array}{l} \lim_{n \to \infty} - \log P_{0} (\frac{1}{n} \sum_{t = 1}^{n - 1} \log \frac{P_{0} (X_{t}, X_{t + 1})}{P_{1} (X_{t}, X_{t + 1})} \leq η (r)) & = r, \\ \lim_{n \to \infty} - \log P_{1} (\frac{1}{n} \sum_{t = 1}^{n - 1} \log \frac{P_{0} (X_{t}, X_{t + 1})}{P_{1} (X_{t}, X_{t + 1})} > η (r)) & = D (P_{θ (η (r))} ‖ P_{1}) . \end{array}

In fact, it can be proved that D(P _θ(η(r))‖P ₁) is the optimal attainable exponent of the type II error probability among any tests such that the type I error probability is less than e ^−nr. Furthermore, it also holds that

D (P_{θ (η (r))} ‖ P_{1}) = \min \{D (P ‖ P_{1}) : P \in W (X, E), D (P ‖ P_{0}) \leq r\},

and the optimal exponential trade-off between the type I and type II error probability can be attained by the so-called Hoeffding test. For a more detailed derivation of these results and finite length analysis, see [17, 19].

5.4.1 Historical remarks and timeline

www.frontiersin.org

Binary hypothesis testing is one of the well-studied problems in information theory. The use of the Perron–Frobenius theory in this context can be traced back to the 1970s and 1980s [63, 66–68]. The geometrical interpretation of the binary hypothesis testing for Markov chains was first studied in [19]. More recently, the finite length analysis of the binary hypothesis testing for Markov chains was developed in [17] using tools from the information geometry. The binary hypothesis testing is also well studied for quantum systems; for results on quantum systems with memory, see [69].

Author contributions

GW drafted the initial version, which was subsequently reviewed and edited by both authors. All authors contributed to the article and approved the submitted version.

Funding

GW was supported by the Special Postdoctoral Researcher Program (SPDR) of RIKEN and the Japan Society for the Promotion of Science KAKENHI under Grant 23K13024. SW was supported in part by JSPS KAKENHI under Grant 20H02144.

Acknowledgments

The authors are thankful to the referees for their numerous comments, which helped improve the quality of this manuscript, and for bringing reference [81] to their attention.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Footnotes

¹As is customary in the literature, θ denotes both the coordinates of a point in context and the corresponding chart map.

²In our definition of m-family, we do not allow a redundant choice of Q ₀, Q ₁, …, Q _d to express $V_{m}$ ; if we allow a redundant choice, Ξ need not be an open set and d need not coincide with the dimension of $V_{m}$ .

³The reason for this name will become clear in (21).

⁴When discussing geodesic convexity in this section, we only consider the section of the geodesic joining the two points, achieved for parameter t ∈ [0, 1], not the entire geodesic.

⁵For $|X| = 2$ , $W_{s y m} (X, X^{2})$ itself is an e-family, which is a strict submanifold of $W_{r e v} (X, X^{2}) = W (X, X^{2})$ .

⁶Note that log U is of the form f(x′) − f(x) + c for some function f and constant c.

⁷The fact that the second derivative of ψ(θ) coincides with the asymptotic variance was clarified in [15].

⁸Note that Nakagawa and Kanaya [19] used a different notation convention, where $T_{n}$ outputs the probability of accepting the null hypothesis.

References

1. Diaconis P, Miclo L. On characterizations of Metropolis type algorithms in continuous time. ALEA: Latin Am J Probab Math Stat (2009) 6:199–238.

Google Scholar

2. Choi MCH, Wolfer G. Systematic approaches to generate reversiblizations of non-reversible Markov chains (2023). arXiv:2303.03650.

Google Scholar

3. Hayashi M. Local equivalence problem in hidden Markov model. Inf Geometry (2019) 2, 1–42. doi:10.1007/s41884-019-00016-z

CrossRef Full Text | Google Scholar

4. Hayashi M. Information geometry approach to parameter estimation in hidden Markov model. Bernoulli (2022) 28, 307–42. doi:10.3150/21-BEJ1344

CrossRef Full Text | Google Scholar

5. Amari S-i., Nagaoka H. Methods of information geometry, 191. American Mathematical Soc. (2007).

Google Scholar

6. Ay N, Jost J, Vân Lê H, Schwachhöfer L. Information geometry, 64. Springer (2017).

Google Scholar

7. Nagaoka H. The exponential family of Markov chains and its information geometry. In: The proceedings of the symposium on information theory and its applications, 28-2 (2005). p. 601–604.

Google Scholar

8. Vidyasagar M. An elementary derivation of the large deviation rate function for finite state Markov chains. Asian J Control (2014) 16:1–19. doi:10.1002/asjc.806

CrossRef Full Text | Google Scholar

9. Levin DA, Peres Y, Wilmer EL. Markov chains and mixing times. second edition. American Mathematical Soc. (2009).

Google Scholar

10. Rached Z, Alajaji F, Campbell LL. The Kullback-Leibler divergence rate between Markov sources. IEEE Trans Inf Theor (2004) 50:917–21. doi:10.1109/TIT.2004.826687

CrossRef Full Text | Google Scholar

11. Eguchi S. Second order efficiency of minimum contrast estimators in a curved exponential family. Ann Stat (1983) 11:793–803. doi:10.1214/aos/1176346246

CrossRef Full Text | Google Scholar

12. Eguchi S. A differential geometric approach to statistical inference on the basis of contrast functionals. Hiroshima Math J (1985) 15:341–91. doi:10.32917/hmj/1206130775

CrossRef Full Text | Google Scholar

13. Wolfer G, Watanabe S. Information geometry of reversible Markov chains. Inf Geometry (2021) 4:393–433. doi:10.1007/s41884-021-00061-7

CrossRef Full Text | Google Scholar

14. Ito H, Amari S. Geometry of information sources. In: Proceedings of the 11th symposium on information theory and its applications. SITA ’88 (1988). p. 57–60.

Google Scholar

15. Hayashi M, Watanabe S. Information geometry approach to parameter estimation in Markov chains. Ann Stat (2016) 44:1495–535. doi:10.1214/15-AOS1420

CrossRef Full Text | Google Scholar

16. Bregman LM. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput Math Math Phys (1967) 7:200–17. doi:10.1016/0041-5553(67)90040-7

CrossRef Full Text | Google Scholar

17. Watanabe S, Hayashi M. Finite-length analysis on tail probability for Markov chain and application to simple hypothesis testing. Ann Appl Probab (2017) 27:811–45. doi:10.1214/16-AAP1216

CrossRef Full Text | Google Scholar

18. Matumoto T Any statistical manifold has a contrast function—On the C3-functions taking the minimum at the diagonal of the product manifold. Hiroshima Math J (1993) 23:327–32. doi:10.32917/hmj/1206128255

CrossRef Full Text | Google Scholar

19. Nakagawa K, Kanaya F. On the converse theorem in statistical hypothesis testing for Markov chains. IEEE Trans Inf Theor (1993) 39:629–33. doi:10.1109/18.212294

CrossRef Full Text | Google Scholar

20. Adamčík M. The information geometry of Bregman divergences and some applications in multi-expert reasoning. Entropy (2014) 16:6338–81. doi:10.3390/e16126338

CrossRef Full Text | Google Scholar

21. Wolfer G, Watanabe S. Geometric aspects of data-processing of Markov chains (2022). arXiv:2203.04575.

Google Scholar

22. Miller H. A convexity property in the theory of random variables defined on a finite Markov chain. Ann Math Stat (1961) 32:1260–70. doi:10.1214/aoms/1177704865

CrossRef Full Text | Google Scholar

23. Csiszár I, Cover T, Choi B-S. Conditional limit theorems under Markov conditioning. IEEE Trans Inf Theor (1987) 33:788–801. doi:10.1109/TIT.1987.1057385

CrossRef Full Text | Google Scholar

24. Takeuchi J-i., Barron AR. Asymptotically minimax regret by Bayes mixtures. In: Proceedings 1998 IEEE International Symposium on Information Theory (Cat No 98CH36252). IEEE (1998). p. 318.

Google Scholar

25. Takeuchi J, Kawabata T. Exponential curvature of Markov models. In: Proceedings. 2007 IEEE International Symposium on Information Theory; June 2007; Nice, France. IEEE (2007). p. 2891–5.

CrossRef Full Text | Google Scholar

26. Takeuchi J, Nagaoka H. On asymptotic exponential family of Markov sources and exponential family of Markov kernels (2017). [Dataset].

Google Scholar

27. Feigin PD Conditional exponential families and a representation theorem for asympotic inference. Ann Stat (1981) 9:597–603. doi:10.1214/aos/1176345463

CrossRef Full Text | Google Scholar

28. Küchler U, Sørensen M. On exponential families of Markov processes. J Stat Plann inference (1998) 66:3–19. doi:10.1016/S0378-3758(97)00072-4

CrossRef Full Text | Google Scholar

29. Hudson IL. Large sample inference for Markovian exponential families with application to branching processes with immigration. Aust J Stat (1982) 24:98–112. doi:10.1111/j.1467-842X.1982.tb00811.x

CrossRef Full Text | Google Scholar

30. Stefanov VT. Explicit limit results for minimal sufficient statistics and maximum likelihood estimators in some Markov processes: Exponential families approach. Ann Stat (1995) 23:1073–101. doi:10.1214/aos/1176324699

CrossRef Full Text | Google Scholar

31. Küchler U, Sørensen M. Exponential families of stochastic processes: A unifying semimartingale approach. Int Stat Review/Revue Internationale de Statistique (1989) 57:123–44. doi:10.2307/1403382

CrossRef Full Text | Google Scholar

32. Sørensen M. On sequential maximum likelihood estimation for exponential families of stochastic processes. Int Stat Review/Revue Internationale de Statistique (1986) 54:191–210. doi:10.2307/1403144

CrossRef Full Text | Google Scholar

33. Kelly FP. Reversibility and stochastic networks. Cambridge University Press (2011).

Google Scholar

34. Brooks S, Gelman A, Jones G, Meng X-L. Handbook of Markov chain Monte Carlo. Chapman & Hall/CRC Press (2011).

Google Scholar

35. Schrödinger E. Über die umkehrung der naturgesetze. Sitzungsberichte der preussischen Akademie der Wissenschaften, physikalische mathematische Klasse (1931) 8:144–53.

Google Scholar

36. Kolmogorov A. Zur theorie der Markoffschen ketten. Mathematische Annalen (1936) 112:155–60. doi:10.1007/BF01565412

CrossRef Full Text | Google Scholar

37. Kolmogorov A. Zur umkehrbarkeit der statistischen naturgesetze. Mathematische Annalen (1937) 113:766–72. doi:10.1007/BF01571664

CrossRef Full Text | Google Scholar

38. Dobrushin RL, Sukhov YM, Fritz J. A.N. Kolmogorov - the founder of the theory of reversible Markov processes. Russ Math Surv (1988) 43:157–82. doi:10.1070/RM1988v043n06ABEH001985

CrossRef Full Text | Google Scholar

39. Hsu D, Kontorovich A, Levin DA, Peres Y, Szepesvári C, Wolfer G. Mixing time estimation in reversible Markov chains from a single sample path. Ann Appl Probab (2019) 29:2439–80. doi:10.1214/18-AAP1457

CrossRef Full Text | Google Scholar

40. Pistone G, Rogantin MP. The algebra of reversible Markov chains. Ann Inst Stat Math (2013) 65:269–93. doi:10.1007/s10463-012-0368-7

CrossRef Full Text | Google Scholar

41. Diaconis P, Rolles SW. Bayesian analysis for reversible Markov chains. Ann Stat (2006) 34:1270–92. doi:10.1214/009053606000000290

CrossRef Full Text | Google Scholar

42. König D. Theorie der endlichen und unendlichen Graphen: Kombinatorische Topologie der Streckenkomplexe, 16. Akademische Verlagsgesellschaft mbh (1936).

Google Scholar

43. Birkhoff G. Three observations on linear algebra. Univ Nac Tacuman, Rev Ser A (1946) 5:147–51.

Google Scholar

44. Von Neumann J. A certain zero-sum two-person game equivalent to the optimal assignment problem. Contrib Theor Games (1953) 2:5–12. doi:10.1515/9781400881970-002

CrossRef Full Text | Google Scholar

45. Čencov NN. Statistical decision rules and optimal inference, Transl. Math. Monographs, 53. Providence-RI: Amer. Math. Soc. (1981).

Google Scholar

46. Campbell LL. An extended Čencov characterization of the information metric. Proc Am Math Soc (1986) 98:135–41. doi:10.1090/S0002-9939-1986-0848890-5

CrossRef Full Text | Google Scholar

47. Lê HV. The uniqueness of the Fisher metric as information metric. Ann Inst Stat Math (2017) 69:879–96. doi:10.1007/s10463-016-0562-0

CrossRef Full Text | Google Scholar

48. Burke C, Rosenblatt M. A Markovian function of a Markov chain. Ann Math Stat (1958) 29:1112–22. doi:10.1214/aoms/1177706444

CrossRef Full Text | Google Scholar

49. Rogers LC, Pitman J. Markov functions. Ann Probab (1981) 9:573–82. doi:10.1214/aop/1176994363

CrossRef Full Text | Google Scholar

50. Kemeny JG, Snell JL Markov chains, 6. New York: Springer-Verlag (1976).

Google Scholar

51. Lebanon G. An extended Čencov-Campbell characterization of conditional information geometry. In: Proceedings of the 20th conference on Uncertainty in artificial intelligence; July 2004 (2004). p. 341–8.

Google Scholar

52. Lebanon G. Axiomatic geometry of conditional models. IEEE Trans Inf Theor (2005) 51:1283–94. doi:10.1109/TIT.2005.844060

CrossRef Full Text | Google Scholar

53. Montúfar G, Rauh J, Ay N. On the Fisher metric of conditional probability polytopes. Entropy (2014) 16:3207–33. doi:10.3390/e16063207

CrossRef Full Text | Google Scholar

54. Wolfer G, Watanabe S. A geometric reduction approach for identity testing of reversible Markov chains. In: Geometric Science of Information (to appear): 6th International Conference, GSI 2023; August–September, 2023; Saint-Malo, France. Springer (2023). Proceedings 6.

Google Scholar

55. Weinberger MJ, Rissanen J, Feder M. A universal finite memory source. IEEE Trans Inf Theor (1995) 41:643–52. doi:10.1109/18.382011

CrossRef Full Text | Google Scholar

56. Willems F, Shtar’kov Y, Tjalkens T. The context tree weighting method: Basic properties. IEEE Trans Inf Theor (1995) 41:653–64. doi:10.1109/18.382012

CrossRef Full Text | Google Scholar

57. Takeuchi J, Nagaoka H. Information geometry of the family of Markov kernels defined by a context tree. In: 2017 IEEE Information Theory Workshop (ITW). IEEE (2017). p. 429–33.

CrossRef Full Text | Google Scholar

58. Spitzer F. A variational characterization of finite Markov chains. Ann Math Stat (1972) 43:303–7. doi:10.1214/aoms/1177692723

CrossRef Full Text | Google Scholar

59. Justesen J, Hoholdt T. Maxentropic Markov chains (corresp). IEEE Trans Inf Theor (1984) 30:665–7. doi:10.1109/TIT.1984.1056939

CrossRef Full Text | Google Scholar

60. Duda J. Optimal encoding on discrete lattice with translational invariant constrains using statistical algorithms (2007). arXiv preprint arXiv:0710.3861.

Google Scholar

61. Burda Z, Duda J, Luck J-M, Waclaw B. Localization of the maximal entropy random walk. Phys Rev Lett (2009) 102:160602. doi:10.1103/PhysRevLett.102.160602

PubMed Abstract | CrossRef Full Text | Google Scholar

62. Moulos V, Anantharam V. Optimal chernoff and hoeffding bounds for finite state Markov chains (2019). arXiv preprint arXiv:1907.04467.

Google Scholar

63. Davisson L, Longo G, Sgarro A. The error exponent for the noiseless encoding of finite ergodic Markov sources. IEEE Trans Inf Theor (1981) 27:431–8. doi:10.1109/TIT.1981.1056377

CrossRef Full Text | Google Scholar

64. Dembo A, Zeitouni O. Large deviations techniques and applications. Springer (1998).

Google Scholar

65. Jones GL. On the Markov chain central limit theorem. Probab Surv (2004) 1:299–320. doi:10.1214/154957804100000051

CrossRef Full Text | Google Scholar

66. Boza LB. Asymptotically optimal tests for finite Markov chains. Ann Math Stat (1971) 42:1992–2007. doi:10.1214/aoms/1177693067

CrossRef Full Text | Google Scholar

67. Vašek K. On the error exponent for ergodic Markov source. Kybernetika (1980) 16:318–29. doi:10.1109/TIT.1981.1056377

CrossRef Full Text | Google Scholar

68. Natarajan S. Large deviations, hypotheses testing, and source coding for finite Markov chains. IEEE Trans Inf Theor (1985) 31:360–5. doi:10.1109/TIT.1985.1057036

CrossRef Full Text | Google Scholar

69. Mosonyi M, Ogawa T. Two approaches to obtain the strong converse exponent of quantum hypothesis testing for general sequences of quantum states. IEEE Trans Inf Theor (2015) 61:6975–94. doi:10.1109/TIT.2015.2489259

CrossRef Full Text | Google Scholar

70. Donsker MD, Varadhan SS. Asymptotic evaluation of certain Markov process expectations for large time, i. Commun Pure Appl Math (1975) 28:1–47. doi:10.1109/TIT.2015.2489259

CrossRef Full Text | Google Scholar

71. Ellis RS. Large deviations for a general class of random vectors. Ann Probab (1984) 12:1–12. doi:10.1214/aop/1176993370

CrossRef Full Text | Google Scholar

72. Gärtner J. On large deviations from the invariant measure. Theor Probab Its Appl (1977) 22:24–39. doi:10.1137/1122003

CrossRef Full Text | Google Scholar

73. Gray RM. Entropy and information theory. Springer Science & Business Media (2011).

Google Scholar

74. Balaji S, Meyn SP. Multiplicative ergodicity and large deviations for an irreducible Markov chain. Stochastic Process their Appl (2000) 90:123–44. doi:10.1016/S0304-4149(00)00032-6

CrossRef Full Text | Google Scholar

75. Kontoyiannis I, Meyn SP. Spectral theory and limit theorems for geometrically ergodic Markov processes. Ann Appl Probab (2003) 13:304–62. doi:10.1214/aoap/1042765670

CrossRef Full Text | Google Scholar

76. Lehmann EL, Romano JP, Casella G Testing statistical hypotheses, 3. Springer (2005).

Google Scholar

77. Nakagawa K. The geometry of m/d/1 queues and large deviation. Int Trans Oper Res (2002) 9:213–22. doi:10.1111/1475-3995.00351

CrossRef Full Text | Google Scholar

78. Neyman J, Pearson ES. Ix. on the problem of the most efficient tests of statistical hypotheses. Philosophical Trans R Soc Lond Ser A, Containing Pap a Math or Phys Character (1933) 231:289–337. doi:10.1098/rsta.1933.0009

CrossRef Full Text | Google Scholar

79. Nielsen F. An elementary introduction to information geometry. Entropy (2020) 22:1100. doi:10.3390/e22101100

PubMed Abstract | CrossRef Full Text | Google Scholar

80. Čencov NN. Algebraic foundation of mathematical statistics. Ser Stat (1978) 9:267–76. doi:10.1080/02331887808801428

CrossRef Full Text | Google Scholar

81. Gaspard P. Time-reversed dynamical entropy and irreversibility in Markovian random processes. J Stat Phys (2004) 117:599–615. doi:10.1007/s10955-004-3455-1

CrossRef Full Text | Google Scholar

Keywords: Markov chains (60J10), data processing, information geometry, congruent embeddings, Markov morphisms

Citation: Wolfer G and Watanabe S (2023) Information geometry of Markov Kernels: a survey. Front. Phys. 11:1195562. doi: 10.3389/fphy.2023.1195562

Received: 28 March 2023; Accepted: 08 June 2023;
Published: 27 July 2023.

Edited by:

Jun Suzuki, The University of Electro-Communications, Japan

Reviewed by:

Antonio Maria Scarfone, National Research Council (CNR), Italy
Marco Favretti, University of Padua, Italy
Fabio Di Cosmo, Universidad Carlos III de Madrid de Madrid, Spain

Copyright © 2023 Wolfer and Watanabe. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Geoffrey Wolfer, Z2VvZmZyZXkud29sZmVyQHJpa2VuLmpw

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Information geometry of Markov Kernels: a survey

1 Introduction

1.1 Outline and scope

2 Preliminaries

2.1 Notation

2.2 Irreducible Markov chains

2.3 Entropy and divergence rates for Markov chains

2.4 Information geometry

3 The dually flat manifold of irreducible stochastic matrices

3.1 The information manifold

3.1.1 Divergence as a general contrast function

3.1.2 Fisher metric and dual affine connections

3.1.3 Asymptotic consistency with information rates

3.2 Exponential families and mixture families

3.2.1 Definition of exponential families

3.2.2 Affine structures and characterization of minimal exponential families

3.2.3 Mixture families

3.2.4 Dual expectation parameter and chart transition maps

3.2.5 Dual flatness

3.2.6 Geodesics and geodesic hulls

3.3 Information projections and decomposition theorems

3.3.1 Information divergence as a Bregman divergence

3.3.2 Canonical divergence

3.3.3 Geodesic convexity and convexity properties of information divergence

3.3.4 Pythagorean inequalities

3.3.5 Pythagorean equality for linear families

3.4 Bibliographical remarks

3.4.1 Timeline

3.4.2 Alternative constructions

4 Recent advances

4.1 Symmetric, bistochastic, and memoryless stochastic matrices

4.1.1 Memoryless class

4.1.2 Bistochastic class

4.1.3 Symmetric class

4.2 Time-reversible stochastic matrices

4.2.1 Reversibility

4.2.2 Geometric invariants

4.2.3 The em-family of reversible stochastic matrices

4.2.4 Reversible information projections

4.2.5 Characterization of the reversible family as geodesic hulls

4.3 Markov morphisms, lumping, and embeddings of Markov chains

4.3.1 Lumpability

4.3.2 Embeddings of Markov chains

4.3.2.1 Hudson expansions

4.3.2.2 Symmetrization embedding for grained reversible stochastic matrices

4.3.3 The foliated manifold of lumpable stochastic matrices

4.4 Tree models

5 Applications

5.1 Maximum entropy principle

5.2 Large deviation theory

5.2.1 Asymptotic theory

5.2.2 Finite sample theory

5.2.3 Timeline

5.3 Parameter estimation

5.4 Hypothesis testing

5.4.1 Historical remarks and timeline

Author contributions

Funding

Acknowledgments

Conflict of interest

Publisher’s note

Footnotes

References

95% of researchers rate our articles as excellent or good