Least Square Approach to Out-of-Sample Extensions of Diffusion Maps

Wang, Jianzhong

doi:10.3389/fams.2019.00024

ORIGINAL RESEARCH article

Front. Appl. Math. Stat., 16 May 2019

Sec. Mathematics of Computation and Data Science

Volume 5 - 2019 | https://doi.org/10.3389/fams.2019.00024

Least Square Approach to Out-of-Sample Extensions of Diffusion Maps

Jianzhong Wang^*

Department of Mathematics and Statistics, Sam Houston State University, Huntsville, TX, United States

Let X = X ∪ Z be a data set in ℝ^D, where X is the training set and Z the testing one. Assume that a kernel method produces a dimensionality reduction (DR) mapping 𝔉: X → ℝ^d (d ≪ D) that maps the high-dimensional data X to its row-dimensional representation Y = 𝔉(X). The out-of-sample extension of dimensionality reduction problem is to find the dimensionality reduction of X using the extension of 𝔉 instead of re-training the whole data set X. In this paper, utilizing the framework of reproducing kernel Hilbert space theory, we introduce a least-square approach to extensions of the popular DR mappings called Diffusion maps (Dmaps). We establish a theoretic analysis for the out-of-sample DR Dmaps. This analysis also provides a uniform treatment of many popular out-of-sample algorithms based on kernel methods. We illustrate the validity of the developed out-of-sample DR algorithms in several examples.

1. Introduction

Recently, in many scientific and technological areas, we need to analyze and process high-dimensional data, such as speech signals, images and videos, text documents, stock trade records, and others. Due to the curse of dimensionality [1, 2], directly analyzing and processing high-dimensional data are often infeasible. Therefore, dimensionality reduction (DR) (see the books [3, 4]) becomes a critical step in high-dimensional data processing. DR maps high-dimensional data into a low-dimensional space so that the data process can be carried out on its low-dimensional representation. There exist many DR methods in literature. The famous linear method is principle component analysis (PCA) [5]. However, PCA cannot effectively reduce the dimension for the data set, which essentially resides on a nonlinear manifold. Therefore, to reduce the dimensions of such data sets, people employ non-linear DR methods [6–12], among which, the method of Diffusion Maps (Dmaps) introduced by Coifman and his research group [13, 14] have been proved attractive and effective. Adopting the ideas of the spectral clustering [15, 16] and Laplacian eigenmaps [17], Dmaps integrates them into a more conceptual framework—the geometric harmonics.

As a spectral method, Dmaps employs the diffusion kernel to define the similarity on a given data set X ⊂ ℝ^D. The principal d-dimensional eigenspace (d ≪ D) of the kernel provides the feature space of X, so that a diffusing mapping 𝔉 maps X to the set Y = 𝔉(X), which is called a DR of X.

Note that the mapping 𝔉 is constructed by the spectral decomposition of the kernel, which is data-dependent. If the set X is enlarged to X = X ∪ Z and we want to make DR of X by Dmaps, we have to retrain the set X in order to construct a new diffusing mapping. The retraining approach is often unpractical if the cardinality of X becomes very large, or the new data set Z comes as a time-stream.

Out-of-example DR extension method finds the DR of X by extending the diffusing mapping 𝔉 onto X. In most cases, we can assume that the new data set Z has the similar features as X. Therefore, instead of retraining the whole set X, we realize the DR of X by extending the mapping 𝔉 from X to X only.

Lots of papers have introduced various out-of-example extension algorithms (see [14, 18, 19] and their references). However, the mathematical analysis on out-of-example extension is not studied sufficiently.

The main purpose of this paper is to give a mathematical analysis on the out-of-sample DR extension of Dmaps. In Wang [20], we preliminarily studied out-of-sample DR extensions for kernel PCA. Since the structure of kernels for Dmaps are different from kernel PCA, it needs a special analysis. In this paper we deal with the DR extensions of Dmaps in the framework of reproducing kernel Hilbert space (RKHS), in which Dmaps extension can be classified as the least square one.

The paper is organized as follows: In section 2, we introduce the general out-of-sample extensions in the RKHS framework. In section 3, we establish the least square out-of-sample DR extensions of Dmaps. In section 4, we give the mathematical analysis and algorithms for the Dmaps DR extension. In the last section, we give several examples for the extension.

2. Preliminary

We first introduce some notions and notations. Let μ be a finite (positive) measure on a data set X ⊂ ℝ^D. We denoted by L²(X, μ) the (real) Hilbert space on X, equipped with the inner product

\begin{array}{l} {〈 f, g 〉}_{L^{2} (X, μ)} = \int_{X} f (x) g (x) d μ (x), f, g \in L^{2} (X, μ) . \end{array}

Then, $| | f | |_{L^{2} (X, μ)} = \sqrt{{〈 f, f 〉}_{L^{2} (X, μ)}}$ . Later, we will abbreviate L²(X, μ) to L²(X) (or L²) if the measure μ (and the set X) is (are) not stressed.

Definition 1 A function k: X² → ℝ is called a Mercer's kernel if it satisfies the following conditions:

1. k is symmetric: k(x, y) = k(y, x);

2. k is positive semi-definite;

3. k is bounded on X², that is, there is an M > 0 such that |k(x, y)| ≤ M, (x, y) ∈ X².

In this paper, we only consider Mercer's kernels. Hence, the term kernel will stand for Mercer's one. The kernel distance (associated with k) between two points x, y ∈ X is defined by

\begin{array}{l} d_{k} (x, y) = \sqrt{k (x, x) + k (y, y) - 2 k (x, y)} . \end{array}

A kernel k defines an RKHS H_k, in which the inner product satisfies [21]

\begin{array}{l} {〈 f (\cdot), k (x, \cdot) 〉}_{H_{k}} = f (x), f \in H_{k}, x \in X . & (1) \end{array}

Later, we will use H instead of H_k if the kernel k is not stressed. Recall that k has a dual identity. It derives the identity operator on H, as shown in 1, and also derives the following compact operator K on L²(X):

\begin{array}{l} (K f) (x) = {〈 f (\cdot), k (x, \cdot) 〉}_{L^{2}} = \int_{X} k (x, y) f (y) d μ (y), f \in L^{2} . \end{array}

In Wang [20], we proved that if

\begin{array}{l} k (x, y) = \sum_{j = 1}^{m} ϕ_{j} (x) ϕ_{j} (y), \end{array}

where the set {ϕ₁, ··· , ϕ_m} is linearly independent, then the set is an o.n. basis of H. Therefore, for f, g ∈ H with $f = \sum_{j} c_{j} ϕ_{j}$ and $g = \sum_{j} d_{j} ϕ_{j}$ , we have ${〈 f, g 〉}_{H_{k}} = \sum_{j} c_{j} d_{j}$ .

Let the spectral decomposition of k be the following:

\begin{array}{l} k (x, y) = \sum_{j = 1}^{m} λ_{j} v_{j} (x) v_{j} (y), 0 \leq m \leq \infty, & (2) \end{array}

where the eigenvalues are arranged decreasingly, λ₁ ≥ ··· ≥ λ_m > 0, and the eigenfunctions v₁, v₁, ··· , v_m, are normalized to satisfy

\begin{array}{l} {〈 v_{i}, v_{j} 〉}_{L^{2} (X)} = δ_{i, j} . \end{array}

Write $γ_{i} (x) = \sqrt{λ_{i}} v_{i} (x)$ . Then, {γ₁, ··· , γ_m} is an o.n. basis of H, which is called the canonic basis of H. We also call $k (x, y) = \sum_{j = 1}^{m} γ_{j} (x) γ_{j} (y)$ the canonic decomposition of k. By 2, we have

\begin{array}{l} γ_{j} = \frac{1}{λ_{j}} \int_{X} k (x, y) γ_{j} (y) d μ (y) . \end{array}

Thus, if f ∈ H have the canonic representation $f = \sum_{j = 1}^{m} c_{j} γ_{j}$ , then, for any g ∈ H, the inner product 〈f, g〉_H has the following integral form:

\begin{array}{l} {〈 f, g 〉}_{H} = \sum_{j = 1}^{m} \frac{c_{j}}{λ_{j}} \int_{X} g (x) γ_{j} (x) d μ (x) . \end{array}

To investigate the out-of-sample DR extension, we first recall some general results on function extensions. Let X = X ∪ Z. To stress that a point x ∈ X is also in X, we use x instead of x. Similarly, we denote by k(x, y) the restriction of k(x, y) on X². That is,

\begin{array}{l} k (x, y) = k (x, y), (x, y) \in X^{2} . \end{array}

We also denote by H the RKHS associated with k. Then a continuous map E : H → H is called an extension if

E (f) (x) = f (x), \forall f \in H .

Correspondingly, we define the restriction R: H → H by

\begin{array}{l} R (f) (x) = f (x), \forall f \in H, x \in X . \end{array}

It is obvious that the extensions from X to X are not unique if Z is not empty. So, we define the set of all extensions of f ∈ H by

\begin{array}{l} A_{f} = {f \in H; R (f) = f}, \end{array}

and call $\hat{f} \in A_{f}$ the least-square extension of f if

\begin{array}{l} | | \hat{f} | |_{H} = min_{f \in A_{f}} | | f | |_{H} . \end{array}

It is evident that the least-square extension of a function is unique. We denote by T : H → H the operator of the least-square extension.

In Wang [20], we already prove the following:

1. Let {v₁, ··· , v_d} be the canonic basis of H and σ₁ ≥ σ₂ ≥ ··· ≥ σ_d > 0 be the eigenvalues of the kernel k(x, y). Then the least-square extension of v_j is

\begin{array}{l} {\hat{v}}_{j} (x) (= T (v_{j}) (x)) = \frac{1}{σ_{j}} \int_{X} k (x, y) v_{j} (y) d μ (y), x \in X, 1 \leq j \leq d . & (3) \end{array}

Therefore, for any $f = \sum_{j = 1}^{d} c_{j} v_{j} \in H$ ,

\begin{array}{l} \hat{f} (x) (= T (f) (x)) = \sum_{j = 1}^{d} c_{j} \frac{1}{σ_{j}} \int_{X} k (x, y) v_{j} (y) d μ (y) . \end{array}

2. Let Ĥ = T(H) and T* : H → H be the joint operator of T. Then P = TT* is an orthogonal projection from H to Ĥ.

3. Let $\hat{k} (x, y)$ be the kernel of the RKHS Ĥ. Then $k_{0} (x, y) = k (x, y) - \hat{k} (x, y)$ is a Mercer's kernel such that $k_{0} (x, y) = 0, (x, y) \in X^{2} \ X^{2}$ . Denote by H₀ the RKHS associated with k₀. Then, H = Ĥ ⊕ H₀ and Ĥ ⊥ H₀.

4. If k(x, y) is a Gramian-type DR kernel [20], and ${[v_{1} (X), \cdot\cdot\cdot, v_{d} (X)]}^{T}$ gives the DR of X, then ${[{\hat{v}}_{1} (X), \cdot\cdot\cdot, {\hat{v}}_{d} (X)]}^{T}$ provides the least-square out-of-sample DR extension on X.

3. Least-Square Out-of-Sample DR Extensions for Dmaps

The kernels of Dmaps are constructed based on the Gaussian kernel

\begin{array}{l} w (x, y) = exp (- \frac{| | x - y | |^{2}}{ϵ}), (x, y) \in X^{2}, ϵ > 0 . \end{array}

The function

\begin{array}{l} S (x) = \int_{X} w (x, y) d μ (y) \end{array}

defines a mass density on X, and $M = \int_{X} S (x) d μ (x)$ is the total mass of X.

There are two important forms of the kernels of Dmaps: The Graph-Laplacian diffusion kernel and the Laplace-Beltrami one.

3.1. Dmaps With the Graph-Laplacian Kernel

We first discuss the least-square out-of-sample DR Extensions for the Dmaps with the Graph-Laplacian (GL) kernel. Normalizing the Gaussian kernel by S(x), we obtain the following Graph-Laplacian diffusion kernel [4, 13]:

\begin{array}{l} g (x, y) = \frac{w (x, y)}{\sqrt{S (x) S (y)}} . \end{array}

This kernel relates to the data set X equipped with an undirected (weighted) graph. It is known that 1 is the greatest eigenvalue of g(x, y) and its corresponding normalized eigenfunction is $\sqrt{\frac{S (x)}{M}}$ .

Let H_g be the RKHS associated with the kernel g and {ϕ₀, ··· , ϕ_m} be its canonic basis, which suggest the following spectral decomposition of g(x, y):

\begin{array}{l} g (x, y) = \sum_{j = 0}^{m} λ_{j} v_{j} (x) v_{j} (y), \end{array}

where 1 = λ₀ ≥ λ₁ ≥ ··· ≥ λ_m > 0 and $v_{j} (x) = ϕ_{j} (x) / \sqrt{λ_{j}}$ . Because $ϕ_{0} = \sqrt{\frac{S (x)}{M}}$ provides only the mass information of the data set, it should not reside on the feature space. Hence, we define the feature space as the RKHS associated with the kernel $\sum_{j = 1}^{m} ϕ_{j} (x) ϕ_{j} (y)$ , where ϕ₀ is removed.

Definition 2. The mapping $Φ : X \to ℝ^{m} : Φ (x) = {[ϕ_{1} (x), \cdot\cdot\cdot, ϕ_{m} (x)]}^{T}$ is called the diffusion mapping and the data set Φ(X) ⊂ ℝ^m is called a DR of X.

Remark. In Wang [20], we already pointed out that each orthogonal transformation of the set Φ(X) can also be considered as a DR of X. Hence, any non-canonical o.n. basis of the feature space also provides a DR mapping.

To study the out-of-sample extension, as what was done in the preceding section, we assume X = X ∪ Z and denote by g(x, y) the Graph-Laplacian kernel on X, that is,

\begin{array}{l} g (x, y) = \frac{w (x, y)}{\sqrt{S (x) S (y)}}, \end{array}

where S(x) is the mass density on X, and

\begin{array}{l} w (x, y) = w (x, y), (x, y) \in X^{2}, \end{array}

Assume that spectral decomposition of g is given by

\begin{array}{l} g (x, y) = \sum_{j = 0}^{d} σ_{j} v_{j} (x) v_{j} (y) . & (4) \end{array}

Then the RKHS H_g associated with g has the canonic basis {φ₀, φ₁, ··· , φ_d}:

\begin{array}{l} g (x, y) = \sum_{j = 0}^{d} φ_{j} (x) φ_{j} (y), \end{array}

where $φ_{j} = \sqrt{σ_{j}} v_{j}$ . Because S(x) ≠ S(x), in general,

\begin{array}{l} g (x, y) \neq g (x, y), (x, y) \in X^{2} . \end{array}

Hence, we cannot directly apply the extension technique in the preceding section to g. Our main purpose in this subsection is to introduce the extension from H_g to H_g.

Denote by H_w and H_w the RKHSs associated with the kernels w and w, respectively. Because w(x, y) = w(x, y) for (x, y) ∈ X², the extension technique in the preceding section can be applied.

Let $u_{j} (x) = \sqrt{S (x)} φ_{j} (x)$ and $u_{j} (x) = \sqrt{S (x)} ϕ_{j} (x)$ . Then we have

\begin{array}{l} w (x, y) = \sum_{j = 0}^{d} u_{j} (x) u_{j} (y), w (x, y) = \sum_{j = 0}^{m} u_{j} (x) u_{j} (y) . \end{array}

Lemma 3 The least-square extension operator T : H_w → H_w has the following representation:

\begin{array}{l} T (u_{j}) (x) = \frac{1}{σ_{j}} \int_{X} w (x, y) \frac{u_{j} (y)}{S (y)} d μ (y), j = 0, 1, \cdot\cdot\cdot, d . & (5) \end{array}

Proof. Because ${u_{j}}_{j = 0}^{d}$ is not a canonic o.n. basis of H_w, we cannot directly apply the extension formula 3. Recall that the formula 3 can also be written as T(f)(x) = 〈f, k(x, ·)〉_H. (In the considered case, the kernel w replaces k.) Note that

\begin{array}{l} {〈 u_{j}, w (x, \cdot) 〉}_{H_{w}} = u_{j} (x) = \sqrt{S (x)} φ_{j} (x) = \sqrt{S (x)} \frac{1}{σ_{j}} \int_{X} g (x, y) φ_{j} (y) \\ d μ (y) = \frac{1}{σ_{j}} \int_{X} w (x, y) \frac{u_{j} (y)}{S (y)} d μ (y), \end{array}

which implies that, for any f ∈ H_w, we have

\begin{array}{l} {〈 f, u_{j} 〉}_{H_{w}} = \frac{1}{σ_{j}} \int_{X} f (y) \frac{u_{j} (y)}{S (y)} d μ (y) . \end{array}

Therefore, the formula T(u_j)(x) = 〈w(x, ·),_{u_j〉H_w} yields 5. ■

We now write û_j = T(u_j) and define

\begin{array}{l} ŵ (x, y) = \sum_{j = 0}^{d} û_{j} (x) û_{j} (y) . \end{array}

Then the RKHS H_ŵ associated with the kernel ŵ is the extension of H_w.

The function S(x) induces the following multiplicator from H_g to H_w:

\begin{array}{l} 𝔖_{S} (f) (x) = \sqrt{S (x)} f (x), x \in X . \end{array}

Similarly, the function S(x) induces the following multiplicator from H_g to H_w:

\begin{array}{l} 𝔖_{S} (f) (x) = \sqrt{S (x)} f (x), x \in X . \end{array}

It is clear that the operator 𝔖_S (𝔖_S) is an isometric mapping. With the aid of 𝔖_S and 𝔖_S, we define the least-square extension $T$ from H_g to H_g by

\begin{array}{l} T = {(𝔖_{S})}^{- 1} \circ T \circ 𝔖_{S} . & (6) \end{array}

The following diagram shows the strategy of the out-of-sample extension using Graph-Laplacian diffusion mapping.

yes

We now derive the integral representation of the operator $T$ .

Lemma 4 Let the canonic decomposition of g be given by 4 and $f = \sum_{j = 0}^{d} c_{j} φ_{j} \in H_{g}$ . Then

\begin{array}{l} T (f) (x) = \sum_{j = 0}^{d} c_{j} {\hat{ϕ}}_{j} (x) = \sum_{j = 0}^{d} \frac{c_{j}}{σ_{j}} \int_{X} \frac{w (x, y)}{\sqrt{S (x) S (y)}} φ_{j} (y) d μ (y) . & (7) \end{array}

Its adjoint operator $T^{*} : H_{g} \to H_{g}$ is given by

\begin{array}{l} T^{*} (h) (x) = h (x) \sqrt{\frac{S (x)}{S (x)}}, h \in H_{g} . & (8) \end{array}

Proof. Write ${\hat{ϕ}}_{j} = T (φ_{j})$ . By 6, we have ${\hat{ϕ}}_{j} (x) = \frac{T (u_{j}) (x)}{\sqrt{S (x)}} .$ By Lemma 3, we obtain

\begin{array}{l} {\hat{ϕ}}_{j} (x) = \frac{1}{σ_{j}} \int_{X} \frac{w (x, y)}{\sqrt{S (x) S (y)}} φ_{j} (y) d μ (y), & (9) \end{array}

which yields 7. Recall that $\frac{w (x, y)}{\sqrt{S (x) S (y)}} = g (x, y) \sqrt{\frac{S (y)}{S (y)}}$ . For any h ∈ H_g, by 〈h, g(·, y)〉_{H_g} = h(y), we have

\begin{array}{l} {〈 h, T (f) 〉}_{H_{g}} = \sum_{j = 0}^{d} \frac{c_{j}}{σ_{j}} \int_{X} h (y) \sqrt{\frac{S (y)}{S (y)}} φ_{j} (y) d μ (y) = {〈 \sqrt{\frac{S (\cdot)}{S (\cdot)}} h (\cdot), f 〉}_{H_{g}}, \end{array}

which yields 8. ■

We now give the main theorem in this subsection.

Theorem 5 Let $T$ be the operator defined in 6. Define $ĝ (x, y) = \sum_{j = 0}^{d} {\hat{ϕ}}_{j} (x) {\hat{ϕ}}_{j} (y)$ , where ${\hat{ϕ}}_{j} = T (φ_{j})$ , and let H_ĝ be the RKHS associated with ĝ. Then,

(i) $T^{*} T = I$ on H_g.

(ii) ${{\hat{ϕ}}_{0}, \cdot\cdot\cdot, {\hat{ϕ}}_{d}}$ is an orthonormal system in H_g, so that H_ĝ is a subspace of H_g and $P = T T^{*}$ is an orthogonal projection from H_g to H_ĝ. Therefore, we have $P ({\hat{ϕ}}_{j}) = {\hat{ϕ}}_{j}$ and $T^{*} ({\hat{ϕ}}_{j}) = φ_{j}$ .

(iii) The function g₀(x, y) = g(x, y) − ĝ(x, y) is a Mercer's kernel. The RKHS H_g₀ associated with g₀ is (m − d) dimensional. Besides, H_g = H_ĝ ⊕ H_g₀ and H_ĝ ⊥ H_g₀.

(iv) For any function f ∈ H_g₀, f(x) = 0, x ∈ X.

Proof. Recall that {φ₀, φ₁, ··· , φ_d} is an on. basis of H_g. By 8 and 9, we have

\begin{array}{l} T^{*} T (φ_{j}) (x) = {\hat{ϕ}}_{j} (x) \sqrt{\frac{S (x)}{S (x)}} = φ_{j} (x), j = 0, 1, \cdot\cdot\cdot, d, \end{array}

which yields $T^{*} T (φ_{j}) = φ_{j}$ . Hence, $T^{*} T = I$ on H_g. The proof of (i) is completed.

Note that

\begin{array}{l} {〈 {\hat{ϕ}}_{i}, {\hat{ϕ}}_{j} 〉}_{H_{g}} = {〈 φ_{i}, T^{*} T (φ_{j}) 〉}_{H_{g}} = {〈 φ_{i}, φ_{j} 〉}_{H_{g}} = δ_{i, j}, \end{array}

which indicates that ${{\hat{ϕ}}_{0} (x), \cdot\cdot\cdot, {\hat{ϕ}}_{d} (x)}$ is an orthonormal system in H_g and H_ĝ is a subspace of H_g. Because $P^{2} = P$ and $P ({\hat{ϕ}}_{j}) = {\hat{ϕ}}_{j}, j = 0, 1, \cdot\cdot\cdot, d$ , $P$ is an orthogonal projection from $H_{\tilde{g}}$ to H_ĝ, which proves (ii).

It is clear that (iii) is a direct consequence of (ii). Finally, we have $P (f) = 0$ for f ∈ H_g₀, which yields $T^{*} (f) = 0$ . Therefore, f(x) = 0, x ∈ X. The proof of (iv) is completed, ■

By Definition 2, the mapping $Φ : Φ (x) = {[φ_{1} (x), \cdot\cdot\cdot, φ_{d} (x)]}^{T}$ is a diffusion mapping from X to ℝ^d and the set Φ(X) is a DR of X. We now give the following definition.

Definition 6 Let $T$ be the operator defined in 6 and ${\hat{ϕ}}_{j} = T (φ_{j})$ . Then the set $\hat{Φ} (X) = {[{\hat{ϕ}}_{1} (X), \cdot\cdot\cdot, {\hat{ϕ}}_{d} (X)]}^{T} \subset ℝ^{d}$ is called the least-square out-of-sample DR extension of the Dmaps with the Graph-Laplacian kernel.

A DR extension on X is called exact if it is equal to a DR of X as defined in Definition 2 (see [20]). The following corollary is a direct consequence of Theorem 5.

Corollary 7 The least-square out-of-sample DR extension given by $T$ from H_g to H_g is exact if and only if dim(H_g) = dim(H_g), or equivalently, H_g₀ = {0}.

3.2. Dmaps With the Laplace-Beltrami Kernel

The discussion on the out-of-sample DR extension of Dmaps with the Laplace-Beltrami (BL) kernel is similar to that in the previous subsection. Hence, in this subsection, we only outline the main results, skipping the details. We start the discussion from the asymmetrically normalized kernel

\begin{array}{l} m (x, y) = \frac{1}{S (x)} w (x, y), \end{array}

which defines a random walk on the data set X such that m(x, y) is the probability of the walk from the node x to the node y after a unit time. From the viewpoint of the random walk, we naturally modify the Gaussian kernel w(x, y) to the following:

\begin{array}{l} a (x, y) = \frac{w (x, y)}{S (x) S (y)} . \end{array}

Then, we normalize it to

\begin{array}{l} b (x, y) = \frac{a (x, y)}{\sqrt{P (x) P (y)}} = \frac{w (x, y)}{\sqrt{R (x) R (y)}}, \end{array}

where

\begin{array}{l} P (x) = \int_{X} a (x, y) d μ (y), R (x) = S^{2} (x) P (x) . \end{array}

We call b(x, y) the Laplace-Beltrami kernel of Dmaps, which relates to the data set X sampled from a manifold in ℝ^D. The greatest eigenvalue of b(x, y) is also 1, which corresponds to the normalized eigenfunction $\sqrt{\frac{P (x)}{L}}$ , where

\begin{array}{l} L = \int_{X} P (x) d μ (x) . \end{array}

Let H_b be the RKHS associated with b and assume that the spectral decomposition of b is

\begin{array}{l} b (x, y) = \sum_{j = 0}^{m} β_{j} q_{j} (x) q_{j} (y) = \sum_{j = 0}^{m} ψ_{j} (x) ψ_{j} (y), \end{array}

where 1 = β₀ ≥ β₁ ≥ ··· ≥ β_m > 0 and $ψ_{j} (x) = \sqrt{β_{j}} q_{j} (x)$ . Similar to the discussion in the previous subsection, since $ψ_{0} = \sqrt{\frac{P (x)}{L}}$ does not contains any feature of the data set, we exclude it from the feature space.

Definition 8 The mapping $Ψ : X \to ℝ^{m} : Ψ (x) = {[ψ_{1} (x), \cdot\cdot\cdot, ψ_{m} (x)]}^{T}$ is called the Laplace-Beltrami diffusion mapping and the data set Ψ(X) ⊂ ℝ^m is called a DR of X associated with Laplace-Beltrami Dmaps.

We new assume again X = X ∪ Z and denote by b(x, y) the Laplace-Beltrami kernel on X. Assume that spectral decomposition of b is

\begin{array}{l} b (x, y) = \sum_{j = 0}^{d} γ_{j} q_{j} (x) q_{j} (y) . & (10) \end{array}

Then the RKHS H_b associated with b has the canonic basis {ω₀, ω₁, ··· , ω_d}, where $ω_{j} = \sqrt{γ_{j}} q_{j}$ .

Define the multiplicator from H_b to H_w by

\begin{array}{l} 𝔖_{R} (f) (x) = \sqrt{R (x)} f (x), x \in X, \end{array}

and the multiplicator from H_b to H_w by

\begin{array}{l} 𝔖_{R} (f) (x) = \sqrt{R (x)} f (x), x \in X . \end{array}

The operator 𝔖_R (𝔖_R) is an isometric mapping. We now define the least-square extension $M$ from H_b to H_b by

\begin{array}{l} M = {(𝔖_{R})}^{- 1} \circ T \circ 𝔖_{R} . & (11) \end{array}

The integral representation of $M$ is give by the following lemma:

Lemma 9 Let {ω₀, ω₁, ··· , ω_d}be the canonic basis of b. Write ${\hat{ψ}}_{j} = M (ω_{j})$ . Then

\begin{array}{l} {\hat{ψ}}_{j} (x) = \frac{1}{γ_{j}} \int_{X} \frac{w (x, y)}{\sqrt{R (x) R (y)}} ω_{j} (y) d μ (y) . & (12) \end{array}

Particularly, for $f = \sum_{j = 0}^{d} c_{j} ω_{j} \in H_{b}$ , we have

\begin{array}{l} M (f) (x) = \sum_{j = 0}^{d} c_{j} {\hat{ψ}}_{j} (x) = \sum_{j = 0}^{d} \frac{c_{j}}{γ_{j}} \int_{X} \frac{w (x, y)}{\sqrt{R (x) R (y)}} ω_{j} (y) d μ (y) . \end{array}

Its adjoint operator $M^{*} : H_{b} \to H_{b}$ is given by

\begin{array}{l} M^{*} (h) (x) = h (x) \sqrt{\frac{P (x)}{P (x)}}, h \in H_{b} . \end{array}

Since the proof is similar to that for Lemma 4, we skip it here.

Theorem 10 Let $M$ be the operator defined in 11. Define $\hat{b} (x, y) = \sum_{j = 0}^{d} {\hat{ψ}}_{j} (x) {\hat{ψ}}_{j} (y)$ , where ${\hat{ψ}}_{j} = M (ω_{j})$ , and let $H_{\hat{b}}$ be the RKHS associated with $\hat{b}$ . Then,

1. $M^{*} M = I$ on H_b.

2. ${{\hat{ψ}}_{0}, \cdot\cdot\cdot, {\hat{ψ}}_{d}}$ is an orthonormal system in H_b, so that $H_{\hat{b}}$ is a subspace of H_b and $Q = M M^{*}$ is an orthogonal projection from H_b to $H_{\hat{b}}$ . Therefore, we have $Q ({\hat{ψ}}_{j}) = {\hat{ψ}}_{j}$ and $M^{*} ({\hat{ψ}}_{j}) = ω_{j}$ .

3. The function $b_{0} (x, y) = b (x, y) - \hat{b} (x, y)$ is a Mercer's kernel. The RKHS H_b₀ associated with b₀ is (m−d) dimensional. Besides, $H_{b} = H_{\hat{b}} \oplus H_{b_{0}}$ and $H_{\hat{b}} ⊥ H_{b_{0}}$ .

4. For any function f ∈ H_b₀, f(x) = 0, x ∈ X.

We skip the proof of Theorem 10 because it is similar to that for Theorem 5. We now give the following definition:

Definition 11 Let $M$ be the operator defined in 11 and ${\hat{ψ}}_{j} = M (ω_{j})$ . Then the set $\hat{Ψ} (X) = {[{\hat{ψ}}_{1} (X), \cdot\cdot\cdot, {\hat{ψ}}_{d} (X)]}^{T} \subset ℝ^{d}$ is called the least-square out-of-sample DR extension of the Dmaps with the Laplace-Beltrami kernel.

Corollary 12 The least-square out-of-sample DR extension given by $M$ from H_b to H_b is exact if and only if dim(H_b) = dim(H_b), or equivalently, H_b₀ = {0}.

3.3. Algorithms for Out-of-Sample DR Extension of Dmaps

In this subsection, we present the algorithm for out-of-sample DR extension of Dmaps. The algorithm contains two parts. In the first part, we construct the DR for X by 4 and 10. In the second part, we extend the DR to the set X, by 9 and 12.

In the algorithm, we represent the data sets X, Z, and X = X ∪ Z as the D×N, D×M, and D×(N+M) matrices, respectively, so that X = [X, Z]. We assume the measure dμ(x) = dx. Write X = [x₁, ··· , x_N], Z = [z₁, ··· , z_M], and X = [x₁, ··· , x_(N+M)], where x_j = x_j, 1 ≤ j ≤ N and x_j = z_j−N, N + 1 ≤ j ≤ N + M. Then we represent all kernels by matrices and all functions by vectors. For example, w is now represented by the N × N matrix with $w_{i, j} = exp (- | | x_{i} - x_{j} | |^{2} / ϵ)$ . To treat GL-map and LB-map in a uniform way, we write $S_{i} = \sum_{j} w_{i, j}$ and define

d_{i} = {\begin{array}{l} \sqrt{S_{i}}, & for GL-map \\ \sqrt{S_{i} \sum_{j} (w_{i, j} / S_{j})}, & for LB-map \end{array}

Then we set either kernel on X as the N × N matrix k with

\begin{array}{l} k_{i, j} = \frac{w_{i, j}}{d_{i} d_{j}} . \end{array}

The pseudo-code is given in Algorithm 1.

Algorithm 1: Out-of-Sample DR Extension for Dmaps

4. Illustrative Examples

In this section, we give several illustrative examples to show the validity of the Dmaps out-of-sample extensions. We employ four benchmark artificial data sets, S-curve, Swiss roll, punched sphere, and 3D cluster, in our samples. The graphs of these four data sets are give in Figure 1.

FIGURE 1

Figure 1. S-curve, Swiss Roll, Punched Sphere, and 3D-Cluster.

4.1. Out-of-Sample Extension by Graph-Laplacian Mapping

We first show the examples for the out-of-sample extensions provided by Graph-Laplacian mapping for the four benchmark figures. We set the size of each of these data sets by |X| = 2, 048. When the out-of-example algorithm is applied, we choose the size of the training data set to be |X| = 1, 843, which is 90% of the all samples, and choose the size of the testing set |Z| = 205, which is 10% of all samples. The parameters for the Graph-Laplacian kernel are set as follows: For obtaining the sparse kernel, we choose 25 nearest neighbors for every node, and assign the diffusion parameter ϵ = 1 for S-curve, Punched Sphere, and 3D Cluster, while assign ϵ = ∞ for Swiss Roll. We compare the DR result of the whole set X obtained by out-of-example extension with that obtained without out-of-example extension in the Figures 2–5. The figures show that the DRs obtained by out-of-sample extensions are satisfactory.

FIGURE 2

Figure 2. GL out-of-sample extension for DR of S-Curve.

FIGURE 3

Figure 3. GL out-of-sample extension for DR of Punched Sphere.

FIGURE 4

Figure 4. GL out-of-sample extension for DR of 3D Cluster.

FIGURE 5

Figure 5. GL out-of-sample extension for DR of Swiss Roll.

4.2. Out-of-Sample Extension by Laplace-Beltrami Mapping

We now show the examples for the out-of-sample extensions provided by Laplace-Beltrami mapping for the same four benchmark figures. We set the same sizes for |X|,|X|, and |Z|, respectively. The parameters for the Laplace-Beltrami kernel are also set the same as for Graph-Laplacian kernel. The results of the comparisons are give in Figures 6–9.

FIGURE 6

Figure 6. LB out-of-sample extension for DR of S-Curve.

FIGURE 7

Figure 7. LB out-of-sample extension for DR of Punched Sphere.

FIGURE 8

Figure 8. LB out-of-sample extension for DR of 3D Cluster.

FIGURE 9

Figure 9. LB out-of-sample extension for DR of Swiss Roll.

To give more detailed comparisons, in Figures 10–13, we show the DRs of the training data and the testing data obtained by out-of-extensions and without extensions, respectively, for LB mapping.

FIGURE 10

Figure 10. Comparisons of DRs of training data and the testing data, respectively, for S-curve.

FIGURE 11

Figure 11. Comparisons of DRs of training data and the testing data, respectively, for Punched Sphere.

FIGURE 12

Figure 12. Comparisons of DRs of training data and the testing data, respectively, for 3D Cluster.

FIGURE 13

Figure 13. Comparisons of DRs of training data and the testing data, respectively, for Swiss Roll.

It is a common sense that if we reduce the size of the training set while increase the size for the testing set, the out-of-sample extension will introduce larger errors for DR. Figures 14–15 show that, in a relative large scope, say, the size of the testing set is no greater than the size of the training set. the out-of-sample extension still produces the acceptable results.

FIGURE 14

Figure 14. LB out-of-sample extension for different sizes of the test sets of S-curve (I).

FIGURE 15

Figure 15. LB out-of-sample extension for different sizes of the test sets of S-curve (II).

Data Availability

No datasets were generated or analyzed for this study.

Author Contributions

The author confirms being the sole contributor of this work and has approved it for publication.

Conflict of Interest Statement

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

1. Bellman R. Adaptive Control Processes: A Guided Tour. Princeton, NJ: Princeton University Press (1961).

Google Scholar

2. Scott DW, Thompson JR. Probability density estimation in higher dimensions. In: Gentle JE, editor. Computer Science and Statistics: Proceedings of the Fifteenth Symposium on the Interface. Amsterdam; New York, Ny; Oxford: North Holland-Elsevier Science Publishers (1983). p. 173–9.

Google Scholar

3. Lee JA, Verleysen M. Nonlinear Dimensionality Reduction. New York, NY: Springer (2007).

Google Scholar

4. Wang JZ. Geometric Structure of High-Dimensional Data and Dimensionality Reduction. Beijing; Berlin; Heidelberg: Higher Educaiton Press; Springer (2012).

Google Scholar

5. Jolliffe IT. Principal Component Analysis. Springer Series in Statistics. Berlin: Springer-Verlag (1986).

PubMed Abstract | Google Scholar

6. Zhang ZY, Zha HY. Principal manifolds and nonlinear dimensionality reduction via local tangent space alignment. SIAM J Sci Comput. (2004) 26:313–38.

Google Scholar

7. Schölkopf B, Smola A, Müller K-R. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. (1998) 10:1299–319.

Google Scholar

8. Roweis ST, Saul LK. Nonlinear dimensionality reduction by locally linear embedding. Science. (2000) 290:2323–6. doi: 10.1126/science.290.5500.2323

PubMed Abstract | CrossRef Full Text | Google Scholar

9. Donoho DL, Grimes C. Hessian eigenmaps: new locally linear embedding techniques for high-dimensional data. Proc Natl Acad Sci USA. (2003) 100:5591–6. doi: 10.1073/pnas.1031596100

CrossRef Full Text | Google Scholar

10. Shmueli Y, Sipola T, Shabat G, Averbuch A. Using affinity perturbations to detect web traffic anomalies. In: The 11th International Conference on Sampling Theory and Applications (Bremen) (2013).

Google Scholar

11. Shmueli Y, Wolf G, Averbuch A. Updating kernel methods in spectral decomposition by affinity perturbations. Linear Algebra Appl. (2012) 437:1356–65. doi: 10.1016/j.laa.2012.04.035

CrossRef Full Text | Google Scholar

12. Balasubramanian M, Schwaartz E, Tenenbaum J, de Silva V, Langford J. The isomap algorithm and topological staility. Science (2002) 295:7. doi: 10.1126/science.295.5552.7a

PubMed Abstract | CrossRef Full Text | Google Scholar

13. Coifman RR, Lafon S. Diffusion maps. Appl Comput Harmon Anal. (2006) 21:5–30. doi: 10.1016/j.acha.2006.04.006

CrossRef Full Text | Google Scholar

14. Coifman RR, Lafon S. Geometric harmonics: a novel tool for multiscale out-of-sample extension of empirical functions. Appl Comput Harmon Anal. (2006) 2:31–52. doi: 10.1016/j.acha.2005.07.005

CrossRef Full Text | Google Scholar

15. Ng A, Jordan M, Weiss Y. On spectral clustering: analysis and an algorithm. Adv Neural Inform Process Syst. (2001) 14:849–56.

Google Scholar

16. Shi J, Malik J. Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell. (2000) 22:888–905. doi: 10.1109/34.868688

CrossRef Full Text | Google Scholar

17. Belkin M, Niyogi P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. (2003) 15:1373–96. doi: 10.1162/089976603321780317

CrossRef Full Text | Google Scholar

18. Aizenbud Y, Bermanis A, Averbuch A. PCA-based out-of-sample extension for dimensionality reduction. arXiv: 1511.00831 (2015).

Google Scholar

19. Bengio Y, Paiement J, Vincent P, Delalleau O, Le Roux N, Ouimet M. Out-of-sample extensions for LLE, Isomap, MDS, eigenmaps, and spectral clustering. In: Thrun S, Saul L, Schölkopf B, editors. Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press (2004).

Google Scholar

20. Wang JZ. Mathematical analysis on out-of-sample extensions. Int J Wavelets Multiresol Inform Process. (2018) 16:1850042. doi: 10.1142/S021969131850042X

CrossRef Full Text | Google Scholar

21. Aronszajn N. Theory of reproducing kernels. Trans Amer Math Soc. (1950) 68:337–404.

Google Scholar

Keywords: out-of-sample extension, dimensionality reduction, reproducing kernel Hilbert space, least-square method, diffusion maps

AMS Subject Classification: 62-07, 42B35, 47A58, 30C40, 35P15

Citation: Wang J (2019) Least Square Approach to Out-of-Sample Extensions of Diffusion Maps. Front. Appl. Math. Stat. 5:24. doi: 10.3389/fams.2019.00024

Received: 02 March 2019; Accepted: 25 April 2019;
Published: 16 May 2019.

Edited by:

Ding-Xuan Zhou, City University of Hong Kong, Hong Kong

Reviewed by:

Bo Zhang, Hong Kong Baptist University, Hong Kong
Shao-Bo Lin, Wenzhou University, China
Sui Tang, Johns Hopkins University, United States

Copyright © 2019 Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Jianzhong Wang, anp3YW5nQHNoc3UuZWR1

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.