Deep Nets for Local Manifold Learning

Chui, Charles K.; Mhaskar, Hrushikesh N.

doi:10.3389/fams.2018.00012

ORIGINAL RESEARCH article

Front. Appl. Math. Stat., 29 May 2018

Sec. Mathematics of Computation and Data Science

Volume 4 - 2018 | https://doi.org/10.3389/fams.2018.00012

Deep Nets for Local Manifold Learning

$\r\nCharles K. Chui$ Charles K. Chui¹

Hrushikesh N. Mhaskar²^*

¹Department of Mathematics, Hong Kong Baptist University, Hong Kong, Hong Kong
²Institute of Mathematical Sciences, Claremont Graduate University, Claremont, CA, United States

The problem of extending a function f defined on a training data $C$ on an unknown manifold 𝕏 to the entire manifold and a tubular neighborhood of this manifold is considered in this paper. For 𝕏 embedded in a high dimensional ambient Euclidean space ℝ^D, a deep learning algorithm is developed for finding a local coordinate system for the manifold without eigen-decomposition, which reduces the problem to the classical problem of function approximation on a low dimensional cube. Deep nets (or multilayered neural networks) are proposed to accomplish this approximation scheme by using the training data. Our methods do not involve such optimization techniques as back-propagation, while assuring optimal (a priori) error bounds on the output in terms of the number of derivatives of the target function. In addition, these methods are universal, in that they do not require a prior knowledge of the smoothness of the target function, but adjust the accuracy of approximation locally and automatically, depending only upon the local smoothness of the target function. Our ideas are easily extended to solve both the pre-image problem and the out-of-sample extension problem, with a priori bounds on the growth of the function thus extended.

1. Introduction

Machine learning is an active sub-field of Computer Science on algorithmic development for learning and making predictions based on some given data, with a long list of applications that range from computational finance and advertisement, to information retrieval, to computer vision, to speech and handwriting recognition, and to structural healthcare and medical diagnosis. In terms of function approximation, the data for learning and prediction can be formulated as {(x, f_x)}, obtained with an unknown probability distribution. Examples include: the Boston housing problem (of predicting the median price f_x of a home based on some vector x of 13 other attributes [1]) and the floor market problem [2, 3] (that deals with the indices of the wheat floor pricing in three major markets in the United States). For such problems, the objective is to predict the index f_x in the next month, say, based on a vector x of their values over the past few months. Other similar problems include the prediction of blood glucose level f_x of a patient based on a vector x of the previous few observed levels [4, 5], and the prediction of box office receipts (f_x) on the date of release of a movie in preparation, based on a vector x of the survey results about the movie [6]. It is pointed out in Mhaskar [7], Maggioni and Mhaskar [8] and Ehler et al. [9] that all the pattern classification problems can also be viewed fruitfully as problems of function approximation. While it is an ongoing research to allow non-numeric input x (e.g., [10]), we restrict our attention in this paper to the consideration of x ∈ ℝ^D, for some integer D ≥ 1.

In the following discussion, the first component x is considered as input, while the second component f_x is considered the output of the underlying process. The central problem is to estimate the conditional expectation of f_x given x. Various statistical techniques and theoretical advances in this direction are well-known (see, for example [11]). In the context of neural and radial-basis-function networks, an explicit formulation of the input/output machines was pointed out in Girosi and Poggio [12] and Girosi et al. [13]. More recently, the nature of deep learning as an input/output process is formulated in the same way, as explained in LeCun et al. [14] and Rosasco et al. [15]. To complement the statistical perspective and understand the theoretical capabilities of these processes, it is customary to think of the expected value of f_x, given x , as a function f of x. The question of empirical estimation in this context is to carry out the approximation of f given samples ${(x, f (x))}_{x \in C}$ , where $C$ is a finite training data set. In practice, because of the random nature of the data, it may be possible that there are several pairs of the form (x, f_x) in the data for the same values of x. In this case, a statistical scheme, such as some kind of averaging of f_x being the simplest one, can be used to obtain a desired value f(x) for the sample of f at x, $x \in C$ . From this perspective, the problem of extending f from the traning data set $C$ to x not in $C$ in machine learning is called the generalization problem.

We will illustrate this general line of ideas by using neural networks as an example. To motivate this idea, let us first recall a theorem originating with Kolmogorov and Arnold [16, Chapter 17, Theorem 1.1]. According to this theorem, there exist universal Lipschitz continuous functions ϕ₁, ⋯ , ϕ_2D+1 and universal scalars λ₁, ⋯ , λ_D ∈ (0, 1), for which every continuous function f :[0, 1]^D → ℝ can be written as

\begin{array}{rcl} f (x) = \sum_{j = 1}^{2 D + 1} g (\sum_{k = 1}^{D} λ_{k} ϕ_{j} (x_{k})), x = (x_{1}, \dots, x_{D}) \in {[0, 1]}^{D}, & (1.1) \end{array}

where g is a continuous function that depends on f. In other words, for a given f, only one function g has to be determind to give the representation formula (1.1) of f.

A neural network, used as an input/output machine, consists of an input layer, one or more hidden layers, and an output layer. Each hidden layer consists of a number of neurons arranged according to the network architecture. Each of these neurons has a local memory and performs a simple non-linear computation upon its input. The input layer fans out the input x ∈ ℝ^D to the neurons at the first hidden layer. The output layer typically takes a linear combination of the outputs of the neurons at the last hidden layer. The right hand side of (1.1) is a neural network with two hidden layers. The first contains D neurons, where the j-th neuron computes the sum $\sum_{k = 1}^{D} λ_{k} ϕ_{j} (x_{k})$ . The next hidden layer contains 2D + 1 neurons each evaluating the function g on the output of the j-th neuron in the first hidden layer. The output layers takes the sum of the results as indicated in (1.1).

From a practical point of view, such a network is clearly hard to construct, since only the existence of the functions ϕ_j and g is known, without a numerical procedure for computing these. In the early mathematical development of neural networks during the late 1980s and early 1990s, instead of finding these functions for the representation of a given continuous function f in (1.1), the interest was to study the existence and characterization of universal functions σ :ℝ → ℝ, called activation functions of the neural networks, such that each neuron evaluates the activation function upon an affine transform of its input, and the network is capable of approximating any desired real-valued continuous target function f : K → ℝ arbitrarily closely on K, where K ⊂ ℝ^D is any compact set.

For example, a neural network with one hidden layer can be expressed as a function

\begin{array}{rcl} N (x) = N_{n} ({w_{k}}, {a_{k}}, {b_{k}}; x) = \sum_{k = 1}^{n} a_{k} σ (w_{k} \cdot x + b_{k}), x \in ℝ^{D} . & (1.2) \end{array}

Here, the hidden layer consists of n neurons, each of which has a local memory. The local memory of the k-th neuron contains the weights $w_{k} \in ℝ^{D}$ , and the threshold b_k ∈ ℝ. Upon receiving the input x ∈ ℝ^D from the input later, the k-th neuron evaluates σ(w_k · x + b_k) as its output, where σ is a non-linear activation function. The output layer is just one circuit where the coefficients {a_k} are stored in a local memory, and that evaluates the linear combination as indicated in (1.2). Training of this network in order to learn a function f on a compact subset K ⊂ ℝ^D to an accuracy of ϵ > 0 involves finding the parameters {a_k}, {w_k}, {b_k} so that

\begin{array}{rcl} max_{x \in K} | f (x) - N (x) | < ϵ . & (1.3) \end{array}

The most popular technique for doing this is the so called back-propagation, which seeks to find these quantities by minimizing an error functional usually with some regularization parameters. We remark that the number n of neurons in the approximant (1.2) must increase, if the tolerance ϵ > 0 in the approximation of the target function f is required to be smaller.

From a theoretical perspective, the main attraction of neural networks with one hidden layer is their universal approximation property as formulated in (1.3), which overshadows the properties of their predecessors, namely: the perceptrons [17]. In particular, the question of finding sufficient conditions on the actvation function σ that ensure the universal approximation property was investigated in great detail by many authors, with emphasis on the most popular sigmoidal function, defined by the property σ(t) → 1 for t → ∞ and σ(t) → 0 for t → −∞. For example, Funahashi [18] applied some discretization of an integral formula from Irie and Miyake [19] to prove the universal approximation property for some sigmoidal function σ. A similar theorem was proved by Hornik et al. [20] by using the Stone-Weierstrass theorem, and another by Cybenko [21] by applying the Hahn-Banach and Riesz Representation theorems. A constructive proof via approximation by ridge functions was given in our paper [22], with algorithm for implementation presented in our follow-up work [23]. A complete characterization of which activation functions are allowed to achieve the universal approximation property was given later in Mhaskar and Micchelli [24] and Leshno et al. [25].

However, for neural networks with one hidden layer, one of the severe limitations to applying training algorithms based on optimization, such as back-propagation or those proposed in the book [11] of Vapnik, is that it is neccessary to know the number of neurons in $N$ in advance. Therefore, one major problem in the 1990s, known as the complexity problem, was to estimate the number of neurons required to approximate a function to a desired accuracy. In practice, this gives rise to a trade-off: to achieve a good approximation, one needs to have a large number of neurons, which makes the implementation of the training algorithm harder.

In this regard, nearly a century of research in approximation theory suggests that the higher the order of smoothness of the target function, the smaller the number of neurons should be, needed to achieve the desired accuracy. There are many different definitions of smoothness that give rise to different estimates. For example, under the condition that the Fourier transform of the target function f satisfies $\int_{ℝ^{D}} | ω \hat{f} (ω) | d ω < \infty$ , Barron [26] proved the existence of a neural network with $O$ (ϵ⁻²) neurons that gives an L²([0, 1]^D) error of $O$ (ϵ). While it is interesting to note that this number of neurons is essentially independent of the dimension D, the constants involved in the $O$ term as well as the number of derivatives needed to ensure the condition on the target function may increase with D. Several authors have subsequently improved upon such results under various conditions on the activation function as well as the target function so as to ensure that the constants depend polynomially on D (e.g., [27, 28, 29] and references therein).

The most commonly understood definition of smoothness is just the number of derivatives of the target function. It is well-known from the theory of n-widths that if r ≥ 1 is an integer, and the only a priori information assumed on the unknown target function is that it is r-times continuously differentiable function, then a stable and uniform approximation to within ϵ by neural networks must have at least a constant multiple of ϵ^−D/r neurons. In Mhaskar [30], we gave an explicit construction for a neural network that achieves the accuracy of ϵ using $O$ (ϵ^−D/r) neurons arranged in a single hidden layer. It follows that this suffers from a curse of dimensionality, in that the number of neurons increases exponentially with the input dimension D. Clearly, if the smoothness r of the function increases linearly with D, as it has to in order to satisfy the condition in Barron [26], then this bound is also “dimension independent.”

While this is definitely unavoidable for neural networks with one hidden layer, the most natural way out is to achieve local approximation; i.e., given an input x, construct a network with a uniformly bounded number of neurons that approximates the target function with the optimal rate of approximation near the point x, preferably using the values of the function also in a neighborhood of x. Unfortunately, this can never be achieved as we proved in Chui et al. [31]. Furthermore, we have proved in Chui et al. [32] that even if we allow each neuron to evaluate its own activation function, this local approximation fails. Therefore the only way out is to use a neural network with more than one hidden layer, called deep net (for deep neural network). Indeed, local approximation can be achieved by a deep net as proved in our papers [7, 33]. In this regard, it is of interest to point out that an adaptive version of Mhaskar [7, 33] was derived in Mhaskar and Khachikyan [34] for prediction of time series, yielding as much as 150% improvement upon the state-of-the-art at that time, in the study of the floor market problem.

Of course, the curse of dimensionality is inherent to the problem itself, whether with one or more hidden layers. Thus, while it is possible to construct a deep net to approximate a function at each point arbitrarily closely by using a uniformly bounded number of neurons, the uniform approximation on an entire compact set, such as a cube, would still require an approximation at a number of points in the cube, and this number increasing exponentially with the input dimension. Equivalently, the effective number of neurons for approximation on the entire cube is still exponentially increasing with the input dimension.

In addition to the high dimensionality, another difficulty in solving the function approximation problem is that the data may be not just high dimensional but unstructured and sparse. A relatively recent idea which has been found very useful in applications, in fact, too many to list exhaustively, is to consider the points x as being sampled from an unknown, low dimensional sub-manifold 𝕏 of the ambient high dimensional space ℝ^D. The understanding of the geometry of 𝕏 is the subject of the bulk of modern research in the area of diffusion geometry. An introduction to this subject can be found in the special issue [35] of Applied and Computational Harmonic Analysis. The basic idea is to construct the so-called diffusion matrix from the data, and use its eigen-decomposition for finding local coordinate charts and other useful aspects of the manifold. The convergence of the eigen-decomposition of the matrices to that of the Laplace-Beltrami and other differential operators on the manifold is discussed, for example, in Belkin and Niyogi [36], Lafon [37], and Singer [38]. It is shown in Jones et al. [39, 40] that some of the eigenfunctions on the manifolds yield a local coordinate chart on the manifold. In the context of deep learning, this idea is explored as a function approximation problem in Mishne et al. [41], where a deep net is developed in order to learn the coordinate system given by the eigenfunctions.

On the other hand, while much of the research in this direction is focused on understanding the data geometry, the theoretical foundations for the problems of function approximation and harmonic analysis on such data-defined manifold are developed extensively in Maggioni and Mhaskar [8], Filbir and Mhaskar [42, 43], Mhaskar [44, 45] and Chui and Mhaskar [46]. The theory is developed more recently for kernel construction on directed graphs and analysis of functions on changing data in our paper [47]. However, a drawback of the approach based on data-defined manifolds, known as the out-of-sample extension problem, is that since the diffusion matrix is constructed entirely using the available data, the whole process must be done again if new data become available. A popular idea is then to extend the eigen-functions to the ambient space by using the so called Nyström extension [48].

The objective of this present paper is to describe a deep learning approach to the problem of function approximation, using three groups of networks in the deep net. The lowest layer accomplishes dimensionality reduction by learning the local coordinate charts on the unknown manifold without using any eigen-decomposition. Having found the local coordinate system, the problem is reduced to the classical problem of approximating a function on a cube in a relatively low dimensional Euclidean space. For the next two layers, we may now apply the powerful techniques from approximation theory to approximate the target function f, given the samples on the training data set $C$ . We describe two approaches to construct the basis functions using multi-layered neural networks, and to construct other networks to use these basis functions in the next layer to accomplish the desired function approximation.

We summarize some of the highlights of our paper.

• We give a very simple learning method for learning the local coordinate chart near each point. The subsequent approximation process is then entirely local to each coordinate patch.

• Our method allows us to solve the pre-image problem easily; i.e., to generate a point on the manifold corresponding to a given local coordinate description.

• The learning method itself does not involve any optimization based technique, except probably for reducing the noise in the values of the function itself.

• We provide optimal error bounds on approximation based on the smoothness of the function, while the method itself does not require an a priori knowledge of such smoothness.

• Our methods can solve easily the out-of-sample extension problem. Unlike the Nyström extension process, our method does not require any elaborate construction of kernels defined on the ambient space and commuting with certain differential operators on the unknown manifold.

• Our method is designed to control the growth of the out-of-sample extension in a tubular neighborhood of 𝕏, and is local to each coordinate patch.

This paper is organized as follows. In section 2, we describe the main ideas in our approach. The local coordinate system is described in detail in section 2.2. Having thus found a local coordinate chart around the input, the problem of function approximation reduces to the classical one. In section 2.3, we demonstrate how the popular basis functions used in this theory can be implemented using neural networks with one or more hidden layers. The function approximation methods which work with unstructured data without using optimization are described in section 2.6. In section 3, we explain how our method can be used to solve both the pre-image problem and the out-of-sample extension problem.

2. Main Ideas and Results

The purpose of this paper is to develop a deep learning algorithm to learn a function f :𝕏 → ℝ, where 𝕏 is a d dimensional compact Riemannian sub-manifold (without boundary) of a Euclidean space ℝ^D, with d ≪ D, given training data of the form ${(x_{j}, f (x_{j}))}_{j = 1}^{M}$ , x_j ∈ 𝕏. It is important to note that 𝕏 itself is not known in advance; the points x_j are known only as D-dimensional vectors, presumed to lie on 𝕏. In section 2.1, we explain our main idea briefly. In section 2.2, we derive a simple construction of the local coordinate chart for 𝕏. In section 2.3, we describe the construction of a neural network with one or more hidden layers to implement two of the basis functions used commonly in function approximation. While the well known classical approximation algorithms require a specific placement of the training data, one has no control on the location of the data in the current problem. In section 2.6, we give algorithms suitable for the purpose of solving this problem.

2.1. Outline of the Main Idea

Our approach is the following.

1. 𝕏 is a finite union of local coordinate neighborhoods, and x belongs to one of them, say 𝕌. We find a local coordinate system for this neighborhood in terms of Euclidean distances on ℝ^D, say Φ : 𝕌 → [−1, 1]^d, where d is the dimension of the manifold. Let y = Φ(x), and with a relabeling for notational convenience, ${x_{j}}_{j = 1}^{K}$ be the points in 𝕌, y_j = Φ(x_j). This way, we have reduced the problem to approximating g = f ◦ Φ:[−1, 1]^d → ℝ at y, given the values ${(y_{j}, g (y_{j}))}_{j = 1}^{K}$ , where g(y_j) = f(x_j). We note that {y_j} is a subset of the unit cube of low dimensional Euclidean space, representing a local coordinate patch on 𝕏. Thus, the problem of approximation of f on this patch is reduced that of approximation of g, a well studied classical approximation problem.

2. We will summarize the solution to this problem using neural networks with one or more hidden layers, e.g., an implementation of multivariate tensor product spline approximation using multi-layerd neural network.

Thus, our deep learning networks will have three main layers.

1. The bottom layer receives the input x, figures out which of the points x_j are in the coordinate neighborhood of x, and computes the local coordinates y, y_j.

2. The next several layers compute the local basis functions necessary for the approximation, for example, the B-splines and their translates using the multi-layered neural network as in Mhaskar [7].

3. The last layer receives the data ${(y_{j}, g (y_{j}))}_{j = 1}^{K}$ , and computes the approximation described in Step 2 above.

2.2. Local Coordinate Learning

We assume that 1 ≤ d ≤ D are integers, 𝕏 is a d dimensional smooth, compact, connected, Riemannian sub-manifold (without boundary) of a Euclidean space ℝ^D, with geodesic distance ρ.

Before we discuss our own construction of a local coordinate chart on 𝕏, we wish to motivate the work by describing a result from Jones et al. [40]. Let ${λ_{k}^{2}}_{k = 0}^{\infty}$ be the sequence of eigenvalues of the (negative of the) Laplace-Beltrami operator on 𝕏, and for each k ≥ 0, ϕ_k be the eigenfunction corresponding to the eigenvalue $λ_{k}^{2}$ . We define a formal “heat kernel” by

\begin{array}{rcl} K_{t} (x, y) = \sum_{k = 0}^{\infty} exp (- λ_{k}^{2} t) ϕ_{k} (x) ϕ_{k} (y) . & (2.1) \end{array}

The following result is a paraphrasing of the heat triangulation theorem proved in [40, Theorem 2.2.7] under weaker assumptions on 𝕏.

Theorem 2.1. (cf. [40, Theorem 2.2.7]) Let $x_{0}^{*} \in 𝕏$ . There exist constants R > 0, c₁, ⋯ , c₆ > 0 depending on $x_{0}^{*}$ with the following property. Let p₁, ⋯ , p_d be d linearly independent vectors in ℝ^d, and $x_{j}^{*} \in 𝕏$ be chosen so that $x_{j}^{*} - x_{0}^{*}$ is in the direction of p_j, j = 1, ⋯ , d, and

c_{1} R \leq ρ (x_{j}^{*}, x_{0}^{*}) \leq c_{2} R, j = 1, \dots, d,

and $t = c_{3} R^{2}$ . Let B ⊂ 𝕏 be the geodesic ball of radius c₄R, centered at $x_{0}^{*}$ , and

\begin{array}{rcl} Φ_{j m s} (x) = R^{d} (K_{t} (x, x_{1}^{*}), \dots, K_{t} (x, x_{d}^{*})), x \in B . & (2.2) \end{array}

Then

\begin{array}{rcl} \frac{c_{5}}{R} ρ (x_{1}, x_{2}) \leq ‖ Φ_{j m s} (x_{1}) - Φ_{j m s} (x_{2}) | |_{d} \leq \frac{c_{6}}{R} ρ (x_{1}, x_{2}), x_{1}, x_{2} \in B . (2.3) \end{array}

Since the paper [40] deals with a very general manifold, the mapping Φ_jms is not claimed to be a diffeomorphism, although it is obviously one-one on B.

We note that even in the simple case of a Euclidean sphere, an explicit expression for the heat kernel is not known. In practice, the heat kernel has to be approximated using appropriate Gaussian networks [37]. In this section, we aim to obtain a local coordinate chart that is computed directly in terms of Euclidean distances on ℝ^D, and depends upon d + 2 trainable parameters. The construction of this chart constitutes the first hidden layer of our deep learning process. As explained in the introduction, once this chart is in place, the question of function extension on the manifold reduces locally to the well studied problem of function extension on a d dimensional unit cube.

To describe our constructions, we first develop some notation.

In this section, it is convenient to use the notation x = (x¹, ⋯ , x^D) ∈ ℝ^D rather than x = (x₁, ⋯ , x_D), which we will use in the rest of the sections. If 1 ≤ d ≤ D is an integer, and x ∈ ℝ^d, ‖x‖_d denotes the Euclidean norm of x. If x ∈ ℝ^D, we will write $π_{c} (x) = (x^{1}, \dots, x^{d})$ , ‖x‖_d = ‖π_c(x)‖_d. If x ∈ ℝ^d, r > 0,

B_{d} (x, r) = {y \in ℝ^{d} : ‖ x - y ‖_{d} \leq r} .

There exists δ* > 0 with the following properties. The manifold is covered by finitely many geodesic balls such that for the center $x_{0}^{*} \in 𝕏$ of any of these balls, there exists a diffeomorphism, namely, the exponential coordinate map u = (u¹, ⋯ , u^D) from $B_{d} (0, δ^{*})$ to the geodesic ball around $x_{0}^{*} = u (0)$ [49, p. 65]. If J is the Jacobian matrix for u, given by $J_{i, j} (y) = D_{i} u^{j} (y)$ , $y \in B_{d} (0, δ^{*})$ , then

\begin{array}{rcl} J (0) = [I_{d} | 0_{d, D - d}] . & (2.4) \end{array}

Further, there exists κ > 0 (independent of x*) such that

\begin{array}{rcl} ‖ J (q) - J (0) ‖ \leq κ ‖ q ‖_{d}, q \in B_{d} (0, δ^{*}), & (2.5) \end{array}

where the matrix norm is the induced norm. Let η*: = min(δ*, 1/(2κ)). Then (2.5) implies that

\begin{array}{rcl} 1 / 2 \leq 1 - κ ‖ q ‖_{d} \leq ‖ J (q) ‖ \leq 1 + κ ‖ q ‖_{d} \leq 2, q \in B_{d} (0, η^{*}) . & (2.6) \end{array}

In turn, this leads to

\begin{array}{rcl} (1 / 2) ρ (u (p), u (q)) \leq ‖ p - q ‖_{d} \leq 2 ρ (u (p), u (q)), p, q \in B_{d} (0, η^{*}) . & (2.7) \end{array}

Let $x_{ℓ}^{*} = u (q_{ℓ})$ , ℓ = 1, ⋯ , d, be chosen with the following properties:

\begin{array}{rcl} ‖ q_{ℓ} ‖_{d} \leq η^{*}, ℓ = 1, \dots, d, & (2.8) \end{array}

and, with the matrix function U defined by

\begin{array}{rcl} U_{i, j} (q) = u^{i} (q) - {(x_{j}^{*})}^{i}, & (2.9) \end{array}

we have

\begin{array}{rcl} ‖ J (0) U (0) y ‖_{d} \geq γ > 0, ‖ y ‖_{d} = 1 . & (2.10) \end{array}

Any set ${x_{ℓ}^{*}}$ with these properties will be called coordinate stars around $x_{0}^{*}$ . We note that the matrix J(0)U(0) has columns given by $π_{c} (x_{0}^{*} - x_{j}^{*})$ , j = 1, ⋯ , d, and hence, can be computed without reference to the map u. Let

\begin{array}{rcl} β^{*} : = (1 / 2) min (\frac{1}{2 κ}, δ^{*}, \frac{γ}{8 \sqrt{d}}) . & (2.11) \end{array}

Theorem 2.2. Let $Ψ (q) : = {(‖ u (q) - u (q_{ℓ}) ‖_{D}^{2})}_{ℓ = 1}^{d} \in ℝ^{d}$ . Then

(a) Ψ is a diffeomorphism on $B_{d} (0, 2 β^{*})$ . If $p, q \in B_{d} (0, 2 β^{*})$ , x = u(p), y = u(q), then

\begin{array}{rcl} \frac{γ}{2} ρ (x, y) \leq ‖ Ψ (p) - Ψ (q) ‖_{d} \leq 32 \sqrt{d} η^{*} ρ (x, y) . & (2.12) \end{array}

(b) The function Ψ is a diffeomorphism from $B_{d} (0, β^{*})$ onto $B_{d} (Ψ (0), β^{*})$ .

Remark 2.1. Let $𝔹 = u (B_{d} (0, β^{*})) \subset 𝕏$ be a geodesic ball around $x_{0}^{*}$ . For x ∈ 𝔹, we define

ϕ (x) = Ψ (u^{- 1} (x)) = (‖ x - x_{ℓ}^{*} ‖_{D}^{2}) .

Then Theorem 2.2(b) shows that ϕ is a diffeomorphism from 𝔹 onto $B_{d} (Ψ (0), β^{*})$ . Since $Ψ (0) = (‖ x_{0}^{*} - x_{ℓ}^{*} ‖_{D}^{2})$ ,

Φ (x) = \frac{\sqrt{d}}{β^{*}} (ϕ (x) - Ψ (0)), x \in 𝔹

maps 𝔹 diffeomorphically onto $B_{d} (0, \sqrt{d}) \supset {[- 1, 1]}^{d}$ . Let 𝕌 = Φ⁻¹([−1, 1]^d). Then 𝕌 is a neighborhood of $x_{0}^{*}$ and Φ maps 𝕌 diffeomorphically onto [−1, 1]^d. We oberve that 𝕏 is a union of finitely many neighborhoods of the form 𝕌, so that any x ∈ 𝕏 belongs to at least one such neighborhood. Moreover, Φ(x) can be computed entirely in terms of the description of x in terms of its D-dimensional coordinates. □

Remark 2.2. The trainable parameters are thus β*, and the points $x_{0}^{*}, \dots, x_{d}^{*}$ . Since ‖J(0)‖ = 1, the condition (2.10) is satisfied if $x_{ℓ}^{*} - x_{0}^{*}$ are along linearly independent directions as in Theorem 2.1. □

Remark 2.3. Since the mapping Φ in Remark 2.1 is a quadratic polynomial in x, it can be implemented as a neural network with a single hidden layer using the activation function given in (2.22) as described in section 2.4.

Example 2.1. Let 0 < a < 1, and M = M_a be the circular helix defined by

u (s) = {(cos a s, sin a s, \sqrt{1 - a^{2}} s)}^{T} .

Clearly, M is a one dimensional manifold, and s is the arclength parameter, measured from (1, 0, 0)^T. The curvature at any point is a². For any point z₀ ∈ M, U_z₀ = M, with the diffeomorphism given by u(tan⁻¹(πt/2)), t ∈ (−1, 1). An interesting fact is that $‖ u (s + 2 π) - u (s) ‖_{3} = 2 π \sqrt{1 - a^{2}}$ can be made arbitrarily small by choosing a close to 1, even though the geodesic distance between u(s + 2π) and u(s) is 2π. Let s₀ ∈ ℝ, s₀ + π/4 ≤ s₁ ≤ s₀ + 3π/8, and U: = {u(s) : |s − s₀| ≤ π/8}. Let $Ψ (s) : = ‖ u (s) - u (s_{1}) ‖_{3}^{2}$ . It is easy to calculate that

Ψ (s) = 2 - 2 cos (a (s - s_{1})) + (1 - a^{2}) {(s - s_{1})}^{2},

so that

Ψ^{'} (s) = 2 a sin (a (s - s_{1})) + 2 (1 - a^{2}) (s - s_{1}) .

If u(s) ∈ U, then π/8 ≤ s₁ − s ≤ π/2. Therefore, using the well known estimates

\frac{2 θ}{π} \leq sin (θ) \leq θ, θ \in [0, π / 2],

we obtain that

\begin{array}{rcl} | Ψ^{'} (s) | \leq 2 a^{2} (s_{1} - s) + 2 (1 - a^{2}) (s_{1} - s) \leq π, & (2.13) \end{array}

and

\begin{array}{rcl} | Ψ^{'} (s) | \geq \frac{4 a^{2}}{π} (s_{1} - s) + 2 (1 - a^{2}) (s_{1} - s) \geq \frac{π}{8} (2 - 2 a^{2} (1 - 2 / π)) \geq 1 / 2 . & (2.14) \end{array}

Hence, for any points u(t₁), u(t₂) ∈ U, we have

(1 / 2) | t_{1} - t_{2} | \leq | ‖ u (t_{1}) - u (s_{1}) ‖_{3}^{2} - ‖ u (t_{2}) - u (s_{1}) ‖_{3}^{2} | \leq π | t_{1} - t_{2} | .

We note that the neighborhood U where this estimate holds and the constants are independent of the curvature. □

The remainder of this section is devoted to the proof of Theorem 2.2.

Lemma 2.1. Let $q \in B_{d} (0, η^{*})$ . Each of the following statements hold for the matrix U defined in (2.9):

\begin{array}{rcl} ‖ U (q) - U (0) ‖ \leq 2 \sqrt{d} ‖ q ‖_{d}, & (2.15) \end{array}

\begin{array}{rcl} ‖ U (q) ‖ \leq 2 \sqrt{d} max_{1 \leq ℓ \leq d} ‖ q - q_{ℓ} ‖_{d} \leq 4 \sqrt{d} η^{*}, & (2.16) \end{array}

\begin{array}{rcl} ‖ J (q) U (q) - J (0) U (0) ‖ \leq 4 \sqrt{d} ‖ q ‖_{d} . & (2.17) \end{array}

With β* as in (2.11), for $‖ q ‖_{d} \leq 2 β^{*}$ , ‖y‖_d = 1,

\begin{array}{rcl} ‖ J (q) U (q) y ‖_{d} \geq γ / 2 . & (2.18) \end{array}

proof. In view of (2.6) and the mean value theorem, we have for $‖ p ‖_{d} \leq η^{*}$ ,

\begin{array}{rcl} ‖ u (q) - u (p) ‖_{D} \leq 2 ‖ q - p ‖_{d} . & (2.19) \end{array}

We observe further that for any integers m, ℓ, $U {(q)}_{m, ℓ} - U {(0)}_{m, ℓ} = u^{m} (q) - u^{m} (0)$ . Consequently, for any y ∈ ℝ^d, ‖y‖_d ≤ 1,

‖ (U (q) - U (0)) y ‖_{D} = ‖ u (q) - u (0) ‖_{D} \sum_{ℓ = 1}^{d} | y^{ℓ} | \leq 2 \sqrt{d} ‖ q ‖_{d} ‖ y ‖_{d} .

This proves (2.15).

In view of (2.19), used with q_ℓ in place of p, ℓ = 1, ⋯ , d, we obtain for all y ∈ ℝ^d, ‖y‖_d ≤ 1,

\begin{array}{l} {‖ \sum_{ℓ = 1}^{d} y^{ℓ} (u (q) - u (q_{ℓ})) ‖}_{D} & \leq & \sum_{ℓ = 1}^{d} | y^{ℓ} | {‖ u (q) - u (q_{ℓ}) ‖}_{D} \\ \leq 2 \sqrt{d} \max_{1 \leq ℓ \leq d} ‖ q - q_{ℓ} ‖_{d} . \end{array}

This proves (2.16).

In view of (2.5), (2.16), (2.15), we obtain for $‖ q ‖_{d} \leq η^{*} \leq 1 / (2 κ)$ that

\begin{array}{l} ‖ J (q) U (q) - J (0) U (0) ‖ = \\ ‖ (J (q) - J (0)) U (q) + J (0) (U (q) - U (0)) ‖ \\ \leq ‖ J (q) - J (0) ‖ ‖ U (q) ‖ + ‖ J (0) ‖ ‖ U (q) - U (0) ‖ \\ \leq 4 \sqrt{d} η^{*} κ ‖ q ‖_{d} + 2 \sqrt{d} ‖ q ‖_{d} = 2 \sqrt{d} (1 + 2 η^{*} κ) ‖ q ‖_{d} \\ \leq 4 \sqrt{d} ‖ q ‖_{d} . \end{array}

This proves (2.17). The estimate (2.18) follows easily from this and (2.10). □

PROOF OF THEOREM 2.2. In this proof only, let $J$ (q) be the Jacobian of Ψ: $J_{i, j} (q) = D_{i} (| | u (q) - u (q_{j}) | |^{2})$ . Then $J$ (q) = 2J(q)U(q). The estimate (2.10) shows that $J$ (0) is invertible, and that ‖ $J$ (0)⁻¹‖ ≤ 1/(2γ). The estimate (2.17) then shows that

\begin{array}{rcl} ‖ J (q) - J (0) ‖ \leq 1 / (2 ‖ J {(0)}^{- 1} ‖), ‖ q | |_{d} \leq 2 β^{*} . & (2.20) \end{array}

Therefore, the inverse function theorem as given in [50, Theorem 9.24 and its proof] implies that Ψ is a diffeomorphism on B(0, 2β*) as claimed. For $‖ q | |_{d} \leq 2 β^{*}$ , (2.18) shows that ‖ $J$ (q)⁻¹‖ ≤ 1/γ. Also, (2.16) and (2.6) show that $‖ J (q) ‖ \leq 16 \sqrt{d} η^{*}$ . Hence, the mean value theorem implies that

\begin{array}{rcl} γ ‖ q - p | |_{d} \leq ‖ Ψ (q) - Ψ (p) | |_{d} \leq 16 \sqrt{d} η^{*} ‖ q - p | |_{d} . & (2.21) \end{array}

Together with (2.7), this implies (2.12).

The part (b) follows also from [50, Theorem 9.24 and its proof] and (2.20). □

2.3. Local Basis Functions

Having found a local coordinate map Φ on a neighborhood 𝕌 of x on 𝕏, the problem of extending f from {x_j}∩𝕌 to 𝕌 is reduced to extending f◦Φ from $C = {y_{j} = Φ (x_{j})} \subset {[- 1, 1]}^{d}$ , a classical approximation problem. There is, of course, 100+ years of research on this subject. We restrict ourselves to two examples, which can be implemented using neural networks with one or more hidden layers. One of the most popular activation function in the deep learning literature (e.g., [14]) is the rectified linear unit function

t_{+} = max (0, t) .

Since this function is not continuously differentiable, there are some technical difficulties to use common algorithms like back-propagation with these activation functions. Although we do not need back-propagation in our theory, we prefer to deal with a rectified quadratic unit function defined for t ∈ ℝ by

\begin{array}{l} σ (t) = {\begin{matrix} t^{2}, & if t \geq 0, \\ 0, & if t < 0, \end{matrix} & (2.22) \end{array}

which is continuously differentiable on ℝ. Our theory will work in general with any activation function of order k ≥ 2; i.e., with a function σ that satisfies

\begin{array}{rcl} lim_{t \to - \infty} \frac{σ (t)}{t^{k}} = 0, lim_{t \to \infty} \frac{σ (t)}{t^{k}} = 1, & (2.23) \end{array}

but for the sake of clarity of exposition, we will use only the activation function σ defined in (2.22).

2.4. Polynomials

The most basic class of classical approximants is the set of all polynomials. For n > 0, we denote the class of all algebraic polynomials of coordinatewise degree at most n in d variables by $Π_{n}^{d}$ . (It is convenient to use the same notation also when n is not an integer; in this case, $Π_{n}^{d}$ is just $Π_{⌊ n ⌋}^{d}$ .

The basic implementation of polynomials is given in [22, Proof of Theorem 3.1], where an explicit construction is given for finding the weights {w_k}, the thresholds {b_k} and the coefficients {a_k} used in (2.24) below.

Theorem 2.3. Let n > 0, $N = 2^{⌈ {log}_{2} n ⌉}$ , $P \in Π_{N}^{d} \supseteq Π_{n}^{d}$ , then there exist weights $w_{1}, \dots, w_{dim (Π_{N}^{d})}$ and real numbers $a_{1}, \dots, a_{dim (Π_{N}^{d})}$ , $b_{1}, \dots, b_{dim (Π_{N}^{d})}$ such that

\begin{array}{rcl} P (x) = \sum_{k = 1}^{dim (Π_{N}^{d})} a_{k} {(w_{k} \cdot x + b_{k})}^{N}, x \in ℝ^{d} . & (2.24) \end{array}

Here, the weights {w_k} and the thresholds {b_k} are independent of P and the coefficients {a_k} are linear functionals on $Π_{N}^{d}$ .

We observe that

\begin{array}{rcl} t^{2} = σ (t) + σ (- t), x \in ℝ, & (2.25) \end{array}

while

\begin{array}{rcl} t^{N} = {({(t^{2})}^{2} \dots)}^{2}, ({log}_{2} N times), & (2.26) \end{array}

so that the expression on the right hand side of (2.24) can be expressed as a neural network with log₂N hidden layers.

We note that a neural network with one hidden layer is given in Mhaskar [30], but using a C^∞ activation function σ; e.g., σ(t) = (1 + e^−t)−1. This uses the fact that for w, x ∈ ℝ^d and b ∈ ℝ, such that none of the derivatives σ^(j)(b), j = 0, 1, ⋯ , equal to 0,

\begin{array}{l} \frac{\partial^{| k |}}{\partial w^{k}} σ (w \cdot x + b) |_{w = 0} = x^{k} σ^{| k |} (b) . & (2.27) \end{array}

A finite difference scheme to implement this differentiation yields a neural network with one hidden layer, containing exactly $dim (Π_{n}^{d})$ neurons, and should be stable for C^∞ functions. If stability is a greater concern, then one may use other numerical differentiation schemes to implement this formula, e.g., spectral methods [51].

2.5. B-splines

For t ∈ ℝ, and integer m ≥ 1, let

t_{+}^{m} = {_{0, otherwise,}^{t^{m}, if t \geq 0,}

A tensor product cardinal B-spline at y∈[−1, 1]^d is defined by

\begin{array}{rcl} N_{m} (y) = \frac{1}{(m - 1)!^{d}} \sum_{\binom{k \in ℤ,}{k \geq 0, | k |_{\infty} \leq m}} {(- 1)}^{| k |_{1}} \prod_{j = 1}^{d} (\binom{m}{k_{j}}) \prod_{j = 1}^{d} {(y_{j} - k_{j})}_{+}^{m} . & (2.28) \end{array}

It is explained in Mhaskar [7, 33] that the quantity N_m(y) can be computed using a neural network with a sigmoidal function of order m−1 consisting of finitely many neurons arranged in multiple hidden layers (the number of neurons and layers depending on m and d alone). Thus, if m is a power of 2, then each of the terms ${(y_{j} - k_{j})}_{+}^{m}$ can be implemented as an iterated power of ${(y_{j} - k_{j})}_{+}^{2}$ (cf. (2.26).) The product of d such expressions can be implemented using either Theorem 2.3 as a network with mulitple hidden layer and utilizing the rectified quadratic unit function as the activation function, or a discretization of the formula (2.27) using a C^∞ sigmoidal function as explained in section 2.4.

2.6. Function Approximation

In this section, if y ∈ ℝ^d, |y|_∞ is the ℓ^∞ norm of y.

2.6.1. Spline Based Approximation

In [52, section 4.5], [53], a quasi-interpolatory spline function is defined by

\begin{array}{rcl} Q_{m} (f) (y) = \sum_{k \in ℤ^{d}} λ_{m}^{*} (f (\cdot + k)) N_{m} (y + m / 2 + k), y \in ℝ^{d}, & (2.29) \end{array}

where $λ_{m}^{*}$ are compactly supported linear functionals, designed specifically to ensure that Q_m(P) = P for every polynomial P of coordinatewise degree at most m − 1 in d variables. With Q_{m, h}(f)(y) = Q_m(f(h(·)))(y/h), h > 0, one has the approximation bound for small h:

\begin{array}{rcl} max_{y \in {[- 1, 1]}^{d}} | Q_{m, h} (f) (y) - f (y) | = O (h^{m}) . & (2.30) \end{array}

The linear functionals $λ_{m}^{*}$ are based on finitely many samples of f at the grid points in a compact subset of ℤ^d. In our context, the data for approximating f is not in this form. Therefore, we may use the following algorithm given in Mhaskar et al. [54], where we assume that $λ_{m}^{*}$ is scaled so as to be supported on [−1, 1]^d.

Given: A set $C = {ξ_{j}}$ of points in [−1, 1]^d. Let

δ (C) = max_{x \in {[- 1, 1]}^{d}} min_{ξ \in C} | x - ξ |_{\infty},

and $δ (C)$ be sufficiently small.

Objective: To find real numbers ${a_{ξ}}_{ξ \in C}$ such that the functional

\begin{array}{rcl} γ (f) : = \sum_{ξ \in C} a_{ξ} f (ξ) & (2.31) \end{array}

satisfies

\begin{array}{l} γ (P) = λ_{m}^{*} (P), & if P is d - variate polynomial of \\ coordinatewise degree \leq m - 1 . \end{array}

Steps:

1. Divide [−1, 1]^d into congruent subcubes of side not exceeding $2 δ (C)$ .

2. Choose $C_{0} \subseteq C$ , so that each subcube has exactly one point of $C_{0}$ .

3. Solve the following (underdetermined) system of equations for the unknowns a_ξ, $ξ \in C_{0}$ .

\begin{array}{rcl} \sum_{ξ \in C} a_{ξ} ξ^{k} = λ_{m}^{*} ({(\cdot)}^{k}), | k |_{\infty} \leq m - 1 . & (2.32) \end{array}

4. Set a_ξ: = 0 if $ξ \in C \ C_{0}$ .

5. Output ${(a_{ξ})}_{ξ \in C}$ .

Substituting γ in place of $λ_{m}^{*}$ in the definition of Q_m(f) yields the desired spline approximation

\begin{array}{rcl} {\tilde{Q}}_{m} (f) (y) = \sum_{k \in ℤ^{d}} γ (f (\cdot + k)) N_{m} (y + m / 2 + k), y \in ℝ^{d}, & (2.33) \end{array}

and it is proved in Mhaskar et al. [54] that the estimate (2.30) holds with ${\tilde{Q}}_{m}$ replacing Q_m.

2.6.2. Polynomial Quasi-Interpolation

A standard method for polynomial approximation is to consider a filtered projection defined in (2.38) below.

The Chebyshev polynomials (of first kind) are defined recursively for t ∈ ℝ and integer m ≥ 0 by

\begin{array}{rcl} T_{0} (t) = 1, T_{1} (t) = t, T_{m} (t) = 2 t T_{m - 1} (t) - T_{m - 2} (t) . & (2.34) \end{array}

In terms of monomials, the Chebyshev polynomials are given by

\begin{array}{l} T_{2 n} (t) = \sum_{j = 0}^{n} \frac{{(- 4)}^{j}}{(2 j)!} \prod_{ℓ = 1}^{j} (n^{2} - {(j - ℓ)}^{2}) t^{2 j}, \\ T_{2 n + 1} (t) = \sum_{j = 0}^{n} \frac{{(- 1)}^{j} {(2 n + 1)}^{2}}{(2 j + 1)!} \prod_{ℓ = 1}^{j} ({(2 n + 1)}^{2} - {(2 j - 2 ℓ + 1)}^{2}) t^{2 j + 1} . & (2.35) \end{array}

For y ∈ ℝ^d and multi-integer m = (m¹, ⋯ , m^d) ≥ 0, the tensor product Chebyshev polynomial is defined by

\begin{array}{rcl} T_{m} (y) = \prod_{j = 1}^{d} T_{m^{j}} (y^{j}) . & (2.36) \end{array}

We choose a smooth low pass filter h; i.e., an even function h:ℝ → [0, 1] such that h(u) = 1 if |u| ≤ 1/2, and h(u) = 0 if |u| ≥ 1, and abuse the notation as usual to define

h (u) = \prod_{j = 1}^{d} h (u^{j}) .

With this filter, we define the kernel

\begin{array}{rcl} Φ_{n} (y, t) = \sum_{k \in ℤ^{d}} h (k / n) T_{k} (y) T_{k} (t), n > 0, y, t \in {[- 1, 1]}^{d} . & (2.37) \end{array}

Then the filtered projection operator is defined by

\begin{array}{l} \begin{matrix} V_{n} (f) (y) & = & \int_{{[- 1, 1]}^{d}} f (t) Φ_{n} (y, t) \frac{d t}{\sqrt{(1 - {(t^{1})}^{2}) \dots (1 - {(t^{d})}^{2})}}, \end{matrix} \\ \begin{matrix} y \in {[- 1, 1]}^{d}, \end{matrix} & (2.38) \end{array}

It is well known that if f is any continuous function on [−1, 1]^d, then {V_n(f)} converges uniformly to f at the near optimal rate of approximation. For example, if f has partial derivatives up to order r in each variable, then analogously to (2.30), but for large n rather than small h,

\begin{array}{rcl} max_{y \in {[- 1, 1]}^{d}} | V_{n} (f) (y) - f (y) | = O (n^{- r}) . & (2.39) \end{array}

Theoretically, the question then is to compute V_n(f) using the data $C$ as in section 2.6.1. The procedure we describe below from Mhaskar [55, 56] and Mhaskar et al. [5] also describes the choice of the parameter n depending upon the data.

Given: A set $C = {ξ_{j}}$ of points in [−1, 1]^d. Let

δ^{◦} (C) = max_{x \in {[- 1, 1]}^{d}} min_{ξ \in C} max_{1 \leq j \leq d} | arccos (x^{j}) - arccos (ξ^{j}) | .

and $δ^{◦} (C)$ be sufficiently small. We also fix an integer n > 0.

Objective: To find real numbers ${w_{ξ}}_{ξ \in C}$ such that the functional

\begin{array}{rcl} γ^{◦} (f) : = \sum_{ξ \in C} w_{ξ} f (ξ) & (2.40) \end{array}

satisfies

γ^{◦} (P) = \int_{{[- 1, 1]}^{d}} \frac{P (t) d t}{\sqrt{(1 - {(t^{1})}^{2}) \dots (1 - {(t^{d})}^{2})}},

for all d-variate polynomials P of coordinatewise degree ≤ n − 1.

Steps:

1. Divide [−1, 1]^d into congruent subcubes of side not exceeding $2 δ^{°} (C)$ .

2. Choose $C_{0} \subseteq C$ , so that each subcube has exactly one point of $C_{0}$ .

3. Solve the following (underdetermined) system of equations for the unknowns w_ξ, $ξ \in C_{0}$ .

\begin{array}{rcl} \sum_{ξ \in C} w_{ξ} T_{k} (ξ) = \int_{{[- 1, 1]}^{d}} \frac{T_{k} (t) d t}{\sqrt{(1 - {(t^{1})}^{2}) \dots (1 - {(t^{d})}^{2})}} \\ \begin{matrix} = δ_{k, 0}, | k |_{\infty} \leq n - 1. \end{matrix} & (2.41) \end{array}

4. Set w_ξ := 0 if $ξ \in C \ C_{0}$ .

5. Output ${(w_{ξ})}_{ξ \in C}$ .

It is proved in the papers cited above that with the discretized operator

\begin{array}{rcl} V_{n, C} (f) (y) = \sum_{ξ \in C} w_{ξ} f (ξ) Φ_{n} (y, ξ), & (2.42) \end{array}

one obtains the near best rates of approximation. In particular, if f has partial derivatives up to order r in each variable, then (2.39) holds with $V_{n, C} (f)$ replacing V_n(f). In practice, we choose n to be the largest integer such that either the condition number of the system of equations in (2.41) is “reasonable” or else by checking the resulting errors in (2.41) [57].

The formula (2.42) can be re-written in the form

\begin{array}{rcl} V_{n, C} (f) (y) = \sum_{| k |_{\infty} \leq n - 1} h (k / n) \hat{f} (C, k) T_{k} (y), & (2.43) \end{array}

where

\begin{array}{rcl} \hat{f} (C, k) = \sum_{ξ \in C} w_{ξ} f (ξ) T_{k} (ξ) . & (2.44) \end{array}

Therefore, rather than evaluating the Chebyshev polynomials as defined in (2.36), (2.35), one can use a multi-layered network to evaluate $V_{n, C} (f)$ in a more stable manner as follows. The first layer computes the coefficients $h (k / n) \hat{f} (C, k)$ using the available data. The output of this layer is input to a recurrent network to execute a multi-variate version of the well known Clenshaw algorithm [58, pp. 78–80].

3. Extensions

Since the starting point of diffusion geometry is to consider eigen-decomposition of a diffusion matrix, which is constructed using the available data, the entire computation needs to be redone if a new data becomes available. Since the manifold 𝕏 is only an abstract model, it is not even clear that the new data will belong to this manifold. This gives rise to two related questions. One is to find new points on the manifold; the so called pre-image problem [41] and the other is the out-of-sample extension problem; i.e., extend the target function to points not necessarily on 𝕏. In this section, we make some comments on how to use the theory described in the previous sections for solving these problems.

3.1. Pre-image Problem

In Remark 2.1, we have given an onto diffeomorphism Φ:𝕌 → [−1, 1]^d where 𝕌 ⊂ 𝕏. The pre-image problem is the following. Given a point y ∈ [−1, 1]^d, find x ∈ 𝕏 such that Φ(x) = y. This amounts to approximating the D-output function Φ⁻¹ on [−1, 1]^d, given its values at the known points ${Φ (ξ) : ξ \in C}$ , and can therefore be solved using any of the techniques described in the previous sections.

3.2. Out of Sample Extension

One well known strategy for function extension outside the manifold is the following. One starts with a compact, positive-semi-definite symmetric kernel K : ℝ^D × ℝ^D and considers the eigen-decomposition of K restricted to 𝕏 × 𝕏; thus, for example, if μ* is the volume measure on 𝕏, one finds numbers λ_k ≥ 0 and orthonormal functions ϕ_k on 𝕏 such that

\begin{array}{l} \int_{𝕏} K (x, y) ϕ_{k} (y) d μ^{*} (y) = λ_{k} ϕ_{k} (x), k = 0, 1, \dots, x \in 𝕏 . & (3.1) \end{array}

A function on 𝕏 can then be expanded in terms of the orthonormal system of functions {ϕ_k}. The extension to ℝ^D is achieved by treating (3.1) as a definition of ϕ_k on ℝ^D (which can be done since K is defined on ℝ^D × ℝ^D), and the expansion of the original function f where the basis functions are now interpreted as extended by (3.1) as the desired extension. This leads to a variety of theoretical problems related to the judicious construction of kernels defined on the whole space whose eigenfunctions are meaningful as functions on 𝕏 (e.g., kernels that commute with the Laplace-Beltrami operator) so as to allow such a construction. In the end, it is not clear how well this extension will behave outside of 𝕏.

Our construction gives an alternative method for extending a function on 𝕏 to a tubular neighborhood of 𝕏, which we feel is more appropriate for most applications rather than trying to extend the function to the entire ambient space. Toward this goal, we first explain the local coordinate learning phase for tubular neighborhood of 𝕏.

Let s ≥ d be an integer, s ≤ D. For q ∈ ℝ^s (or q ∈ ℝ^D), we write $π_{c} (q) = (q^{1}, \dots, q^{d})$ , and

v (q) = u (π_{c} (q)) + (\underset{d times}{\underset{︸}{0, \dots, 0}}, q^{d + 1}, \dots, q^{s}, \underset{D - s times}{\underset{︸}{0, \dots, 0}}) \in ℝ^{D} .

If J_s is the Jacobian matrix for v, i.e., ${(J_{s})}_{i, j} (q) = D_{i} v^{j} (q)$ , then it is easy to check that

\begin{array}{rcl} J_{s} (0) = [I_{s} | 0_{s, D - s}] . & (3.2) \end{array}

Moreover, if $p, q \in B_{d} (0, δ^{*})$ , then p = π_c(p), q = π_c(q). Therefore,

\begin{array}{l} (J_{s} (q) - J_{s} (p)) (z) & = & (J_{d} (π_{c} (q)) - J_{d} (π_{c} (p))) (z), \\ \begin{matrix} p, q \in B_{d} (0, δ^{*}), z \in ℝ^{D} . \end{matrix} & (3.3) \end{array}

Consequently,

\begin{array}{rcl} ‖ J_{s} (q) - J_{s} (0) ‖ \leq κ ‖ π_{c} (q) ‖_{d} \leq κ ‖ q ‖_{s} . & (3.4) \end{array}

If we now define for $p, q \in B_{d} (0, δ^{*})$

ρ_{1} {(v (p), v (q))}^{2} : = ρ {(u (π_{c} (p)), u (π_{c} (q)))}^{2} + \sum_{k = d + 1}^{s} | p^{k} - q^{k} |^{2},

then following the same argument as the one leading to (2.7) leads to

(1 / 2) ρ_{1} (v (p), v (q)) \leq ‖ p - q ‖_{s} \leq 2 ρ_{1} (v (p), v (q)), ‖ p ‖_{s}, ‖ q ‖_{s} \leq η^{*} .

Thus, there is no loss of generality in assuming that 𝕏 is already a s dimensional submanifold of ℝ^D, defined by v, and with geodesic distance ρ₁. This has several consequences. Even if one overestimates the dimension of the original manifold to be s rather than d, the resulting “distance respecting” coordinate system will also be “distance respecting” for the original manifold, except for the presumably small error resulting from the overestimate. If we have no information about d (or s), we may take s = D. This would answer the question regarding points off the manifold, as well as take noise into account. However, then the advantage of dimension reduction is lost. Also, all the constants will depend upon D rather than s (or d).

Having defined a local coordinate system for the tubular neighborhood of 𝕏 in this way, we can then construct the local basis functions on this neighborhood as in section 2.3. However, since the original data $C$ is not dense on the local coordinate patch in the tubular neighborhood, one cannot use the ideas in section 2.6. We would like instead to keep some control on the growth of the extension operator. For this purpose, we propose to use the minimal Sobolev norm (MSN) interpolant introduced in Chandrasekaran et al. [59] and used very fruitfully in solutions of partial differential equations [60] and image segmentation [61].

Thus, using the procedures explained in these papers, we consider a differential operator Δ depending upon the application, and find an integer N and coefficients $c_{k}^{*}$ , |k|_∞ ≤ N, so as the s-variate polynomial $P^{*} = \sum_{k : | k |_{\infty} \leq N} c_{k}^{*} T_{k}$ minimizes

\begin{array}{l} \int_{{[- 1, 1]}^{s}} | Δ P (t) |^{2} \frac{d t}{\sqrt{(1 - {(t^{1})}^{2}) \dots (1 - {(t^{s})}^{2})}} & (3.5) \end{array}

over all s-variate polynomials P of coordinatewise degree ≤ N, subject to the conditions

\begin{array}{rcl} P (Φ (x_{j})) = f (x_{j}), x_{j} \in C . & (3.6) \end{array}

The polynomial P* then defines an extension of f to the tubular neighborhood of the local coordinate patch of 𝕏.

Author Contributions

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

We would like to thank Professor Tomaso Poggio for his comments and for sharing the manuscript [15] with us. An earlier version of this work is available as a preprint [62] and the authors hold the copyright of the preprint.

CC is also associated with the Statistics Department of Stanford University, CA 94305, and his research is partially supported by U.S. ARO Grant # W911NF-15-1-0385, Hong Kong Research Council Grant # 12300917, and Hong Kong Baptist University Grant # HKBU-RC-ICRS/16-17/03. The research of HM is supported in part by ARO Grant W911NF-15-1-0385.

References

1. Vapnik V. The Nature of Statistical Learning Theory. New York, NY: Springer Science & Business Media (2013).

Google Scholar

2. Tiao GC, Tsay RS. Model specification in multivariate time series. J Roy Statist Soc Ser B (1989) 51:157–213.

Google Scholar

3. Chakraborty K, Mehrotra K, Mohan CK, Ranka S. Forecasting the behavior of multivariate time series using neural networks. Neural Netw. (1992) 5:961–70. doi: 10.1016/S0893-6080(05)80092-9

CrossRef Full Text | Google Scholar

4. Naumova V, Pereverzyev SV, Sivananthan S. Adaptive parameter choice for one-sided finite difference schemes and its application in diabetes technology. J Complex. (2012) 28:524–38. doi: 10.1016/j.jco.2012.06.001

CrossRef Full Text | Google Scholar

5. Mhaskar HN, Naumova V, Pereverzyev SV. Filtered Legendre expansion method for numerical differentiation at the boundary point with application to blood glucose predictions. Appl Math Comput. (2013) 224:835–47. doi: 10.1016/j.amc.2013.09.015

CrossRef Full Text | Google Scholar

6. Sharda R, Delen D. Forecasting Box-Office Receipts of Motion Pictures Using Neural Networks. Stillwater, OK: Citeseer (2002).

Google Scholar

7. Mhaskar HN. Approximation properties of a multilayered feedforward artificial neural network. Adv Comput Math. (1993) 1:61–80. doi: 10.1007/BF02070821

CrossRef Full Text | Google Scholar

8. Maggioni M, Mhaskar HN. Diffusion polynomial frames on metric measure spaces. Appl Comput Harm Anal. (2008) 24:329–53. doi: 10.1016/j.acha.2007.07.001

CrossRef Full Text | Google Scholar

9. Ehler M, Filbir F, Mhaskar HN. Locally Learning Biomedical Data Using Diffusion Frames. J Comput Biol. (2012) 19:1251–64. doi: 10.1089/cmb.2012.0187

PubMed Abstract | CrossRef Full Text | Google Scholar

10. Chui CK, Filbir F, Mhaskar HN. Representation of functions on big data: graphs and trees. Appl Comput Harm Anal. (2015) 38:489–509. doi: 10.1016/j.acha.2014.06.006

CrossRef Full Text | Google Scholar

11. Vapnik VN, Vapnik V. Statistical Learning Theory. New York, NY: Wiley (1998).

PubMed Abstract | Google Scholar

12. Girosi F, Poggio T. Networks and the best approximation property. Biol Cybernet. (1990) 63:169–76. doi: 10.1007/BF00195855

CrossRef Full Text | Google Scholar

13. Girosi F, Jones MB, Poggio T. Regularization theory and neural networks architectures. Neural Comput. (1995) 7:219–69. doi: 10.1162/neco.1995.7.2.219

CrossRef Full Text | Google Scholar

14. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature (2015) 521:436–44. doi: 10.1038/nature14539

PubMed Abstract | CrossRef Full Text | Google Scholar

15. Rosasco L, Shashua A, Cohen N, Poggio T. Notes on Hierarchical Splines, DCLNs, Convolutional Kernels and i-Theory. Boston, MA: Manuscript, Center for Brains, Minds, and Machines, MIT (2015).

16. Lorentz GG, von Golitschek M, Makovoz Y. Constructive Approximation: Advanced Problems. Berlin: Springer (1996).

Google Scholar

17. Minsky M, Papert S. Perceptrons: An Introduction to Computational Geometry. Cambridge, MA: MIT Press (1988).

Google Scholar

18. Funahashi KI. On the approximate realization of continuous mappings by neural networks. Neural Netw. (1989) 2:183–92. doi: 10.1016/0893-6080(89)90003-8

CrossRef Full Text | Google Scholar

19. Irie B, Miyake S. Capabilities of three-layered perceptrons. In: Neural Networks, 1988., IEEE International Conference on. IEEE (1988). p. 641–8.

Google Scholar

20. Hornik K, Stinchcombe M, White H. Multilayer feedforward networks are universal approximators. Neural Netw. (1989) 2:359–66. doi: 10.1016/0893-6080(89)90020-8

CrossRef Full Text | Google Scholar

21. Cybenko G. Approximation by superposition of sigmoidal functions. Math Contl Signals Syst. (1989) 2:303–14. doi: 10.1007/BF02551274

CrossRef Full Text | Google Scholar

22. Chui CK, Li X. Approximation by ridge functions and neural networks with one hidden layer. J Approx Theory (1992) 70:131–41. doi: 10.1016/0021-9045(92)90081-X

CrossRef Full Text | Google Scholar

23. Chui CK, Li X. Realization of Neural Networks with One Hidden Layer. Singapore: Multivariate Approximation: From CAGD to Wavelets, World Scientific, (1993). p. 77–89.

Google Scholar

24. Mhaskar HN, Micchelli CA. Approximation by superposition of sigmoidal and radial basis functions. Adv Appl Math. (1992) 13:350–73. doi: 10.1016/0196-8858(92)90016-P

CrossRef Full Text | Google Scholar

25. Leshno M, Lin VY, Pinkus A, Schocken S. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Netw. (1993) 6:861–7. doi: 10.1016/S0893-6080(05)80131-5

CrossRef Full Text | Google Scholar

26. Barron AR. Universal approximation bounds for superpositions of a sigmoidal function. Inf Theory IEEE Trans. (1993) 39:930–45. doi: 10.1109/18.256500

CrossRef Full Text | Google Scholar

27. Kurková V, Sanguineti M. Bounds on rates of variable basis and neural network approximation. IEEE Trans Inf Theory (2001) 47:2659–65. doi: 10.1109/18.945285

CrossRef Full Text | Google Scholar

28. Kurková V, Sanguineti M. Comparison of worst case errors in linear and neural network approximation. IEEE Trans Inf Theory (2002) 48:264–75. doi: 10.1109/18.971754

CrossRef Full Text | Google Scholar

29. Mhaskar HN. On the tractability of multivariate integration and approximation by neural networks. J Complex. (2004) 20:561–90. doi: 10.1016/j.jco.2003.11.004

CrossRef Full Text | Google Scholar

30. Mhaskar HN. Neural networks for optimal approximation of smooth and analytic functions. Neural Comput. (1996) 8:164–77. doi: 10.1162/neco.1996.8.1.164

CrossRef Full Text | Google Scholar

31. Chui CK, Li X, Mhaskar HN. Neural networks for localized approximation. Math Comput. (1994) 63:607–23. doi: 10.1090/S0025-5718-1994-1240656-2

CrossRef Full Text | Google Scholar

32. Chui CK, Li X, Mhaskar HN. Limitations of the approximation capabilities of neural networks with one hidden layer. Adv Comput Math. (1996) 5:233–43. doi: 10.1007/BF02124745

CrossRef Full Text | Google Scholar

33. Mhaskar HN. Neural networks for localized approximation of real functions. In: Neural Networks for Processing [1993] III. Proceedings of the 1993 IEEE-SP Workshop. IEEE (1993). p. 190–96.

Google Scholar

34. Mhaskar HN, Khachikyan L. Neural networks for function approximation. In: Neural Networks for Signal Processing [1995] V. Proceedings of the 1995 IEEE Workshop. IEEE (1995). p. 21–9.

Google Scholar

35. Chui CK, Donoho DL. Special issue: diffusion maps and wavelets. Appl. Comput. Harm Anal. (2006) 21:1–2. doi: 10.1016/j.acha.2006.05.005

CrossRef Full Text

36. Belkin M, Niyogi P. Towards a theoretical foundation for Laplacian-based manifold methods. J Comput Syst Sci. (2008) 74:1289–308. doi: 10.1016/j.jcss.2007.08.006

CrossRef Full Text | Google Scholar

37. Lafon SS. Diffusion Maps and Geometric Harmonics. Yale, MI: Yale University (2004).

Google Scholar

38. Singer A. From graph to manifold Laplacian: the convergence rate. Appl Comput Harm Anal. (2006) 21:128–34. doi: 10.1016/j.acha.2006.03.004

CrossRef Full Text | Google Scholar

39. Jones PW, Maggioni M, Schul R. Manifold parametrizations by eigenfunctions of the Laplacian and heat kernels. Proc Natl Acad Sci USA. (2008) 105:1803–8. doi: 10.1073/pnas.0710175104

PubMed Abstract | CrossRef Full Text | Google Scholar

40. Jones PW, Maggioni M, Schul R. Universal local parametrizations via heat kernels and eigenfunctions of the Laplacian. Ann Acad Sci Fenn Math. (2010) 35:131–74. doi: 10.5186/aasfm.2010.3508

CrossRef Full Text | Google Scholar

41. Mishne G, Shaham U, Cloninger A, Cohen I. Diffusion nets. arXiv:150607840v1 (2015).

Google Scholar

42. Filbir F, Mhaskar HN. A quadrature formula for diffusion polynomials corresponding to a generalized heat kernel. J Fourier Anal Appl. (2010) 16:629–57. doi: 10.1007/s00041-010-9119-4

CrossRef Full Text | Google Scholar

43. Filbir F, Mhaskar HN. Marcinkiewicz–Zygmund measures on manifolds. J Complex. (2011) 27:568–96. doi: 10.1016/j.jco.2011.03.002

CrossRef Full Text | Google Scholar

44. Mhaskar HN. Eignets for function approximation on manifolds. Appl Comput Harm Anal. (2010) 29:63–87. doi: 10.1016/j.acha.2009.08.006

CrossRef Full Text | Google Scholar

45. Mhaskar HN. A generalized diffusion frame for parsimonious representation of functions on data defined manifolds. Neural Netw. (2011) 24:345–59. doi: 10.1016/j.neunet.2010.12.007

PubMed Abstract | CrossRef Full Text | Google Scholar

46. Chui CK, Mhaskar HN. Smooth function extension based on high dimensional unstructured data. Math Comput. (2014) 83:2865–91. doi: 10.1090/S0025-5718-2014-02819-6

CrossRef Full Text | Google Scholar

47. Mhaskar HN. A unified framework for harmonic analysis of functions on directed graphs and changing data. Appl Comput Harm Anal. (2018) 44:611–644. doi: 10.1016/j.acha.2016.06.007

CrossRef Full Text | Google Scholar

48. Coifman RR, Lafon S. Geometric harmonics: a novel tool for multiscale out-of-sample extension of empirical functions. Appl Comput Harm Anal. (2006) 21:31–52. doi: 10.1016/j.acha.2005.07.005

CrossRef Full Text | Google Scholar

49. do Carmo Valero MP. Riemannian Geometry. Birkhäuser (1992).

50. Rudin W. Principles of Mathematical Analysis (International Series in Pure & Applied Mathematics). Singapore: McGraw-Hill Publishing Co. (1976).

Google Scholar

51. Gottlieb D, Orszag SA. Numerical Analysis of Spectral Methods: Theory and Applications. Siam (1977).

Google Scholar

52. Chui CK. An Introduction to Wavelets. San Diego, CA: Academic Press (1992).

Google Scholar

53. Chui CK, Diamond H. A characterization of multivariate quasi-interpolation formulas and its applications. Numerische Mathematik. (1990) 57:105–21. doi: 10.1007/BF01386401

CrossRef Full Text | Google Scholar

54. Mhaskar HN, Narcowich FJ, Ward JD. Quasi-interpolation in shift invariant spaces. J Math Anal Appl. (2000) 251:356–63. doi: 10.1006/jmaa.2000.7051

CrossRef Full Text | Google Scholar

55. Mhaskar HN. Approximation theory and neural networks. In: Wavelet Analysis and Applications, Proceedings of the International Workshop, Delhi (1999). p. 247–89.

Google Scholar

56. Mhaskar HN. Polynomial operators and local smoothness classes on the unit interval, II. Jaén J Approx. (2009) 1:1–25.

Google Scholar

57. Le Gia QT, Mhaskar HN. Localized linear polynomial operators and quadrature formulas on the sphere. SIAM J Numer Anal. (2009) 47:440–66. doi: 10.1137/060678555

CrossRef Full Text | Google Scholar

58. Gautschi W. Orthogonal Polynomials: Computation and Approximation. Oxford University Press on Demand (2004).

59. Chandrasekaran S, Jayaraman KR, Mhaskar HN. Minimum Sobolev norm interpolation with trigonometric polynomials on the torus. J Comput Phys. (2013) 249:96–112. doi: 10.1016/j.jcp.2013.03.041

CrossRef Full Text | Google Scholar

60. Chandrasekaran S, Jayaraman KR, Gu M, Mhaskar HN, Mofftt J. Higher order numerical discretization methods with sobolev norm minimization. Proc Comput Sci. (2011) 4:206–15. doi: 10.1016/j.procs.2011.04.022

CrossRef Full Text | Google Scholar

61. Chandrasekaran S, Jayaraman KR, Moffitt J, Mhaskar HN, Pauli S. Minimum Sobolev Norm schemes and applications in image processing.In: IS&T/SPIE Electronic Imaging. International Society for Optics and Photonics (2010). p. 753507.

Google Scholar

62. Chui CK, Mhaskar HN. Deep nets for local manifold learning. arXiv:160707110 (2016).

Google Scholar

Keywords: deep learning, function approximation, manifold learning, neural networks, local approximation

Citation: Chui CK and Mhaskar HN (2018) Deep Nets for Local Manifold Learning. Front. Appl. Math. Stat. 4:12. doi: 10.3389/fams.2018.00012

Received: 05 February 2018; Accepted: 24 April 2018;
Published: 29 May 2018.

Edited by:

Ding-Xuan Zhou, City University of Hong Kong, Hong Kong

Reviewed by:

Xin Guo, Hong Kong Polytechnic University, Hong Kong
Shao-Bo Lin, Wenzhou University, China

Copyright © 2018 Chui and Mhaskar. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Hrushikesh N. Mhaskar, hrushikesh.mhaskar@cgu.edu

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.