- 1Department of Mathematics, Hong Kong Baptist University, Hong Kong, Hong Kong
- 2Institute of Mathematical Sciences, Claremont Graduate University, Claremont, CA, United States
The problem of extending a function f defined on a training data on an unknown manifold 𝕏 to the entire manifold and a tubular neighborhood of this manifold is considered in this paper. For 𝕏 embedded in a high dimensional ambient Euclidean space ℝD, a deep learning algorithm is developed for finding a local coordinate system for the manifold without eigen-decomposition, which reduces the problem to the classical problem of function approximation on a low dimensional cube. Deep nets (or multilayered neural networks) are proposed to accomplish this approximation scheme by using the training data. Our methods do not involve such optimization techniques as back-propagation, while assuring optimal (a priori) error bounds on the output in terms of the number of derivatives of the target function. In addition, these methods are universal, in that they do not require a prior knowledge of the smoothness of the target function, but adjust the accuracy of approximation locally and automatically, depending only upon the local smoothness of the target function. Our ideas are easily extended to solve both the pre-image problem and the out-of-sample extension problem, with a priori bounds on the growth of the function thus extended.
1. Introduction
Machine learning is an active sub-field of Computer Science on algorithmic development for learning and making predictions based on some given data, with a long list of applications that range from computational finance and advertisement, to information retrieval, to computer vision, to speech and handwriting recognition, and to structural healthcare and medical diagnosis. In terms of function approximation, the data for learning and prediction can be formulated as {(x, fx)}, obtained with an unknown probability distribution. Examples include: the Boston housing problem (of predicting the median price fx of a home based on some vector x of 13 other attributes [1]) and the floor market problem [2, 3] (that deals with the indices of the wheat floor pricing in three major markets in the United States). For such problems, the objective is to predict the index fx in the next month, say, based on a vector x of their values over the past few months. Other similar problems include the prediction of blood glucose level fx of a patient based on a vector x of the previous few observed levels [4, 5], and the prediction of box office receipts (fx) on the date of release of a movie in preparation, based on a vector x of the survey results about the movie [6]. It is pointed out in Mhaskar [7], Maggioni and Mhaskar [8] and Ehler et al. [9] that all the pattern classification problems can also be viewed fruitfully as problems of function approximation. While it is an ongoing research to allow non-numeric input x (e.g., [10]), we restrict our attention in this paper to the consideration of x ∈ ℝD, for some integer D ≥ 1.
In the following discussion, the first component x is considered as input, while the second component fx is considered the output of the underlying process. The central problem is to estimate the conditional expectation of fx given x. Various statistical techniques and theoretical advances in this direction are well-known (see, for example [11]). In the context of neural and radial-basis-function networks, an explicit formulation of the input/output machines was pointed out in Girosi and Poggio [12] and Girosi et al. [13]. More recently, the nature of deep learning as an input/output process is formulated in the same way, as explained in LeCun et al. [14] and Rosasco et al. [15]. To complement the statistical perspective and understand the theoretical capabilities of these processes, it is customary to think of the expected value of fx, given x , as a function f of x. The question of empirical estimation in this context is to carry out the approximation of f given samples , where is a finite training data set. In practice, because of the random nature of the data, it may be possible that there are several pairs of the form (x, fx) in the data for the same values of x. In this case, a statistical scheme, such as some kind of averaging of fx being the simplest one, can be used to obtain a desired value f(x) for the sample of f at x, . From this perspective, the problem of extending f from the traning data set to x not in in machine learning is called the generalization problem.
We will illustrate this general line of ideas by using neural networks as an example. To motivate this idea, let us first recall a theorem originating with Kolmogorov and Arnold [16, Chapter 17, Theorem 1.1]. According to this theorem, there exist universal Lipschitz continuous functions ϕ1, ⋯ , ϕ2D+1 and universal scalars λ1, ⋯ , λD ∈ (0, 1), for which every continuous function f :[0, 1]D → ℝ can be written as
where g is a continuous function that depends on f. In other words, for a given f, only one function g has to be determind to give the representation formula (1.1) of f.
A neural network, used as an input/output machine, consists of an input layer, one or more hidden layers, and an output layer. Each hidden layer consists of a number of neurons arranged according to the network architecture. Each of these neurons has a local memory and performs a simple non-linear computation upon its input. The input layer fans out the input x ∈ ℝD to the neurons at the first hidden layer. The output layer typically takes a linear combination of the outputs of the neurons at the last hidden layer. The right hand side of (1.1) is a neural network with two hidden layers. The first contains D neurons, where the j-th neuron computes the sum . The next hidden layer contains 2D + 1 neurons each evaluating the function g on the output of the j-th neuron in the first hidden layer. The output layers takes the sum of the results as indicated in (1.1).
From a practical point of view, such a network is clearly hard to construct, since only the existence of the functions ϕj and g is known, without a numerical procedure for computing these. In the early mathematical development of neural networks during the late 1980s and early 1990s, instead of finding these functions for the representation of a given continuous function f in (1.1), the interest was to study the existence and characterization of universal functions σ :ℝ → ℝ, called activation functions of the neural networks, such that each neuron evaluates the activation function upon an affine transform of its input, and the network is capable of approximating any desired real-valued continuous target function f : K → ℝ arbitrarily closely on K, where K ⊂ ℝD is any compact set.
For example, a neural network with one hidden layer can be expressed as a function
Here, the hidden layer consists of n neurons, each of which has a local memory. The local memory of the k-th neuron contains the weights , and the threshold bk ∈ ℝ. Upon receiving the input x ∈ ℝD from the input later, the k-th neuron evaluates σ(wk · x + bk) as its output, where σ is a non-linear activation function. The output layer is just one circuit where the coefficients {ak} are stored in a local memory, and that evaluates the linear combination as indicated in (1.2). Training of this network in order to learn a function f on a compact subset K ⊂ ℝD to an accuracy of ϵ > 0 involves finding the parameters {ak}, {wk}, {bk} so that
The most popular technique for doing this is the so called back-propagation, which seeks to find these quantities by minimizing an error functional usually with some regularization parameters. We remark that the number n of neurons in the approximant (1.2) must increase, if the tolerance ϵ > 0 in the approximation of the target function f is required to be smaller.
From a theoretical perspective, the main attraction of neural networks with one hidden layer is their universal approximation property as formulated in (1.3), which overshadows the properties of their predecessors, namely: the perceptrons [17]. In particular, the question of finding sufficient conditions on the actvation function σ that ensure the universal approximation property was investigated in great detail by many authors, with emphasis on the most popular sigmoidal function, defined by the property σ(t) → 1 for t → ∞ and σ(t) → 0 for t → −∞. For example, Funahashi [18] applied some discretization of an integral formula from Irie and Miyake [19] to prove the universal approximation property for some sigmoidal function σ. A similar theorem was proved by Hornik et al. [20] by using the Stone-Weierstrass theorem, and another by Cybenko [21] by applying the Hahn-Banach and Riesz Representation theorems. A constructive proof via approximation by ridge functions was given in our paper [22], with algorithm for implementation presented in our follow-up work [23]. A complete characterization of which activation functions are allowed to achieve the universal approximation property was given later in Mhaskar and Micchelli [24] and Leshno et al. [25].
However, for neural networks with one hidden layer, one of the severe limitations to applying training algorithms based on optimization, such as back-propagation or those proposed in the book [11] of Vapnik, is that it is neccessary to know the number of neurons in in advance. Therefore, one major problem in the 1990s, known as the complexity problem, was to estimate the number of neurons required to approximate a function to a desired accuracy. In practice, this gives rise to a trade-off: to achieve a good approximation, one needs to have a large number of neurons, which makes the implementation of the training algorithm harder.
In this regard, nearly a century of research in approximation theory suggests that the higher the order of smoothness of the target function, the smaller the number of neurons should be, needed to achieve the desired accuracy. There are many different definitions of smoothness that give rise to different estimates. For example, under the condition that the Fourier transform of the target function f satisfies , Barron [26] proved the existence of a neural network with (ϵ−2) neurons that gives an L2([0, 1]D) error of (ϵ). While it is interesting to note that this number of neurons is essentially independent of the dimension D, the constants involved in the term as well as the number of derivatives needed to ensure the condition on the target function may increase with D. Several authors have subsequently improved upon such results under various conditions on the activation function as well as the target function so as to ensure that the constants depend polynomially on D (e.g., [27, 28, 29] and references therein).
The most commonly understood definition of smoothness is just the number of derivatives of the target function. It is well-known from the theory of n-widths that if r ≥ 1 is an integer, and the only a priori information assumed on the unknown target function is that it is r-times continuously differentiable function, then a stable and uniform approximation to within ϵ by neural networks must have at least a constant multiple of ϵ−D/r neurons. In Mhaskar [30], we gave an explicit construction for a neural network that achieves the accuracy of ϵ using (ϵ−D/r) neurons arranged in a single hidden layer. It follows that this suffers from a curse of dimensionality, in that the number of neurons increases exponentially with the input dimension D. Clearly, if the smoothness r of the function increases linearly with D, as it has to in order to satisfy the condition in Barron [26], then this bound is also “dimension independent.”
While this is definitely unavoidable for neural networks with one hidden layer, the most natural way out is to achieve local approximation; i.e., given an input x, construct a network with a uniformly bounded number of neurons that approximates the target function with the optimal rate of approximation near the point x, preferably using the values of the function also in a neighborhood of x. Unfortunately, this can never be achieved as we proved in Chui et al. [31]. Furthermore, we have proved in Chui et al. [32] that even if we allow each neuron to evaluate its own activation function, this local approximation fails. Therefore the only way out is to use a neural network with more than one hidden layer, called deep net (for deep neural network). Indeed, local approximation can be achieved by a deep net as proved in our papers [7, 33]. In this regard, it is of interest to point out that an adaptive version of Mhaskar [7, 33] was derived in Mhaskar and Khachikyan [34] for prediction of time series, yielding as much as 150% improvement upon the state-of-the-art at that time, in the study of the floor market problem.
Of course, the curse of dimensionality is inherent to the problem itself, whether with one or more hidden layers. Thus, while it is possible to construct a deep net to approximate a function at each point arbitrarily closely by using a uniformly bounded number of neurons, the uniform approximation on an entire compact set, such as a cube, would still require an approximation at a number of points in the cube, and this number increasing exponentially with the input dimension. Equivalently, the effective number of neurons for approximation on the entire cube is still exponentially increasing with the input dimension.
In addition to the high dimensionality, another difficulty in solving the function approximation problem is that the data may be not just high dimensional but unstructured and sparse. A relatively recent idea which has been found very useful in applications, in fact, too many to list exhaustively, is to consider the points x as being sampled from an unknown, low dimensional sub-manifold 𝕏 of the ambient high dimensional space ℝD. The understanding of the geometry of 𝕏 is the subject of the bulk of modern research in the area of diffusion geometry. An introduction to this subject can be found in the special issue [35] of Applied and Computational Harmonic Analysis. The basic idea is to construct the so-called diffusion matrix from the data, and use its eigen-decomposition for finding local coordinate charts and other useful aspects of the manifold. The convergence of the eigen-decomposition of the matrices to that of the Laplace-Beltrami and other differential operators on the manifold is discussed, for example, in Belkin and Niyogi [36], Lafon [37], and Singer [38]. It is shown in Jones et al. [39, 40] that some of the eigenfunctions on the manifolds yield a local coordinate chart on the manifold. In the context of deep learning, this idea is explored as a function approximation problem in Mishne et al. [41], where a deep net is developed in order to learn the coordinate system given by the eigenfunctions.
On the other hand, while much of the research in this direction is focused on understanding the data geometry, the theoretical foundations for the problems of function approximation and harmonic analysis on such data-defined manifold are developed extensively in Maggioni and Mhaskar [8], Filbir and Mhaskar [42, 43], Mhaskar [44, 45] and Chui and Mhaskar [46]. The theory is developed more recently for kernel construction on directed graphs and analysis of functions on changing data in our paper [47]. However, a drawback of the approach based on data-defined manifolds, known as the out-of-sample extension problem, is that since the diffusion matrix is constructed entirely using the available data, the whole process must be done again if new data become available. A popular idea is then to extend the eigen-functions to the ambient space by using the so called Nyström extension [48].
The objective of this present paper is to describe a deep learning approach to the problem of function approximation, using three groups of networks in the deep net. The lowest layer accomplishes dimensionality reduction by learning the local coordinate charts on the unknown manifold without using any eigen-decomposition. Having found the local coordinate system, the problem is reduced to the classical problem of approximating a function on a cube in a relatively low dimensional Euclidean space. For the next two layers, we may now apply the powerful techniques from approximation theory to approximate the target function f, given the samples on the training data set . We describe two approaches to construct the basis functions using multi-layered neural networks, and to construct other networks to use these basis functions in the next layer to accomplish the desired function approximation.
We summarize some of the highlights of our paper.
• We give a very simple learning method for learning the local coordinate chart near each point. The subsequent approximation process is then entirely local to each coordinate patch.
• Our method allows us to solve the pre-image problem easily; i.e., to generate a point on the manifold corresponding to a given local coordinate description.
• The learning method itself does not involve any optimization based technique, except probably for reducing the noise in the values of the function itself.
• We provide optimal error bounds on approximation based on the smoothness of the function, while the method itself does not require an a priori knowledge of such smoothness.
• Our methods can solve easily the out-of-sample extension problem. Unlike the Nyström extension process, our method does not require any elaborate construction of kernels defined on the ambient space and commuting with certain differential operators on the unknown manifold.
• Our method is designed to control the growth of the out-of-sample extension in a tubular neighborhood of 𝕏, and is local to each coordinate patch.
This paper is organized as follows. In section 2, we describe the main ideas in our approach. The local coordinate system is described in detail in section 2.2. Having thus found a local coordinate chart around the input, the problem of function approximation reduces to the classical one. In section 2.3, we demonstrate how the popular basis functions used in this theory can be implemented using neural networks with one or more hidden layers. The function approximation methods which work with unstructured data without using optimization are described in section 2.6. In section 3, we explain how our method can be used to solve both the pre-image problem and the out-of-sample extension problem.
2. Main Ideas and Results
The purpose of this paper is to develop a deep learning algorithm to learn a function f :𝕏 → ℝ, where 𝕏 is a d dimensional compact Riemannian sub-manifold (without boundary) of a Euclidean space ℝD, with d ≪ D, given training data of the form , xj ∈ 𝕏. It is important to note that 𝕏 itself is not known in advance; the points xj are known only as D-dimensional vectors, presumed to lie on 𝕏. In section 2.1, we explain our main idea briefly. In section 2.2, we derive a simple construction of the local coordinate chart for 𝕏. In section 2.3, we describe the construction of a neural network with one or more hidden layers to implement two of the basis functions used commonly in function approximation. While the well known classical approximation algorithms require a specific placement of the training data, one has no control on the location of the data in the current problem. In section 2.6, we give algorithms suitable for the purpose of solving this problem.
2.1. Outline of the Main Idea
Our approach is the following.
1. 𝕏 is a finite union of local coordinate neighborhoods, and x belongs to one of them, say 𝕌. We find a local coordinate system for this neighborhood in terms of Euclidean distances on ℝD, say Φ : 𝕌 → [−1, 1]d, where d is the dimension of the manifold. Let y = Φ(x), and with a relabeling for notational convenience, be the points in 𝕌, yj = Φ(xj). This way, we have reduced the problem to approximating g = f ◦ Φ:[−1, 1]d → ℝ at y, given the values , where g(yj) = f(xj). We note that {yj} is a subset of the unit cube of low dimensional Euclidean space, representing a local coordinate patch on 𝕏. Thus, the problem of approximation of f on this patch is reduced that of approximation of g, a well studied classical approximation problem.
2. We will summarize the solution to this problem using neural networks with one or more hidden layers, e.g., an implementation of multivariate tensor product spline approximation using multi-layerd neural network.
Thus, our deep learning networks will have three main layers.
1. The bottom layer receives the input x, figures out which of the points xj are in the coordinate neighborhood of x, and computes the local coordinates y, yj.
2. The next several layers compute the local basis functions necessary for the approximation, for example, the B-splines and their translates using the multi-layered neural network as in Mhaskar [7].
3. The last layer receives the data , and computes the approximation described in Step 2 above.
2.2. Local Coordinate Learning
We assume that 1 ≤ d ≤ D are integers, 𝕏 is a d dimensional smooth, compact, connected, Riemannian sub-manifold (without boundary) of a Euclidean space ℝD, with geodesic distance ρ.
Before we discuss our own construction of a local coordinate chart on 𝕏, we wish to motivate the work by describing a result from Jones et al. [40]. Let be the sequence of eigenvalues of the (negative of the) Laplace-Beltrami operator on 𝕏, and for each k ≥ 0, ϕk be the eigenfunction corresponding to the eigenvalue . We define a formal “heat kernel” by
The following result is a paraphrasing of the heat triangulation theorem proved in [40, Theorem 2.2.7] under weaker assumptions on 𝕏.
Theorem 2.1. (cf. [40, Theorem 2.2.7]) Let . There exist constants R > 0, c1, ⋯ , c6 > 0 depending on with the following property. Let p1, ⋯ , pd be d linearly independent vectors in ℝd, and be chosen so that is in the direction of pj, j = 1, ⋯ , d, and
and . Let B ⊂ 𝕏 be the geodesic ball of radius c4R, centered at , and
Then
Since the paper [40] deals with a very general manifold, the mapping Φjms is not claimed to be a diffeomorphism, although it is obviously one-one on B.
We note that even in the simple case of a Euclidean sphere, an explicit expression for the heat kernel is not known. In practice, the heat kernel has to be approximated using appropriate Gaussian networks [37]. In this section, we aim to obtain a local coordinate chart that is computed directly in terms of Euclidean distances on ℝD, and depends upon d + 2 trainable parameters. The construction of this chart constitutes the first hidden layer of our deep learning process. As explained in the introduction, once this chart is in place, the question of function extension on the manifold reduces locally to the well studied problem of function extension on a d dimensional unit cube.
To describe our constructions, we first develop some notation.
In this section, it is convenient to use the notation x = (x1, ⋯ , xD) ∈ ℝD rather than x = (x1, ⋯ , xD), which we will use in the rest of the sections. If 1 ≤ d ≤ D is an integer, and x ∈ ℝd, ‖x‖d denotes the Euclidean norm of x. If x ∈ ℝD, we will write , ‖x‖d = ‖πc(x)‖d. If x ∈ ℝd, r > 0,
There exists δ* > 0 with the following properties. The manifold is covered by finitely many geodesic balls such that for the center of any of these balls, there exists a diffeomorphism, namely, the exponential coordinate map u = (u1, ⋯ , uD) from to the geodesic ball around [49, p. 65]. If J is the Jacobian matrix for u, given by , , then
Further, there exists κ > 0 (independent of x*) such that
where the matrix norm is the induced norm. Let η*: = min(δ*, 1/(2κ)). Then (2.5) implies that
In turn, this leads to
Let , ℓ = 1, ⋯ , d, be chosen with the following properties:
and, with the matrix function U defined by
we have
Any set with these properties will be called coordinate stars around . We note that the matrix J(0)U(0) has columns given by , j = 1, ⋯ , d, and hence, can be computed without reference to the map u. Let
Theorem 2.2. Let . Then
(a) Ψ is a diffeomorphism on . If , x = u(p), y = u(q), then
(b) The function Ψ is a diffeomorphism from onto .
Remark 2.1. Let be a geodesic ball around . For x ∈ 𝔹, we define
Then Theorem 2.2(b) shows that ϕ is a diffeomorphism from 𝔹 onto . Since ,
maps 𝔹 diffeomorphically onto . Let 𝕌 = Φ−1([−1, 1]d). Then 𝕌 is a neighborhood of and Φ maps 𝕌 diffeomorphically onto [−1, 1]d. We oberve that 𝕏 is a union of finitely many neighborhoods of the form 𝕌, so that any x ∈ 𝕏 belongs to at least one such neighborhood. Moreover, Φ(x) can be computed entirely in terms of the description of x in terms of its D-dimensional coordinates. □
Remark 2.2. The trainable parameters are thus β*, and the points . Since ‖J(0)‖ = 1, the condition (2.10) is satisfied if are along linearly independent directions as in Theorem 2.1. □
Remark 2.3. Since the mapping Φ in Remark 2.1 is a quadratic polynomial in x, it can be implemented as a neural network with a single hidden layer using the activation function given in (2.22) as described in section 2.4.
Example 2.1. Let 0 < a < 1, and M = Ma be the circular helix defined by
Clearly, M is a one dimensional manifold, and s is the arclength parameter, measured from (1, 0, 0)T. The curvature at any point is a2. For any point z0 ∈ M, Uz0 = M, with the diffeomorphism given by u(tan−1(πt/2)), t ∈ (−1, 1). An interesting fact is that can be made arbitrarily small by choosing a close to 1, even though the geodesic distance between u(s + 2π) and u(s) is 2π. Let s0 ∈ ℝ, s0 + π/4 ≤ s1 ≤ s0 + 3π/8, and U: = {u(s) : |s − s0| ≤ π/8}. Let . It is easy to calculate that
so that
If u(s) ∈ U, then π/8 ≤ s1 − s ≤ π/2. Therefore, using the well known estimates
we obtain that
and
Hence, for any points u(t1), u(t2) ∈ U, we have
We note that the neighborhood U where this estimate holds and the constants are independent of the curvature. □
The remainder of this section is devoted to the proof of Theorem 2.2.
Lemma 2.1. Let . Each of the following statements hold for the matrix U defined in (2.9):
With β* as in (2.11), for , ‖y‖d = 1,
proof. In view of (2.6) and the mean value theorem, we have for ,
We observe further that for any integers m, ℓ, . Consequently, for any y ∈ ℝd, ‖y‖d ≤ 1,
This proves (2.15).
In view of (2.19), used with qℓ in place of p, ℓ = 1, ⋯ , d, we obtain for all y ∈ ℝd, ‖y‖d ≤ 1,
This proves (2.16).
In view of (2.5), (2.16), (2.15), we obtain for that
This proves (2.17). The estimate (2.18) follows easily from this and (2.10). □
PROOF OF THEOREM 2.2. In this proof only, let (q) be the Jacobian of Ψ: . Then (q) = 2J(q)U(q). The estimate (2.10) shows that (0) is invertible, and that ‖(0)−1‖ ≤ 1/(2γ). The estimate (2.17) then shows that
Therefore, the inverse function theorem as given in [50, Theorem 9.24 and its proof] implies that Ψ is a diffeomorphism on B(0, 2β*) as claimed. For , (2.18) shows that ‖(q)−1‖ ≤ 1/γ. Also, (2.16) and (2.6) show that . Hence, the mean value theorem implies that
Together with (2.7), this implies (2.12).
The part (b) follows also from [50, Theorem 9.24 and its proof] and (2.20). □
2.3. Local Basis Functions
Having found a local coordinate map Φ on a neighborhood 𝕌 of x on 𝕏, the problem of extending f from {xj}∩𝕌 to 𝕌 is reduced to extending f◦Φ from , a classical approximation problem. There is, of course, 100+ years of research on this subject. We restrict ourselves to two examples, which can be implemented using neural networks with one or more hidden layers. One of the most popular activation function in the deep learning literature (e.g., [14]) is the rectified linear unit function
Since this function is not continuously differentiable, there are some technical difficulties to use common algorithms like back-propagation with these activation functions. Although we do not need back-propagation in our theory, we prefer to deal with a rectified quadratic unit function defined for t ∈ ℝ by
which is continuously differentiable on ℝ. Our theory will work in general with any activation function of order k ≥ 2; i.e., with a function σ that satisfies
but for the sake of clarity of exposition, we will use only the activation function σ defined in (2.22).
2.4. Polynomials
The most basic class of classical approximants is the set of all polynomials. For n > 0, we denote the class of all algebraic polynomials of coordinatewise degree at most n in d variables by . (It is convenient to use the same notation also when n is not an integer; in this case, is just .
The basic implementation of polynomials is given in [22, Proof of Theorem 3.1], where an explicit construction is given for finding the weights {wk}, the thresholds {bk} and the coefficients {ak} used in (2.24) below.
Theorem 2.3. Let n > 0, , , then there exist weights and real numbers , such that
Here, the weights {wk} and the thresholds {bk} are independent of P and the coefficients {ak} are linear functionals on .
We observe that
while
so that the expression on the right hand side of (2.24) can be expressed as a neural network with log2N hidden layers.
We note that a neural network with one hidden layer is given in Mhaskar [30], but using a C∞ activation function σ; e.g., σ(t) = (1 + e−t)−1. This uses the fact that for w, x ∈ ℝd and b ∈ ℝ, such that none of the derivatives σ(j)(b), j = 0, 1, ⋯ , equal to 0,
A finite difference scheme to implement this differentiation yields a neural network with one hidden layer, containing exactly neurons, and should be stable for C∞ functions. If stability is a greater concern, then one may use other numerical differentiation schemes to implement this formula, e.g., spectral methods [51].
2.5. B-splines
For t ∈ ℝ, and integer m ≥ 1, let
A tensor product cardinal B-spline at y∈[−1, 1]d is defined by
It is explained in Mhaskar [7, 33] that the quantity Nm(y) can be computed using a neural network with a sigmoidal function of order m−1 consisting of finitely many neurons arranged in multiple hidden layers (the number of neurons and layers depending on m and d alone). Thus, if m is a power of 2, then each of the terms can be implemented as an iterated power of (cf. (2.26).) The product of d such expressions can be implemented using either Theorem 2.3 as a network with mulitple hidden layer and utilizing the rectified quadratic unit function as the activation function, or a discretization of the formula (2.27) using a C∞ sigmoidal function as explained in section 2.4.
2.6. Function Approximation
In this section, if y ∈ ℝd, |y|∞ is the ℓ∞ norm of y.
2.6.1. Spline Based Approximation
In [52, section 4.5], [53], a quasi-interpolatory spline function is defined by
where are compactly supported linear functionals, designed specifically to ensure that Qm(P) = P for every polynomial P of coordinatewise degree at most m − 1 in d variables. With Qm, h(f)(y) = Qm(f(h(·)))(y/h), h > 0, one has the approximation bound for small h:
The linear functionals are based on finitely many samples of f at the grid points in a compact subset of ℤd. In our context, the data for approximating f is not in this form. Therefore, we may use the following algorithm given in Mhaskar et al. [54], where we assume that is scaled so as to be supported on [−1, 1]d.
Given: A set of points in [−1, 1]d. Let
and be sufficiently small.
Objective: To find real numbers such that the functional
satisfies
Steps:
1. Divide [−1, 1]d into congruent subcubes of side not exceeding .
2. Choose , so that each subcube has exactly one point of .
3. Solve the following (underdetermined) system of equations for the unknowns aξ, .
4. Set aξ: = 0 if .
5. Output .
Substituting γ in place of in the definition of Qm(f) yields the desired spline approximation
and it is proved in Mhaskar et al. [54] that the estimate (2.30) holds with replacing Qm.
2.6.2. Polynomial Quasi-Interpolation
A standard method for polynomial approximation is to consider a filtered projection defined in (2.38) below.
The Chebyshev polynomials (of first kind) are defined recursively for t ∈ ℝ and integer m ≥ 0 by
In terms of monomials, the Chebyshev polynomials are given by
For y ∈ ℝd and multi-integer m = (m1, ⋯ , md) ≥ 0, the tensor product Chebyshev polynomial is defined by
We choose a smooth low pass filter h; i.e., an even function h:ℝ → [0, 1] such that h(u) = 1 if |u| ≤ 1/2, and h(u) = 0 if |u| ≥ 1, and abuse the notation as usual to define
With this filter, we define the kernel
Then the filtered projection operator is defined by
It is well known that if f is any continuous function on [−1, 1]d, then {Vn(f)} converges uniformly to f at the near optimal rate of approximation. For example, if f has partial derivatives up to order r in each variable, then analogously to (2.30), but for large n rather than small h,
Theoretically, the question then is to compute Vn(f) using the data as in section 2.6.1. The procedure we describe below from Mhaskar [55, 56] and Mhaskar et al. [5] also describes the choice of the parameter n depending upon the data.
Given: A set of points in [−1, 1]d. Let
and be sufficiently small. We also fix an integer n > 0.
Objective: To find real numbers such that the functional
satisfies
for all d-variate polynomials P of coordinatewise degree ≤ n − 1.
Steps:
1. Divide [−1, 1]d into congruent subcubes of side not exceeding .
2. Choose , so that each subcube has exactly one point of .
3. Solve the following (underdetermined) system of equations for the unknowns wξ, .
4. Set wξ := 0 if .
5. Output .
It is proved in the papers cited above that with the discretized operator
one obtains the near best rates of approximation. In particular, if f has partial derivatives up to order r in each variable, then (2.39) holds with replacing Vn(f). In practice, we choose n to be the largest integer such that either the condition number of the system of equations in (2.41) is “reasonable” or else by checking the resulting errors in (2.41) [57].
The formula (2.42) can be re-written in the form
where
Therefore, rather than evaluating the Chebyshev polynomials as defined in (2.36), (2.35), one can use a multi-layered network to evaluate in a more stable manner as follows. The first layer computes the coefficients using the available data. The output of this layer is input to a recurrent network to execute a multi-variate version of the well known Clenshaw algorithm [58, pp. 78–80].
3. Extensions
Since the starting point of diffusion geometry is to consider eigen-decomposition of a diffusion matrix, which is constructed using the available data, the entire computation needs to be redone if a new data becomes available. Since the manifold 𝕏 is only an abstract model, it is not even clear that the new data will belong to this manifold. This gives rise to two related questions. One is to find new points on the manifold; the so called pre-image problem [41] and the other is the out-of-sample extension problem; i.e., extend the target function to points not necessarily on 𝕏. In this section, we make some comments on how to use the theory described in the previous sections for solving these problems.
3.1. Pre-image Problem
In Remark 2.1, we have given an onto diffeomorphism Φ:𝕌 → [−1, 1]d where 𝕌 ⊂ 𝕏. The pre-image problem is the following. Given a point y ∈ [−1, 1]d, find x ∈ 𝕏 such that Φ(x) = y. This amounts to approximating the D-output function Φ−1 on [−1, 1]d, given its values at the known points , and can therefore be solved using any of the techniques described in the previous sections.
3.2. Out of Sample Extension
One well known strategy for function extension outside the manifold is the following. One starts with a compact, positive-semi-definite symmetric kernel K : ℝD × ℝD and considers the eigen-decomposition of K restricted to 𝕏 × 𝕏; thus, for example, if μ* is the volume measure on 𝕏, one finds numbers λk ≥ 0 and orthonormal functions ϕk on 𝕏 such that
A function on 𝕏 can then be expanded in terms of the orthonormal system of functions {ϕk}. The extension to ℝD is achieved by treating (3.1) as a definition of ϕk on ℝD (which can be done since K is defined on ℝD × ℝD), and the expansion of the original function f where the basis functions are now interpreted as extended by (3.1) as the desired extension. This leads to a variety of theoretical problems related to the judicious construction of kernels defined on the whole space whose eigenfunctions are meaningful as functions on 𝕏 (e.g., kernels that commute with the Laplace-Beltrami operator) so as to allow such a construction. In the end, it is not clear how well this extension will behave outside of 𝕏.
Our construction gives an alternative method for extending a function on 𝕏 to a tubular neighborhood of 𝕏, which we feel is more appropriate for most applications rather than trying to extend the function to the entire ambient space. Toward this goal, we first explain the local coordinate learning phase for tubular neighborhood of 𝕏.
Let s ≥ d be an integer, s ≤ D. For q ∈ ℝs (or q ∈ ℝD), we write , and
If Js is the Jacobian matrix for v, i.e., , then it is easy to check that
Moreover, if , then p = πc(p), q = πc(q). Therefore,
Consequently,
If we now define for
then following the same argument as the one leading to (2.7) leads to
Thus, there is no loss of generality in assuming that 𝕏 is already a s dimensional submanifold of ℝD, defined by v, and with geodesic distance ρ1. This has several consequences. Even if one overestimates the dimension of the original manifold to be s rather than d, the resulting “distance respecting” coordinate system will also be “distance respecting” for the original manifold, except for the presumably small error resulting from the overestimate. If we have no information about d (or s), we may take s = D. This would answer the question regarding points off the manifold, as well as take noise into account. However, then the advantage of dimension reduction is lost. Also, all the constants will depend upon D rather than s (or d).
Having defined a local coordinate system for the tubular neighborhood of 𝕏 in this way, we can then construct the local basis functions on this neighborhood as in section 2.3. However, since the original data is not dense on the local coordinate patch in the tubular neighborhood, one cannot use the ideas in section 2.6. We would like instead to keep some control on the growth of the extension operator. For this purpose, we propose to use the minimal Sobolev norm (MSN) interpolant introduced in Chandrasekaran et al. [59] and used very fruitfully in solutions of partial differential equations [60] and image segmentation [61].
Thus, using the procedures explained in these papers, we consider a differential operator Δ depending upon the application, and find an integer N and coefficients , |k|∞ ≤ N, so as the s-variate polynomial minimizes
over all s-variate polynomials P of coordinatewise degree ≤ N, subject to the conditions
The polynomial P* then defines an extension of f to the tubular neighborhood of the local coordinate patch of 𝕏.
Author Contributions
All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.
Conflict of Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Acknowledgments
We would like to thank Professor Tomaso Poggio for his comments and for sharing the manuscript [15] with us. An earlier version of this work is available as a preprint [62] and the authors hold the copyright of the preprint.
CC is also associated with the Statistics Department of Stanford University, CA 94305, and his research is partially supported by U.S. ARO Grant # W911NF-15-1-0385, Hong Kong Research Council Grant # 12300917, and Hong Kong Baptist University Grant # HKBU-RC-ICRS/16-17/03. The research of HM is supported in part by ARO Grant W911NF-15-1-0385.
References
1. Vapnik V. The Nature of Statistical Learning Theory. New York, NY: Springer Science & Business Media (2013).
2. Tiao GC, Tsay RS. Model specification in multivariate time series. J Roy Statist Soc Ser B (1989) 51:157–213.
3. Chakraborty K, Mehrotra K, Mohan CK, Ranka S. Forecasting the behavior of multivariate time series using neural networks. Neural Netw. (1992) 5:961–70. doi: 10.1016/S0893-6080(05)80092-9
4. Naumova V, Pereverzyev SV, Sivananthan S. Adaptive parameter choice for one-sided finite difference schemes and its application in diabetes technology. J Complex. (2012) 28:524–38. doi: 10.1016/j.jco.2012.06.001
5. Mhaskar HN, Naumova V, Pereverzyev SV. Filtered Legendre expansion method for numerical differentiation at the boundary point with application to blood glucose predictions. Appl Math Comput. (2013) 224:835–47. doi: 10.1016/j.amc.2013.09.015
6. Sharda R, Delen D. Forecasting Box-Office Receipts of Motion Pictures Using Neural Networks. Stillwater, OK: Citeseer (2002).
7. Mhaskar HN. Approximation properties of a multilayered feedforward artificial neural network. Adv Comput Math. (1993) 1:61–80. doi: 10.1007/BF02070821
8. Maggioni M, Mhaskar HN. Diffusion polynomial frames on metric measure spaces. Appl Comput Harm Anal. (2008) 24:329–53. doi: 10.1016/j.acha.2007.07.001
9. Ehler M, Filbir F, Mhaskar HN. Locally Learning Biomedical Data Using Diffusion Frames. J Comput Biol. (2012) 19:1251–64. doi: 10.1089/cmb.2012.0187
10. Chui CK, Filbir F, Mhaskar HN. Representation of functions on big data: graphs and trees. Appl Comput Harm Anal. (2015) 38:489–509. doi: 10.1016/j.acha.2014.06.006
12. Girosi F, Poggio T. Networks and the best approximation property. Biol Cybernet. (1990) 63:169–76. doi: 10.1007/BF00195855
13. Girosi F, Jones MB, Poggio T. Regularization theory and neural networks architectures. Neural Comput. (1995) 7:219–69. doi: 10.1162/neco.1995.7.2.219
15. Rosasco L, Shashua A, Cohen N, Poggio T. Notes on Hierarchical Splines, DCLNs, Convolutional Kernels and i-Theory. Boston, MA: Manuscript, Center for Brains, Minds, and Machines, MIT (2015).
16. Lorentz GG, von Golitschek M, Makovoz Y. Constructive Approximation: Advanced Problems. Berlin: Springer (1996).
17. Minsky M, Papert S. Perceptrons: An Introduction to Computational Geometry. Cambridge, MA: MIT Press (1988).
18. Funahashi KI. On the approximate realization of continuous mappings by neural networks. Neural Netw. (1989) 2:183–92. doi: 10.1016/0893-6080(89)90003-8
19. Irie B, Miyake S. Capabilities of three-layered perceptrons. In: Neural Networks, 1988., IEEE International Conference on. IEEE (1988). p. 641–8.
20. Hornik K, Stinchcombe M, White H. Multilayer feedforward networks are universal approximators. Neural Netw. (1989) 2:359–66. doi: 10.1016/0893-6080(89)90020-8
21. Cybenko G. Approximation by superposition of sigmoidal functions. Math Contl Signals Syst. (1989) 2:303–14. doi: 10.1007/BF02551274
22. Chui CK, Li X. Approximation by ridge functions and neural networks with one hidden layer. J Approx Theory (1992) 70:131–41. doi: 10.1016/0021-9045(92)90081-X
23. Chui CK, Li X. Realization of Neural Networks with One Hidden Layer. Singapore: Multivariate Approximation: From CAGD to Wavelets, World Scientific, (1993). p. 77–89.
24. Mhaskar HN, Micchelli CA. Approximation by superposition of sigmoidal and radial basis functions. Adv Appl Math. (1992) 13:350–73. doi: 10.1016/0196-8858(92)90016-P
25. Leshno M, Lin VY, Pinkus A, Schocken S. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Netw. (1993) 6:861–7. doi: 10.1016/S0893-6080(05)80131-5
26. Barron AR. Universal approximation bounds for superpositions of a sigmoidal function. Inf Theory IEEE Trans. (1993) 39:930–45. doi: 10.1109/18.256500
27. Kurková V, Sanguineti M. Bounds on rates of variable basis and neural network approximation. IEEE Trans Inf Theory (2001) 47:2659–65. doi: 10.1109/18.945285
28. Kurková V, Sanguineti M. Comparison of worst case errors in linear and neural network approximation. IEEE Trans Inf Theory (2002) 48:264–75. doi: 10.1109/18.971754
29. Mhaskar HN. On the tractability of multivariate integration and approximation by neural networks. J Complex. (2004) 20:561–90. doi: 10.1016/j.jco.2003.11.004
30. Mhaskar HN. Neural networks for optimal approximation of smooth and analytic functions. Neural Comput. (1996) 8:164–77. doi: 10.1162/neco.1996.8.1.164
31. Chui CK, Li X, Mhaskar HN. Neural networks for localized approximation. Math Comput. (1994) 63:607–23. doi: 10.1090/S0025-5718-1994-1240656-2
32. Chui CK, Li X, Mhaskar HN. Limitations of the approximation capabilities of neural networks with one hidden layer. Adv Comput Math. (1996) 5:233–43. doi: 10.1007/BF02124745
33. Mhaskar HN. Neural networks for localized approximation of real functions. In: Neural Networks for Processing [1993] III. Proceedings of the 1993 IEEE-SP Workshop. IEEE (1993). p. 190–96.
34. Mhaskar HN, Khachikyan L. Neural networks for function approximation. In: Neural Networks for Signal Processing [1995] V. Proceedings of the 1995 IEEE Workshop. IEEE (1995). p. 21–9.
35. Chui CK, Donoho DL. Special issue: diffusion maps and wavelets. Appl. Comput. Harm Anal. (2006) 21:1–2. doi: 10.1016/j.acha.2006.05.005
36. Belkin M, Niyogi P. Towards a theoretical foundation for Laplacian-based manifold methods. J Comput Syst Sci. (2008) 74:1289–308. doi: 10.1016/j.jcss.2007.08.006
38. Singer A. From graph to manifold Laplacian: the convergence rate. Appl Comput Harm Anal. (2006) 21:128–34. doi: 10.1016/j.acha.2006.03.004
39. Jones PW, Maggioni M, Schul R. Manifold parametrizations by eigenfunctions of the Laplacian and heat kernels. Proc Natl Acad Sci USA. (2008) 105:1803–8. doi: 10.1073/pnas.0710175104
40. Jones PW, Maggioni M, Schul R. Universal local parametrizations via heat kernels and eigenfunctions of the Laplacian. Ann Acad Sci Fenn Math. (2010) 35:131–74. doi: 10.5186/aasfm.2010.3508
42. Filbir F, Mhaskar HN. A quadrature formula for diffusion polynomials corresponding to a generalized heat kernel. J Fourier Anal Appl. (2010) 16:629–57. doi: 10.1007/s00041-010-9119-4
43. Filbir F, Mhaskar HN. Marcinkiewicz–Zygmund measures on manifolds. J Complex. (2011) 27:568–96. doi: 10.1016/j.jco.2011.03.002
44. Mhaskar HN. Eignets for function approximation on manifolds. Appl Comput Harm Anal. (2010) 29:63–87. doi: 10.1016/j.acha.2009.08.006
45. Mhaskar HN. A generalized diffusion frame for parsimonious representation of functions on data defined manifolds. Neural Netw. (2011) 24:345–59. doi: 10.1016/j.neunet.2010.12.007
46. Chui CK, Mhaskar HN. Smooth function extension based on high dimensional unstructured data. Math Comput. (2014) 83:2865–91. doi: 10.1090/S0025-5718-2014-02819-6
47. Mhaskar HN. A unified framework for harmonic analysis of functions on directed graphs and changing data. Appl Comput Harm Anal. (2018) 44:611–644. doi: 10.1016/j.acha.2016.06.007
48. Coifman RR, Lafon S. Geometric harmonics: a novel tool for multiscale out-of-sample extension of empirical functions. Appl Comput Harm Anal. (2006) 21:31–52. doi: 10.1016/j.acha.2005.07.005
50. Rudin W. Principles of Mathematical Analysis (International Series in Pure & Applied Mathematics). Singapore: McGraw-Hill Publishing Co. (1976).
51. Gottlieb D, Orszag SA. Numerical Analysis of Spectral Methods: Theory and Applications. Siam (1977).
53. Chui CK, Diamond H. A characterization of multivariate quasi-interpolation formulas and its applications. Numerische Mathematik. (1990) 57:105–21. doi: 10.1007/BF01386401
54. Mhaskar HN, Narcowich FJ, Ward JD. Quasi-interpolation in shift invariant spaces. J Math Anal Appl. (2000) 251:356–63. doi: 10.1006/jmaa.2000.7051
55. Mhaskar HN. Approximation theory and neural networks. In: Wavelet Analysis and Applications, Proceedings of the International Workshop, Delhi (1999). p. 247–89.
56. Mhaskar HN. Polynomial operators and local smoothness classes on the unit interval, II. Jaén J Approx. (2009) 1:1–25.
57. Le Gia QT, Mhaskar HN. Localized linear polynomial operators and quadrature formulas on the sphere. SIAM J Numer Anal. (2009) 47:440–66. doi: 10.1137/060678555
58. Gautschi W. Orthogonal Polynomials: Computation and Approximation. Oxford University Press on Demand (2004).
59. Chandrasekaran S, Jayaraman KR, Mhaskar HN. Minimum Sobolev norm interpolation with trigonometric polynomials on the torus. J Comput Phys. (2013) 249:96–112. doi: 10.1016/j.jcp.2013.03.041
60. Chandrasekaran S, Jayaraman KR, Gu M, Mhaskar HN, Mofftt J. Higher order numerical discretization methods with sobolev norm minimization. Proc Comput Sci. (2011) 4:206–15. doi: 10.1016/j.procs.2011.04.022
61. Chandrasekaran S, Jayaraman KR, Moffitt J, Mhaskar HN, Pauli S. Minimum Sobolev Norm schemes and applications in image processing.In: IS&T/SPIE Electronic Imaging. International Society for Optics and Photonics (2010). p. 753507.
Keywords: deep learning, function approximation, manifold learning, neural networks, local approximation
Citation: Chui CK and Mhaskar HN (2018) Deep Nets for Local Manifold Learning. Front. Appl. Math. Stat. 4:12. doi: 10.3389/fams.2018.00012
Received: 05 February 2018; Accepted: 24 April 2018;
Published: 29 May 2018.
Edited by:
Ding-Xuan Zhou, City University of Hong Kong, Hong KongReviewed by:
Xin Guo, Hong Kong Polytechnic University, Hong KongShao-Bo Lin, Wenzhou University, China
Copyright © 2018 Chui and Mhaskar. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Hrushikesh N. Mhaskar, hrushikesh.mhaskar@cgu.edu