Regularized Kernel Algorithms for Support Estimation

Rudi, Alessandro; De Vito, Ernesto; Verri, Alessandro; Odone, Francesca

doi:10.3389/fams.2017.00023

ORIGINAL RESEARCH article

Front. Appl. Math. Stat., 08 November 2017

Sec. Mathematics of Computation and Data Science

Volume 3 - 2017 | https://doi.org/10.3389/fams.2017.00023

Regularized Kernel Algorithms for Support Estimation

Alessandro Rudi^1,2

Ernesto De Vito³^*

Alessandro Verri⁴

Francesca Odone⁴

¹INRIA—Sierra team—École Normale Supérieure, Paris, France
²Laboratory for Computational and Statistical Learning, Istituto Italiano di Tecnologia, Genova, Italy
³Dipartimento di Matematica, Università di Genova, Genova, Italy
⁴DIBRIS, Università di Genova, Genova, Italy

In the framework of non-parametric support estimation, we study the statistical properties of a set estimator defined by means of Kernel Principal Component Analysis. Under a suitable assumption on the kernel, we prove that the algorithm is strongly consistent with respect to the Hausdorff distance. We also extend the above analysis to a larger class of set estimators defined in terms of a low-pass filter function. We finally provide numerical simulations on synthetic data to highlight the role of the hyper parameters, which affect the algorithm.

1. Introduction

A classical issue in statistics is support estimation, i.e., the problem of learning the support of a probability distribution from a set of points identically sampled according to the distribution. For example, the Devroye-Wise algorithm [1] estimates the support with the union of suitable balls centered in the training points. In the last two decades, many algorithms have been proposed and their statistical properties analyzed [1–14] and references therein.

An instance of the above setting, which plays an important role in applications, is the problem of novelty/anomaly detection, see Campos et al. [15] for an updated review. In this context, in Hoffmann [16] the author proposed an estimator based on Kernel Principal Component Analysis (KPCA), first introduced in Schölkopf et al. [17] in the context of dimensionality reduction. The algorithm was successfully tested in many applications from computer vision to biochemistry [18–24]. In many of these examples the data are often represented by high dimensional vectors, but they actually live close to a nonlinear low dimensional submanifold of the original space, and the proposed estimator takes advantage of the fact that KPCA provides an efficient compression/dimensionality reduction of the original data [16, 17], whereas many classical set estimators refer to the dimension of the original space, as it happens for the Devroye-Wise algorithm.

In this paper we prove that KPCA is a consistent estimator of the support of the distribution with respect to the Hausdorff distance. The result is based on an intriguing property of the reproducing kernel, called separating condition, first introduced in De Vito et al. [25]. This assumption ensures that any closed subset of the original space is represented in the feature space by a linear subspace. We show that this property remains true if the data are recentered to have zero mean in the feature space. Together with the results in De Vito et al. [25], we conclude that the consistency of KPCA algorithm is preserved by recentering of the data, which can be regarded as a degree of freedom to improve the empirical performance of the algorithm in a specific application.

Our main contribution is sketched in the next subsection together with some basic properties of KPCA and some relevant previous works. In section 2, we describe the mathematical framework and the related notations. Section 3 introduces the spectral support estimator and informally discusses its main features, whereas its statistical properties and the meaning of the separating condition for the kernel are analyzed in section 4. Finally section 5 presents the effective algorithm to compute the decision function and discusses the role of the two meta-parameters based on the previous theoretical analysis. In the Appendix (Supplementary Material), we collect some technical results.

1.1. Sketch of the Main Result and Previous Works

In this section we sketch our main result by first recalling the construction of the KPCA estimator introduced in Hoffmann [16]. We have at disposal a training set ${x_{1}, \dots, x_{n}} \in D \subset ℝ^{d}$ of n points independently sampled according to some probability distribution P. The input space $D$ is a known compact subset of ℝ^d, but the probability distribution P is unknown and the goal is to estimate the support C of P from the empirical data. We recall that C is the smallest closed subset of $D$ such that ℙ[C] = 1 and we stress that C is in general a proper subset of $D$ , possibly of low dimension.

Classical Principal Component Analysis (PCA) is based on the construction of the vector space V spanned by the first m eigenvectors associated with the largest eigenvalues of the empirical covariance matrix

\frac{1}{n} \sum_{i = 1}^{n} (x_{i} - \bar{x}) \otimes (x_{i} - \bar{x}),

where $\bar{x} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}$ is the empirical mean. However, if the data do not live on an affine subspace, the set V is not a consistent estimator of the support. In order to take into account non-linear models, following the idea introduced in Schölkopf et al. [17] we consider a feature map Φ from the input space $D$ into the corresponding feature space $H$ , which is assumed to be a Hilbert space, and we replace the empirical covariance matrix with the empirical covariance operator

\hat{T_{n}^{c}} = \frac{1}{n} \sum_{i = 1}^{n} (Φ (x_{i}) - {\hat{μ}}_{n}) \otimes (Φ (x_{i}) - {\hat{μ}}_{n}),

where ${\hat{μ}}_{n} = \frac{1}{n} \sum_{i = 1}^{n} Φ (x_{i})$ is the empirical mean in the feature space. As it happens in PCA, we consider the subspace ${\hat{V}}_{m, n}$ of $H$ spanned by the first m- eigenvectors ${\hat{f}}_{1}, \dots, {\hat{f}}_{m}$ of $\hat{T_{n}^{c}}$ . According to the proposal in Hoffmann [16], we consider the following estimator of the support of the probability distribution P

{\hat{C}}_{n} = {x \in D ∣ ‖ {\hat{μ}}_{n} + \sum_{j = 1}^{m} 〈 Φ (x) - {\hat{μ}}_{n}, {\hat{f}}_{j} 〉 {\hat{f}}_{j} - Φ (x) ‖ \leq τ_{n}},

where τ_n is a suitable threshold depending on the number of examples and

{\hat{μ}}_{n} + \sum_{j = 1}^{m} 〈 Φ (x) - {\hat{μ}}_{n}, {\hat{f}}_{j} 〉 {\hat{f}}_{j}

is the projection of an arbitrary point $x \in D$ onto the affine subspace ${\hat{μ}}_{n} + {\hat{V}}_{m, n}$ . We show that, under a suitable assumption on the feature map, called separating property, ${\hat{C}}_{n}$ is a consistent estimator of C with respect to the Hausdorff distance between compact sets, see Theorem 3 of section 4.2.

The separating property was introduced in De Vito et al. [25] and it ensures that the feature space is rich enough to learn any closed subset of $D$ . This assumption plays the same role of the notion of universal kernel [26] in supervised learning.

Moreover, following [25, 27] we extend the KPCA estimator to a class of learning algorithms defined in terms of a low-pass filter function r_m(σ) acting on the spectrum of the covariance matrix and depending on a regularization parameter m ∈ ℕ. The projection of ${\hat{Φ}}_{n} (x)$ onto ${\hat{V}}_{m, n}$ is replaced by the vector

{\hat{Φ}}_{n, m} (x) = \sum_{j = 1}^{+ \infty} r_{m} ({\hat{σ}}_{j}) 〈 Φ (x) - {\hat{μ}}_{n}, {\hat{f}}_{j} 〉 {\hat{f}}_{j},

where ${{\hat{f}}_{j}}_{j}$ is the family of eigenvectors of $\hat{T_{n}^{c}}$ and ${{\hat{σ}}_{j}}_{j}$ is the corresponding family of eigenvalues. The support is then estimated by the set

{x \in D ∣ | | {\hat{μ}}_{n} + {\hat{Φ}}_{n, m} (x) - Φ (x) | | \leq τ_{n}} .

Note that KPCA corresponds to the choice of the hard-cut off filter

r_{m} ({\hat{σ}}_{j}) = {\begin{array}{l} 1 & i \leq m \\ 0 & i > m \end{array} .

However, other filter functions can be considered, inspired by the theory of regularization for inverse problems [28] and by supervised learning algorithms [29, 30]. In this paper we show that the explicit computation of these spectral estimators reduces to a finite dimensional problem depending only on the kernel K(x, w) = 〈Φ(x), Φ(w)〉 associated with the feature map, as for KPCA. The computational properties of each learning algorithm depend on the choice of the low-pass filter r_m(σ), which can be tuned to out-perform of some specific data set, see the discussion in Rudi et al. [31].

We conclude this section with two considerations. First, in De Vito et al. [25, 27] it is proven a consistency result for a similar estimator, where the subspace ${\hat{V}}_{n, m}$ is computed with respect to the non-centered covariance matrix in the feature space $H$ , instead of the covariance matrix. In this paper we analyze the impact of recentering the data in the feature space $H$ on the support estimation problem, see Theorem 1 below. This point of view is further analyzed in Rudi et al. [32, 33].

Finally note that, our consistency results are based on convergence rates of empirical subspaces to true subspaces of the covariance operator, see Theorem 2 below. The main difference between our result and the result in Blanchard et al. [34], is that we prove the consistency for the case when the dimension m = m_n of the subspace ${\hat{V}}_{m, n}$ goes to infinity slowly enough. On the contrary, in their seminal paper [34] the authors analyze the most specific case when the dimension of the projection space is fixed.

2. Mathematical Assumptions

In this section we introduce the statistical model generating the data, the notion of separating feature map and the properties of the filter function. Furthermore, we show that KPCA can be seen as a filter function and we recall the main properties of the covariance operators.

We assume that the input space $D$ is a bounded closed subset of ℝ^d. However, our results also hold true by replacing $D$ with any compact metric space. We denote by d(x, w) the Euclidean distance |x − w| between two points x, w ∈ ℝ^d and by d_{_H}(A, B) the Hausdorff distance between two compact subsets $A, B \subset D$ , explicitly given by

d_{_{H}} (A, B) = \max {\sup_{x \in A} d (x, B), \sup_{x \in B} d (x, A)},

where $d (x, A) = {inf}_{w \in A} d (x, w)$ .

2.1. Statistical Model

The statistical model is described by a random vector X taking value in $D$ . We denote by P the probability distribution of X, defined on the Borel σ-algebra of $D$ , and by C the support of P.

Since the probability distribution P is unknown, so is its support. We aim to estimate C from a training set of empirical data, which are described by a family X₁, …, X_n of random vectors, which are independent and identically distributed as X. More precisely, we are looking for a closed subset ${\hat{C}}_{n} = {\hat{C}}_{X_{1}, \dots, X_{n}} \subset D$ , depending only on X₁, …, X_n, but independent of P, such that

ℙ [\lim_{n \to + \infty} d_{_{H}} ({\hat{C}}_{n}, C) = 0] = 1

for all probability distributions P. In the context of regression estimate, the above convergence is usually called universal strong consistency [35].

2.2. Mercer Feature Maps and Separating Condition

To define the estimator ${\hat{C}}_{n}$ we first map the data into a suitable feature space, so that the support C is represented by a linear subspace.

Assumption 1. Given a Hilbert space $H$ , take $Φ : D \to H$ satisfying the following properties:

(H1) the set $Φ (D)$ is total in $H$ , i.e.,

\bar{span} {\hat{Φ} (x) ∣ x \in D} = H,

where $\bar{span} {\cdot}$ denotes the closure of the linear span;

(H2) the map Φ is continuous.

The space $H$ is called the feature space and the map Φ is called a Mercer feature map.

In the following the norm and scalar product of $H$ are denoted by ||·|| and 〈··〉, respectively.

Assumptions (H1) and (H2) are standard for kernel methods, see Steinwart and Christmann [36]. We now briefly recall some basic consequences. First of all, the map $K : D \times D \to ℝ$

K (x, w) = 〈 Φ (x) Φ (w) 〉

is a Mercer kernel and we denote by $H_{K}$ the corresponding (separable) reproducing kernel Hilbert space, whose elements are continuous functions on $D$ . Moreover, each element $f \in H$ defines a function $f_{Φ} \in H_{K}$ by setting f_Φ(x) = 〈f, Φ(x)〉 for all $x \in D$ . Since $Φ (D)$ is total in $H$ , the linear map f ↦ f_Φ is an isometry from $H$ onto $H_{K}$ . In the following, with slight abuse of notation, we write f instead of f_Φ, so that the elements $f \in H$ are viewed as functions on $D$ satisfying the reproducing property

f (x) = 〈 f, Φ (x) 〉 x \in D .

Finally, since $D$ is compact and Φ is continuous, it holds that

\begin{matrix} R = \sup_{x \in D} | | Φ (x) | |^{2} = \sup_{x \in D} K (x, x) < + \infty . & (1) \end{matrix}

Following De Vito et al. [27], we call Φ a separating Mercer feature map if the following the separating property also holds true.

(H3) The map Φ is injective and for all closed subsets $C \subset D$

\begin{matrix} Φ (C) = Φ (D) \cap \bar{span} {Φ (x) ∣ x \in C} . & (2) \end{matrix}

It states that any closed subset $C \subset D$ is mapped by Φ onto the intersection of $Φ (D)$ and the closed subspace $\bar{span} {Φ (x) ∣ x \in C} \subset H$ . Examples of kernels satisfying the separating property are for $D \subset ℝ^{d}$ [27]:

• Sobolev kernels with smoothness index $s > \frac{d}{2}$ ;

• the Abel/Laplacian kernel K(x, w) = e^{− γ|x − w|} with γ > 0;

• the ℓ₁-kernel $K (x, w) = e^{- γ | x - w |_{1}}$ , where |·|₁ is the ℓ₁-norm and γ > 0.

As shown in De Vito et al. [25], given a closed set $C$ the equality (2) is equivalent to the condition that for every $x_{0} \notin C$ there exists $f \in H$ such that

\begin{matrix} f (x_{0}) \neq 0 and f (x) = 0 \forall x \in C . & (3) \end{matrix}

Clearly, an arbitrary Mercer feature map is not able to separate all the closed subsets, but only few of them. To better describe these sets, we introduce the elementary learnable sets, namely

C_{f} = {x \in D ∣ f (x) = 〈 f, Φ (x) 〉 = 0},

where $f \in H$ . Clearly, $C_{f}$ is closed and the equality (3) holds true. Furthermore the intersection of an arbitrary family of elementary learnable sets $\cap_{f \in F} C_{f}$ with $F \subset H$ satisfies (3), too. Conversely, if $C$ is a set satisfying (2), select a maximal family {f_j}_j∈J of orthonormal functions in $H$ such that

f_{j} (x) = 〈 f_{j}, Φ (x) 〉 = 0 \forall x \in C, j \in J,

i.e., a basis of the orthogonal complement of ${Φ (x) ∣ x \in C}$ , then it is easy to prove that

\begin{matrix} C = {x \in D ∣ f_{j}, Φ (x) 〉 = 0 \forall j \in J} = \underset{j \in J}{\cap} C_{f_{j}}, & (4) \end{matrix}

so that any set which is separating by Φ is the (possibly denumerable) intersection of elementary sets. Assumption (H3) is hence a requirement that the family of the elementary learnable sets, labeled by the elements of $H$ , is rich enough to parameterize all the closed subsets of $D$ by means of (4). In section 4.3 we present some examples.

The Gaussian kernel K(x, w) = e^{−γ|x−w|2} is a popular choice in machine learning, however it is not separating. Indeed, since K is analytic, the elements of the corresponding reproducing kernel Hilbert space are analytic functions, too [36]. It is known that, given an analytic function f ≠ 0, the corresponding elementary learnable set $C_{f} = {x \in D ∣ f (x) = 0}$ is a closed set whose interior is the empty set. Hence also the denumerable intersections have empty interior, so that K can not separate a support with not-empty interior. In Figure 1 we compare the decay behavior of the eigenvalues of the Laplacian and the Gaussian kernels.

FIGURE 1

Figure 1. Eigenvalues in logarithmic scale of the Covariance operator when the kernel is Abel (blue) and Gaussian (red) and the distribution is uniformly supported on the “8” curve in Figure 2. Note that the eigenvalue decay rate of the first operator has a polynomial behavior while the second has an exponential one.

2.3. Filter Function

The second building block is a low pass filter, we introduce to avoid that the estimator overfits the empirical data. The filter functions were first introduced in the context of inverse problem, see Engl et al. [28] and references therein, and in the context of supervised learning, see Lo Gerfo et al. [29] and Blanchard and Mucke [30].

We now fix some notations. For any $f \in H$ , we denote by f ⊗ f the rank one operator (f ⊗ f)g = 〈gf〉f. We recall that a bounded operator A on $H$ is a Hilbert-Schmidt operator if for some (any) basis {f_j}_j the series $| | A | |_{2}^{2} : = \sum_{j} | | A f_{j} | |^{2}$ is finite, ||A||₂ is called the Hilbert-Schmidt norm and ||A||_∞ ≤ ||A||₂, where ||·||_∞ is the spectral norm. We denote by $S_{2}$ the space of Hilbert-Schmidt operators, which is a separable Hilbert space under the scalar product ${〈 A B 〉}_{2} = \sum_{j} 〈 A f_{j} B f_{j} 〉$ .

Assumption 2. A filter function is a sequence of functions r_m:[0, R] → [0, 1], with m ∈ ℕ, satisfying

(H4) for any m ∈ ℕ, r_m(0) = 0;

(H5) for all σ > 0, $lim_{m \to + \infty} r_{m} (σ) = 1$ ;

(H6) for all m ∈ ℕ, there is L_m > 0 such that

| r_{m} (σ^{'}) - r_{m} (σ) | \leq L_{m} | σ^{'} - σ |,

i.e., r_m is a Lipschitz function with Lipschitz constant L_m.

For fixed m, r_m is a filter cutting the smallest eigenvalues (high frequencies). Indeed, (H4) and (H6) with σ′ = 0 give

\begin{matrix} | r_{m} (σ) | \leq L_{m} | σ | . & (5) \end{matrix}

On the contrary, if m goes to infinity, by (H5) r_m converges point-wisely to the Heaviside function

Θ (σ) = {\begin{array}{l} 1 & σ > 0 \\ 0 & σ = 0 \end{array} .

Since r_m(σ) converges to Θ(σ), which does not satisfy (5), we have that $lim_{m \to + \infty} L_{m} = + \infty$ .

We fix the interval [0, R] as domain of the filter functions r_m since the eigenvalues of the operators we are interested belong to [0, R], see (23).

Examples of filter functions are

• Tikhonov filter

r_{m} (σ) = \frac{m σ}{m σ + R} L_{m} = \frac{m}{R} .

• Soft cut-off

r_{m} (σ) = {\begin{matrix} 1 & σ \geq \frac{R}{m} \\ \frac{m σ}{R} & σ < \frac{R}{m} \end{matrix} L_{m} = \frac{m}{R} .

• Landweber iteration

r_{m} (σ) = \frac{σ}{R} \sum_{k = 0}^{m} {(1 - \frac{σ}{R})}^{k} = 1 - {(1 - \frac{σ}{R})}^{m + 1} L_{m} = \frac{m + 1}{R} .

We recall a technical result, which is based on functional calculus for compact operators. If A is a positive Hilbert-Schmidt operator, Hilbert-Schmidt theorem (for compact self-adjoint operators) gives that there exist a basis {f_j}_j of $H$ and a family {σ_j}_j of positive numbers such that

\begin{matrix} A = \sum_{j} σ_{j} f_{j} \otimes f_{j} \Leftrightarrow A f_{j} = σ_{j} f_{j} . & (6) \end{matrix}

If the spectral norm ||A||_∞ ≤ R, then all the eigenvalues σ_j belong to [0, R] and the spectral calculus defines r_m(A) as the operator on $H$ given by

r_{m} (A) = \sum_{j} r_{m} (σ_{j}) f_{j} \otimes f_{j} \Leftrightarrow r_{m} (A) f_{j} = r_{m} (σ_{j}) f_{j} .

With this definition each f_j is still an eigenvector of r_m(A), but the corresponding eigenvalue is shrunk to r_m(σ_j). Proposition 1 in the Appendix in Supplementary Material summarizes the main properties of r_m(A).

2.4. Kernel Principal Component Analysis

As anticipated in the introduction, the estimators we propose are a generalization of KPCA suggested by Hoffmann [16] in the context of novelty detection. In our framework this corresponds to the hard cut-off filter, i.e., by labeling the different eigenvalues of A in a decreasing order¹ σ₁ > σ₂ > … > σ_m > σ_m+1 > …, the filter function is

r_{m} (σ) = {\begin{array}{l} 1 & σ \geq σ_{m} \\ 0 & σ < σ_{m} \end{array} .

Clearly, r_m satisfies (H4) and (H5), but (H6) does not hold. However, the Lipschitz assumption is needed only to prove the bound (21e) and, for the hard cut-off filter, r_m(A) is simply the orthogonal projector onto the linear space spanned by the eigenvectors whose eigenvalues are bigger than σ_m+1. For such projections [37] proves the following bound

| | r_{m} (A^{'}) - r_{m} (A) | |_{2} \leq \frac{2}{σ_{m + 1} - σ_{m}} | | A^{'} - A | |_{2},

so that (21e) holds true with $L_{m} = \frac{2}{σ_{m + 1} - σ_{m}}$ . Hence, our results also hold for hard cut-off filter at the price to have a Lipschitz constant L_m depending on the eigenvalues of A.

2.5. Covariance Operators

The third building block is made of the eigenvectors of the distribution dependent covariance operator and of its empirical version. The covariance operators are computed by first mapping the data in the feature space $H$ .

As usual, we introduce two random variables Φ(X) and Φ(X) ⊗ Φ(X), taking value in $H$ and in $S_{2}$ , respectively. Since Φ is continuous and X belongs to the compact subset $D$ , both random variables are bounded. We set

\begin{array}{l} μ = 𝔼 [Φ (X)] = \int_{D} Φ (x) d P (x), & (7 a) \end{array}

\begin{array}{l} T = 𝔼 [Φ (X) \otimes Φ (X)] = \int_{D} Φ (x) \otimes Φ (x) d P (x), & (7 b) \end{array}

\begin{array}{l} T^{c} = T - μ \otimes μ, & (7 c) \end{array}

where the integrals are in the Bochner sense.

We denote by ${\hat{μ}}_{n}$ and ${\hat{T}}_{n}$ the empirical mean of Φ(X) and Φ(X) ⊗ Φ(X), respectively, and by $\hat{T_{n}^{c}}$ the empirical covariance operator, respectively. Explicitly,

\begin{array}{l} {\hat{μ}}_{n} = \frac{1}{n} \sum_{i} Φ (X_{i}), & (8 a) \end{array}

\begin{array}{l} {\hat{T}}_{n} = \frac{1}{n} \sum_{i} Φ (X_{i}) \otimes Φ (X_{i}), & (8 b) \end{array}

\begin{array}{l} \hat{T_{n}^{c}} = {\hat{T}}_{n} - {\hat{μ}}_{n} \otimes {\hat{μ}}_{n} . & (8 c) \end{array}

The main properties of the covariance operator and its empirical version are summarized in Proposition 2 in the Appendix in Supplementary Material.

3. The Estimator

Now we are ready to construct the estimator, whose computational aspects are discussed in section 5. The set ${\hat{C}}_{n}$ is defined by the following three steps:

a) the points $x \in D$ are mapped into the corresponding centered vectors $Φ (x) - {\hat{μ}}_{n} \in H$ , where the center is the empirical mean;

b) the operator $r_{m} (\hat{T_{n}^{c}})$ is applied to each vector $Φ (x) - {\hat{μ}}_{n}$ ;

c) the point $x \in D$ is assigned to ${\hat{C}}_{n}$ if the distance between $r_{m_{n}} ({\hat{T}}_{n}^{c}) (Φ (x) - {\hat{μ}}_{n})$ and $Φ (x) - {\hat{μ}}_{n}$ is smaller than a threshold τ.

Explicitly we have that

\begin{array}{l} {\hat{C}}_{n} = {x \in D ∣ | | r_{m_{n}} ({\hat{T}}_{n}^{c}) (Φ (x) - {\hat{μ}}_{n}) - (Φ (x) - {\hat{μ}}_{n}) | | \leq τ_{n}}, & (9) \end{array}

where τ = τ_n and m = m_n are chosen as a function of the number n of training data.

With the choice of the hard cut-off filter, this reduces to the KPCA algorithm [16, 17]. Indeed, $r_{m} ({\hat{T}}_{n}^{c})$ is the projection Q^m onto the vector space spanned by the first m eigenvectors. Hence ${\hat{C}}_{n}$ is the set of points x whose image $Φ (x) - {\hat{μ}}_{n}$ is close to Q^m. For an arbitrary filter function r_m, Q^m is replaced by $r_{m} ({\hat{T}}_{n}^{c})$ , which can be interpreted as a smooth version of Q^m. Note that, in general, $r_{m} ({\hat{T}}_{n}^{c})$ is not a projection.

In De Vito et al. [27] a different estimator is defined. In that paper the data are mapped in the feature space $H$ without centering the points with respect to the empirical mean and the estimator is given by

{\tilde{C}}_{n} = {x \in D ∣ | 〈 Φ (x) - r_{m_{n}} ({\hat{T}}_{n}) Φ (x) Φ (x) 〉 | \leq τ_{n}^{2}},

where the filter function r_m is as in the present work, but $r_{m_{n}} ({\hat{T}}_{n})$ is defined in terms of the eigenvectors of the non-centered second momentum ${\hat{T}}_{n}$ . To compare the two estimators note that

\begin{array}{l} | | r_{m} ({\hat{T}}_{n}^{c}) (Φ (x) - {\hat{μ}}_{n}) - (Φ (x) - {\hat{μ}}_{n}) | |^{2} \\ = 〈 {(I - r_{m} ({\hat{T}}_{n}^{c}))}^{2} (Φ (x) - {\hat{μ}}_{n}) Φ (x) - {\hat{μ}}_{n} 〉 \\ = 〈 (I - r_{m}^{*} ({\hat{T}}_{n}^{c})) (Φ (x) - {\hat{μ}}_{n}) Φ (x) - {\hat{μ}}_{n} 〉, \end{array}

where $r_{m}^{*} (σ) = 2 r_{m} (σ) - r_{m} {(σ)}^{2}$ , which is a filter function too, possibly with a Lipschitz constant $L_{m}^{*} \leq 2 L_{m}$ . Note that for the hard cut-off filter $r_{m}^{*} (σ) = r_{m} (σ)$ .

Though $r_{m_{n}} ({\hat{T}}_{n})$ and $r_{m_{n}} ({\hat{T}}_{n}^{c})$ are different, both ${\hat{C}}_{n}$ and ${\tilde{C}}_{n}$ converge to the support of the probability distribution P, provided that the separating property (H3) holds true. Hence, one has the freedom to choose if the empirical data have or not zero mean in the feature space.

4. Main Results

In this section, we prove that the estimator ${\hat{C}}_{n}$ we introduce is strongly consistent. To state our results, for each n ∈ ℕ, we fix an integer m_n ∈ ℕ and set ${\hat{F}}_{n} : D \to H$ to be

{\hat{F}}_{n} (x) = (I - r_{m_{n}} ({\hat{T}}_{n}^{c})) (Φ (x) - {\hat{μ}}_{n}),

so that Equation (9) becomes

\begin{array}{l} {\hat{C}}_{n} = {x \in D ∣ | | {\hat{F}}_{n} (x) | | \leq τ_{n}} . & (10 \end{array}

4.1. Spectral Characterization

First of all, we characterize the support of P by means of Q^c, the orthogonal projector onto the null space of the distribution dependence covariance operator T^c. The following theorem will show that the centered feature map

Φ^{c} : D \to H Φ^{c} (x) = Φ (x) - μ

sends the support C onto the intersection of $Φ^{c} (D)$ and the closed subspace $(I - Q^{c}) H$ , i.e,

Φ^{c} (C) = Φ^{c} (D) \cap (I - Q^{c}) H .

Theorem 1. Assume that Φ is a separating Mercer feature map, then

\begin{array}{l} C = {x \in D ∣ Q^{c} (Φ (x) - μ) = 0}, & (11) \end{array}

where Q^c is the orthogonal projector onto the null space of the covariance operator T^c.

Proof. To prove the result we need some technical lemmas, we state and prove in the Appendix in Supplementary Material. Assume first that $x \in D$ is such that Q^c(Φ(x) − μ) = 0. Denoted by Q the orthogonal projection onto the null space of T, by Lemma 2 QQ^c = Q and Qμ = 0, so that

Q Φ (x) = Q (Φ (x) - μ) = Q Q^{c} (Φ (x) - μ) = 0.

Hence Lemma 1 implies that x ∈ C.

Conversely, if x ∈ C, then as above Q(Φ(x) − μ) = 0. By Lemma 2 we have that Q^c(1 − Q) = ||Q^cμ||⁻²Q^c μ ⊗ Q^cμ. Hence it is enough to prove that

〈 Q^{c} μ Φ (x) - μ 〉 = 0 \Leftrightarrow Q^{c} Φ (x) = Q^{c} μ,

which holds true by Lemma 3.

4.2. Consistency

Our first result is about the convergence of ${\hat{F}}_{n}$ .

Theorem 2. Assume that Φ is a Mercer feature map. Take the sequence {m_n}_n such that

\begin{array}{l} \lim_{n \to \infty} m_{n} = + \infty, & (12 a) \end{array}

\begin{array}{l} L_{m_{n}} \leq κ \frac{\sqrt{n}}{\ln n}, & (12 b) \end{array}

for some constant κ > 0, then

\begin{array}{l} ℙ [\lim_{n \to \infty} \sup_{x \in D} | | {\hat{F}}_{n} (x) - Q^{c} (Φ (x) - μ) | | = 0] = 1. & (13) \end{array}

Proof. We first prove Equation (13). Set $A_{n} = I - r_{m_{n}} (\hat{T_{n}^{c}})$ . Given $x \in D$ ,

\begin{array}{l} | | A_{n} (Φ (x) - {\hat{μ}}_{n}) - Q^{c} (Φ (x) - μ) | | \\ \leq | | (A_{n} - Q^{c}) (Φ (x) - μ) | | + | | A_{n} (μ - {\hat{μ}}_{n}) | | \\ \leq | | r_{m_{n}} ({\hat{T}}_{n}^{c}) - r_{m_{n}} (T^{c}) | |_{2} | | Φ (x) - μ | | \\ + | | (r_{m_{n}} (T^{c}) - (I - Q^{c})) (Φ (x) - μ) | | + | | A_{n} (μ - {\hat{μ}}_{n}) | | \\ \leq 2 \sqrt{R} L_{m_{n}} | | {\hat{T}}_{n}^{c} - T^{c} | |_{2} \\ + | | (r_{m_{n}} (T^{c}) - (I - Q^{c})) (Φ (x) - μ) | | + | | μ - {\hat{μ}}_{n} | |, \end{array}

where the fourth line is due to (21e), the bound $| | A_{n} | |_{\infty} = sup_{σ \in [0, 1]} | 1 - r_{m} (σ) | \leq 1$ , and the fact that both Φ(x) and μ are bounded by $\sqrt{R}$ . By (12b) it follows that

\begin{array}{l} \sup_{x \in D} | | A_{n} (Φ (x) - {\hat{μ}}_{n}) - Q^{c} (Φ - μ) | | \leq 2 \sqrt{R} κ \frac{\sqrt{n}}{\ln n} | | {\hat{T}}_{n}^{c} - T^{c} | |_{2} \\ + \sup_{x \in D} | | (r_{m_{n}} (T^{c}) - (I - Q^{c})) (Φ (x) - μ) | | - | | μ - {\hat{μ}}_{n} | |, \end{array}

so that, taking into account (24a) and (24c), it holds that

\lim_{n \to + \infty} \sup_{x \in D} | | A_{n} (Φ (x) - {\hat{μ}}_{n}) - Q^{c} (Φ - μ) | | = 0

almost surely, provided that lim_n→+∞∥(r_{m_n}(T^c) − (I − Q^c))(Φ(x) − μ)∥ = 0. This last limit is a consequence of (21d) observing that ${Φ (x) - μ ∣ x \in D}$ is compact since $D$ is compact and Φ is continuous.

We add some comments. Theorem 1 suggests that the consistency depends on the fact that the vector $(I - r_{m_{n}} ({\hat{T}}_{n}^{c})) (Φ (x) - {\hat{μ}}_{n})$ is a good approximation of Q^c(Φ(x)−μ). By the law of large numbers, ${\hat{T}}_{n}$ and ${\hat{μ}}_{n}$ converge to T and μ, respectively, and Equation (21d) implies that, if m is large enough, (I − r_m(T))(Φ(x)−μ) is closed to Q^c(Φ(x)−μ). Hence, if m_n is large enough, see condition (12a), we expect that $r_{m_{n}} ({\hat{T}}_{n}^{c})$ is close to $r_{m_{n}} (T^{c})$ . However, this is true only if m_n goes to infinity slowly enough, see condition (12b). The rate depends on the behavior of the Lipschitz constant L_m, which goes to infinity if m goes to infinity. For example, for Tikhonov filter a sufficient condition is that $m_{n} ~ n^{\frac{1}{2} - ϵ}$ with ϵ > 0. With the right choice of m_n, the empirical decision function ${\hat{F}}_{n}$ converges uniformly to the function F(x) = Q^c(Φ(x) − μ), see Equation (13).

If the map Φ is separating, Theorem 1 gives that the zero level set of F is precisely the support C. However, if C is not learnable by Φ, i.e., the equality (2) does not hold, then the zero level set of F is bigger than C. For example, if $D$ is connected, C has not-empty interior and Φ is the feature map associated with the Gaussian kernel, it is possible to prove that F is an analytic function, which is zero on an open set, hence it is zero on the whole space $D$ . We note that, in real applications the difference between Gaussian and Abel kernel, which is separating, is not so big and in our experience the Gaussian kernel provides a reasonable estimator.

From now on we assume that Φ is separating, so that Theorem 1 holds true. However, the uniform convergence of ${\hat{F}}_{n}$ to F does not imply that the zero level sets of ${\hat{F}}_{n}$ converges to C = F⁻¹(0) with respect to the Hausdorff distance. For example, with the Tikhonov filter ${\hat{F}}_{n}^{- 1} (0)$ is always the empty set. To overcome the problem, ${\hat{C}}_{n}$ is defined as the τ_n-neighborhood of the zero level set of ${\hat{F}}_{n}$ , where the threshold τ_m goes to zero slowly enough.

Define the data dependent parameter ${\hat{τ}}_{n}$ as

\begin{array}{l} {\hat{τ}}_{n} = \max_{1 \leq i \leq n} | | {\hat{F}}_{n} (X_{i}) | | . & (14) \end{array}

Since ${\hat{F}}_{n} \in [0, 1]$ , clearly ${\hat{τ}}_{n} \in [0, 1]$ and the set estimator becomes

{\hat{C}}_{n} = {x \in D ∣ | | {\hat{F}}_{n} (x) | | \leq {\hat{τ}}_{n}} .

The following result shows that ${\hat{C}}_{n}$ is a universal strongly consistent estimator of the support of the probability distribution P. Note that for KPCA the consistency is not universal since the choice of m_n depends on some a-priori information about the decay of the eigenvalues of the covariance operator T^c, which depends on P.

Theorem 3. Assume that Φ is a separating Mercer feature map. Take the sequence {m_n}_n satisfying (12a)-(12b) and define ${\hat{τ}}_{n}$ by (14). Then

\begin{array}{l} ℙ [\lim_{n \to \infty} {\hat{τ}}_{n} = 0] = 1, & (15 a) \end{array}

\begin{array}{l} ℙ [\lim_{n \to \infty} d_{_{H}} ({\hat{C}}_{n}, C) = 0] = 1. & (15 b) \end{array}

Proof. We first show Equation (15a). Set F(x) = Q^c(Φ(x) − μ) and let E be the event on which ${\hat{F}}_{n}$ converges uniformly to F(x), and F be the event such that X_i ∈ C for all i ≥ 1. Theorem 2 shows that ℙ[E] = 1 and, since C is the support, then ℙ[F] = 1. Take ω ∈ E ∩ F and fix ϵ > 0, then there exists n₀ > 0 (possibly depending on ω and ϵ) such that for all n ≥ n₀ $| {\hat{F}}_{n} (x) - F (x) | \leq ϵ$ for all $x \in D$ . By Theorem 1 F(x) = 0 for all x ∈ C and X₁(ω), …, X_n(ω) ∈ C, it follows that $| {\hat{F}}_{n} (X_{i} (ω)) | \leq ϵ$ for all 1 ≤ i ≤ n so that $0 \leq {\hat{τ}}_{n} (ω) \leq ϵ$ , so that the sequence ${\hat{τ}}_{n} (ω)$ goes to zero. Since ℙ[E ∩ F] = 1 Equation (15a) holds true.

We split the proof of Equation (15b) in two steps. We first show that with probability one $lim_{n \to + \infty} sup_{x \in {\hat{C}}_{n}} d (x, C) = 0$ . On the event E ∩ F, suppose, by contraction, that the sequence ${{sup}_{x \in {\hat{C}}_{n}} d (x, C)}_{n}$ does not converge to zero. Possibly passing to a subsequence, for all n ∈ ℕ there exists $x_{n} \in {\hat{C}}_{n}$ such that $d (x_{n}, {\hat{C}}_{n}) \geq ϵ_{0}$ for some fixed ϵ₀ > 0. Since $D$ is compact, possibly passing to a subsequence, {x_n}_n converges to $x_{0} \in D$ with d(x₀, C) ≥ ϵ₀. We claim that x₀ ∈ C. Indeed,

\begin{array}{l} | | Q^{c} (Φ (x_{0}) - μ) | | \leq | | Q^{c} (Φ (x_{0}) - Φ (x_{n})) | | \\ + | | {\hat{F}}_{n} (x_{n}) - Q^{c} (Φ (x_{n}) - μ) | | + | | {\hat{F}}_{n} (x_{n}) | | \\ \leq | | Q^{c} (Φ (x_{0}) - Φ (x_{n})) | | \\ + \sup_{x \in D} | | {\hat{F}}_{n} (x) - Q^{c} (Φ (x) - μ) + τ_{n}, \end{array}

since $x_{n} \in {\hat{C}}_{n}$ means that $| | {\hat{F}}_{n} (x_{n}) | | \leq τ_{n}$ . If n goes to infinity, since Φ is continuous and by the definition of E and F, the right side of the above inequality goes to zero, so that $| | Q^{c} (Φ (x_{0}) - μ) | | = 0$ , i.e., by Theorem 1 we get x₀ ∈ C, which is a contraction since by construction d(x₀, C) ≥ ϵ₀ > 0.

We now prove that

\lim_{n \to \infty} \sup_{x \in C} d (x, {\hat{C}}_{n}) .

For any $x \in D$ , set X_1,n(x) to be a first neighbor of x in the training set {X₁, …, X_n}. It is known that for all x ∈ C,

\begin{array}{l} ℙ [\lim_{n \to + \infty} d (X_{1, n} (x), x) = 0] = 1, & (16) \end{array}

see for example Lemma 6.1 of Györfi et al. [35].

Choose a denumerable family {z_j}_{j ∈ J} in C such that is dense in C. By Equation (16) there exists an event G with such that ℙ[G] = 1 and, on G, for all j ∈ J

\lim_{n \to + \infty} d (X_{1, n} (z_{j}), z_{j}) = 0.

Fix ω ∈ G, we claim that $lim_{n} {sup}_{x \in C} d (x, {\hat{C}}_{n}) = 0$ . Observe that, by definition of ${\hat{τ}}_{n}$ , $X_{i} \in {\hat{C}}_{n}$ for all 1 ≤ i ≤ n and

\sup_{x \in C} d (x, {\hat{C}}_{n}) \leq \sup_{x \in C} \min_{1 \leq i \leq n} d (x, X_{i}) = \sup_{x \in C} d (X_{1, n} (x), x),

so that it is enough to show that $\lim_{n \to + \infty} {sup}_{x \in C} d (X_{1, n} (x), x) = 0$ .

Fix ϵ > 0. Since C is compact, there is a finite subset J_ϵ ⊂ J such that {B(z_j, ϵ)}_{j ∈ J_ϵ} is a finite covering of C. Furthermore,

\begin{array}{l} \sup_{x \in C} d (X_{1, n} (x), x) \leq \max_{j \in J_{ϵ}} d (X_{1, n} (z_{j}), z_{j}) + ϵ . & (17) \end{array}

Indeed, fix x ∈ C, there exists an index j ∈ J_ϵ such that x ∈ B(z_j, ϵ). By definition of first neighbor, clearly

d (X_{1, n} (x), x) \leq d (X_{1, n} (z_{j}), x),

so that by triangular inequality we get

\begin{array}{l} d (X_{1, n} (x), x) \leq d (X_{1, n} (z_{j}), x) \leq d (X_{1, n} (z_{j}), z_{j}) + d (z_{j}, x) \\ \leq d (X_{1, n} (z_{j}), z_{j}) + ϵ \\ \leq \max_{j \in J_{ϵ}} d (X_{1, n} (z_{j}), z_{j}) + ϵ . \end{array}

Taking the supremum over C we get the claim. Since ω ∈ G and J_ϵ is finite,

\lim_{n \to + \infty} \max_{j \in J_{ϵ}} d (X_{1, n} (z_{j}), z_{j}) = 0,

so that by Equation (17)

\underset{n \to + \infty}{limsup} \sup_{x \in C} d (X_{1, n} (x), x) \leq ϵ .

Since ϵ is arbitrary, we get $lim_{n \to + \infty} {sup}_{x \in C} d (X_{1, n} (x), x) = 0$ , which implies that

\lim_{n \to + \infty} \sup_{x \in C} d (x, {\hat{C}}_{n}) = 0.

Theorem 3 is an asymptotic result. Up to now, we are not able to provide finite sample bounds on d_H(Ĉ_n, C). It is possible to have finite sample bounds on $| | {\hat{F}}_{n} (x) - Q^{c} (Φ (x) - μ) | |$ , as in Theorem 7 of De Vito et al. [25] with the same kind of proof.

4.3. The Separating Condition

The following two examples clarify the notion of the separating condition.

Example 1. Let $D$ be a compact subset of ℝ², $H = ℝ^{6}$ with the euclidean scalar product, and $Φ : D \to ℝ^{6}$ be the feature map

Φ ((x, y)) = (x^{2}, y^{2}, \sqrt{2} x y, \sqrt{2} x, \sqrt{2} y, 1),

whose corresponding Mercer kernel is a polynomial kernel of degree two, explicitly given by

K (x_{1}, y_{1}; x_{2}, y_{2}) = {(x_{1} x_{2} + y_{1} y_{2} + 1)}^{2} .

Given a vector $f = {(f_{1}, \dots, f_{6})}^{⊤}$ , the corresponding elementary set is the conic

\begin{array}{l} C_{f, c} = {(x, y) \in D ∣ f_{1} x^{2} + f_{2} y^{2} + f_{3} \sqrt{2} x y \\ + f_{4} \sqrt{2} x + f_{5} \sqrt{2} y + f_{6} = 0}, \end{array}

Conversely, all the conics are elementary sets. The family of all the intersections of at most five conics, i.e., the sets whose cartesian equation is a system of the form

{\begin{array}{l} f_{11} x^{2} + f_{12} y^{2} + f_{13} \sqrt{2} x y + f_{14} \sqrt{2} x + f_{15} \sqrt{2} y + f_{16} = 0 \\ ⋮ \\ f_{51} x^{2} + f_{52} y^{2} + f_{53} \sqrt{2} x y + f_{54} \sqrt{2} x + f_{55} \sqrt{2} y + f_{56} = 0 \end{array},

where f₁₁, …, f₅₆ ∈ ℝ.

Example 2. The data are the random vectors in ℝ²

X_{i} = (\sin (a Θ_{i} + b), \sin (c Θ_{i} + d)),

where a, c ∈ ℕ, b, d ∈ [0, 2π] and Θ₁, …, Θ_n are independent random variables, each of them uniformly distributed on [0, 2π]. Setting $D = {[- 1, 1]}^{2}$ , clearly $X_{i} \in D$ and the support of their common probability distribution is the Lissajous curve

C = {Lis}_{a, b, c, d} = {(\sin (a θ + b), \sin (c θ + d)) \in D ∣ θ \in [0, 2 π]} .

Figure 2 shows two examples of Lissajous curves. As a filter function r_m, we fix the hard cut-off filter where m is the number of eigenvectors corresponding to the highest eigenvalues we keep. We denote by $Ĉ_{n}^{m, τ}$ the corresponding estimator given by (10).

FIGURE 2

Figure 2. Examples of Lissajous curves for different values of the parameters. (Left) ${Lis}_{1, 0, 1, \frac{π}{2}}$ . (Right) Lis_2,0.11,1,0.3.

In the first two tests we use the polynomial kernel (18), so that the elementary learnable sets are conics. One can check that the rank of T^c is less or equal than 5. More precisely, if Lis_a,b,c,d is a conic, the rank of T^c is 4 and we need to estimate five parameters, whereas if Lis_a,b,c,d is not a conic, Lis_a,b,c,d is not a learnable set and the rank of T^c is 5.

In the first test the data are sampled from the distribution supported on the circumference ${Lis}_{1, 0, 1, \frac{π}{2}}$ (see panel left of Figure 2). In Figure 3 we draw the set $Ĉ_{n}^{m, τ}$ for different values of m and τ when n varies. In this toy example n = 5 is enough to learn exactly the support, hence for each n = 2, …, 5 the corresponding values of m_n and τ_n are m_n = 1, 2, 3, 4 and τ_n = 0.01, 0.005, 0.005, 0.002.

FIGURE 3

Figure 3. From left to right, top to bottom. The set $Ĉ_{n}^{m, τ}$ with n, respectively 2, 3, 4, 5, m = 1, 2, 3, 4 and τ = 0.1, 0.005, 0.005, 0.002.

In the second test the data are sampled from the distribution supported on the curve Lis_2,0.11,1,0.3, which is not a conic (see panel right of Figure 2). In Figure 4 we draw the set $Ĉ_{n}^{m, τ}$ for n = 10, 20, 50, 100, m = 4, and τ = 0.01. Clearly, ${\hat{C}}_{n}$ is not able to estimate Lis_2,0.11,1,0.3.

FIGURE 4

Figure 4. The set $Ĉ_{n}^{4, 0.01}$ with n respectively 10, 20, 50, 100.

In the third test, we use the Abel kernel with the data sampled from the distribution supported on the curve Lis_2,0.11,1,0.3 (see panel right of Figure 2). In Figure 5 we show the set $Ĉ_{n}^{m, τ}$ for n = 20, 50, 100, 500, m = 5, 20, 30, 50, and τ = 0.4, 0.35, 0.3, 0.2. According the fact that the kernel is separating, ${\hat{C}}_{n}$ is able to estimate Lis_2,0.11,1,0.3 correctly.

FIGURE 5

Figure 5. From left to right and top to bottom. The learned set $Ĉ_{n}^{m, τ}$ with n respectively of 20, 50, 100, 500, and m = 5, 20, 30, 50, τ = 0.4, 0.35, 0.3, 0.2.

We now briefly discuss how to select the parameter m_n and τ_n from the data. The goal of set-learning problem is to recover the support of the probability distribution generating the data by using the given input observations. Since no output is present, set-learning belongs to the category of unsupervised learning problems, for which there is not a general framework accounting for model selection. However there are some possible strategies (whose analysis is out of the scope of this paper). A first approach, we used in our simulations, is based on the monotonicity properties of $Ĉ_{n}^{m, τ}$ with respect to m, τ. More precisely, given f ∈ (0, 1), we select (the smallest) m and (the biggest) τ such that at most nf observed points belong to the the estimated set. It is possible to prove that this method is consistent when f tends to 1 as the number of observations increases. Another way to select the parameters consists in transforming the set-learning problem is a supervised one and then performing standard model selection techniques like cross validation. In particular set-learning can be casted in a classification problem by associating the observed example to the class +1 and by defining an auxiliary measure μ (e.g., uniform on a ball of interest in $D$ ) associated to −1, from which n i.i.d. points are drawn. It is possible to prove that this last method is consistent when μ(suppρ) = 0.

4.4. The Role of the Regularization

We now explain the role of the filter function. Given a training set X₁, …, X_n of size n, the separating property (3) applied to the support of the empirical distribution gives that

{X_{1}, \dots, X_{n}} = {x \in D ∣ \hat{Q_{n}^{c}} (Φ (x) - {\hat{μ}}_{n}) = 0},

where ${\hat{μ}}_{n}$ is the empirical mean and $I - \hat{Q_{n}^{c}}$ is the orthogonal projection onto the linear space spanned by the family ${Φ (X_{1}) - {\hat{μ}}_{n}, \dots, Φ (X_{n}) - {\hat{μ}}_{n}}$ , which are the centered images of the examples. Hence, given a new point x ∈ D the condition $| \hat{Q_{n}^{c}} (Φ (x) - {\hat{μ}}_{n}) | | \leq τ$ with τ ≪ 1 is satisfied if only if x is close to one of the examples in the training set. Hence the naive estimator ${x \in D | ‖ \hat{Q_{n}^{c}} (Φ (x) - \hat{μ_{n}}) ‖ \leq τ}$ overfits the data. Hence we would like to replace $\hat{Q_{n}^{c}}$ with an operator, which should be close to the identity on the linear subspace spanned by ${Φ (X_{1}) - {\hat{μ}}_{n}, \dots, Φ (X_{n}) - {\hat{μ}}_{n}}$ and it should have a small range. To modulate the two requests, one can consider the following optimization problem

\min_{A \in S_{2}} (\frac{1}{n} \sum_{i = 1}^{n} {‖ (I - A) (Φ (X_{i}) - {\hat{μ}}_{n}) ‖}^{2} + λ {‖ A ‖}_{2}^{2}) .

We note that if A is a projection its Hilbert-Schmidt norm ||A||₂ is the square root of the dimension of the range of A. Since

\begin{array}{l} \frac{1}{n} \sum_{i = 1}^{n} | | (I - A) (Φ (X_{i}) - {\hat{μ}}_{n}) | |^{2} \\ = Tr ((I - A^{⊤}) (I - A)) \frac{1}{n} \sum_{i = 1}^{n} (Φ (X_{i}) - {\hat{μ}}_{n}) \otimes (Φ (X_{i}) - {\hat{μ}}_{n})) \\ = Tr ((I - A^{⊤}) (I - A) \overset{⌢}{T_{n}^{c}}), \end{array}

where Tr(A) is the trace, A^⊤ is the transpose and $| | A | |_{2} = \sqrt{Tr (A^{⊤} A)}$ is the Hilbert-Schmidt norm, then

\begin{array}{l} \frac{1}{n} \sum_{i = 1}^{n} | | (I - A) (Φ (X_{i}) - {\hat{μ}}_{n}) | |^{2} + λ | | A | |_{2}^{2} \\ = Tr ((I - A^{⊤}) (I - A) {\hat{T}}_{n}^{c} + λ_{m} A^{⊤} A), \end{array}

and the optimal solution is given by

A_{opt} = \hat{T_{n}^{c}} {(\hat{T_{n}^{c}} + λ_{m})}^{- 1},

i.e., A_opt is precisely the operator $r_{m} (\hat{T_{n}^{c}})$ with the Tikhonov filter $r_{m} (σ) = \frac{σ}{σ + λ}$ and $λ = \frac{R}{m}$ . A different choice of the filter function r_m corresponds to a different regularization of the least-square problem

\min \frac{1}{n} \sum_{i = 1}^{n} | | (I - A) (Φ (X_{i}) - {\hat{μ}}_{n}) | |^{2} .

5. The Kernel Machine

In this section we show that the computation of $| | {\hat{F}}_{n} (x) | |$ , in terms of which is defined the estimator ${\hat{C}}_{n}$ , reduces to a finite dimensional problem, depending only on the Mercer kernel K, associated with the feature map. We introduce the centered sampling operator

S_{n}^{c} : ℋ \to ℝ^{n} {(S_{n}^{c} f)}_{i} = 〈 f, Φ (X_{i}) - {\hat{μ}}_{n} 〉,

whose transpose is given by

S_{n}^{c}^{⊤} : ℝ^{n} \to ℋ S_{n}^{c}^{⊤} v = \sum_{i = 1} v_{i} (Φ (X_{i}) - {\hat{μ}}_{n}),

where v_i is the i-th entry of the column vector v ∈ ℝⁿ. Hence, it holds that

\begin{array}{l} \frac{1}{n} S_{n}^{c}^{⊤} S_{n}^{c} = \hat{T_{n}^{c}}, \\ \frac{1}{n} S_{n}^{c} S_{n}^{c}^{⊤} = (I_{n} - \frac{1 1^{⊤}}{n}) \frac{K_{n}}{n} (I_{n} - \frac{1 1^{⊤}}{n}), \end{array}

where K_n is the n × n matrix whose (i, j)-entry is K(X_i, X_j) and I_n is the identity n × n matrix, so that the (i, j)-entry of $S_{n}^{c} {S_{n}^{c}}^{⊤}$ is

K (X_{i}, X_{j}) - \frac{1}{n} \sum_{b} K (X_{i}, X_{b}) - \frac{1}{n} \sum_{a} K (X_{a}, X_{j}) + \frac{1}{n^{2}} \sum_{a, b} K (X_{a}, X_{b}) .

Denoted by ℓ the rank of $S_{n}^{c} {S_{n}^{c}}^{⊤}$ , take the singular value decomposition of $S_{n}^{c} {S_{n}^{c}}^{⊤} / n$ , i.e.,

\frac{S_{n}^{c} S_{n}^{c}^{⊤}}{n} = V Σ V^{⊤},

where V is an n × ℓ matrix whose columns $v_{j} \in ℝ^{n}$ are the normalized eigenvectors, $V^{⊤} V = I_{ℓ}$ , and Σ is a diagonal ℓ × ℓ matrix with the strictly positive eigenvalues on the diagonal, i.e., $Σ = diag ({\hat{σ}}_{1}, \dots, {\hat{σ}}_{ℓ})$ . Set $U = {S_{n}^{c}}^{⊤} V Σ^{- \frac{1}{2}}$ , regarded as operator from ℝ^ℓ to $H$ , then a simple calculation shows that

\hat{T_{n}^{c}} = U Σ U^{⊤} r_{m} (\hat{T_{n}^{c}}) = U r_{m} (Σ) U^{⊤},

where r_m(Σ) is the diagonal ℓ × ℓ matrix

r_{m} (Σ) = diag (r_{m} ({\hat{σ}}_{1}), \dots, r_{m} ({\hat{σ}}_{ℓ})),

and the equation for $r_{m} (\hat{T_{n}^{c}})$ holds true since by assumption r_m(0) = 0. Hence

\begin{array}{l} | | {\hat{F}}_{n} (x) | |^{2} = 〈 (I - r_{m} (\hat{T_{n}^{c}})) (Φ (x) - {\hat{μ}}_{n}) (I - r_{m} (\hat{T_{n}^{c}})) (Φ (x) - {\hat{μ}}_{n}) 〉 \\ = 〈 Φ (x) - {\hat{μ}}_{n} Φ (x) - {\hat{μ}}_{n} 〉 - 〈 (2 r_{m} (\hat{T_{n}^{c}}) \\ - r_{m} {(\hat{T_{n}^{c}})}^{2}) (Φ (x) - {\hat{μ}}_{n}) 〉 Φ (x) - {\hat{μ}}_{n} \\ = 〈 Φ (x) - {\hat{μ}}_{n} Φ (x) - {\hat{μ}}_{n} 〉 - 〈 U (2 r_{m} (Σ) \\ - r_{m} {(Σ)}^{2}) U^{⊤} (Φ (x) - {\hat{μ}}_{n}) 〉 Φ (x) - {\hat{μ}}_{n} \\ = 〈 Φ (x) - {\hat{μ}}_{n} Φ (x) - {\hat{μ}}_{n} 〉 - 〈 V Σ^{- \frac{1}{2}} (2 r_{m} (Σ) \\ - r_{m} {(Σ)}^{2}) Σ^{- \frac{1}{2}} V^{⊤} S_{n}^{c} (Φ (x) - {\hat{μ}}_{n}) 〉 S_{n}^{c} (Φ (x) - {\hat{μ}}_{n}) \\ = w (x) - v {(x)}^{⊤} G_{m} v (x), & (19) \end{array}

where the real number $w (x) = 〈 Φ (x) - {\hat{μ}}_{n} Φ (x) - {\hat{μ}}_{n} 〉$ is

w (x) = K (x, x) - \frac{2}{n} \sum_{b} K (x, X_{b}) + \frac{1}{n^{2}} \sum_{a, b} K (X_{a}, X_{b}),

the i-th entry of the column vector v(x) ∈ ℝⁿ is

\begin{array}{l} v {(x)}_{i} = {(S_{n}^{c} (Φ (x) - {\hat{μ}}_{n}))}_{i} = K (x, X_{i}) - \frac{1}{n} \sum_{a} K (X_{a}, x) \\ - \frac{1}{n} \sum_{b} K (X_{i}, X_{b}) + \frac{1}{n^{2}} \sum_{a, b} K (X_{a}, X_{b}), \end{array}

the diagonal ℓ × ℓ matrix $R_{m} (Σ) = Σ^{- 1} (2 r_{m} (Σ) - r_{m} {(Σ)}^{2})$ is

R_{m} (Σ) = diag (\frac{2 r_{m} ({\hat{σ}}_{1}) - r_{m} {({\hat{σ}}_{1})}^{2}}{{\hat{σ}}_{1}}, \dots, \frac{2 r_{m} ({\hat{σ}}_{ℓ}) - r_{m} {({\hat{σ}}_{ℓ})}^{2}}{{\hat{σ}}_{ℓ}}),

and the n × n-matrix G_m is

\begin{matrix} G_{m} = V R_{m} (Σ) V^{⊤} . & (20) \end{matrix}

In Algorithm 1 we list the corresponding MatLab Code.

Algorithm 1. Matlab code for Set Learning.

The above equations make clear that both ${\hat{F}}_{n}$ and ${\hat{C}}_{n}$ can be computed in terms of the singular value decomposition (V, Σ) of the n × n Gram matrix K_n and of the filter function r_m, so that ${\hat{F}}_{n}$ belongs to the class of kernel methods and ${\hat{C}}_{n}$ is a plug-in estimator. For the hard cut-off filter, one simply has

R_{m} {(Σ)}_{i i} = {\begin{array}{l} \frac{1}{{\hat{σ}}_{i}} & {\hat{σ}}_{i} \geq {\hat{σ}}_{m} \\ 0 & {\hat{σ}}_{i} < {\hat{σ}}_{m} \end{array} .

For real applications, a delicate issue is the choice of the parameters m and τ, we refer to Rudi et al. [31] for a detailed discussion. Here, we add some simple remarks.

We first discuss the role of τ. According to (10), $Ĉ_{n}^{m, τ} \subseteq Ĉ_{n}^{m, τ^{'}}$ whenever τ < τ′. We exemplify this behavior with the dataset of Example 2. The training set is sampled from the distribution supported on the curve Lis_2,0.11,1,0.3 (see panel right of Figure 2) and we compute ${\hat{C}}_{n}$ with the Abel kernel, n = 100 and m ranging over 5, 10, 20, 50. Figure 6 shows the nested sets when τ runs in the associated color-bar.

FIGURE 6

Figure 6. From left to right and top to bottom: The family of sets $Ĉ_{100}^{5, τ}$ , $Ĉ_{100}^{10, τ}$ , $Ĉ_{100}^{20, τ}$ , $Ĉ_{100}^{50, τ}$ with τ varying as in the related colorbar.

Analyzing the role of m, we now show that, for the the hard cut-off filter, $Ĉ_{n}^{m^{'}, τ} \subseteq Ĉ_{n}^{m, τ}$ whenever m′ ≤ m. Indeed, this filter satisfies $r_{m^{'}} (σ) \leq r_{m} (σ)$ and, since 0 ≤ r_m(σ) ≤ 1, one has ${(1 - r_{m} (σ))}^{2} \leq {(1 - r_{m^{'}} (σ))}^{2}$ . Hence, denoted by ${{\hat{u}}_{j}}_{j}$ a base of eigenvectors of $\hat{T_{n}^{c}}$ , it holds that

\begin{array}{l} | | (I - r_{m} (\hat{T_{n}^{c}})) (Φ (x) - {\hat{μ}}_{n}) | |^{2} = \sum_{j} (1 - r_{m} ({\hat{σ}}_{j}))^{2} {〈 Φ (x) - {\hat{μ}}_{n} {\hat{u}}_{j} 〉}^{2} \\ \leq \sum_{j} (1 - r_{m^{'}} ({\hat{σ}}_{j}))^{2} {〈 Φ (x) - {\hat{μ}}_{n} {\hat{u}}_{j} 〉}^{2} \\ = | | (I - r_{m^{'}} (\hat{T_{n}^{c}})) (Φ (x) - {\hat{μ}}_{n}) | |^{2} . \end{array}

Hence, for any point in $x \in Ĉ_{n}^{m^{'}, τ}$ ,

\begin{array}{l} | | (I - r_{m} (\hat{T_{n}^{c}})) (Φ (x) - {\hat{μ}}_{n}) | |_{ℋ}^{2} \\ \leq | | (I - r_{m^{'}} (\hat{T_{n}^{c}})) (Φ (x) - {\hat{μ}}_{n}) | |_{ℋ}^{2} \leq τ^{2}, \end{array}

so that $x \in Ĉ_{n}^{m, τ}$ .

As above, we illustrate the different choices of m with the data sampled from the curve Lis_2,0.11,1,0.3 and the Abel kernel where n = 100 and τ ranges over 0.25, 0.3, 0.4, 0.5. Figure 7 shows the nested sets when m runs in the associated color-bar.

FIGURE 7

Figure 7. From left to right and top to bottom: The family of sets $Ĉ_{100}^{m, 0.25}$ , $Ĉ_{100}^{m, 0.3}$ , $Ĉ_{100}^{m, 0.4}$ , $Ĉ_{100}^{m, 0.5}$ with m varying as in the related colorbar.

6. Discussion

We presented a new class of set estimators, which are able to learn the support of an unknown probability distribution from a training set of random data. The set estimator is defined through a decision function, which can be seen as a novelty/anomality detection algorithm as in Schölkopf et al. [6].

The decision function we defined is a kernel machine. It is computed by the singular value decomposition of the empirical (kernel)-covariance matrix and by a low pass filter. An example of filter is the hard cut-off function and the corresponding decision function reduces to KPCA algorithm for novelty detection first introduced by Hoffmann [16]. However, we showed that it is possible to choose other low pass filters, as it was done for a class of supervised algorithms in the regression/classification setting [38].

Under some weak assumptions on the low pass filter, we proved that the corresponding set estimator is strongly consistent with respect to the Hausdorff distance, provided that the kernel satisfies a suitable separating condition, as it happens, for example, for the Abel kernel. Furthermore, by comparing Theorem 2 with a similar consistency result in De Vito et al. [27], it appears clear that the algorithm correctly learns the support both if the data have zero mean, as in our paper, and if the data are not centered, as in De Vito et al. [27]. On the contrary, if the separating property does not hold, the algorithm learns only the supports that are mapped into linear subspaces by the feature map defined by the kernel.

The set estimator we introduced depends on two parameters: the effective number m of eigenvectors defining the decision function and the thickness τ of the region estimating the support. The role of these parameters and of the separating property was briefly discussed by a few tests on toy data.

We finally observe that our class of set learning algorithms is very similar to classical kernel machines in supervised learning. So, in order to reduce both the computational cost and the memory requirements, there is the possibility to successfully implement some new advanced approximation techniques, for which there exist theoretical guarantees for the statistical learning setting. For example random features [39, 40], Nyström projections [41, 42] or mixed approaches with iterative regularization and preconditioning [43, 44].

Author Contributions

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

ED is member of the Gruppo Nazionale per l'Analisi Matematica, la Probabilità e le loro Applicazioni (GNAMPA) of the Istituto Nazionale di Alta Matematica (INdAM).

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fams.2017.00023/full#supplementary-material

Footnotes

1. ^Here, the labeling is different from the one in (6), where the eigenvalues are repeated according to their multiplicity.

References

1. Devroye L, Wise GL. Detection of abnormal behavior via nonparametric estimation of the support. SIAM J Appl Math. (1980) 38:480–8. doi: 10.1137/0138038

CrossRef Full Text | Google Scholar

2. Korostelëv AP, Tsybakov AB. Minimax Theory of Image Reconstruction. New York, NY: Springer-Verlag (1993).

Google Scholar

3. Dümbgen L, Walther G. Rates of convergence for random approximations of convex sets. Adv Appl Probab. (1996) 28:384–93. doi: 10.2307/1428063

CrossRef Full Text | Google Scholar

4. Cuevas A, Fraiman R. A plug-in approach to support estimation. Ann Stat. (1997) 25:2300–12. doi: 10.1214/aos/1030741073

CrossRef Full Text | Google Scholar

5. Tsybakov AB. On nonparametric estimation of density level sets. Ann Stat. (1997) 25:948–69. doi: 10.1214/aos/1069362732

CrossRef Full Text | Google Scholar

6. Schölkopf B, Platt J, Shawe-Taylor J, Smola A, Williamson R. Estimating the support of a high-dimensional distribution. Neural Comput. (2001) 13:1443–71. doi: 10.1162/089976601750264965

CrossRef Full Text | Google Scholar

7. Cuevas A, Rodríguez-Casal A. Set estimation: an overview and some recent developments. In: Recent Advances and Trends in Nonparametric Statistics. Elsevier: Amsterdam (2003). p. 251–64.

8. Reitzner M. Random polytopes and the Efron-Stein jackknife inequality. Ann Probab. (2003) 31:2136–66. doi: 10.1214/aop/1068646381

CrossRef Full Text | Google Scholar

9. Steinwart I, Hush D, Scovel C. A classification framework for anomaly detection. J Mach Learn Res. (2005) 6:211–32.

Google Scholar

10. Vert R, Vert JP. Consistency and convergence rates of one-class SVMs and related algorithms. J Mach Learn Res. (2006) 7:817–54.

Google Scholar

11. Scott CD, Nowak RD. Learning minimum volume sets. J Mach Learn Res. (2006) 7:665–704.

Google Scholar

12. Biau G, Cadre B, Mason D, Pelletier B. Asymptotic normality in density support estimation. Electron J Probab. (2009) 91:2617–35.

Google Scholar

13. Cuevas A, Fraiman R. Set estimation. In: W. Kendall and I. Molchanov, editors. New Perspectives in Stochastic Geometry. Oxford: Oxford University Press (2010). p. 374–97.

14. Bobrowski O, Mukherjee S, Taylor JE. Topological consistency via kernel estimation. Bernoulli (2017) 23:288–328. doi: 10.3150/15-BEJ744

CrossRef Full Text | Google Scholar

15. Campos GO, Zimek A, Sander J, Campello RJ, Micenková B, Schubert E, et al. On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Discov. (2016) 30:891–927. doi: 10.1007/s10618-015-0444-8

CrossRef Full Text | Google Scholar

16. Hoffmann H. Kernel PCA for novelty detection. Pattern Recognit. (2007) 40:863–74. doi: 10.1016/j.patcog.2006.07.009

CrossRef Full Text | Google Scholar

17. Schölkopf B, Smola A, Müller KR. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. (1998) 10:1299–319.

Google Scholar

18. Ristic B, La Scala B, Morelande M, Gordon N. Statistical analysis of motion patterns in AIS data: anomaly detection and motion prediction. In: 2008 11th International Conference on Information Fusion (2008). p. 1–7.

Google Scholar

19. Lee HJ, Cho S, Shin MS. Supporting diagnosis of attention-deficit hyperactive disorder with novelty detection. Artif Intell Med. (2008) 42:199–212. doi: 10.1016/j.artmed.2007.11.001

CrossRef Full Text | Google Scholar

20. Valero-Cuevas FJ, Hoffmann H, Kurse MU, Kutch JJ, Theodorou EA. Computational models for neuromuscular function. IEEE Rev Biomed Eng. (2009) 2:110–35. doi: 10.1109/RBME.2009.2034981

CrossRef Full Text | Google Scholar

21. He F, Yang JH, Li M, Xu JW. Research on nonlinear process monitoring and fault diagnosis based on kernel principal component analysis. Key Eng Mater. (2009) 413:583–90. doi: 10.4028/www.scientific.net/KEM.413-414.583

CrossRef Full Text | Google Scholar

22. Maestri ML, Cassanello MC, Horowitz GI. Kernel PCA performance in processes with multiple operation modes. Chem Prod Process Model. (2009) 4:1934–2659. doi: 10.2202/1934-2659.1383

CrossRef Full Text | Google Scholar

23. Cheng P, Li W, Ogunbona P. Kernel PCA of HOG features for posture detection. In: VCNZ'09. 24th International Conference on Image and Vision Computing New Zealand, 2009. Wellington (2009). p. 415–20.

Google Scholar

24. Sofman B, Bagnell JA, Stentz A. Anytime online novelty detection for vehicle safeguarding. In: 2010 IEEE International Conference on Robotics and Automation (ICRA). Pittsburgh, PA (2010). p. 1247–54.

Google Scholar

25. De Vito E, Rosasco L, Toigo A. A universally consistent spectral estimator for the support of a distribution. Appl Comput Harmonic Anal. (2014) 37:185–217. doi: 10.1016/j.acha.2013.11.003

CrossRef Full Text | Google Scholar

26. Steinwart I. On the influence of the kernel on the consistency of support vector machines. J Mach Learn Res. (2002) 2:67–93. doi: 10.1162/153244302760185252

CrossRef Full Text | Google Scholar

27. De Vito E, Rosasco L, Toigo A. Spectral regularization for support estimation. In: NIPS. Vancouver, BC (2010). p. 1–9.

Google Scholar

28. Engl HW, Hanke M, Neubauer A. Regularization of Inverse Problems. Vol. 375 of Mathematics and its Applications. Dordrecht: Kluwer Academic Publishers Group (1996).

29. Lo Gerfo L, Rosasco L, Odone F, De Vito E, Verri A. Spectral algorithms for supervised learning. Neural Comput. (2008) 20:1873–97. doi: 10.1162/neco.2008.05-07-517

CrossRef Full Text | Google Scholar

30. Blanchard G, Mücke N. Optimal rates for regularization of statistical inverse learning problems. In: Foundations of Computational Mathematics. (2017). Available online at: https://arxiv.org/abs/1604.04054

Google Scholar

31. Rudi A, Odone, F, De Vito E. Geometrical and computational aspects of Spectral Support Estimation for novelty detection. Pattern Recognit Lett. (2014) 36:107–16. doi: 10.1016/j.patrec.2013.09.025

CrossRef Full Text | Google Scholar

32. Rudi A, Canas, GD, Rosasco L. On the sample complexity of subspace learning. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, editors. Advances in Neural Information Processing Systems. Lake Tahoe: Neural Information Processing Systems Conference (2013). p. 2067–75.

Google Scholar

33. Rudi A, Canas GD, De Vito E, Rosasco L. Learning Sets and Subspaces. In: Suykens JAK, Signoretto M, and Argyriou A, editors. Regularization, Optimization, Kernels, and Support Vector Machines. Boca Raton, FL: Chapman and Hall/CRC (2014). p. 337.

Google Scholar

34. Blanchard G, Bousquet O, Zwald L. Statistical properties of kernel principal component analysis. Machine Learn. (2007) 66:259–94. doi: 10.1007/s10994-006-8886-2

CrossRef Full Text | Google Scholar

35. Györfi L, Kohler M, Krzy zak A, Walk H. A Distribution-Free Theory of Nonparametric Regression. Springer Series in Statistics. New York, NY: Springer-Verlag (2002).

Google Scholar

36. Steinwart I, Christmann A. Support Vector Machines. Information Science and Statistics. New York, NY: Springer (2008).

37. Zwald L, Blanchard G. On the Convergence of eigenspaces in kernel principal component analysis. In: Weiss Y, Schölkopf B, Platt J, editors. Advances in Neural Information Processing Systems 18. Cambridge, MA: MIT Press (2006). p. 1649–56.

Google Scholar

38. De Vito E, Rosasco L, Caponnetto A, De Giovannini U, Odone F. Learning from examples as an inverse problem. J Machine Learn Res. (2005) 6:883–904.

Google Scholar

39. Rahimi A, Recht B. Random features for large-scale kernel machines. In: Koller D, Schuurmans D, Bengio, Y, Bottou L, editors. Advances in Neural Information Processing Systems. Vancouver, BC: Neural Information Processing Systems Conference (2008). p. 1177–84.

Google Scholar

40. Rudi A, Camoriano R, Rosasco L. Generalization properties of learning with random features. arXiv preprint arXiv:160204474 (2016).

Google Scholar

41. Smola AJ, Schölkopf B. Sparse greedy matrix approximation for machine learning. In: Proceeding ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning. Stanford, CA: Morgan Kaufmann (2000). p. 911–18.

Google Scholar

42. Rudi A, Camoriano R, Rosasco L. Less is more: nyström computational regularization. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R, editors. Advances in Neural Information Processing Systems. Montreal, QC: Neural Information Processing Systems Conference (2015). p. 1657–1665.

Google Scholar

43. Camoriano R, Angles T, Rudi A, Rosasco L. NYTRO: when subsampling meets early stopping. In: Gretton A, Robert CC, editors. Artificial Intelligence and Statistics. Cadiz: Proceedings of Machine Learning Research (2016). p. 1403–11.

Google Scholar

44. Rudi A, Carratino L, Rosasco L. FALKON: an optimal large scale Kernel method. arXiv preprint arXiv:170510958 (2017).

Google Scholar

45. Folland G. A Course in Abstract Harmonic Analysis. Studies in Advanced Mathematics. Boca Raton, FL: CRC Press (1995).

Google Scholar

46. Birman MS, Solomyak M. Double operator integrals in a Hilbert space. Integr Equat Oper Theor. (2003) 47:131–68. doi: 10.1007/s00020-003-1157-8

CrossRef Full Text | Google Scholar

47. De Vito E, Umanità V, Villa S. A consistent algorithm to solve Lasso, elastic-net and Tikhonov regularization. J Complex. (2011) 27:188–200. doi: 10.1016/j.jco.2011.01.003

CrossRef Full Text | Google Scholar

Keywords: support estimation, Kernel PCA, novelty detection, dimensionality reduction, regularized Kernel methods

Citation: Rudi A, De Vito E, Verri A and Odone F (2017) Regularized Kernel Algorithms for Support Estimation. Front. Appl. Math. Stat. 3:23. doi: 10.3389/fams.2017.00023

Received: 11 August 2017; Accepted: 25 October 2017;
Published: 08 November 2017.

Edited by:

Ding-Xuan Zhou, City University of Hong Kong, Hong Kong

Reviewed by:

Ting Hu, Wuhan University, China
Sergiy Pereverzyev, University of Innsbruck, Austria
Qiang Wu, Middle Tennessee State University, United States

Copyright © 2017 Rudi, De Vito, Verri and Odone. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Ernesto De Vito, ZGV2aXRvQGRpbWEudW5pZ2UuaXQ=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.