Optimal Rates for the Regularized Learning Algorithms under General Source Condition

Rastogi, Abhishake; Sampath, Sivananthan

doi:10.3389/fams.2017.00003

ORIGINAL RESEARCH article

Front. Appl. Math. Stat., 27 March 2017

Sec. Mathematics of Computation and Data Science

Volume 3 - 2017 | https://doi.org/10.3389/fams.2017.00003

Optimal Rates for the Regularized Learning Algorithms under General Source Condition

Abhishake Rastogi^*

Sivananthan Sampath

Department of Mathematics, Indian Institute of Technology Delhi, New Delhi, India

We consider the learning algorithms under general source condition with the polynomial decay of the eigenvalues of the integral operator in vector-valued function setting. We discuss the upper convergence rates of Tikhonov regularizer under general source condition corresponding to increasing monotone index function. The convergence issues are studied for general regularization schemes by using the concept of operator monotone index functions in minimax setting. Further we also address the minimum possible error for any learning algorithm.

1. Introduction

Learning theory [1–3] aims to learn the relation between the inputs and outputs based on finite random samples. We require some underlying space to search the relation function. From the experiences we have some idea about the underlying space which is called hypothesis space. Learning algorithms tries to infer the best estimator over the hypothesis space such that f(x) gives the maximum information of the output variable y for any unseen input x. The given samples ${x_{i}, y_{i}}_{i = 1}^{m}$ are not exact in the sense that for underlying relation function f(x_i) ≠ y_i but f(x_i) ≈ y_i. We assume that the uncertainty follows the probability distribution ρ on the sample space X × Y and the underlying function (called the regression function) for the probability distribution ρ is given by

f_{ρ} (x) = \int_{Y} y d ρ (y | x), x \in X,

where ρ(y|x) is the conditional probability measure for given x. The problem of obtaining estimator from examples is ill-posed. Therefore, we apply the regularization schemes [4–7] to stabilize the problem. Various regularization schemes are studied for inverse problems. In the context of learning theory [2, 3, 8–10], the square loss-regularization (Tikhonov regularization) is widely considered to obtain the regularized estimator [9, 11–16]. Gerfo et al. [6] introduced general regularization in the learning theory and provided the error bounds under Hölder's source condition [5]. Bauer et al. [4] discussed the convergence issues for general regularization under general source condition [17] by removing the Lipschitz condition on the regularization considered in Gerfo et al. [6]. Caponnetto and De Vito [12] discussed the square-loss regularization under the polynomial decay of the eigenvalues of the integral operator L_K with Hölder's source condition. For the inverse statistical learning problem, Blanchard and Mücke [18] analyzed the convergence rates for general regularization scheme under Hölder's source condition in scalar-valued function setting. Here we are discussing the convergence issues of general regularization schemes under general source condition and the polynomial decay of the eigenvalues of the integral operator in vector-valued framework. We present the minimax upper convergence rates for Tikhonov regularization under general source condition Ω_{ϕ, R}, for a monotone increasing index function ϕ. For general regularization the minimax rates are obtained using the operator monotone index function ϕ. The concept of effective dimension [19, 20] is exploited to achieve the convergence rates. In the choice of regularization parameters, the effective dimension plays the important role. We also discuss the lower convergence rates for any learning algorithm under the smoothness conditions. We present the results in vector-valued function setting. Therefore, in particular they can be applied to multi-task learning problems.

The structure of the paper is as follows. In the second section, we introduce some basic assumptions and notations for supervised learning problems. In Section 3, we present the upper and lower convergence rates under the smoothness conditions in minimax setting.

2. Learning From Examples: Notations and Assumptions

In the learning theory framework [2, 3, 8–10], the sample space Z = X × Y consists of two spaces: The input space X (locally compact second countable Hausdorff space) and the output space (Y, 〈·, ·〉_Y) (the real separable Hilbert space). The input space X and the output space Y are related by some unknown probability distribution ρ on Z. The probability measure can be split as ρ(x, y) = ρ(y|x)ρ_X(x), where ρ(y|x) is the conditional probability measure of y given x and ρ_X is the marginal probability measure on X. The only available information is the random i.i.d. samples z = ((x₁, y₁), …, (x_m, y_m)) drawn according to the probability measure ρ. Given the training set z, learning theory aims to develop an algorithm which provides an estimator f_z : X → Y such that f_z(x) predicts the output variable y for any given input x. The goodness of the estimator can be measured by the generalization error of a function f which can be defined as

\begin{array}{rcl} E (f) : = E_{ρ} (f) = \int_{Z} V (f (x), y) d ρ (x, y), & (1) \end{array}

where V : Y × Y → ℝ is the loss function. The minimizer of $E (f)$ for the square loss function $V (f (x), y) = | | f (x) - y | |_{Y}^{2}$ is given by

\begin{array}{rcl} f_{ρ} (x) : = \int_{Y} y d ρ (y | x), & (2) \end{array}

where f_ρ is called the regression function. The regression function f_ρ belongs to the space of square integrable functions provided that

\begin{array}{rcl} \int_{Z} | | y | |_{Y}^{2} d ρ (x, y) < \infty . & (3) \end{array}

We search the minimizer of the generalization error over a hypothesis space $H$ ,

\begin{array}{rcl} f_{H} : = \underset{f \in H}{argmin} {\int_{Z} | | f (x) - y | |_{Y}^{2} d ρ (x, y)}, & (4) \end{array}

where f_$H$ is called the target function. In case f_ρ ∈ $H$ , f_$H$ becomes the regression function f_ρ.

Because of inaccessibility of the probability distribution ρ, we minimize the regularized empirical estimate of the generalization error over the hypothesis space $H$ ,

\begin{array}{rcl} f_{z, λ} : = \underset{f \in H}{argmin} {\frac{1}{m} \sum_{i = 1}^{m} | | f (x_{i}) - y_{i} | |_{Y}^{2} + λ | | f | |_{H}^{2}}, & (5) \end{array}

where λ is the positive regularization parameter. The regularization schemes [4–7, 10] are used to incorporate various features in the solution such as boundedness, monotonicity and smoothness. In order to optimize the vector-valued regularization functional, one of the main problems is to choose the appropriate hypothesis space which is assumed to be a source to provide the estimator.

2.1. Reproducing Kernel Hilbert Space as a Hypothesis Space

Definition 2.1. (Vector-valued reproducing kernel Hilbert space) For non-empty set X and the real Hilbert space (Y, 〈·, ·〉_Y), the Hilbert space ( $H$ , 〈·, ·〉_$H$) of functions from X to Y is called reproducing kernel Hilbert space if for any x ∈ X and y ∈ Y the linear functional which maps f ∈ $H$ to 〈y, f(x)〉_Y is continuous.

By Riesz lemma [21], for every x ∈ X and y ∈ Y there exists a linear operator K_x : Y → $H$ such that

{〈 y, f (x) 〉}_{Y} = {〈 K_{x} y, f 〉}_{H}, \forall f \in H .

Therefore, the adjoint operator $K_{x}^{*} : H \to Y$ is given by $K_{x}^{*} f = f (x)$ . Through the linear operator K_x : Y → $H$ we define the linear operator K(x, t) : Y → Y,

K (x, t) y : = K_{t} y (x) .

From Proposition 2.1 [22], the linear operator $K (x, t) \in L (Y)$ (the set of bounded linear operators on Y), K(x, t) = K(t, x)* and K(x, x) is non-negative bounded linear operator. For any m ∈ ℕ, {x_i: 1 ≤ i ≤ m} ∈ X, {y_i: 1 ≤ i ≤ m} ∈ Y, we have that $\sum_{i, j = 1}^{m} 〈 y_{i}, K (x_{i}, x_{j}) y_{j} 〉 \geq 0$ . The operator valued function $K : X \times X \to L (Y)$ is called the kernel.

There is one to one correspondence between the kernels and reproducing kernel Hilbert spaces [22, 23]. So a reproducing kernel Hilbert space $H$ corresponding to a kernel K can be denoted as $H$ _K and the norm in the space $H$ can be denoted as ||·||_{$H$ _K}. In the following article, we suppress K by simply using $H$ for reproducing kernel Hilbert space and ||·||_$H$ for its norm.

Throughout the paper we assume the reproducing kernel Hilbert space $H$ is separable such that

(i) K_x : Y → $H$ is a Hilbert-Schmidt operator for all x ∈ X and $κ : = \sqrt{sup_{x \in X} T r (K_{x}^{*} K_{x})} < \infty$ .

(ii) The real function from X × X to ℝ, defined by (x, t) ↦ 〈K_tv, K_xw〉_$H$, is measurable ∀v, w ∈ Y.

By the representation theorem [22], the solution of the penalized regularization problem (5) will be of the form:

f_{z, λ} = \sum_{i = 1}^{m} K_{x_{i}} c_{i}, for (c_{1}, \dots, c_{m}) \in Y^{m} .

Definition 2.2. let $H$ be a separable Hilbert space and ${e_{k}}_{k = 1}^{\infty}$ be an orthonormal basis of $H$ . Then for any positive operator $A \in L (H)$ we define $T r (A) = \sum_{k = 1}^{\infty} 〈 A e_{k}, e_{k} 〉$ . It is well-known that the number Tr(A) is independent of the choice of the orthonormal basis.

Definition 2.3. An operator $A \in L (H)$ is called Hilbert-Schmidt operator if Tr(A*A) < ∞. The family of all Hilbert-Schmidt operators is denoted by $L_{2} (H)$ . For $A \in L_{2} (H)$ , we define $T r (A) = \sum_{k = 1}^{\infty} 〈 A e_{k}, e_{k} 〉$ for an orthonormal basis ${e_{k}}_{k = 1}^{\infty}$ of $H$ .

It is well-known that $L_{2} (H)$ is the separable Hilbert space with the inner product,

{〈 A, B 〉}_{L_{2} (H)} = T r (B^{*} A)

and its norm satisfies

| | A | |_{L (H)} \leq | | A | |_{L_{2} (H)} \leq T r (| A |),

where $| A | = \sqrt{A^{*} A}$ and $| | \cdot | |_{L (H)}$ is the operator norm (For more details see [24]).

For the positive trace class operator $K_{x} K_{x}^{*}$ , we have

| | K_{x} K_{x}^{*} | |_{L (H)} \leq | | K_{x} K_{x}^{*} | |_{L_{2} (H)} \leq T r (K_{x} K_{x}^{*}) \leq κ^{2} .

Given the ordered set $x = (x_{1}, \dots, x_{m}) \in X^{m}$ , the sampling operator $S_{x} : H \to Y^{m}$ is defined by S_x(f) = (f(x₁), …, f(x_m)) and its adjoint $S_{x}^{*} : Y^{m} \to H$ is given by $S_{x}^{*} y = \frac{1}{m} \sum_{i = 1}^{m} K_{x_{i}} y_{i}, \forall y = (y_{1}, \dots, y_{m}) \in Y^{m} .$

The regularization scheme (5) can be expressed as

\begin{array}{rcl} f_{z, λ} = \underset{f \in H}{argmin} {| | S_{x} f - y | |_{m}^{2} + λ | | f | |_{H}^{2}}, & (6) \end{array}

where $| | y | |_{m}^{2} = \frac{1}{m} \sum_{i = 1}^{m} | | y_{i} | |_{Y}^{2}$ .

We obtain the explicit expression of f_{z, λ} by taking the functional derivative of above expression over RKHS $H$ .

Theorem 2.1. For the positive choice of λ, the functional (6) has unique minimizer:

\begin{array}{rcl} f_{z, λ} = {(S_{x}^{*} S_{x} + λ I)}^{- 1} S_{x}^{*} y . & (7) \end{array}

Define f_λ as the minimizer of the optimization functional,

\begin{array}{rcl} f_{λ} : = \underset{f \in H}{argmin} {\int_{Z} | | f (x) - y | |_{Y}^{2} d ρ (x, y) + λ | | f | |_{H}^{2}} . & (8) \end{array}

Using the fact $E (f) = | | L_{K}^{1 / 2} (f - f_{H}) | |_{H}^{2} + E (f_{H})$ , we get the expression of f_λ,

\begin{array}{rcl} f_{λ} = {(L_{K} + λ I)}^{- 1} L_{K} f_{H}, & (9) \end{array}

where the integral operator $L_{K} : L_{ρ_{X}}^{2} \to L_{ρ_{X}}^{2}$ is a self-adjoint, non-negative, compact operator, defined as

L_{K} (f) (x) : = \int_{X} K (x, t) f (t) d ρ_{X} (t), x \in X .

The integral operator L_K can also be defined as a self-adjoint operator on $H$ . We use the same notation L_K for both the operators defined on different domains. It is well-known that $L_{K}^{1 / 2}$ is an isometry from the space of square integrable functions to reproducing kernel Hilbert space.

In order to achieve the uniform convergence rates for learning algorithms we need some prior assumptions on the probability measure ρ. Following the notion of Bauer et al. [4] and Caponnetto and De Vito [12], we consider the class of probability measures $P_{ϕ}$ which satisfies the assumptions:

(i) For the probability measure ρ on X × Y,

\begin{array}{rcl} \int_{Z} | | y | |_{Y}^{2} d ρ (x, y) < \infty . & (10) \end{array}

(ii) The minimizer of the generalization error f_$H$ (4) over the hypothesis space $H$ exists.

(iii) There exist some constants M, Σ such that for almost all x ∈ X,

\begin{array}{rcl} \int_{Y} (e^{| | y - f_{H} (x) | |_{Y} / M} - \frac{| | y - f_{H} (x) | |_{Y}}{M} - 1) d ρ (y | x) \leq \frac{Σ^{2}}{2 M^{2}} . & (11) \end{array}

(iv) The target function f_$H$ belongs to the class Ω_{ϕ, R} with

\begin{array}{rcl} Ω_{ϕ, R} : = {f \in H : f = ϕ (L_{K}) g and | | g | |_{H} \leq R}, & (12) \end{array}

where ϕ is a continuous increasing index function defined on the interval [0, κ²] with the assumption ϕ(0) = 0. This condition is usually referred to as general source condition [17].

In addition, we consider the set of probability measures $P_{ϕ, b}$ which satisfies the conditions (i), (ii), (iii), (iv) and the eigenvalues t_n's of the integral operator L_K follow the polynomial decay: For fixed positive constants α, β and b > 1,

\begin{array}{rcl} α n^{- b} \leq t_{n} \leq β n^{- b} \forall n \in ℕ . & (13) \end{array}

Under the polynomial decay of the eigenvalues the effective dimension $N (λ)$ , to measure the complexity of RKHS, can be estimated from Proposition 3 [12] as follows,

\begin{array}{rcl} N (λ) : = T r ({(L_{K} + λ I)}^{- 1} L_{K}) \leq \frac{β b}{b - 1} λ^{- 1 / b}, for b > 1 & (14) \end{array}

and without the polynomial decay condition (13), we have

N (λ) \leq | | {(L_{K} + λ I)}^{- 1} | |_{L (H)} T r (L_{K}) \leq \frac{κ^{2}}{λ} .

We discuss the convergence issues for the learning algorithms (z → f_z ∈ $H$ ) in probabilistic sense by exponential tail inequalities such that

P r o b_{z} {| | f_{z} - f_{ρ} | |_{ρ} \leq ε (m) log (\frac{1}{η})} \geq 1 - η

for all 0 < η ≤ 1 and ε(m) is a positive decreasing function of m. Using these probabilistic estimates we can obtain error estimates in expectation by integration of tail inequalities:

\begin{array}{l} E_{z} (| | f_{z} - f_{ρ} | |_{ρ}) = \int_{0}^{\infty} P r o b_{z} (| | f_{z} - f_{ρ} | |_{ρ} > t) d t \\ \leq \int_{0}^{\infty} exp (- \frac{t}{ε (m)}) d t = ε (m), \end{array}

where $| | f | |_{ρ} = | | f | |_{L_{ρ_{X}}^{2}} = {\int_{X} | | f (x) | |_{Y}^{2} d ρ_{X} (x)}^{1 / 2}$ and $E_{z} (ξ) = \int_{Z^{m}} ξ d ρ (z_{1}) \dots d ρ (z_{m})$ .

3. Convergence Analysis

In this section, we analyze the convergence issues of the learning algorithms on reproducing kernel Hilbert space under the smoothness priors in the supervised learning framework. We discuss the upper and lower convergence rates for vector-valued estimators in the standard minimax setting. Therefore, the estimates can be utilized particularly for scalar-valued functions and multi-task learning algorithms.

3.1. Upper Rates for Tikhonov Regularization

In General, we consider Tikhonov regularization in learning theory. Tikhonov regularization is briefly discussed in the literature [7, 9, 10, 25]. The error estimates for Tikhonov regularization are discussed theoretically under Hölder's source condition [12, 15, 16]. We establish the error estimates for Tikhonov regularization scheme under general source condition f_$H$ ∈ Ω_ϕ,R for some continuous increasing index function ϕ and the polynomial decay of the eigenvalues of the integral operator L_K.

In order to estimate the error bounds, we consider the following inequality used in the papers [4, 12] which is based on the results of Pinelis and Sakhanenko [26].

Proposition 3.1. Let ξ be a random variable on the probability space $(Ω, B, P)$ with values in real separable Hilbert space $H$ . If there exist two constants Q and S satisfying

\begin{array}{rcl} E {| | ξ - E (ξ) | |_{H}^{n}} \leq \frac{1}{2} n! S^{2} Q^{n - 2} \forall n \geq 2, & (15) \end{array}

then for any 0 < η < 1 and for all m ∈ ℕ,

\begin{array}{l} P r o b {{(ω_{1}, \dots, ω_{m}) \in Ω^{m} : ‖ \frac{1}{m} \sum_{i = 1}^{m} [ξ (ω_{i}) - E (ξ (ω_{i}))] ‖}_{H} \\ \leq 2 (\frac{Q}{m} + \frac{S}{\sqrt{m}}) \log (\frac{2}{η})} \geq 1 - η . \end{array}

In particular, the inequality (15) holds if

| | ξ (ω) | |_{H} \leq Q a n d E (| | ξ (ω) | |_{H}^{2}) \leq S^{2} .

We estimate the error bounds for the regularized estimators by measuring the effect of random sampling and the complexity of f_$H$. The quantities described in Proposition 3.2 express the probabilistic estimates of the perturbation measure due to random sampling. The expressions of Proposition 3.3 describe the complexity of the target function f_$H$ which are usually referred to as the approximation errors. The approximation errors are independent of the samples z.

Proposition 3.2. Let z be i.i.d. samples drawn according to the probability measure ρ satisfying the assumptions (10), (11) and $κ = \sqrt{sup_{x \in X} T r (K_{x}^{*} K_{x})}$ . Then for all 0 < η < 1, we have

\begin{matrix} \begin{array}{l} | | {(L_{K} + λ I)}^{- 1 / 2} {S_{x}^{*} y - S_{x}^{*} S_{x} f_{H}} | |_{H} \\ \leq 2 (\frac{κ M}{m \sqrt{λ}} + \sqrt{\frac{Σ^{2} N (λ)}{m}}) \log (\frac{4}{η}) \end{array} & (16) \end{matrix}

and

\begin{array}{rcl} | | S_{x}^{*} S_{x} - L_{K} | |_{L_{2} (H)} \leq 2 (\frac{κ^{2}}{m} + \frac{κ^{2}}{\sqrt{m}}) log (\frac{4}{η}) . & (17) \end{array}

with the confidence 1 − η.

The proof of the first expression is the content of the step 3.2 of Theorem 4 [12] while the proof of the second expression can be obtained from Theorem 2 in De Vito et al. [25].

Proposition 3.3. Suppose f_$H$ ∈ Ω_ϕ,R. Then,

(i) Under the assumption that $ϕ (t) \sqrt{t}$ and $\sqrt{t} / ϕ (t)$ are non-decreasing functions, we have

\begin{array}{rcl} | | f_{λ} - f_{H} | |_{ρ} \leq R ϕ (λ) \sqrt{λ} . & (18) \end{array}

(ii) Under the assumption that ϕ(t) and t/ϕ(t) are non-decreasing functions, we have

\begin{array}{rcl} | | f_{λ} - f_{H} | |_{ρ} \leq R κ ϕ (λ) & (19) \end{array}

and

\begin{array}{rcl} | | f_{λ} - f_{H} | |_{H} \leq R ϕ (λ) . & (20) \end{array}

Under the source condition f_$H$ ∈ Ω_{ϕ, R}, the proposition can be proved using the ideas of Theorem 10 [4].

Theorem 3.1. Let z be i.i.d. samples drawn according to the probability measure $ρ \in P_{ϕ}$ where ϕ is the index function satisfying the conditions that ϕ(t), t/ϕ(t) are non-decreasing functions. Then for all 0 < η < 1, with confidence 1 − η, for the regularized estimator f_z,λ (7) the following upper bound holds:

| | f_{z, λ} - f_{H} | |_{H} \leq 2 {R ϕ (λ) + \frac{2 κ M}{m λ} + \sqrt{\frac{4 Σ^{2} N (λ)}{m λ}}} log (\frac{4}{η})

provided that

\begin{array}{rcl} \sqrt{m} λ \geq 8 κ^{2} log (4 / η) . & (21) \end{array}

Proof. The error of regularized solution f_{z, λ} can be estimated in terms of the sample error and the approximation error as follows:

\begin{array}{rcl} | | f_{z, λ} - f_{H} | |_{H} \leq | | f_{z, λ} - f_{λ} | |_{H} + | | f_{λ} - f_{H} | |_{H} . & (22) \end{array}

Now f_{z, λ} − f_λ can be expressed as

f_{z, λ} - f_{λ} = {(S_{x}^{*} S_{x} + λ I)}^{- 1} {S_{x}^{*} y - S_{x}^{*} S_{x} f_{λ} - λ f_{λ}} .

Then $f_{λ} = {(L_{K} + λ I)}^{- 1} L_{K} f_{H}$ implies

L_{K} f_{H} = L_{K} f_{λ} + λ f_{λ} .

Therefore,

f_{z, λ} - f_{λ} = {(S_{x}^{*} S_{x} + λ I)}^{- 1} {S_{x}^{*} y - S_{x}^{*} S_{x} f_{λ} - L_{K} (f_{H} - f_{λ})} .

Employing RKHS-norm we get,

\begin{array}{l} | | f_{z, λ} - f_{λ} | |_{H} \leq | | {(S_{x}^{*} S_{x} + λ I)}^{- 1} {S_{x}^{*} y - S_{x}^{*} S_{x} f_{H} \\ + (S_{x}^{*} S_{x} - L_{K}) (f_{H} - f_{λ})} | |_{H} \\ \leq I_{1} I_{2} + I_{3} | | f_{λ} - f_{H} | |_{H} / λ, & (23) \end{array}

where $I_{1} = | | {(S_{x}^{*} S_{x} + λ I)}^{- 1} {(L_{K} + λ I)}^{1 / 2} | |_{L (H)}$ , $I_{2} = | | {(L_{K} + λ I)}^{- 1 / 2} (S_{x}^{*} y - S_{x}^{*} S_{x} f_{H}) | |_{H}$ and $I_{3} = | | S_{x}^{*} S_{x} - L_{K} | |_{L (H)}$ .

The estimates of I₂, I₃ can be obtained from Proposition 3.2 and the only task is to bound I₁. For this we consider

\begin{array}{l} {(S_{x}^{*} S_{x} + λ I)}^{- 1} {(L_{K} + λ I)}^{1 / 2} = {I - {(L_{K} + λ I)}^{- 1} (L_{K} - S_{x}^{*} S_{x})}^{- 1} \\ {(L_{K} + λ I)}^{- 1 / 2} \end{array}

which implies

\begin{array}{rcl} I_{1} \leq \sum_{n = 0}^{\infty} | | {(L_{K} + λ I)}^{- 1} (L_{K} - S_{x}^{*} S_{x}) | |_{L (H)}^{n} | | {(L_{K} + λ I)}^{- 1 / 2} | |_{L (H)} & (24) \end{array}

provided that $| | {(L_{K} + λ I)}^{- 1} (L_{K} - S_{x}^{*} S_{x}) | |_{L (H)} < 1$ . To verify this condition, we consider

| | {(L_{K} + λ I)}^{- 1} (S_{x}^{*} S_{x} - L_{K}) | |_{L (H)} \leq I_{3} / λ .

Now using Proposition 3.2 we get with confidence 1 − η/2,

| | {(L_{K} + λ I)}^{- 1} (S_{x}^{*} S_{x} - L_{K}) | |_{L (H)} \leq \frac{4 κ^{2}}{\sqrt{m} λ} log (\frac{4}{η}) .

From the condition (21) we get with confidence 1 − η/2,

\begin{array}{rcl} | | {(L_{K} + λ I)}^{- 1} (S_{x}^{*} S_{x} - L_{K}) | |_{L (H)} \leq \frac{1}{2} . & (25) \end{array}

Consequently, using (25) in the inequality (24) we obtain with probability 1 − η/2,

\begin{array}{l} I_{1} = | | {(S_{x}^{*} S_{x} + λ I)}^{- 1} {(L_{K} + λ I)}^{1 / 2} | |_{L (H)} \\ \leq 2 | | {(L_{K} + λ I)}^{- 1 / 2} | |_{L (H)} \leq \frac{2}{\sqrt{λ}} . & (26) \end{array}

From Proposition 3.2 we have with confidence 1 − η/2,

| | S_{x}^{*} S_{x} - L_{K} | |_{L (H)} \leq 2 (\frac{κ^{2}}{m} + \frac{κ^{2}}{\sqrt{m}}) log (\frac{4}{η}) .

Again from the condition (21) we get with probability 1 − η/2,

\begin{array}{rcl} I_{3} = | | S_{x}^{*} S_{x} - L_{K} | |_{L (H)} \leq \frac{λ}{2} . & (27) \end{array}

Therefore, the inequality (23) together with (16), (20), (26), (27) provides the desired bound.□

The following theorem discuss the error estimates in $L$ ²-norm. The proof is similar to the above theorem.

Theorem 3.2. Let z be i.i.d. samples drawn according to the probability measure $ρ \in P_{ϕ}$ and f_{z, λ} is the regularized solution (7) corresponding to Tikhonov regularization. Then for all 0 < η < 1, with confidence 1 − η, the following upper bounds holds:

(i) Under the assumption that ϕ(t), $\sqrt{t} / ϕ (t)$ are non-decreasing functions,

\begin{array}{l} | | f_{z, λ} - f_{H} | |_{ρ} \leq 2 {R ϕ (λ) \sqrt{λ} + \frac{2 κ M}{m \sqrt{λ}} + \sqrt{\frac{4 Σ^{2} N (λ)}{m}}} \\ log (\frac{4}{η}) \end{array}

(ii) Under the assumption that ϕ(t), t/ϕ(t) are non-decreasing functions,

\begin{array}{l} | | f_{z, λ} - f_{H} | |_{ρ} \leq {R (κ + \sqrt{λ}) ϕ (λ) + \frac{4 κ M}{m \sqrt{λ}} + \sqrt{\frac{16 Σ^{2} N (λ)}{m}}} \\ log (\frac{4}{η}) \end{array}

provided that

\begin{array}{rcl} \sqrt{m} λ \geq 8 κ^{2} log (4 / η) . & (28) \end{array}

We derive the convergence rates of Tikhonov regularizer based on data-driven strategy of the parameter choice of λ for the class of probability measure $P_{ϕ, b}$ .

Theorem 3.3. Under the same assumptions of Theorem 3.2 and hypothesis (13), the convergence of the estimator f_{z, λ} (7) to the target function f_$H$ can be described as:

(i) If ϕ(t) and $\sqrt{t} / ϕ (t)$ are non-decreasing functions. Then under the parameter choice λ ∈ (0, 1], λ = Ψ⁻¹(m^−1/2) where $Ψ (t) = t^{\frac{1}{2} + \frac{1}{2 b}} ϕ (t)$ , we have

\begin{array}{l} P r o b_{z} {\begin{matrix} | | f_{z, λ} - f_{H} | |_{ρ} \leq C {(Ψ^{- 1} (m^{- 1 / 2}))}^{1 / 2} ϕ \\ (Ψ^{- 1} (m^{- 1 / 2})) log (\frac{4}{η}) \end{matrix}} \geq 1 - η \end{array}

and

\begin{array}{l} \lim_{τ \to \infty} \underset{m \to \infty}{\lim \sup} \sup_{ρ \in P_{ϕ, b}} P r o b_{z} {| | f_{z, λ} - f_{H} | |_{ρ} \\ > τ {(Ψ^{- 1} (m^{- 1 / 2}))}^{1 / 2} ϕ (Ψ^{- 1} (m^{- 1 / 2}))} = 0, \end{array}

(ii) If ϕ(t) and t/ϕ(t) are non-decreasing functions. Then under the parameter choice λ ∈ (0, 1], λ = Θ⁻¹(m^−1/2) where $Θ (t) = t^{\frac{1}{2 b}} ϕ (t)$ , we have

P r o b_{z} {| | f_{z, λ} - f_{H} | |_{ρ} \leq C^{'} ϕ (Θ^{- 1} (m^{- 1 / 2})) log (\frac{4}{η})} \geq 1 - η

and

\begin{array}{l} \lim_{τ \to \infty} \underset{m \to \infty}{\lim \sup} \sup_{ρ \in P_{ϕ, b}} ​ P r o b_{z} ​ {| | f_{z, λ} - f_{H} | |_{ρ} \\ > τ ϕ (Θ^{- 1} (m^{- 1 / 2}))} = 0. \end{array}

Proof. (i) Let $Ψ (t) = t^{\frac{1}{2} + \frac{1}{2 b}} ϕ (t)$ . Then it follows,

lim_{t \to 0} \frac{Ψ (t)}{\sqrt{t}} = lim_{t \to 0} \frac{t^{2}}{Ψ^{- 1} (t)} = 0 .

Under the parameter choice λ = Ψ⁻¹(m^−1/2) we have,

lim_{m \to \infty} m λ = \infty .

Therefore, for sufficiently large m,

\frac{1}{m λ} = \frac{λ^{\frac{1}{2 b}} ϕ (λ)}{\sqrt{m λ}} \leq λ^{\frac{1}{2 b}} ϕ (λ) .

Under the fact λ ≤ 1 from Theorem 3.2 and Equation (14) follows that with confidence 1 − η,

\begin{array}{rcl} | | f_{z, λ} - f_{H} | |_{ρ} \leq C {(Ψ^{- 1} (m^{- 1 / 2}))}^{1 / 2} ϕ (Ψ^{- 1} (m^{- 1 / 2})) log (\frac{4}{η}), & (29) \end{array}

where $C = 2 R + 4 κ M + 4 \sqrt{β b Σ^{2} / (b - 1)}$ .

Now defining $τ : = C log (\frac{4}{η})$ gives

η = η_{τ} = 4 e^{- τ / C} .

The estimate (29) can be reexpressed as

\begin{array}{rcl} P r o b_{z} {| | f_{z, λ} - f_{H} | |_{ρ} > τ {(Ψ^{- 1} (m^{- 1 / 2}))}^{1 / 2} ϕ (Ψ^{- 1} (m^{- 1 / 2}))} \leq η_{τ} . & (30) \end{array}

(ii) Suppose $Θ (t) = t^{\frac{1}{2 b}} ϕ (t) .$ Then the condition (28) follows that

\sqrt{m λ} \geq \frac{8 κ^{2} log (4 / η)}{\sqrt{λ}} \geq \frac{8 κ^{2}}{\sqrt{λ}} .

Hence under the parameter choice λ ∈ (0, 1], λ = Θ⁻¹(m^−1/2) we have

\frac{1}{m \sqrt{λ}} \leq \frac{\sqrt{λ}}{8 κ^{2} \sqrt{m}} \leq \frac{λ^{\frac{1}{2} + \frac{1}{2 b}} ϕ (λ)}{8 κ^{2}} \leq \frac{ϕ (λ)}{8 κ^{2}} .

From Theorem 3.2 and Equation (14), it follows that with confidence 1 − η,

\begin{array}{rcl} | | f_{z, λ} - f_{H} | |_{ρ} \leq C^{'} ϕ (Θ^{- 1} (m^{- 1 / 2})) log (\frac{4}{η}), & (31) \end{array}

where $C^{'} : = R (κ + 1) + M / 2 κ + 4 \sqrt{β b Σ^{2} / (b - 1)} .$

Now defining $τ : = C^{'} log (\frac{4}{η})$ gives

η = η_{τ} = 4 e^{- τ / C^{'}} .

The estimate (31) can be reexpressed as

\begin{array}{rcl} P r o b_{z} {| | f_{z, λ} - f_{H} | |_{ρ} > τ ϕ (Θ^{- 1} (m^{- 1 / 2}))} \leq η_{τ} . & (32) \end{array}

Then from Equations (30) and (32) our conclusions follow. □

Theorem 3.4. Under the same assumptions of Theorem 3.1 and hypothesis (13) with the parameter choice λ ∈ (0, 1], λ = Ψ⁻¹(m^−1/2) where $Ψ (t) = t^{\frac{1}{2} + \frac{1}{2 b}} ϕ (t)$ , the convergence of the estimator f_{z, λ} (7) to the target function f_$H$ can be described as

P r o b_{z} {| | f_{z, λ} - f_{H} | |_{H} \leq C ϕ (Ψ^{- 1} (m^{- 1 / 2})) log (\frac{4}{η})} \geq 1 - η

and

\begin{array}{l} lim_{τ \to \infty} \underset{m \to \infty}{lim sup} sup_{ρ \in P_{ϕ, b}} P r o b_{z} {| | f_{z, λ} - f_{H} | |_{H} > τ ϕ (Ψ^{- 1} (m^{- 1 / 2}))} \\ = 0 . \end{array}

The proof of the theorem follows the same steps as of Theorem 3.3 (i). We obtain the following corollary as a consequence of Theorem 3.3, 3.4.

Corollary 3.1. Under the same assumptions of Theorem 3.3, 3.4 for Tikhonov regularization with Hölder's source condition $f_{H} \in Ω_{ϕ, R}, ϕ (t) = t^{r}$ , for all 0 < η < 1, with confidence 1 − η, for the parameter choice $λ = m^{- \frac{b}{2 b r + b + 1}}$ , we have

| | f_{z, λ} - f_{H} | |_{H} \leq C m^{- \frac{b r}{2 b r + b + 1}} log (\frac{4}{η}) for 0 \leq r \leq 1,

| | f_{z, λ} - f_{H} | |_{ρ} \leq C m^{- \frac{2 b r + b}{4 b r + 2 b + 2}} log (\frac{4}{η}) for 0 \leq r \leq \frac{1}{2}

and for the parameter choice $λ = m^{- \frac{b}{2 b r + 1}}$ , we have

| | f_{z, λ} - f_{H} | |_{ρ} \leq C^{'} m^{- \frac{b r}{2 b r + 1}} log (\frac{4}{η}) for 0 \leq r \leq 1 .

3.2. Upper Rates for General Regularization Schemes

Bauer et al. [4] discussed the error estimates for general regularization schemes under general source condition. Here we study the convergence issues for general regularization schemes under general source condition and the polynomial decay of the eigenvalues of the integral operator L_K. We define the regularization in learning theory framework similar to considered for ill-posed inverse problems (See Section 3.1 [4]).

Definition 3.1. A family of functions $g_{λ} : [0, κ^{2}] \to ℝ$ , 0 < λ ≤ κ², is said to be the regularization if it satisfies the following conditions:

• $\exists D : sup_{σ \in (0, κ^{2}]} | σ g_{λ} (σ) | \leq D$ .

• $\exists B : sup_{σ \in (0, κ^{2}]} | g_{λ} (σ) | \leq \frac{B}{λ}$ .

• $\exists γ : sup_{σ \in (0, κ^{2}]} | 1 - g_{λ} (σ) σ | \leq γ$ .

• The maximal p satisfying the condition:

sup_{σ \in (0, κ^{2}]} | 1 - g_{λ} (σ) σ | σ^{p} \leq γ_{p} λ^{p}

is called the qualification of the regularization g_λ, where γ_p does not depend on λ.

The properties of general regularization are satisfied by the large class of learning algorithms which are essentially all the linear regularization schemes. We refer to Section 2.2 [10] for brief discussion of the regularization schemes. Here we consider general regularized solution corresponding to the above regularization:

\begin{array}{l} f_{z, λ} = g_{λ} (S_{x}^{*} S_{x}) S_{x}^{*} y . & (33) \end{array}

Here we are discussing the connection between the qualification of the regularization and general source condition [17].

Definition 3.2. The qualification p covers the index function ϕ if the function $t \to \frac{t^{p}}{ϕ (t)}$ on t ∈ (0, κ²] is non-decreasing.

The following result is a restatement of Proposition 3 [17].

Proposition 3.4. Suppose ϕ is a non-decreasing index function and the qualification of the regularization g_λ covers ϕ. Then

sup_{σ \in (0, κ^{2}]} | 1 - g_{λ} (σ) σ | ϕ (σ) \leq c_{g} ϕ (λ), c_{g} = max (γ, γ_{p}) .

Generally, the index function ϕ is not stable with respect to perturbation in the integral operator L_K. In practice, we are only accessible to the perturbed empirical operator $S_{x}^{*} S_{x}$ but the source condition can be expressed in terms of L_K only. So we want to control the difference $ϕ (L_{K}) - ϕ (S_{x}^{*} S_{x})$ . In order to obtain the error estimates for general regularization, we further restrict the index functions to operator monotone functions which is defined as

Definition 3.3. A function ϕ₁:[0, d] → [0, ∞) is said to be operator monotone index function if ϕ₁(0) = 0 and for every non-negative pair of self-adjoint operators A, B such that ||A||, ||B|| ≤ d and A ≤ B we have ϕ₁(A) ≤ ϕ₁(B).

We consider the class of operator monotone index functions:

\begin{array}{l} F_{μ} = {ϕ_{1} : [0, κ^{2}] \to [0, \infty) operator monotone, \\ ϕ_{1} (0) = 0, ϕ_{1} (κ^{2}) \leq μ} . \end{array}

For the above class of operator monotone functions from Theorem 1 [4], given ϕ₁ ∈ F_μ there exists c_ϕ₁ such that

| | ϕ_{1} (S_{x}^{*} S_{x}) - ϕ_{1} (L_{K}) | |_{L (H)} \leq c_{ϕ_{1}} ϕ_{1} (| | S_{x}^{*} S_{x} - L_{K} | |_{L (H)}) .

Here we observe that the rate of convergence of $ϕ_{1} (S_{x}^{*} S_{x})$ to ϕ₁(L_K) is slower than the convergence rate of $S_{x}^{*} S_{x}$ to L_K. Therefore, we consider the following class of index functions:

\begin{array}{l} F = {ϕ = ϕ_{2} ϕ_{1} : ϕ_{1} \in F_{μ}, ϕ_{2} : [0, κ^{2}] \\ \to [0, \infty) non-decreasing Lipschitz, ϕ_{2} (0) = 0} . \end{array}

The splitting of ϕ = ϕ₂ϕ₁ is not unique. So we can take ϕ₂ as a Lipschitz function with Lipschitz constant 1. Now using Corollary 1.2.2 [27] we get

| | ϕ_{2} (S_{x}^{*} S_{x}) - ϕ_{2} (L_{K}) | |_{L_{2} (H)} \leq | | S_{x}^{*} S_{x} - L_{K} | |_{L_{2} (H)} .

General source condition f_$H$ ∈ Ω_ϕ,R corresponding to index class functions $F$ covers wide range of source conditions as Hölder's source condition ϕ(t) = t^r, logarithm source condition $ϕ (t) = t^{p} {log}^{- ν} (\frac{1}{t})$ . Following the analysis of Bauer et al. [4] we develop the error estimates of general regularization for the index class function $F$ under the suitable priors on the probability measure ρ.

Theorem 3.5. Let z be i.i.d. samples drawn according to the probability measure $ρ \in P_{ϕ}$ . Suppose f_z,λ is the regularized solution (33) corresponding to general regularization and the qualification of the regularization covers ϕ. Then for all 0 < η < 1, with confidence 1 − η, the following upper bound holds:

\begin{array}{l} | | f_{z, λ} - f_{H} | |_{H} \leq {\begin{matrix} R c_{g} (1 + c_{ϕ_{1}}) ϕ (λ) + \frac{4 R μ γ κ^{2}}{\sqrt{m}} + \frac{2 \sqrt{2} ν_{1} κ M}{m λ} \\ + \sqrt{\frac{8 ν_{1}^{2} Σ^{2} N (λ)}{m λ}} \end{matrix}} \\ log (\frac{4}{η}) \end{array}

provided that

\begin{array}{l} \sqrt{m} λ \geq 8 κ^{2} log (4 / η) . & (34) \end{array}

Proof. We consider the error expression for general regularized solution (33),

\begin{array}{l} f_{z, λ} - f_{H} = g_{λ} (S_{x}^{*} S_{x}) (S_{x}^{*} y - S_{x}^{*} S_{x} f_{H}) - r_{λ} (S_{x}^{*} S_{x}) f_{H}, & (35) \end{array}

where r_λ(σ) = 1 − g_λ(σ)σ.

Now the first term can be expressed as

\begin{array}{l} g_{λ} (S_{x}^{*} S_{x}) (S_{x}^{*} y - S_{x}^{*} S_{x} f_{H}) = g_{λ} (S_{x}^{*} S_{x}) {(S_{x}^{*} S_{x} + λ I)}^{1 / 2} \\ {(S_{x}^{*} S_{x} + λ I)}^{- 1 / 2} {(L_{K} + λ I)}^{1 / 2} \\ {(L_{K} + λ I)}^{- 1 / 2} (S_{x}^{*} y - S_{x}^{*} S_{x} f_{H}) . \end{array}

On applying RKHS-norm we get,

\begin{array}{l} | | g_{λ} (S_{x}^{*} S_{x}) (S_{x}^{*} y - S_{x}^{*} S_{x} f_{H}) | |_{H} \leq I_{2} I_{5} | | g_{λ} (S_{x}^{*} S_{x}) \\ {(S_{x}^{*} S_{x} + λ I)}^{1 / 2} | |_{L (H)}, & (36) \end{array}

where $I_{2} = | | {(L_{K} + λ I)}^{- 1 / 2} (S_{x}^{*} y - S_{x}^{*} S_{x} f_{H}) | |_{H}$ and $I_{5} = | | {(S_{x}^{*} S_{x} + λ I)}^{- 1 / 2} {(L_{K} + λ I)}^{1 / 2} | |_{L (H)}$ .

The estimate of I₂ can be obtained from the first estimate of Proposition 3.2 and from the second estimate of Proposition 3.2 with the condition (34) we obtain with probability 1 − η/2,

\begin{array}{l} | | {(L_{K} + λ I)}^{- 1 / 2} (L_{K} - S_{x}^{*} S_{x}) {(L_{K} + λ I)}^{- 1 / 2} | |_{L (H)} \\ \leq \frac{1}{λ} | | S_{x}^{*} S_{x} - L_{K} | |_{L (H)} \leq \frac{4 κ^{2}}{\sqrt{m} λ} log (\frac{4}{η}) \leq \frac{1}{2} . \end{array}

which implies that with confidence 1 − η/2,

\begin{array}{l} I_{5} = | | {(S_{x}^{*} S_{x} + λ I)}^{- 1 / 2} {(L_{K} + λ I)}^{1 / 2} | |_{L (H)} \\ = | | {(L_{K} + λ I)}^{1 / 2} {(S_{x}^{*} S_{x} + λ I)}^{- 1} {(L_{K} + λ I)}^{1 / 2} | |_{L (H)}^{1 / 2} \\ = | | {I - {(L_{K} + λ I)}^{- 1 / 2} (L_{K} - S_{x}^{*} S_{x}) \\ {(L_{K} + λ I)}^{- 1 / 2}}^{- 1} | |_{L (H)}^{1 / 2} \\ \leq \sqrt{2} . & (37) \end{array}

From the properties of the regularization we have,

\begin{array}{l} | | g_{λ} (S_{x}^{*} S_{x}) {(S_{x}^{*} S_{x})}^{1 / 2} | |_{L (H)} \leq sup_{0 < σ \leq κ^{2}} | g_{λ} (σ) \sqrt{σ} | \\ = {(sup_{0 < σ \leq κ^{2}} | g_{λ} (σ) σ | sup_{0 < σ \leq κ^{2}} | g_{λ} (σ) |)}^{1 / 2} \leq \sqrt{\frac{B D}{λ}} . & (38) \end{array}

Hence it follows,

\begin{array}{l} | | g_{λ} (S_{x}^{*} S_{x}) {(S_{x}^{*} S_{x} + λ I)}^{1 / 2} | |_{L (H)} \leq sup_{0 < σ \leq κ^{2}} | g_{λ} (σ) {(σ + λ)}^{1 / 2} | \\ \leq sup_{0 < σ \leq κ^{2}} | g_{λ} (σ) \sqrt{σ} | + \sqrt{λ} sup_{0 < σ \leq κ^{2}} | g_{λ} (σ) | \leq \frac{ν_{1}}{\sqrt{λ}}, & (39) \end{array}

where $ν_{1} : = B + \sqrt{B D}$ .

Therefore, using (16), (37) and (39) in Equation (36) we conclude that with probability 1 − η,

\begin{array}{l} | | g_{λ} (S_{x}^{*} S_{x}) (S_{x}^{*} y - S_{x}^{*} S_{x} f_{H}) | |_{H} & \leq & 2 \sqrt{2} ν_{1} {\frac{κ M}{m λ} + \sqrt{\frac{Σ^{2} N (λ)}{m λ}}} \\ log (\frac{4}{η}) . & (40) \end{array}

Now we consider the second term,

\begin{array}{l} r_{λ} (S_{x}^{*} S_{x}) f_{H} = r_{λ} (S_{x}^{*} S_{x}) ϕ (L_{K}) v = r_{λ} (S_{x}^{*} S_{x}) ϕ (S_{x}^{*} S_{x}) v \\ + r_{λ} (S_{x}^{*} S_{x}) ϕ_{2} (S_{x}^{*} S_{x}) (ϕ_{1} (L_{K}) - ϕ_{1} (S_{x}^{*} S_{x})) v \\ + r_{λ} (S_{x}^{*} S_{x}) (ϕ_{2} (L_{K}) - ϕ_{2} (S_{x}^{*} S_{x})) ϕ_{1} (L_{K}) v . \end{array}

Employing RKHS-norm we get

\begin{array}{l} | | r_{λ} (S_{x}^{*} S_{x}) f_{H} | |_{H} \leq R c_{g} ϕ (λ) + R c_{g} c_{ϕ_{1}} ϕ_{2} (λ) ϕ_{1} \\ (| | L_{K} - S_{x}^{*} S_{x} | |_{L (H)}) + R μ γ | | L_{K} - S_{x}^{*} S_{x} | |_{L_{2} (H)} . \end{array}

Here we used the fact that if the qualification of the regularization covers ϕ = ϕ₁ϕ₂, then the qualification also covers ϕ₁ and ϕ₂ both separately.

From Equations (17) and (34) we have with probability 1 − η/2,

\begin{array}{l} | | S_{x}^{*} S_{x} - L_{K} | |_{L (H)} \leq \frac{4 κ^{2}}{\sqrt{m}} log (\frac{4}{η}) \leq λ / 2 . & (41) \end{array}

Therefore, with probability 1 − η/2,

\begin{array}{l} | | r_{λ} (S_{x}^{*} S_{x}) f_{H} | |_{H} \leq R c_{g} (1 + c_{ϕ_{1}}) ϕ (λ) + \frac{4 R μ γ κ^{2}}{\sqrt{m}} log (\frac{4}{η}) . & (42) \end{array}

Combining the bounds (40) and (42) we get the desired result.

□

Theorem 3.6. Let z be i.i.d. samples drawn according to the probability measure $ρ \in P_{ϕ}$ and f_z,λ is the regularized solution (33) corresponding to general regularization. Then for all 0 < η < 1, with confidence 1 − η, the following upper bounds holds:

(i) If the qualification of the regularization covers ϕ,

\begin{array}{l} | | f_{z, λ} - f_{H} | |_{ρ} \leq {\begin{matrix} R c_{g} (1 + c_{ϕ_{1}}) (κ + \sqrt{λ}) ϕ (λ) \\ + \frac{4 R μ γ κ^{2} (κ + \sqrt{λ})}{\sqrt{m}} + \frac{2 \sqrt{2} ν_{2} κ M}{m \sqrt{λ}} \\ + \sqrt{\frac{8 ν_{2}^{2} Σ^{2} N (λ)}{m}} \end{matrix}} log (\frac{4}{η}), \end{array}

(ii) If the qualification of the regularization covers $ϕ (t) \sqrt{t}$ ,

\begin{array}{l} | | f_{z, λ} - f_{H} | |_{ρ} \leq {\begin{matrix} 2 R c_{g} (1 + c_{ϕ_{1}}) ϕ (λ) \sqrt{λ} + \frac{4 R μ (γ + c_{g}) κ^{2} \sqrt{λ}}{\sqrt{m}} \\ + \frac{2 \sqrt{2} ν_{2} κ M}{m \sqrt{λ}} + \sqrt{\frac{8 ν_{2}^{2} Σ^{2} N (λ)}{m}} \end{matrix}} \\ log (\frac{4}{η}) \end{array}

provided that

\begin{array}{l} \sqrt{m} λ \geq 8 κ^{2} log (4 / η) . & (43) \end{array}

Proof. Here we establish $L$ ²-norm estimate for the error expression:

\begin{array}{l} f_{z, λ} - f_{H} = g_{λ} (S_{x}^{*} S_{x}) (S_{x}^{*} y - S_{x}^{*} S_{x} f_{H}) - r_{λ} (S_{x}^{*} S_{x}) f_{H} . \end{array}

On applying $L$ ²-norm in the first term we get,

\begin{array}{l} | | g_{λ} (S_{x}^{*} S_{x}) (S_{x}^{*} y - S_{x}^{*} S_{x} f_{H}) | |_{ρ} \leq I_{2} I_{5} | | L_{K}^{1 / 2} g_{λ} (S_{x}^{*} S_{x}) \\ {(S_{x}^{*} S_{x} + λ I)}^{1 / 2} | |_{L (H)}, & (44) \end{array}

where $I_{2} = | | {(L_{K} + λ I)}^{- 1 / 2} (S_{x}^{*} y - S_{x}^{*} S_{x} f_{H}) | |_{H}$ and $I_{5} = | | {(S_{x}^{*} S_{x} + λ I)}^{- 1 / 2} {(L_{K} + λ I)}^{1 / 2} | |_{L (H)}$ .

The estimates of I₂ and I₅ can be obtained from Proposition 3.2 and Theorem 3.5 respectively. Now we consider

\begin{array}{l} | | L_{K}^{1 / 2} g_{λ} (S_{x}^{*} S_{x}) {(S_{x}^{*} S_{x} + λ I)}^{1 / 2} | |_{L (H)} \leq | | L_{K}^{1 / 2} \\ - {(S_{x}^{*} S_{x})}^{1 / 2} | |_{L (H)} | | g_{λ} (S_{x}^{*} S_{x}) {(S_{x}^{*} S_{x} + λ I)}^{1 / 2} \\ | |_{L (H)} + | | {(S_{x}^{*} S_{x})}^{1 / 2} g_{λ} (S_{x}^{*} S_{x}) \\ {(S_{x}^{*} S_{x} + λ I)}^{1 / 2} | |_{L (H)} . \end{array}

Since $ϕ (t) = \sqrt{t}$ is operator monotone function. Therefore, from Equation (41) with probability 1 − η/2, we get

| | L_{K}^{1 / 2} - {(S_{x}^{*} S_{x})}^{1 / 2} | |_{L (H)} \leq {(| | L_{K} - S_{x}^{*} S_{x} | |_{L (H)})}^{1 / 2} \leq \sqrt{λ} .

Then using the properties of the regularization and Equation (38) we conclude that with probability 1 − η/2,

\begin{array}{l} | | L_{K}^{1 / 2} g_{λ} (S_{x}^{*} S_{x}) {(S_{x}^{*} S_{x} + λ I)}^{1 / 2} | |_{L (H)} \\ \leq \sqrt{λ} sup_{0 < σ \leq κ^{2}} | g_{λ} (σ) {(σ + λ)}^{1 / 2} | + sup_{0 < σ \leq κ^{2}} | g_{λ} (σ) {(σ^{2} + λ σ)}^{1 / 2} | \\ \leq sup_{0 < σ \leq κ^{2}} | g_{λ} (σ) σ | + λ sup_{0 < σ \leq κ^{2}} | g_{λ} (σ) | + 2 \sqrt{λ} sup_{0 < σ \leq κ^{2}} | g_{λ} (σ) \sqrt{σ} | \\ \leq B + D + 2 \sqrt{B D} = ν_{2} (let) . & (45) \end{array}

From Equations (44) with Equations (16), (37), and (45) we obtain with probability 1 − η,

\begin{array}{l} | | g_{λ} (S_{x}^{*} S_{x}) (S_{x}^{*} y - S_{x}^{*} S_{x} f_{H}) | |_{ρ} \leq 2 \sqrt{2} ν_{2} {\frac{κ M}{m \sqrt{λ}} + \sqrt{\frac{Σ^{2} N (λ)}{m}}} \\ log (\frac{4}{η}) . & (46) \end{array}

The second term can be expressed as

\begin{array}{l} | | r_{λ} (S_{x}^{*} S_{x}) f_{H} | |_{ρ} \leq | | L_{K}^{1 / 2} - {(S_{x}^{*} S_{x})}^{1 / 2} | |_{L (H)} | | r_{λ} (S_{x}^{*} S_{x}) f_{H} | |_{H} \\ + | | {(S_{x}^{*} S_{x})}^{1 / 2} r_{λ} (S_{x}^{*} S_{x}) f_{H} | |_{H} \\ \leq | | L_{K} - S_{x}^{*} S_{x} | |_{L (H)}^{1 / 2} | | r_{λ} (S_{x}^{*} S_{x}) f_{H} | |_{H} \\ + | | r_{λ} (S_{x}^{*} S_{x}) {(S_{x}^{*} S_{x})}^{1 / 2} ϕ (S_{x}^{*} S_{x}) v | |_{H} \\ + | | r_{λ} (S_{x}^{*} S_{x}) {(S_{x}^{*} S_{x})}^{1 / 2} ϕ_{2} (S_{x}^{*} S_{x}) (ϕ_{1} (S_{x}^{*} S_{x}) - ϕ_{1} (L_{K})) v | |_{H} \\ + | | r_{λ} (S_{x}^{*} S_{x}) {(S_{x}^{*} S_{x})}^{1 / 2} (ϕ_{2} (S_{x}^{*} S_{x}) - ϕ_{2} (L_{K})) ϕ_{1} (L_{K}) v | |_{H} . \end{array}

Here two cases arises:

Case 1. If the qualification of the regularization covers ϕ. Then we get with confidence 1 − η/2,

\begin{array}{l} | | r_{λ} (S_{x}^{*} S_{x}) f_{H} | |_{ρ} \leq (κ + \sqrt{λ}) (R c_{g} (1 + c_{ϕ_{1}}) ϕ (λ) \\ + R μ γ | | S_{x}^{*} S_{x} - L_{K} | |_{L_{2} (H)}) . \end{array}

Therefore, using Equation (17) we obtain with probability 1 − η/2,

\begin{array}{l} | | r_{λ} (S_{x}^{*} S_{x}) f_{H} | |_{ρ} \leq (κ + \sqrt{λ}) \\ (R c_{g} (1 + c_{ϕ_{1}}) ϕ (λ) + \frac{4 R μ γ κ^{2}}{\sqrt{m}} log (\frac{4}{η})) . & (47) \end{array}

Case 2. If the qualification of the regularization covers $ϕ (t) \sqrt{t}$ , we get with probability 1 − η/2,

\begin{array}{l} | | r_{λ} (S_{x}^{*} S_{x}) f_{H} | |_{ρ} \leq 2 R c_{g} (1 + c_{ϕ_{1}}) ϕ (λ) \sqrt{λ} \\ + 4 R μ (γ + c_{g}) κ^{2} \sqrt{\frac{λ}{m}} log (\frac{4}{η}) . & (48) \end{array}

Combining the error estimates (46), (47) and (48) we get the desired results.

□

We discuss the convergence rates of general regularizer based on data-driven strategy of the parameter choice of λ for the class of probability measure $P_{ϕ, b}$ . The proof of Theorem 3.7, 3.8 are similar to Theorem 3.3.

Theorem 3.7. Under the same assumptions of Theorem 3.5 and hypothesis (13) with the parameter choice λ ∈ (0, 1], λ = Ψ⁻¹(m^−1/2) where $Ψ (t) = t^{\frac{1}{2} + \frac{1}{2 b}} ϕ (t)$ , the convergence of the estimator f_z,λ (33) to the target function f_$H$ can be described as

P r o b_{z} {| | f_{z, λ} - f_{H} | |_{H} \leq \tilde{C} ϕ (Ψ^{- 1} (m^{- 1 / 2})) log (\frac{4}{η})} \geq 1 - η,

where $\tilde{C} = R c_{g} (1 + c_{ϕ_{1}}) + 4 R μ γ κ^{2} + 2 \sqrt{2} ν_{1} κ M + \sqrt{8 β b ν_{1}^{2} Σ^{2} / (b - 1)}$ and

\begin{array}{l} lim_{τ \to \infty} \underset{m \to \infty}{lim sup} sup_{ρ \in P_{ϕ, b}} P r o b_{z} {| | f_{z, λ} - f_{H} | |_{H} > τ ϕ (Ψ^{- 1} (m^{- 1 / 2}))} \\ = 0 . \end{array}

Theorem 3.8. Under the same assumptions of Theorem 3.6 and hypothesis (13), the convergence of the estimator f_z,λ (33) to the target function f_$H$ can be described as

(i) If the qualification of the regularization covers ϕ. Then under the parameter choice λ ∈ (0, 1], λ = Θ⁻¹(m^−1/2) where $Θ (t) = t^{\frac{1}{2 b}} ϕ (t)$ , we have

P r o b_{z} {| | f_{z, λ} - f_{H} | |_{ρ} \leq {\tilde{C}}_{1} ϕ (Θ^{- 1} (m^{- 1 / 2})) log (\frac{4}{η})} \geq 1 - η

where ${\tilde{C}}_{1} = R c_{g} (1 + c_{ϕ_{1}}) (κ + 1) + 4 R μ γ κ^{2} (κ + 1) + ν_{2} M / 2 \sqrt{2} κ + \sqrt{8 β b ν_{2}^{2} Σ^{2} / (b - 1)}$ and

\begin{array}{l} lim_{τ \to \infty} \underset{m \to \infty}{lim sup} sup_{ρ \in P_{ϕ, b}} P r o b_{z} {| | f_{z, λ} - f_{H} | |_{ρ} > τ ϕ (Θ^{- 1} (m^{- 1 / 2}))} \\ = 0, \end{array}

(ii) If the qualification of the regularization covers $ϕ (t) \sqrt{t}$ . Then under the parameter choice λ ∈ (0, 1], λ = Ψ⁻¹(m^−1/2) where $Ψ (t) = t^{\frac{1}{2} + \frac{1}{2 b}} ϕ (t)$ , we have

\begin{array}{l} P r o b_{z} \begin{matrix} {| | f_{z, λ} - f_{H} | |_{ρ} \leq \tilde{C_{2}} {(Ψ^{- 1} (m^{- 1 / 2}))}^{1 / 2} ϕ \\ (Ψ^{- 1} (m^{- 1 / 2})) log (\frac{4}{η})} \geq 1 - η \end{matrix} \end{array}

where ${\tilde{C}}_{2} = 2 R c_{g} (1 + c_{ϕ_{1}}) + 4 R μ (γ + c_{g}) κ^{2} + 2 \sqrt{2} ν_{2} κ M + \sqrt{8 β b ν_{2}^{2} Σ^{2} / (b - 1)}$ and

\begin{array}{l} P r o b_{z} {| | f_{z, λ} - f_{H} | |_{ρ} \leq {\tilde{C}}_{2} {(Ψ^{- 1} (m^{- 1 / 2}))}^{1 / 2} ϕ (Ψ^{- 1} (m^{- 1 / 2})) \\ \log (\frac{4}{η}) ​} \geq ​ 1 - η, \end{array}

We obtain the following corollary as a consequence of Theorem 3.7, 3.8.

Corollary 3.2. Under the same assumptions of Theorem 3.7, 3.8 for general regularization of the qualification p with Hölder's source condition $f_{H} \in Ω_{ϕ, R}, ϕ (t) = t^{r}$ , for all 0 < η < 1, with confidence 1 − η, for the parameter choice $λ = m^{- \frac{b}{2 b r + b + 1}}$ , we have

| | f_{z, λ} - f_{H} | |_{H} \leq \tilde{C} m^{- \frac{b r}{2 b r + b + 1}} log (\frac{4}{η}) for 0 \leq r \leq p,

| | f_{z, λ} - f_{H} | |_{ρ} \leq {\tilde{C}}_{2} m^{- \frac{2 b r + b}{4 b r + 2 b + 2}} log (\frac{4}{η}) for 0 \leq r \leq p - \frac{1}{2}

and for the parameter choice $λ = m^{- \frac{b}{2 b r + 1}}$ , we have

| | f_{z, λ} - f_{H} | |_{ρ} \leq {\tilde{C}}_{1} m^{- \frac{b r}{2 b r + 1}} log (\frac{4}{η}) for 0 \leq r \leq p .

Remark 3.1. It is important to observe from Corollary 3.1, 3.2 that using the concept of operator monotonicity of index function we are able to achieve the same error estimates for general regularization as of Tikhonov regularization up to a constant multiple.

Remark 3.2. (Related work) Corollary 3.1 provides the order of convergence same as of Theorem 1 [12] for Tikhonov regularization under the Hölder's source condition f_$H$ ∈ Ω_{ϕ, R} for $ϕ (t) = t^{r} (\frac{1}{2} \leq r \leq 1)$ and the polynomial decay of the eigenvalues (13). Blanchard and Mücke [18] addressed the convergence rates for inverse statistical learning problem for general regularization under the Hölder's source condition with the assumption f_ρ ∈ $H$ . In particular, the upper convergence rates discussed in Blanchard and Mücke [18] agree with Corollary 3.2 for considered learning problem which is referred as direct learning problem in Blanchard and Mücke[18]. Under the fact $N (λ) \leq \frac{κ^{2}}{λ}$ from Theorem 3.5, 3.6 we obtain the similar estimates as of Theorem 10 [4] for general regularization schemes without the polynomial decay condition of the eigenvalues (13).

Remark 3.3. For the real valued functions and multi-task algorithms (the output space Y ⊂ ℝ^m for some m ∈ ℕ) we can obtain the error estimates from our analysis without imposing any condition on the conditional probability measure (11) for the bounded output space Y.

Remark 3.4. We can address the convergence issues of binary classification problem [28] using our error estimates as similar to discussed in Section 3.3 [4] and Section 5 [16].

3.3. Lower Rates for General Learning Algorithms

In this section, we discuss the estimates of minimum possible error over a subclass of the probability measures $P_{ϕ, b}$ parameterized by suitable functions f ∈ $H$ . Throughout this section we assume that Y is finite-dimensional.

Let ${v_{j}}_{j = 1}^{d}$ be a basis of Y and f ∈ Ω_{ϕ, R}. Then we parameterize the probability measure based on the function f,

\begin{array}{l} ρ_{f} (x, y) : = \frac{1}{2 d L} \sum_{j = 1}^{d} (a_{j} (x) δ_{y + d L v_{j}} + b_{j} (x) δ_{y - d L v_{j}}) ν (x), & (49) \end{array}

where a_j(x) = L − 〈f,K_xv_j〉_$H$, b_j(x) = L + 〈f,K_xv_j〉_$H$, L = 4κϕ(κ²)R and δ_ξ denotes the Dirac measure with unit mass at ξ. It is easy to observe that the marginal distribution of ρ_f over X is ν and the regression function for the probability measure ρ_f is f (see Proposition 4 [12]). In addition to this, for the conditional probability measure ρ_f(y|x) we have,

\begin{array}{l} \int_{Y} (e^{| | y - f (x) | |_{Y} / M} - \frac{| | y - f (x) | |_{Y}}{M} - 1) d ρ_{f} (y | x) \\ \leq (d^{2} L^{2} - | | f (x) | |_{Y}^{2}) \sum_{i = 2}^{\infty} \frac{{(d L + | | f (x) | |_{Y})}^{i - 2}}{M^{i} i!} \leq \frac{Σ^{2}}{2 M^{2}} \end{array}

provided that

d L + L / 4 \leq M and \sqrt{2} d L \leq Σ .

We assume that the eigenvalues of the integral operator L_K follow the polynomial decay (13) for the marginal probability measure ν. Then we conclude that the probability measure ρ_f parameterized by f belongs to the class $P_{ϕ, b}$ .

The concept of information theory such as the Kullback-Leibler information and Fano inequalities (Lemma 3.3 [29]) are the main ingredients in the analysis of lower bounds. In the literature [12, 29], the closeness of probability measures is described through Kullback-Leibler information: Given two probability measures ρ₁ and ρ₂, it is defined as

K (ρ_{1}, ρ_{2}) : = \int_{Z} log (g (z)) d ρ_{1} (z),

where g is the density of ρ₁ with respect to ρ₂, that is, $ρ_{1} (E) = \int_{E} g (z) d ρ_{2} (z)$ for all measurable sets E.

Following the analysis of Caponnetto and De Vito [12] and DeVore et al. [29] we establish the lower rates of accuracy that can be attained by any learning algorithm.

To estimate the lower rates of learning algorithms, we generate N_ε-functions belonging to Ω_ϕ,R for given ε > 0 such that (53), (54) holds. Then we construct the probability measures $ρ_{i} \in P_{ϕ, b}$ from Equation (49), parameterized by these functions f_i's (1 ≤ i ≤ N_ε). On applying Lemma 3.3 [29], we obtain the lower convergence rates using Kullback-Leibler information.

Theorem 3.9. Let z be i.i.d. samples drawn according to the probability measure $ρ \in P_{ϕ, b}$ under the hypothesis dim(Y) = d < ∞. Then for any learning algorithm (z → f_z ∈ $H$ ) there exists a probability measure $ρ_{*} \in P_{ϕ, b}$ and f_{ρ_*} ∈ $H$ such that for all 0 < ε < ε_o, f_z can be approximated as

P r o b_{z} {| | f_{z} - f_{ρ_{*}} | |_{H} > ε / 2} \geq min {\frac{1}{1 + e^{- ℓ_{ε} / 24}}, ϑ e^{(\frac{ℓ_{ε}}{48} - \frac{c m ε^{2}}{ℓ_{ε}^{b}})}}

where ϑ = e^−3/e, $c = \frac{64 β}{15 (b - 1) d L^{2}} (1 - \frac{1}{2^{b - 1}})$ and $ℓ_{ε} = ⌊ \frac{1}{2} {(\frac{α}{ϕ^{- 1} (ε / R)})}^{1 / b} ⌋$ .

Proof. For given ε > 0, we define

g = \sum_{n = ℓ + 1}^{2 ℓ} \frac{ε σ^{n - ℓ} e_{n}}{\sqrt{ℓ} ϕ (t_{n})},

where σ = (σ¹, …, σ^ℓ) ∈ {−1, +1}^ℓ, t_n's are the eigenvalues of the integral operator L_K, e_n's are the eigenvectors of the integral operator L_K and the orthonormal basis of RKHS $H$ . Under the decay condition on the eigenvalues $α \leq n^{b} t_{n}$ , we get

| | g | |_{H}^{2} = \sum_{n = ℓ + 1}^{2 ℓ} \frac{ε^{2}}{ℓ ϕ^{2} (t_{n})} \leq \sum_{n = ℓ + 1}^{2 ℓ} \frac{ε^{2}}{ℓ ϕ^{2} (\frac{α}{n^{b}})} \leq \frac{ε^{2}}{ϕ^{2} (\frac{α}{2^{b} ℓ^{b}})} .

Hence f = ϕ(L_K)g ∈ Ω_ϕ,R provided that ||g||_$H$ ≤ R or equivalently,

\begin{array}{l} ℓ \leq \frac{1}{2} {(\frac{α}{ϕ^{- 1} (ε / R)})}^{1 / b} . & (50) \end{array}

For ℓ = $ℓ_{ε} = ⌊ \frac{1}{2} {(\frac{α}{ϕ^{- 1} (ε / R)})}^{1 / b} ⌋$ , choose ε_o such that ℓ_{ε_o} > 16. Then according to Proposition 6 [12], for every positive ε < ε_o (ℓ_ε > ℓ_{ε_o}) there exists N_ε ∈ ℕ and $σ_{1}, \dots, σ_{N_{ε}} \in {- 1, + 1}^{ℓ_{ε}}$ such that

\begin{array}{l} \sum_{n = 1}^{ℓ_{ε}} {(σ_{i}^{n} - σ_{j}^{n})}^{2} \geq ℓ_{ε}, for all 1 \leq i, j \leq N_{ε}, i \neq j & (51) \end{array}

and

\begin{array}{l} N_{ε} \geq e^{ℓ_{ε} / 24} . & (52) \end{array}

Now we suppose f_i = ϕ(L_K)g_i and for ε > 0,

g_{i} = \sum_{n = ℓ_{ε} + 1}^{2 ℓ_{ε}} \frac{ε σ_{i}^{n - ℓ_{ε}} e_{n}}{\sqrt{ℓ_{ε}} ϕ (t_{n})}, for i = 1, \dots, N_{ε},

where $σ_{i} = (σ_{i}^{1}, \dots, σ_{i}^{ℓ_{ε}}) \in {- 1, + 1}^{ℓ_{ε}}$ . Then from Equation (51) we get,

\begin{array}{l} ε \leq | | f_{i} - f_{j} | |_{H}, for all 1 \leq i, j \leq N_{ε}, i \neq j . & (53) \end{array}

For 1 ≤ i, j ≤ N_ε, we have

\begin{array}{l} | | f_{i} - f_{j} | |_{L_{ν}^{2} (X)}^{2} \leq \sum_{n = ℓ_{ε} + 1}^{2 ℓ_{ε}} \frac{β ε^{2} {(σ_{i}^{n - ℓ_{ε}} - σ_{j}^{n - ℓ_{ε}})}^{2}}{ℓ_{ε} n^{b}} \leq \sum_{n = ℓ_{ε} + 1}^{2 ℓ_{ε}} \frac{4 β ε^{2}}{ℓ_{ε} n^{b}} \\ \leq \frac{4 β ε^{2}}{ℓ_{ε}} \int_{ℓ_{ε}}^{2 ℓ_{ε}} \frac{1}{x^{b}} d x = c^{'} \frac{ε^{2}}{ℓ_{ε}^{b}}, & (54) \end{array}

where $c^{'} = \frac{4 β}{(b - 1)} (1 - \frac{1}{2^{b - 1}})$ .

We define the sets,

A_{i} = {z : | | f_{z} - f_{i} | |_{H} < \frac{ε}{2}}, for 1 \leq i \leq N_{ε} .

It is clear from Equation (53) that A_i's are disjoint sets. On applying Lemma 3.3 [29] with probability measures $ρ_{f_{i}}^{m}, 1 \leq i \leq N_{ε}$ , we obtain that either

\begin{array}{l} p : = max_{1 \leq i \leq N_{ε}} ρ_{f_{i}}^{m} (A_{i}^{c}) \geq \frac{N_{ε}}{N_{ε} + 1} & (55) \end{array}

\begin{array}{l} min_{1 \leq j \leq N_{ε}} \frac{1}{N_{ε}} \sum_{i = 1, i \neq j}^{N_{ε}} K (ρ_{f_{i}}^{m}, ρ_{f_{j}}^{m}) \geq Ψ_{N_{ε}} (p), & (56) \end{array}

where $Ψ_{N_{ε}} (p) = log (N_{ε}) + (1 - p) log (\frac{1 - p}{p}) - p log (\frac{N_{ε} - p}{p})$ . Further,

\begin{array}{l} Ψ_{N_{ε}} (p) \geq (1 - p) log (N_{ε}) + (1 - p) log (1 - p) - log (p) \\ + 2 p log (p) \geq - log (p) + log (\sqrt{N_{ε}}) - 3 / e . & (57) \end{array}

Since minimum value of x log(x) is −1/e on [0, 1].

For the joint probability measures $ρ_{f_{i}}^{m}$ , $ρ_{f_{j}}^{m}$ $(ρ_{f_{i}}, ρ_{f_{j}} \in P_{ϕ, b}, 1 \leq i, j \leq N_{ε})$ from Proposition 4 [12] and the Equation (54) we get,

\begin{array}{l} K (ρ_{f_{i}}^{m}, ρ_{f_{j}}^{m}) = m K (ρ_{f_{i}}, ρ_{f_{j}}) \leq \frac{16 m}{15 d L^{2}} | | f_{i} - f_{j} | |_{L_{ν}^{2} (X)}^{2} \leq \frac{c m ε^{2}}{ℓ_{ε}^{b}}, & (58) \end{array}

where c = 16c′/15dL².

Therefore, Equations (55), (56), together with Equations (57) and (58) implies

\begin{array}{l} p : = max_{1 \leq i \leq N_{ε}} (P r o b {z : | | f_{z} - f_{i} | |_{H} > \frac{ε}{2}}) \\ \geq min {\frac{N_{ε}}{N_{ε} + 1}, \sqrt{N_{ε}} e^{- \frac{3}{e} - \frac{c m ε^{2}}{ℓ_{ε}^{b}}}} . \end{array}

From Equation (52) for the probability measure ρ_* such that $p = ρ_{*}^{m} (A_{i}^{c})$ follows the result. □

The lower estimates in $L$ ²-norm can be obtained similar to above theorem.

Theorem 3.10. Let z be i.i.d. samples drawn according to the probability measure $ρ \in P_{ϕ, b}$ under the hypothesis dim(Y) = d < ∞. Then for any learning algorithm (z → f_z ∈ $H$ ) there exists a probability measure $ρ_{*} \in P_{ϕ, b}$ and f_{ρ_*} ∈ $H$ such that for all 0 < ε < ε_o, f_z can be approximated as

\begin{array}{l} P r o b_{z} {| | f_{z} - f_{ρ_{*}} | |_{L_{ν}^{2} (X)} > ε / 2} \\ \geq min {\frac{1}{1 + e^{- ℓ_{ε} / 24}}, ϑ e^{(\frac{ℓ_{ε}}{48} - \frac{64 m ε^{2}}{15 d L^{2}})}} \end{array}

where ϑ = e^−3/e, $ℓ_{ε} = ⌊ {(\frac{α}{Ψ^{- 1} (ε / R)})}^{1 / b} ⌋$ and $ψ (t) = \sqrt{t} ϕ (t)$ .

Theorem 3.11. Under the same assumptions of Theorem 3.10 for ψ(t) = t^1/2ϕ(t) and $Ψ (t) = t^{\frac{1}{2} + \frac{1}{2 b}} ϕ (t)$ , the estimator f_z corresponding to any learning algorithm converges to the regression function f_ρ with the following lower rate:

\begin{array}{l} \lim_{τ \to 0} \underset{m \to \infty}{\lim \inf} \inf_{l \in A} \sup_{ρ \in P_{ϕ, b}} P r o b_{z} {| | f_{z}^{l} - f_{ρ} | |_{L_{ν}^{2} (X)} > τ ψ \\ (Ψ^{- 1} (m^{- 1 / 2}))} = 1, \end{array}

where $A$ denotes the set of all learning algorithms $l : z \to f_{z}^{l}$ .

Proof. Under the condition $ℓ_{ε} = ⌊ {(\frac{α}{Ψ^{- 1} (ε / R)})}^{1 / b} ⌋$ from Theorem 3.10 we get,

\begin{array}{l} P r o b_{z} {| | f_{z} - f_{ρ_{*}} | |_{L_{ν}^{2} (X)} > \frac{ε}{2}} \\ \geq min {\frac{1}{1 + e^{- ℓ_{ε} / 24}}, ϑ e^{- \frac{1}{48}} e^{{\frac{1}{48} {(\frac{α}{ψ^{- 1} (ε / R)})}^{1 / b} - \frac{64 m ε^{2}}{15 d L^{2}}}}} . \end{array}

Choosing $ε_{m} = τ R ψ (Ψ^{- 1} (m^{- 1 / 2}))$ , we obtain

\begin{array}{l} P r o b_{z} {| | f_{z} - f_{ρ_{*}} | |_{L_{ν}^{2} (X)} > τ \frac{R}{2} ψ (Ψ^{- 1} (m^{- 1 / 2}))} \\ \geq min {\frac{1}{1 + e^{- ℓ_{ε} / 24}}, ϑ e^{- \frac{1}{48}} e^{c {(Ψ^{- 1} (m^{- 1 / 2}))}^{- 1 / b}}}, \end{array}

where $c = (\frac{α^{1 / b}}{48} - \frac{64 R^{2} τ^{2}}{15 d L^{2}}) > 0$ for $0 < τ < min (\frac{\sqrt{5 d} L α^{\frac{1}{2 b}}}{32 R}, 1)$ .

Now as m goes to ∞, ε → 0 and ℓ_ε → ∞. Therefore, for c > 0 we conclude that

\begin{array}{l} \underset{m \to \infty}{lim inf} inf_{l \in A} sup_{ρ \in P_{ϕ, b}} P r o b_{z} {| | f_{z}^{l} - f_{ρ} | |_{L_{ν}^{2} (X)} > τ \frac{R}{2} ψ (Ψ^{- 1} (m^{- 1 / 2}))} = 1 . \end{array}

□

Choosing $ε_{m} = τ R ϕ (Ψ^{- 1} (m^{- 1 / 2}))$ we get the following convergence rate from Theorem 3.9.

Theorem 3.12. Under the same assumptions of Theorem 3.10 for $Ψ (t) = t^{\frac{1}{2} + \frac{1}{2 b}} ϕ (t)$ , the estimator f_z corresponding to any learning algorithm converges to the regression function f_ρ with the following lower rate:

\begin{array}{l} lim_{τ \to 0} \underset{m \to \infty}{lim inf} inf_{l \in A} sup_{ρ \in P_{ϕ, b}} P r o b_{z} {| | f_{z}^{l} - f_{ρ} | |_{H} > τ ϕ (Ψ^{- 1} (m^{- 1 / 2}))} = 1 . \end{array}

We obtain the following corollary as a consequence of Theorem 3.11, 3.12.

Corollary 3.3. For any learning algorithm under Hölder's source condition $f_{ρ} \in Ω_{ϕ, R}, ϕ (t) = t^{r}$ and the polynomial decay condition (13) for b > 1, the lower convergence rates can be described as

\begin{array}{l} lim_{τ \to 0} \underset{m \to \infty}{lim inf} inf_{l \in A} sup_{ρ \in P_{ϕ, b}} P r o b_{z} {| | f_{z}^{l} - f_{ρ} | |_{L_{ν}^{2} (X)} > τ m^{- \frac{2 b r + b}{4 b r + 2 b + 2}}} = 1 \end{array}

and

lim_{τ \to 0} \underset{m \to \infty}{lim inf} inf_{l \in A} sup_{ρ \in P_{ϕ, b}} P r o b_{z} {| | f_{z}^{l} - f_{ρ} | |_{H} > τ m^{- \frac{b r}{2 b r + b + 1}}} = 1 .

If the minimax lower rate coincides with the upper convergence rate for λ = λ_m. Then the choice of parameter is said to be optimal. For the parameter choice λ = Ψ⁻¹(m^−1/2), Theorem 3.3 and Theorem 3.8 share the upper convergence rate with the lower convergence rate of Theorem 3.11 in $L$ ²-norm. For the same parameter choice, Theorem 3.4 and Theorem 3.7 share the upper convergence rate with the lower convergence rate of Theorem 3.12 in RKHS-norm. Therefore, the choice of the parameter is optimal.

It is important to observe that we get the same convergence rates for b = 1.

3.4. Individual Lower Rates

In this section, we discuss the individual minimax lower rates that describe the behavior of the error for the class of probability measure $P_{ϕ, b}$ as the sample size m grows.

Definition 3.4. A sequence of positive numbers a_n (n ∈ ℕ) is called the individual lower rate of convergence for the class of probability measure $P$ , if

inf_{l \in A} sup_{ρ \in P} \underset{m \to \infty}{lim sup} (\frac{E_{z} (| | f_{z}^{l} - f_{H} | |^{2})}{a_{m}}) > 0,

where $A$ denotes the set of all learning algorithms $l : z \mapsto f_{z}^{l}$ .

Theorem 3.13. Let z be i.i.d. samples drawn according to the probability measure $P_{ϕ, b}$ where ϕ is the index function satisfying the conditions that $ϕ (t) / t^{r_{1}}$ , $t^{r_{2}} / ϕ (t)$ are non-decreasing functions and dim(Y) = d < ∞. Then for every ε > 0, the following lower bound holds:

inf_{l \in A} sup_{ρ \in P_{ϕ, b}} \underset{m \to \infty}{lim sup} (\frac{E_{z} (| | f_{z}^{l} - f_{H} | |_{L_{ν}^{2} (X)}^{2})}{m^{- (b c_{2} + ε) / (b c_{1} + ε + 1)}}) > 0,

where c₁ = 2r₁ + 1 and c₂ = 2r₂ + 1.

We consider the class of probability measures such that the target function f_$H$ is parameterized by $s = {(s_{n})}_{n = 1}^{\infty} \in {- 1, + 1}^{\infty}$ . Suppose for ε > 0,

g = \sum_{n = 1}^{\infty} s_{n} R \sqrt{\frac{ε}{ε + 1} \frac{α}{n^{b} t_{n}}} (\frac{ϕ (α / n^{b})}{ϕ (t_{n})}) n^{- (ε + 1) / 2} e_{n},

where $s = {(s_{n})}_{n = 1}^{\infty} \in {- 1, + 1}^{\infty}$ , t_n's are the eigenvalues of the integral operator L_K, e_n's are the eigenvectors of the integral operator L_K and the orthonormal basis of RKHS $H$ . Then the target function f_$H$ = ϕ(L_K)g satisfies the general source condition. We assume that the conditional probability measure ρ(y|x) follows the normal distribution centered at f_$H$ and the marginal probability measure ρ_X = ν. Now we can derive the individual lower rates over the considered class of probability measures from the ideas of the literature [12, 30].

Theorem 3.14. Let z be i.i.d. samples drawn according to the probability measure $P_{ϕ, b}$ where ϕ is the index function satisfying the conditions that $ϕ (t) / t^{r_{1}}$ , $t^{r_{2}} / ϕ (t)$ are non-decreasing functions and dim(Y) = d < ∞. Then for every ε > 0, the following lower bound holds:

inf_{l \in A} sup_{ρ \in P_{ϕ, b}} \underset{m \to \infty}{lim sup} (\frac{E_{z} (| | f_{z}^{l} - f_{H} | |_{H}^{2})}{m^{- (b c_{2} - b + ε) / (b c_{1} + ε + 1)}}) > 0 .

4. Conclusion

In our analysis we derive the upper and lower convergence rates over the wide class of probability measures considering general source condition in vector-valued setting. In particular, our minimax rates can be used for the scalar-valued functions and multi-task learning problems. The lower convergence rates coincide with the upper convergence rates for the optimal parameter choice based on smoothness parameters b, ϕ. We can also develop various parameter choice rules such as balancing principle [31], quasi-optimality principle [32], discrepancy principle [33] for the regularized solutions provided in our analysis.

Author Contributions

All authors listed, have made substantial, direct and intellectual contribution to the work, and approved it for publication.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

The authors are grateful to the reviewers for their helpful comments and pointing out a subtle error that led to improve the quality of the paper.

References

1. Cucker F, Smale, S. On the mathematical foundations of learning. Bull Am Math Soc. (2002) 39:1–49. doi: 10.1090/S0273-0979-01-00923-5

CrossRef Full Text | Google Scholar

2. Evgeniou T, Pontil, M Poggio, T. Regularization networks and support vector machines. Adv Comput Math. (2000) 13:1–50. doi: 10.1023/A:1018946025316

CrossRef Full Text | Google Scholar

3. Vapnik VN, Vapnik V. Statistical Learning Theory. New York, NY: Wiley (1998).

PubMed Abstract | Google Scholar

4. Bauer F, Pereverzev S, Rosasco L. On regularization algorithms in learning theory. J Complex. (2007) 23:52–72. doi: 10.1016/j.jco.2006.07.001

CrossRef Full Text | Google Scholar

5. Engl HW, Hanke M, Neubauer A Regularization of Inverse Problems. Dordrecht: Kluwer Academic Publishers Group (1996).

Google Scholar

6. Gerfo LL, Rosasco L, Odone F, De Vito E, Verri A. Spectral algorithms for supervised learning. Neural Comput. (2008) 20:1873–97. doi: 10.1162/neco.2008.05-07-517

PubMed Abstract | CrossRef Full Text | Google Scholar

7. Tikhonov AN, Arsenin VY. Solutions of Ill-Posed Problems. Washington, DC: W. H. Winston (1977).

Google Scholar

8. Bousquet O, Boucheron S, Lugosi G. Introduction to statistical learning theory. In: Bousquet O, von Luxburg U, Ratsch G editors. Advanced Lectures on Machine Learning, Volume 3176 of Lecture Notes in Computer Science. Berlin; Heidelberg: Springer (2004). pp. 169–207.

Google Scholar

9. Cucker F, Zhou DX. Learning Theory: An Approximation Theory Viewpoint. Cambridge, UK: Cambridge Monographs on Applied and Computational Mathematics, Cambridge University Press (2007).

Google Scholar

10. Lu S, Pereverzev S. Regularization Theory for Ill-posed Problems: Selected Topics, Berlin: DeGruyter (2013).

Google Scholar

11. Abhishake Sivananthan S. Multi-penalty regularization in learning theory. J Complex. (2016) 36:141–65. doi: 10.1016/j.jco.2016.05.003

CrossRef Full Text | Google Scholar

12. Caponnetto A, De Vito E. Optimal rates for the regularized least-squares algorithm. Found Comput Math. (2007) 7:331–68. doi: 10.1007/s10208-006-0196-8

CrossRef Full Text | Google Scholar

13. Smale S, Zhou DX. Estimating the approximation error in learning theory. Anal Appl. (2003) 1:17–41. doi: 10.1142/S0219530503000089

CrossRef Full Text | Google Scholar

14. Smale S, Zhou DX. Shannon sampling and function reconstruction from point values. Bull Am Math Soc. (2004) 41:279–306. doi: 10.1090/S0273-0979-04-01025-0

CrossRef Full Text | Google Scholar

15. Smale S, Zhou DX. Shannon sampling II: connections to learning theory. Appl Comput Harmon Anal. (2005) 19:285–302. doi: 10.1016/j.acha.2005.03.001

CrossRef Full Text | Google Scholar

16. Smale S, Zhou DX. Learning theory estimates via integral operators and their approximations. Constr Approx. (2007) 26:153–72. doi: 10.1007/s00365-006-0659-y

CrossRef Full Text | Google Scholar

17. Mathé P, Pereverzev SV. Geometry of linear ill-posed problems in variable Hilbert scales. Inverse Probl. (2003) 19:789–803. doi: 10.1088/0266-5611/19/3/319

CrossRef Full Text | Google Scholar

18. Blanchard G, Mücke N. Optimal rates for regularization of statistical inverse learning problems. arXiv:1604.04054 (2016).

Google Scholar

19. Mendelson S. On the performance of kernel classes. J Mach Learn Res. (2003) 4:759–71.

Google Scholar

20. Zhang T. Effective dimension and generalization of kernel learning. In: Thrun S, Becker S, Obermayer K. editors. Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, (2003). pp. 454–61.

Google Scholar

21. Akhiezer NI, Glazman IM. Theory of Linear Operators in Hilbert Space, Translated from the Russian and with a preface by Merlynd Nestell. New York, NY: Dover Publications Inc (1993).

Google Scholar

22. Micchelli CA, Pontil M. On learning vector-valued functions. Neural Comput. (2005) 17:177–204. doi: 10.1162/0899766052530802

PubMed Abstract | CrossRef Full Text | Google Scholar

23. Aronszajn N. Theory of reproducing kernels. Trans Am Math Soc. (1950) 68:337–404. doi: 10.1090/S0002-9947-1950-0051437-7

CrossRef Full Text | Google Scholar

24. Reed M, Simon B. Functional Analysis, Vol. 1, San Diego, CA: Academic Press (1980).

25. De Vito E, Rosasco L, Caponnetto A, De Giovannini U, Odone F. Learning from examples as an inverse problem. J Mach Learn Res. (2005) 6:883–904.

Google Scholar

26. Pinelis IF, Sakhanenko AI. Remarks on inequalities for the probabilities of large deviations. Theory Prob Appl. (1985) 30:127–31. doi: 10.1137/1130013

CrossRef Full Text | Google Scholar

27. Peller VV. Multiple operator integrals in perturbation theory. Bull Math Sci. (2016) 6:15–88. doi: 10.1007/s13373-015-0073-y

CrossRef Full Text | Google Scholar

28. Boucheron S, Bousquet O, Lugosi G. Theory of classification: a survey of some recent advances. ESAIM: Prob Stat. (2005) 9:323–75. doi: 10.1051/ps:2005018

CrossRef Full Text | Google Scholar

29. DeVore R, Kerkyacharian G, Picard D, Temlyakov V. Approximation methods for supervised learning. Found Comput Math. (2006) 6:3–58. doi: 10.1007/s10208-004-0158-6

CrossRef Full Text | Google Scholar

30. Györfi L, Kohler M, Krzyzak A, Walk H. A Distribution-Free Theory of Nonparametric Regression. New York, NY: Springer Series in Statistics, Springer-Verlag (2002).

Google Scholar

31. De Vito E, Pereverzyev S, Rosasco L. Adaptive kernel methods using the balancing principle. Found Comput Math. (2010) 10:455–79. doi: 10.1007/s10208-010-9064-2

CrossRef Full Text | Google Scholar

32. Bauer F, Reiss M. Regularization independent of the noise level: an analysis of quasi-optimality. Inverse Prob. (2008) 24:055009. doi: 10.1088/0266-5611/24/5/055009

CrossRef Full Text | Google Scholar

33. Lu S, Pereverzev SV, Tautenhahn U. A model function method in regularized total least squares. Appl Anal. (2010) 89:1693–703. doi: 10.1080/00036811.2010.492502

CrossRef Full Text | Google Scholar

Keywords: learning theory, general source condition, vector-valued RKHS, error estimate, optimal rates

Mathematics Subject Classification 2010: 68T05, 68Q32

Citation: Rastogi A and Sampath S (2017) Optimal Rates for the Regularized Learning Algorithms under General Source Condition. Front. Appl. Math. Stat. 3:3. doi: 10.3389/fams.2017.00003

Received: 02 November 2016; Accepted: 09 March 2017;
Published: 27 March 2017.

Edited by:

Yiming Ying, University at Albany, SUNY, USA

Reviewed by:

Xin Guo, The Hong Kong Polytechnic University, Hong Kong
Ernesto De Vito, University of Genoa, Italy

Copyright © 2017 Rastogi and Sampath. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Abhishake Rastogi, YWJoaXNoZWtyYXN0b2dpMjAxMkBnbWFpbC5jb20=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.