Multi-Task Learning Based on Stochastic Configuration Networks

Dong, Xue-Mei; Kong, Xudong; Zhang, Xiaoping

doi:10.3389/fbioe.2022.890132

ORIGINAL RESEARCH article

Front. Bioeng. Biotechnol., 04 August 2022

Sec. Bionics and Biomimetics

Volume 10 - 2022 | https://doi.org/10.3389/fbioe.2022.890132

This article is part of the Research TopicBio-Inspired Computation and Its ApplicationsView all 72 articles

Multi-Task Learning Based on Stochastic Configuration Networks

Updated

A correction has been applied to this article in:

Corrigendum: Multi-task learning based on Stochastic Configuration Networks
1. Read correction

Xue-Mei Dong

Xudong Kong*

Xiaoping Zhang

Collaborative Innovation Center of Statistical Data Engineering, Technology & Application, School of Statistics and Mathematics, Zhejiang Gongshang University, Hangzhou, China

When the human brain learns multiple related or continuous tasks, it will produce knowledge sharing and transfer. Thus, fast and effective task learning can be realized. This idea leads to multi-task learning. The key of multi-task learning is to find the correlation between tasks and establish a fast and effective model based on these relationship information. This paper proposes a multi-task learning framework based on stochastic configuration networks. It organically combines the idea of the classical parameter sharing multi-task learning with that of constraint sharing configuration in stochastic configuration networks. Moreover, it provides an efficient multi-kernel function selection mechanism. The convergence of the proposed algorithm is proved theoretically. The experiment results on one simulation data set and four real life data sets verify the effectiveness of the proposed algorithm.

1 Introduction

In supervised machine learning, we often encounter situations that establishing models for several related tasks, such as searching cancer sites, identifying cancer types, judging cancer stages and so on, based on cancer image data. Generally, these tasks are undertaken separately, which we refer to as single-task supervised learning (STSL) in traditional machine learning (Ben-David and Schuller, 2003). These models do not consider the correlation among multiple tasks so some common information in model parameters or data features is lost. In particular, when the training sample size of a single task is insufficient, it is difficult for STSL to capture enough information, which results in poor generalization performance. Multi-task supervised learning (MTSL) provides a solution for such a situation. It improves the performance of each task by setting shared representations among related tasks (Baxter, 2000; Argyriou et al., 2007; Liu et al., 2017). In a sense, a very important reason why human beings can learn based on a small number of samples is that human beings can make full use of various senses to obtain enough information and synthesize relevant information. MTSL is one of the ways to realize this idea.

Classical MTSL can be roughly divided into two categories, namely, MTSL based on constraint sharing and MTSL based on parameter sharing. In relation to the first method, Argyriou et al. (2008) proposed the MTL-L₂₁ based on regularization strategies, that was achieved by adding a regularization term for all the tasks’ objective function coefficients on the cost function. But this method performs poorly when data features have the problem of collinearity. To reduce the impact of this problem, Chen et al. (2012) added a quadratic regularization term for all the tasks’ objective function coefficients based on MTL-L₂₁. In 2015, Duong et al. (2015) used L₂ distance to regularize the parameters in their multi-task neural networks, so that each task has similar but different model parameters. In 2017, Yang and Hospedales (2017) used the trace norm to implement Duong’s model. In 2019, Oliveira et al. (2019) attempted to conceive a group LASSO with asymmetric transference formulation in multi-task learning, looking for the best of both worlds in a framework that admits the overlap of groups. Since all of these MTSL methods need to learn sparse features, their performance is not ideal when the data has few features. The MTSL methods based on parameter sharing (Caruana, 1997; Jacob et al., 2009) are not affected by this problem. In 1997, Caruana (Caruana, 1997) proposed a MTSL method (MTL) based on backpropagation neural networks. He mirrored the correlation information by sharing the input and hidden layer neurons among different tasks. In 1998, Lecun et al. (1998) used convolutional neural networks, named as LeNet-5, for document recognition on the basis of MTL. Their results clearly demonstrated the advantages of training a recognizer at the word level, rather than training it on presegmented, hand-labeled, isolated characters. In 2018, Ma et al. (2018) proposed multi-gate mixture-of-experts (MMoE), which adapted the mixture-of-experts (MoE) structure to multi-task learning by sharing the expert submodels across all tasks, while also had a gating network trained to optimize each task. In 2021, Zhang et al. (Zhang et al., 2021) developed a programming framework, AutoMTL, which generates compact multi-task models given an arbitrary input backbone convolutional neural network model and a set of tasks. However, these methods have high computational complexity and poor learning performance when the training samples are insufficient.

To address the aforementioned problems, this paper proposes a MTSL method based on a constraint sharing framework of stochastic configuration networks (SCNs) proposed by Wang et al. (Wang and Li, 2017a; Wang and Li, 2017b) Instead of the complex gradient descent method for solving the weight parameters of hidden layer nodes in general neural networks, SCNs use a supervision mechanism to stochastically configure these parameters. This stochastic configuration mechanism greatly reduces the computational complexity. Inspired by this idea, we establish a multi-task supervised learning algorithm based on stochastic configuration radial basis networks (MTSL-SCRBN). The main contributions of this study are as follows.

1. We combine constraint sharing of SCNs and parameter sharing of MTSL organically. The shared parameters are stochastically configured under certain constraint, which has low computational complexity. At the same time, to improve learning performance, the radial basis functions (Powell, 1987; Broomhead and Lowe, 1988) with different scale parameters are used as the basis functions to replace the original sigmoid functions of SCNs.

2. Two types of difficult to choice hyper parameters of the proposed model, the scale parameters and the centers of the radial basis functions, are stochastically configured during the learning process.

The rest of the paper is organized as follows. In Section 2, we briefly review MTL-L₂₁, MTEN, MTL, and SCNs. Section 3 details our proposed algorithm MTSL-SCRBN and proves its convergence. The experimental results of these algorithms on one simulation data set and five real data sets are detailed in Section 4. Section 5 summarizes this paper.

2 Related Work

Firstly, we introduce some notations. Suppose that there are M supervised learning tasks. The samples of the m-th task are given by,

\{(x_{1}^{m}, y_{1}^{m}), \dots, (x_{N_{m}}^{m}, y_{N_{m}}^{m})\}, (1)

where $x_{i}^{m} = {[x_{i, 1}^{m}, \dots, x_{i, d}^{m}]}^{T} \in R^{d}$ , $y_{i}^{m} \in R$ , i = 1, …, N_m, m = 1, …, M, and T means transpose transform.

2.1 Multi-Task Learning Methods Based on Constraint Sharing

Inspired by group sparsity, Argyrios et al. (Argyriou et al., 2008) proposed the MTL-L₂₁ method to learn the correlation among multiple tasks under a regularization strategy. It can be described as the following optimization problem,

V_{M T L - L_{21}}^{*} = \underset{V \in R^{d \times M}}{arg min} \sum_{m = 1}^{M} ‖ X^{m} v^{m} - y^{m} ‖_{F}^{2} + λ ‖ V ‖_{2,1},

where $X^{m} = {[x_{1}^{m}, \dots, x_{N_{m}}^{m}]}^{T} \in R^{N_{m} \times d}$ , $y^{m} = {[y_{1}^{m}, \dots, y_{N_{m}}^{m}]}^{T} \in R^{N_{m}}$ , ‖ ⋅‖_F is the Frobenius norm, V = [v¹, …, v^M] on behalf of the model coefficient matrix, $v^{m} = {[v_{1}^{m}, \dots, v_{d}^{m}]}^{T}$ represents the m-th column of V, which is the coefficient vector of the m-th task. λ represents the regularization coefficient and $‖ V ‖_{2,1} = \sum_{i = 1}^{d} \sqrt{\sum_{m = 1}^{M} {(v_{i}^{m})}^{2}}$ . For the input ${\tilde{x}}^{m}$ of the m-th task, MTL-L₂₁ gives the predicted value $f ({\tilde{x}}^{m}) = {\tilde{x}}^{m^{T}} v_{M T L - L_{21}}^{m *}$ .

When data features have the problem of collinearity, MTL-L₂₁ will have an unstable prediction performance. Xi Chen et al. (Chen et al., 2012) proposed the MTEN method by adding another quadratic regularization term for the objective function coefficients of all tasks on the basis of MTL-L₂₁. It can be described as the following optimization problem,

V_{MTEN}^{*} = \underset{V \in R^{d \times M}}{arg min} \sum_{m = 1}^{M} \frac{1}{2 n} ‖ X^{m} v^{m} - y^{m} ‖_{F}^{2} + λ ρ ‖ V ‖_{2,1} + \frac{λ (1 - ρ)}{2} ‖ V ‖_{F}^{2},

where ρ ∈ [0, 1] represents the elastic net mixing parameter. For the input ${\tilde{x}}^{m}$ of the m-th task, MTEN gives the predicted value $f ({\tilde{x}}^{m}) = {\tilde{x}}^{m^{T}} v_{MTEN}^{m *}$ .

In the case of insufficient data features and data size, the two algorithms MTL-L₂₁ and MTEN cannot obtain enough information by learning sparse features, which leads to poor prediction performance.

2.2 Multi-Task Learning Methods Based on Parameter Sharing

Caruana (1997) implemented MTSL on backpropagation nets by sharing input and hidden layer neurons among different tasks. Essentially, this method optimizes the choice of function space by the correlation among tasks and obtains better internal weight parameters.

Figure 1 shows the process of traditional backpropagation nets to deal with four related tasks. This method ignores the information among related tasks. Especially in the case of insufficient data samples, these models may have problems such as over-fitting.

FIGURE 1

FIGURE 1. Single-task backpropagation nets.

Figure 2 shows the multi-task backpropagation net (MTL) conceived by Caruana. In MTL, each task shares input and hidden layer neurons.

FIGURE 2

FIGURE 2. The multi-task backpropagation net.

Compared with the data form given in Eq. 1, the data form suitable for MTL is that different tasks have the same input,

X^{1} = \dots = X^{M} = X = {[x_{1}, \dots, x_{N}]}^{T} \in R^{N \times d},

and the output is,

Y = [y^{1}, \dots, y^{M}] \in R^{N \times M}, y^{m} = {[y_{1}^{m}, \dots, y_{N}^{m}]}^{T} \in R^{N} .

MTL can be described as follows

(β^{*}, W^{*}, b^{*}) = \underset{b \in R^{S}}{\underset{β \in R^{S \times M} W \in R^{S \times d}}{arg min}} \sum_{m = 1}^{M} ‖y^{m} - \sum_{j = 1}^{S} g_{j} β_{j}^{m}‖,

where S is the number of hidden layer nodes, $β_{j}^{m}$ is the external weight parameter of the m-th task in the j-th hidden layer node. $g_{j} ≔ g_{j} (X) = {[g (x_{1}^{T} w_{j} + b_{j}), \dots, g (x_{N}^{T} w_{j} + b_{j})]}^{T}$ (here g represents the sigmoid function), $w_{j} = {[w_{j, 1}, \dots, w_{j, d}]}^{T}$ and b_j represent the internal weight parameter vector and the bias internal weight parameter shared by the backpropagation net. β ∈ R^N×M, W ∈ R^d×N, b ∈ R^N are the corresponding parameter matrix. For the input ${\tilde{x}}^{m}$ of the m-th task, MTL gives the predicted value $f ({\tilde{x}}^{m}) = \sum_{j = 1}^{S} g ({\tilde{x}}^{m^{T}} w_{j}^{*} + b_{j}^{*}) β_{j}^{m *}$ .

From a mathematical point of view, the essence of the backpropagation net is the gradient descent algorithm. In single-task supervised learning, the backpropagation net may fall into local optimum. However, in multi-task supervised learning, the local optimum of different tasks is in different positions, and the interaction among tasks can help the hidden layer to escape from local optimums (Caruana, 1997).

2.3 Stochastic Configuration Networks

Wang and Li (Wang and Li, 2017a; Wang and Li, 2017b) proposed supervised stochastic configuration networks, and implemented SCNs using three algorithms SC-i (i=I,II,III). SC-i starts with a small network structure (Tin-Yan Kwok and Dit-Yan Yeung, 1997), and uses a supervision mechanism to add hidden layer neurons until the model meets a predetermined error criterion. Since SC-III performs the best of the three algorithms, we next describe the implementation of the SC-III algorithm.

Suppose a SC-III with L − 1 hidden layer nodes has already been constructed, that is,

f^{L - 1} (x^{T}) = \sum_{j = 1}^{L - 1} g (x^{T} w_{j}^{*} + b_{j}^{*}) β_{j}^{L - 1}, (L = 2,3, \dots, f^{0} \in [0,0, \dots, 0] \in R^{q}),

where $β_{j}^{L - 1} = [β_{j, 1}^{L - 1}, \dots, β_{j, q}^{L - 1}] \in R^{q}$ represents the optimal external weight parameter of the j-th hidden layer node, $w_{j}^{*}$ and $b_{j}^{*}$ represent the optimal internal weight parameters of the j-th hidden layer node.

For training data set $X = {[x_{1}, \dots, x_{N}]}^{T} \in R^{N \times d}$ , Y = [y¹, …, y^q] ∈ R^N×q, where $x_{i} = {[x_{i, 1}, \dots, x_{i, d}]}^{T}$ . Let $f^{L - 1} ≔ f^{L - 1} (X) = \sum_{j = 1}^{L - 1} g_{j} (X) β_{j}^{L - 1}$ and $e^{L - 1} ≔ Y - f^{L - 1} ≔ [e_{1}^{L - 1}, \dots, e_{q}^{L - 1}]$ , which is the residual error matrix of the (L − 1)-th hidden layer node. If ‖e^L−1‖_F does not meet the predetermined error criteria, SC-III needs to generate a new hidden layer node, that is, stochastically configure internal weight parameters w_L, b_L from an uniform distribution U^d+1[ − Φ, Φ], Φ > 0. For new variables,

ξ_{n}^{L} = \frac{{⟨ e_{n}^{L - 1}, g_{L} ⟩}^{2}}{‖ g_{L} ‖_{2}^{2}} - (1 - r - μ_{L}) ‖ e_{n}^{L - 1} ‖_{2}^{2}, n = 1, \dots, q,

if $\min_{n} (ξ_{n}^{L}) \geq 0$ , then w_L, b_L are considered to meet the condition, otherwise w_L, b_L need to be configured again. With the qualified internal weight parameters $w_{L}^{*}$ and $b_{L}^{*}$ , SC-III obtains the optimal external weight parameter vector by the following optimization problem,

β^{L} = \underset{β \in R^{L \times q}}{arg min} {‖Y - \sum_{j = 1}^{L} g_{j} β_{j}‖}_{F}^{2} .

The leading model $f^{L} (x) = \sum_{j = 1}^{L} g (x^{T} w_{j}^{*} + b_{j}^{*}) β_{j}^{L}$ will have an improved residual error. Repeat the above steps to add hidden layer nodes until the residual error meets the predetermined error criteria.

3 Multi-Task Supervised Learning Based on Stochastic Configuration Radial Basis Networks

In this section, we introduce the proposed MTSL-SCRBN algorithm.

3.1 Model Introduction

In order to combine SCNs and MTL organically, we need to change the data form given in Eq. 1. In MTSL-SCRBN, first, we require each task to have a same number of samples, namely, N₁ =…= N_M =: N. (If the number of samples for each task is different, this requirement can be achieved by random sampling.) Then we merge the input data of different tasks into a new input data, that is, the i-th new input data is $X_{i}^{T} : = (x_{i}^{1^{T}}, \dots, x_{i}^{M^{T}}) \in R^{1 \times M d}$ , where $x_{i}^{m} \in R^{d}$ is the i-th input of the m-th task, for i = 1, …, N and m = 1, …, M. The corresponding i-th new output data is $y_{i} : = {(y_{i}^{1}, \dots, y_{i}^{M})}^{T}$ , where $y_{i}^{m}$ is the original output of $x_{i}^{m}$ in the m-th task. The goal of our MTSL-SCRBN is to establish an appropriate model from R^Md to R^M based on these data ${{(X_{i}, y_{i})}_{i = 1}^{N}}$ .

In order to obtain good learning performance, we use the following radial basis function k_σ(x, x′) as model’s basis function,

k_{σ} (x, x^{'}) = \exp (- \frac{‖ x - x^{'} ‖^{2}}{2 σ^{2}}),

where x is the input, x′ represents the center and σ is the scale parameter.

Suppose f⁰ = [0, 0, …, 0] ∈ R^M, for L = 2, 3, …, it is assumed that a MTSL-SCRBN with L − 1 hidden layer nodes has already been constructed as follows,

f^{L - 1} (X) = \sum_{j = 1}^{L - 1} [k_{σ_{j}^{*}} (x^{1}, x^{j *}), \dots, k_{σ_{j}^{*}} (x^{M}, x^{j *})] β_{j}^{L - 1},

where $X^{T} = (x^{1^{T}}, \dots, x^{M^{T}})$ is a new input formed by the inputs of M tasks, and $β_{j}^{L - 1} = [β_{j}^{1, L - 1}, \dots, β_{j}^{M, L - 1}] \in R^{M \times M}$ ,where $β_{j}^{m, L - 1} \in R^{M}$ represents the optimal external weight parameter vector of the m-th task in the j-th hidden layer node, x^j* and $σ_{j}^{*}$ are the optimal center and the optimal scale parameter of the radial basis function in the j-th hidden layer node, respectively. Different from the traditional learning of radial basis neural network, in our MTSL-SCRBN, the optimal centers and the optimal scale parameters at each step are randomly assigned by a shared supervision mechanism given in the following. This is simple to implement and easy to obtain a learning model with good performance.

Denote $k_{j}^{m} ≔ {[k_{σ_{j}^{*}} (x_{1}^{m}, x^{j *}), \dots, k_{σ_{j}^{*}} (x_{N}^{m}, x^{j *})]}^{T}, K_{j} ≔ [k_{j}^{1}, \dots, k_{j}^{M}] \in R^{N \times M}$ and $f^{L - 1} ≔ \sum_{j = 1}^{L - 1} K_{j} β_{j}^{L - 1}$ , $Y = {[y_{1}, \dots, y_{N}]}^{T} \in R^{N \times M}$ . Then, let e^L−1≔Y − f^L−1≔[e^1,L−1, …, e^M,L−1] be the residual error matrix of the (L − 1)-th hidden layer node. If ‖e^L−1‖_F does not meet the predetermined error criteria, MTSL-SCRBN needs to generate a new hidden layer node, that is, stochastically configure the scale parameter σ_L from U[0, Ω], Ω > 0 and the center of the radial basis function x^L from ${x_{i}^{m} : i = 1, \dots, N, m = 1, \dots, M}$ .

Similar to that in SCNs, we introduce a variable ξ^m,L in our multi-task learning case as follows,

ξ^{m, L} = \frac{1}{M} \frac{{⟨ e^{m, L - 1}, k_{L}^{m} ⟩}^{2}}{‖ k_{L}^{m} ‖_{2}^{2}} - (1 - r - μ_{L, r}) ‖ e^{m, L - 1} ‖_{2}^{2} .

Here $k_{L}^{m} = {[k_{σ_{L}} (x_{1}^{m}, x^{L}), \dots, k_{σ_{L}} (x_{N}^{m}, x^{L})]}^{T}$ , 0 < r < 1 is a given constant and $μ_{L, r} = \frac{1 - r}{L + 1}$ .

If $\sum_{m = 1}^{M} ξ^{m, L} \geq 0$ , then σ_L, x^L are considered to meet the condition, otherwise, σ_L, x^L need to be configured again. With the qualified parameters $σ_{L}^{*}$ and x^L∗, MTSL-SCRBN obtains the optimal external weight parameter vector by the following optimization problem,

β^{L} = \underset{β \in R^{L M \times M}}{arg min} {‖Y - \sum_{j = 1}^{L} K_{j} β_{j}‖}_{F}^{2} .

The leading model,

f^{L} (X) = \sum_{j = 1}^{L} [k_{σ_{j}^{*}} (x^{1}, x^{j *}), \dots, k_{σ_{j}^{*}} (x^{M}, x^{j *}))] β_{j}^{L},

will have an improved residual error. Repeat the above steps to add hidden layer nodes until the residual error meets the predetermined error criteria.

The above implementation process of the proposed MTSL-SCRBN algorithm is described as follows.

Algorithm 1. The MTSL-SCRBN algorithm

Notice that in step 19, we calculate the parameter matrix based on the standard least squares method,

β^{L} = \underset{β \in R^{L M \times M}}{arg min} {‖Y - \sum_{j = 1}^{L} K_{j} β_{j}‖}_{F}^{2} = K_{L}^{†} Y,

where $K_{L}^{†}$ is the Moore-Penrose generalized inverse (Lancaster and Tismenetsky, 1985) of K_L. The setting of μ_L,r in step 7 and the updating idea of r in step 14 can be referred to literature (Wang and Li, 2017a).

3.2 The Convergence Theorem of the MTSL-SCRBN Algorithm

We extend the method in (Wang and Li, 2017a) to the multi-task learning framework of this paper and prove the convergence of the proposed algorithm.

Theorem 1. Assume that there are some p_k ∈ R⁺, satisfying $0 < ‖ k_{j}^{m} ‖_{2} < p_{k}$ . Given 0 < r < 1 and a non-negative real value sequence {μ_L} with lim_L→+∞μ_L = 0 and μ_L ≤ (1 − r). For L = 2, 3 …, denoted by

δ^{L} = \sum_{m = 1}^{M} δ^{m, L},

δ^{m, L} = (1 - r - μ_{L}) ‖ e^{m, L - 1} ‖_{2}^{2} .

If the basis function $k_{L}^{m}$ is generated to satisfy the following inequality,

\sum_{m = 1}^{M} {⟨ e^{m, L - 1}, k_{L}^{m} ⟩}^{2} \geq \frac{M^{2}}{2 M - 2} p_{k}^{2} δ^{L}, (2)

and the external weight parameter vector is evaluated by,

β^{L} = {[β_{1}^{L}, \dots, β_{L}^{L}]}^{T} = \underset{β \in R^{L M \times M}}{arg min} {‖Y - \sum_{j = 1}^{L} K_{j} β_{j}‖}_{F}^{2} .

Then, we have lim_L→+∞‖Y − f^L‖_F = 0.

Proof of Theorem 1. Define intermediate values

{\tilde{β}}_{L}^{m, L} = {[\frac{〈 e^{m, L - 1}, k_{L}^{1} 〉}{M ‖ k_{L}^{1} ‖_{2}^{2}}, \dots, \frac{〈 e^{m, L - 1}, k_{L}^{M} 〉}{M ‖ k_{L}^{M} ‖_{2}^{2}}]}^{T},

and ${\tilde{e}}^{L} = e^{L - 1} - K_{L} {\tilde{β}}_{L}^{L}$ , with ${\tilde{β}}_{L}^{L} = [{\tilde{β}}_{L}^{1, L}, \dots, {\tilde{β}}_{L}^{M, L}], e^{0} = Y$ .It is clear that $‖ e^{L} ‖_{F}^{2} \leq ‖ {\tilde{e}}^{L} ‖_{F}^{2} \leq ‖ e^{L - 1} ‖_{F}^{2} \leq ‖ {\tilde{e}}^{L - 1} ‖_{F}^{2}$ , where L = 2, 3, …. So ${‖ e^{L - 1} ‖_{F}^{2}}$ is monotonically decreasing and convergent. Hence, we have,

\begin{array}{l} ‖ e^{L} ‖_{F}^{2} - (r + μ_{L}) ‖ e^{L - 1} ‖_{F}^{2} \\ \leq & ‖ {\tilde{e}}^{L} ‖_{F}^{2} - (r + μ_{L}) ‖ e^{L - 1} ‖_{F}^{2} \\ = & \sum_{m = 1}^{M} (⟨ e^{m, L - 1} - K_{L} {\tilde{β}}_{L}^{m, L}, e^{m, L - 1} - K_{L} {\tilde{β}}_{L}^{m, L} ⟩ - (r + μ_{L}) ⟨ e^{m, L - 1}, e^{m, L - 1} ⟩) \\ = & (1 - r - μ_{L}) ‖ e^{L - 1} ‖_{F}^{2} - \sum_{m = 1}^{M} (2 ⟨ e^{m, L - 1}, K_{L} {\tilde{β}}_{L}^{m, L} ⟩ - ⟨ K_{L} {\tilde{β}}_{L}^{m, L}, K_{L} {\tilde{β}}_{L}^{m, L} ⟩) \\ \leq & (1 - r - μ_{L}) ‖ e^{L - 1} ‖_{F}^{2} - \frac{2 M - 2}{M^{2}} \sum_{m = 1}^{M} (\frac{{⟨ e^{m, L - 1}, k_{L}^{1} ⟩}^{2}}{‖ k_{L}^{1} ‖_{2}^{2}} + \dots + \frac{{⟨ e^{m, L - 1}, k_{L}^{M} ⟩}^{2}}{‖ k_{L}^{M} ‖_{2}^{2}}) \\ \leq & (1 - r - μ_{L}) ‖ e^{L - 1} ‖_{F}^{2} - \frac{2 M - 2}{M^{2}} \sum_{m = 1}^{M} \frac{{⟨ e^{m, L - 1}, k_{L}^{m} ⟩}^{2}}{‖ k_{L}^{m} ‖_{2}^{2}} \\ = & δ^{L} - \frac{2 M - 2}{M^{2}} \sum_{m = 1}^{M} \frac{{⟨ e^{m, L - 1}, k_{L}^{m} ⟩}^{2}}{‖ k_{L}^{m} ‖_{2}^{2}} \\ \leq & δ^{L} - \frac{2 M - 2}{M^{2}} \frac{\sum_{m = 1}^{M} {⟨ e^{m, L - 1}, k_{L}^{m} ⟩}^{2}}{p_{k}^{2}} \\ \leq & 0 . \end{array}

Then, the following inequality holds,

‖ e^{L} ‖_{F}^{2} \leq r ‖ e^{L - 1} ‖_{F}^{2} + γ_{L}, (γ_{L} = μ_{L} ‖ e^{L - 1} ‖_{F}^{2} \geq 0) .

Since lim_L→+∞μ_L = 0, and 0 < r < 1, we have $\lim_{L \to + \infty} ‖ e^{L} ‖_{F}^{2} = 0$ , and lim_L→+∞‖e^L‖_F = 0.

Remark 1. Unlike SC-III, we relax the condition for the configuration parameters in the formula (∗). SC-III requires each task to meet the inequality conditions, but MTSL-SCRBN only requires the sum of all tasks to satisfy the inequality condition. The rationality of this condition will also be verified in the experiment results of next section.

4 Experiment Results

In order to show the effectiveness of the proposed algorithm, this section uses the classical STSL algorithms SVM (Cortes and Vapnik, 1995), SC-III (Wang and Li, 2017a) and seven MTSL algorithms MTSL-SCRBN, MTL (Caruana, 1997), MTEN (Chen et al., 2012), DMTRL (Yang and Hospedales, 2017), MMoE (Ma et al., 2018), GAMTL (Oliveira et al., 2019), AUTOMTL (Zhang et al., 2021) to perform comparative experiments. All calculations are conducted using Python 3.6.5 on a computer with 2.60 GHz CPU and 8 GB RAM. The input features are scaled into [ − 1, 1] and the output remains unchanged. All the results reported in this paper take averages over 20 independent trials, except for the SVM and MTEN algorithms, which have fixed experiment results. The accuracy (ACC) and root mean square error (RMSE) are chosen as the classification and regression evaluation indicators, where

R M S E = {[\frac{1}{N \times M} \sum_{m = 1}^{M} \sum_{i = 1}^{N} {({\hat{y}}_{i}^{m} - y_{i}^{m})}^{2}]}^{\frac{1}{2}},

with $y_{i}^{m}$ and ${\hat{y}}_{i}^{m}$ representing the target output and the learner’s output of i − th sample for task m respectively.

For different data sets, some algorithms used in the following experiments can stochastically configure hyperparameters within specified ranges or determine parameters by cross-validation. Table 1 gives the specific selection range of each parameter.

TABLE 1

TABLE 1. Parameter description.

4.1 Experimental Results and Analysis on Simulated Data

The simulation data set selected in this paper is generated by the following five functions, which we refer as five tasks,

\begin{array}{l} Task 1 : f_{1} (x) & = 2 e^{- {(10 x - 4)}^{2}} + 5 e^{- {(80 x - 40)}^{2}} + 3 e^{- {(80 x - 20)}^{2}} + 4 e^{- {(90 x - 60)}^{2}} \\ Task 2 : f_{2} (x) & = 10 e^{- {(10 x - 4)}^{2}} + 25 e^{- {(80 x - 40)}^{2}} + 15 e^{- {(80 x - 20)}^{2}} + 20 e^{- {(90 x - 60)}^{2}} \\ Task 3 : f_{3} (x) & = 18 e^{- {(10 x - 4)}^{2}} + 45 e^{- {(80 x - 40)}^{2}} + 27 e^{- {(80 x - 20)}^{2}} + 36 e^{- {(90 x - 60)}^{2}} . \\ Task 4 : f_{4} (x) & = 26 e^{- {(10 x - 4)}^{2}} + 65 e^{- {(80 x - 40)}^{2}} + 39 e^{- {(80 x - 20)}^{2}} + 52 e^{- {(90 x - 60)}^{2}} \\ Task 5 : f_{5} (x) & = 34 e^{- {(10 x - 4)}^{2}} + 85 e^{- {(80 x - 40)}^{2}} + 51 e^{- {(80 x - 20)}^{2}} + 68 e^{- {(90 x - 60)}^{2}} \end{array}

Figure 3 depicts the distributions of five functions on [0, 1]. As we can see, when the independent variables of the five functions are the same, the function values follow similar trends. Therefore, learning the function values of five functions with the same independent variable can be regarded as a multi-task learning. Here, we independently extract 100 one-dimensional input data from the same uniform distribution, then calculate the corresponding function values according to these five functions, and add white Gaussian noise with a standard deviation of 0.01 to form 100 five-dimensional output data. In the following experiments, we randomly select 70% of the data as training data and 30% of the data as test data. From the figures of the five functions, it can be seen that only 70 training samples are not enough to achieve good single-task learning results. We verify this point by experimenting with single-task and multi-task algorithms.

FIGURE 3

FIGURE 3. Distribution of the simulation data set.

Firstly, we compare the learning performance of the proposed MTSL-SCRBN with other two STSL algorithms, SVM and SC-III, on five tasks. For MTSL-SCRBN, these five tasks are combined to learn together. The training and test RMSEs on five tasks for these three methods are given in Table 2. Clearly, the proposed multi-task learning model can product better performance on each task than the two STSL models, which only use 70 samples to learn each task independently. Furthermore, we show the learning effects of the three algorithms on Task 1 in Figure 4. It is can be seen that the proposed MTSL-SCRBN has good learning performance where the data changes dramatically.

TABLE 2

TABLE 2. The results of MTSL-SCRBN, SC-III and SVM on the simulation data set.

FIGURE 4

FIGURE 4. Prediction performance of MTSL-SCRBN, SC-III and SVM on Task 1 of simulation data set.

Next, the comparison results of three MTSL algorithms, MTSL-SCRBN, MTL, MTEN, on the simulation data set are recorded in Table 3 and Figure 5. In Table 3, the values in parentheses represent the standard deviations of 20 experiments’ results. According to these results, compared with MTEN and MTL, the proposed MTSL-SCRBN has better approximation ability.

TABLE 3

TABLE 3. The results of MTSL-SCRBN, MTL, MTEN on the simulation data set.

FIGURE 5

FIGURE 5. Prediction performance of MTSL-SCRBN, MTL, MTEN on Task 1 of simulation data set.

4.2 Experimental Results and Analysis on Benchmark Datasets

This subsection further compares seven MTSL algorithms on four benchmark datasets. They are MTL, MTEN, DMTRL, MMoE, GAMTL, AUTOMTL and the proposed MTSL-SCRBN. According to the characteristics of data sets and algorithms, different algorithms will be selected for comparative analysis on different data sets. The four benchmark datasets include three regression problems on the stock portfolio performance data set, the bionic robot data set SARCOS and the School data set, one classification problem on the Mnist data set from Yann LeCun¹ The basic information of the four datasets are summarized in Table 4.

TABLE 4

TABLE 4. Descriptions of benchmark datasets.

Firstly, we compare the performance of MTL, MTEN, DMTRL, MMoE, GAMTL and MTSL-SCRBN on different sizes of three regression data sets. For the three data sets, we randomly choose 15/30/3,500 samples outside the training set as test set, respectively. At the same time, we select 5 tasks, all of which have more than 230 samples, from 139 tasks in School data set. Table 5 below shows the specific experiment results. As we can see, the performance of each algorithm tends to be better and more stable with an increasing number of training samples. Furthermore, the proposed MTSL-SCRBN algorithm exhibits good performance even with a small number of training observations.

TABLE 5

TABLE 5. The comparison results of six MSTL algorithms on three data sets.

Then, in order to further verify the performance of MTSL-SCRBN for classification cases, we compare the results of MTSL-SCRBN, DMTRL, MMoE, GAMTL and AUTOMTL on the Mnist data set. We randomly choose 50/100/150 samples from each task in the Mnist data set as training set, and 10000 samples in the remaining samples as test set. Considering that this is a high dimensional small sample problem, we firstly reduce the dimensionality of the data set, and then use MTSL-SCRBN for training and prediction. There are many dimensionality reduction methods, such as Principal Component Analysis(PCA) (Pearson, 1901), Latent Dirichlet Allocation(LDA) (Blei et al., 2003), Sequential Markov Blanket Criterion (SMBC) (Pratama et al., 2017), Auto Encoder (Hinton and Salakhutdinov, 2006) and so on. Here, we use the performance of the dimensionality-reduced data in the MTSL-SCRBN as the selection criterion, and choose Auto Encoder to reduce the dimension of original data set into 30 dimensions. It can be seen from Table 6 that the performance of MTSL-SCRBN which uses dimensionality reduction data set is a little bit better than that of other Multi-task deep learning algorithms, but as the sample size increases, the performance of the two algorithms gradually approaches.

TABLE 6

TABLE 6. The accuracy of MTSL-SCRBN, DMTRL, MMoE, GAMTL and AUTOMTL on Mnist data set.

4.3 Comparison Experiment Results for Different Activation Functions

The previous results show that the proposed MTSL-SCRBN algorithm is effective for multi-task learning in the case of small samples. This subsection mainly discusses the impact of selecting different activation functions on algorithm performance. Here we select other three usually used activation functions. They are sigmoid function, Tanh function and ReLU function. After replacing the radial basis functions in MTSL-SCRBN with these three functions respectively, the model names are respectively called MTSL-SCSGM, MTSL-SCTANH and MTSL-SCReLU. We choose to conduct comparative experiments on the Stock and SARCOS data sets.

For different data sets, the parameters contained in each algorithm need to be randomly set or cross verified within a certain range. The specific selection range of each parameter is given in Table 7.

TABLE 7

TABLE 7. Parameter description for the four models.

Table 8 depicts the RMSE results of stochastic configuration multi-task learning models based on four different activation functions. Under the training samples with different sample sizes in the two data sets, the MTSL-SCRBN, which based on radial basis functions, has certain advantages over other three models in terms of performance.

TABLE 8

TABLE 8. The comparison results of different activation functions based models on two data sets.

5 Conclusion

In this paper, we propose a multi-task supervised learning framework based on stochastic configuration radial basis network. It can be effectively used in classification and regression problems when a single task has a small number of samples. The series experiment results on the four data sets show the proposed MTSL-SCRBN achieves a good performance compared with some existing methods.

Interesting areas for further directions include using the proposed algorithm in hyperspectral remote sensing image classification and other related research areas, considering the impact of using different activation functions in the network, and trying to explore the range of the sample size of the data set to use the multi-task learning method.

Data Availability Statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

Author Contributions

X-MD: Conceptualization, Methodology, Formal analysis, Supervision, Project administration. XK: Writing-original draft, Software, Investigation. XZ: Validation, Writing—review and editing.

Funding

This work was supported in part by the Characteristic and Preponderant Discipline of Key Construction Universities in Zhejiang Province (Zhejiang Gongshang University- Statistics) and the special fund for basic business expenses of colleges and universities in Zhejiang Province.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Footnotes

¹These four datasets can be obtained from http://archive.ics.uci.edu/ml/datasets/Stock+portfolio+performance, http://gaussianprocess.org/gpml/data, http://bristol.ac.uk/cmm/learning/support/datasets and http://yann.lecun.com/exdb/mnist/, respectively.

References

Argyriou, A., Evgeniou, T., and Pontil, M. (2008). Convex Multi-Task Feature Learning. Mach. Learn 73, 243–272. doi:10.1007/s10994-007-5040-8

CrossRef Full Text | Google Scholar

Argyriou, A., Evgeniou, T., and Pontil, M. (2007). “Multi-task Feature Learning,” in Advances in Neural Information Processing Systems, 41–48. doi:10.7551/mitpress/7503.003.0010

CrossRef Full Text | Google Scholar

Baxter, J. (2000). A Model of Inductive Bias Learning. Jair 12, 149–198. doi:10.1613/jair.731

CrossRef Full Text | Google Scholar

Ben-David, S., and Schuller, R. (2003). “Exploiting Task Relatedness for Multiple Task Learning,” in Learning Theory and Kernel Machines (Springer), 567–580. doi:10.1007/978-3-540-45167-9_41

CrossRef Full Text | Google Scholar

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet Allocation. J. Mach. Learn. Res. 3, 993–1022. doi:10.1162/jmlr.2003.3.4-5.993

CrossRef Full Text | Google Scholar

Broomhead, D., and Lowe, D. (1988). Radial Basis Functions, Multi-Variable Functional Interpolation and Adaptive Networks. Malvern, United Kingdom: Royal Signals and Radar Establishment Malvern United Kingdom. Tech. rep.

Google Scholar

Caruana, R. (1997). Multitask Learning. Mach. Learn. 28, 41–75. doi:10.1023/a:1007379606734

CrossRef Full Text | Google Scholar

Chen, X., He, J., Lawrence, R., and Carbonell, J. (2012). “Adaptive Multi-Task Sparse Learning with an Application to Fmri Study,” in Proceedings of the 2012 SIAM International Conference on Data Mining, 212–223. doi:10.1137/1.9781611972825.19

CrossRef Full Text | Google Scholar

Cortes, C., and Vapnik, V. (1995). Support-vector Networks. Mach. Learn 20, 273–297. doi:10.1007/BF00994018

CrossRef Full Text | Google Scholar

Duong, L., Cohn, T., Bird, S., and Cook, P. (2015). “Low Resource Dependency Parsing: Cross-Lingual Parameter Sharing in a Neural Network Parser,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 2, 845–850. Short Papers. doi:10.3115/v1/p15-2139

CrossRef Full Text | Google Scholar

Hinton, G. E., and Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science 313, 504–507. doi:10.1126/science.1127647

PubMed Abstract | CrossRef Full Text | Google Scholar

Jacob, L., Vert, J., and Bach, F. (2009). “Clustered Multi-Task Learning: A Convex Formulation,” in Advances in Neural Information Processing Systems, 745–752.

Google Scholar

Lancaster, P., and Tismenetsky, M. (1985). The Theory of Matrices: With Applications. Elsevier.

Google Scholar

Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based Learning Applied to Document Recognition. Proc. IEEE 86, 2278–2324. doi:10.1109/5.726791

CrossRef Full Text | Google Scholar

Liu, T., Tao, D., Song, M., and Maybank, S. J. (2017). Algorithm-dependent Generalization Bounds for Multi-Task Learning. IEEE Trans. Pattern Anal. Mach. Intell. 39, 227–241. doi:10.1109/TPAMI.2016.2544314

PubMed Abstract | CrossRef Full Text | Google Scholar

Ma, J., Zhao, Z., Yi, X., Chen, J., Hong, L., and Chi, E. (2018). “Modeling Task Relationships in Multi-Task Learning with Multi-Gate Mixture-Of-Experts,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (London, United Kingdom: Association for Computing Machinery), 1930–1939. doi:10.1145/3219819.3220007

CrossRef Full Text | Google Scholar

Oliveira, S. H. G., Gonçalves, A. R., and Von Zuben, F. J. (2019). “Group Lasso with Asymmetric Structure Estimation for Multi-Task Learning,” in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19 (Macao, China: International Joint Conferences on Artificial Intelligence Organization), 3202–3208. doi:10.24963/ijcai.2019/444

CrossRef Full Text | Google Scholar

Pearson, K. (1901). Liii. on lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philosophical Mag. J. Sci. 2, 559–572. doi:10.1080/14786440109462720

CrossRef Full Text | Google Scholar

Powell, M. (1987). Radial Basis Functions for Multivariable Interpolation: A Review. Oxford, United Kingdom: Algorithms for approximation, 143–167.

Google Scholar

Pratama, M., Lu, J., Lughofer, E., Zhang, G., and Er, M. J. (2017). An incremental learning of concept drifts using evolving type-2 recurrent fuzzy neural networks. IEEE Trans. Fuzzy Syst. 25, 1175–1192. doi:10.1109/tfuzz.2016.2599855

CrossRef Full Text | Google Scholar

Tin-Yan Kwok, T., and Dit-Yan Yeung, D. (1997). Objective functions for training new hidden units in constructive neural networks. IEEE Trans. Neural Netw. 8, 1131–1148. doi:10.1109/72.623214

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, D., and Li, M. (2017). Robust stochastic configuration networks with kernel density estimation for uncertain data regression. Inf. Sci. 412-413, 210–222. doi:10.1016/j.ins.2017.05.047

CrossRef Full Text | Google Scholar

Wang, D., and Li, M. (2017). Stochastic configuration networks: Fundamentals and algorithms. IEEE Trans. Cybern. 47, 3466–3479. doi:10.1109/tcyb.2017.2734043

PubMed Abstract | CrossRef Full Text | Google Scholar

Yang, Y., and Hospedales, T. (2017). “Deep multi-task representation learning: A tensor factorisation approach,” In International Conference on Learning Representations.

Google Scholar

Zhang, L., Liu, X., and Guan, H. (2021). Automtl: A Programming Framework for Automated Multi-Task Learning, 13076. ArXiv abs/2110.

Google Scholar

Keywords: multi-task learning, neural networks, stochastic configuration, knowledge sharing and transfer, supervised mechanism

Citation: Dong X-M, Kong X and Zhang X (2022) Multi-Task Learning Based on Stochastic Configuration Networks. Front. Bioeng. Biotechnol. 10:890132. doi: 10.3389/fbioe.2022.890132

Received: 05 March 2022; Accepted: 16 June 2022;
Published: 04 August 2022.

Edited by:

Gongfa Li, Wuhan University of Science and Technology, China

Reviewed by:

Inci M Baytas, Boğaziçi University, Turkey
Tiantian He, Agency for Science, Technology and Research (A∗STAR), Singapore

Copyright © 2022 Dong, Kong and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Xudong Kong, c2ltb25fa29uZ0AxNjMuY29t

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.