ORIGINAL RESEARCH article

Front. Appl. Math. Stat., 06 May 2020

Sec. Mathematics of Computation and Data Science

Volume 6 - 2020 | https://doi.org/10.3389/fams.2020.00013

Convergence of a Relaxed Variable Splitting Coarse Gradient Descent Method for Learning Sparse Weight Binarized Activation Neural Network

  • Department of Mathematics, University of California, Irvine, Irvine, CA, United States

Abstract

Sparsification of neural networks is one of the effective complexity reduction methods to improve efficiency and generalizability. Binarized activation offers an additional computational saving for inference. Due to vanishing gradient issue in training networks with binarized activation, coarse gradient (a.k.a. straight through estimator) is adopted in practice. In this paper, we study the problem of coarse gradient descent (CGD) learning of a one hidden layer convolutional neural network (CNN) with binarized activation function and sparse weights. It is known that when the input data is Gaussian distributed, no-overlap one hidden layer CNN with ReLU activation and general weight can be learned by GD in polynomial time at high probability in regression problems with ground truth. We propose a relaxed variable splitting method integrating thresholding and coarse gradient descent. The sparsity in network weight is realized through thresholding during the CGD training process. We prove that under thresholding of ℓ1, ℓ0, and transformed-ℓ1 penalties, no-overlap binary activation CNN can be learned with high probability, and the iterative weights converge to a global limit which is a transformation of the true weight under a novel sparsifying operation. We found explicit error estimates of sparse weights from the true weights.

1. Introduction

Deep neural networks (DNN) have achieved state-of-the-art performance on many machine learning tasks such as speech recognition [1], computer vision [2], and natural language processing [3]. Training such networks is a problem of minimizing a high-dimensional non-convex and non-smooth objective function, and is often solved by first-order methods such as stochastic gradient descent (SGD). Nevertheless, the success of neural network training remains to be understood from a theoretical perspective. Progress has been made in simplified model problems. Blum and Rivest [4] showed that even training a three-node neural network is NP-hard, and Shamir [5] showed learning a simple one-layer fully connected neural network is hard for some specific input distributions. Recently, several works [6, 7] focused on the geometric properties of loss functions, which is made possible by assuming that the input data distribution is Gaussian. They showed that SGD with random or zero initialization is able to train a no-overlap neural network in polynomial time.

Another prominent issue is that DNNs contain millions of parameters and lots of redundancies, potentially causing over-fitting and poor generalization [8] besides spending unnecessary computational resources. One way to reduce complexity is to sparsify the network weights using an empirical technique called pruning [9] so that the non-essential ones are zeroed out with minimal loss of performance [1012]. Recently a surrogate ℓ0 regularization approach based on a continuous relaxation of Bernoulli random variables in the distribution sense is introduced with encouraging results on small size image data sets [13]. This motivated our work here to study deterministic regularization of ℓ0 via its Moreau envelope and related ℓ1 penalties in a one hidden layer convolutional neural network model [7]. Moreover, we consider binarized activation which further reduces computational costs [14].

The architecture of the network is illustrated in Figure 1 similar to Brutzkus and Globerson [7]. We consider the convolutional setting in which a sparse filter w ∈ ℝd is shared among different hidden nodes. The input sample is Z ∈ ℝk×d. Note that this is identical to the one layer non-overlapping case where the input is x ∈ ℝk×d with k non-overlapping patches, each of size d. We also assume that the vectors of Z are i.i.d. Gaussian random vectors with zero mean and unit variance. Let denote this distribution. Finally, let σ denote the binarized ReLU activation function, σ(z): = χ{z>0} which equals 1 if z > 0, and 0 otherwise. The output of the network in Figure 1 is given by:

We address the realizable case, where the response training data is mapped from the input training data Z by Equation (1) with a ground truth unit weight vector w*. The input training data is generated by sampling m training points Z1, .., Zm from a Gaussian distribution. The learning problem seeks w to minimize the empirical risk function:

Due to binarized activation, the gradient of l in w is almost everywhere zero, hence in-effective for descent. Instead, an approximate gradient on the coarse scale, the so called coarse gradient (denoted as ) is adopted as proxy and is proved to drive the iterations to global minimum [14].

Figure 1

In the limit m ↑ ∞, the empirical risk l converges to the population risk:

which is more regular in w than l. However, the “true gradient” ∇wf is inaccessible in practice. On the other hand, the coarse gradient in the limit m ↑ ∞ forms an acute angle with the true gradient [14]. Hence the expected coarse gradient descent (CGD) essentially minimizes the population risk f as desired.

Our task is to sparsify w in CGD. We note that the iterative thresholding algorithms (IT) are commonly used for retrieving sparse signals [[1519] and references therein]. In high dimensional setting, IT algorithms provide simplicity and low computational cost, while also promote sparsity of the target vector. We shall investigate the convergence of CGD with simultaneous thresholding for the following objective function

where f(w) is the population loss function of the network, and P is ℓ0, ℓ1, or the transformed-ℓ1 (Tℓ1) function: a one parameter family of bilinear transformations composed with the absolute value function [20, 21]. When acting on vectors, the Tℓ1 penalty interpolates ℓ0 and ℓ1 with thresholding in closed analytical form for any parameter value [19]. The ℓ1 thresholding function is known as soft-thresholding [15, 22], and that of ℓ0 the hard-thresholding [17, 18]. The thresholding part should be properly integrated with CGD to be applicable for learning CNNs. As pointed out in Louizos et al. [13], it is beneficial to attain sparsity during the optimization (training) process.

1.1. Contribution

We propose a Relaxed Variable Splitting (RVS) approach combining thresholding and CGD for minimizing the following augmented objective function

for a positive parameter β. We note in passing that minimizing in u recovers the original objective (4) with penalty P replaced by its Moreau envelope [23]. We shall prove that our algorithm (RVSCGD), alternately minimizing u and w, converges for ℓ0, ℓ1, and Tℓ1 penalties to a global limit with high probability. A key estimate is the Lipschitz inequality of the expected coarse gradient (Lemma 4). Then the descent of Lagrangian function (9) and the angles between the iterated w and w* follows. The is a novel shrinkage of the true weight w* up to a scalar multiple. The is a sparse approximation of w*. To our best knowledge, this result is the first to establish the convergence of CGD for sparse weight binarized activation networks. In numerical experiments, we observed that the limit of RVSCGD with the ℓ0 penalty recovers sparse w* accurately.

1.2. Outline

In section 2, we briefly overview related mathematical results in the study of neural networks and complexity reduction. Preliminaries are in section 3. In section 4, we state and discuss the main results. The proofs of the main results are in section 5, and conclusion in section 6.

2. Related Work

In recent years, significant progress has been made in the study of convergence in neural network training. From a theoretical point of view, optimizing (training) neural network is a non-convex non-smooth optimization problem. Blum and Rivest [4], Livni et al. [24], Shalev-Shwartz et al. and [25] showed that training a neural network is hard in the worst cases. Shamir [5] showed that if either the target function or input distribution is “nice,” optimization algorithms used in practice can succeed. Optimization methods in deep neural networks are often categorized into (stochastic) gradient descent methods and others.

Stochastic gradient descent methods were first proposed by Robbins and Monro [26]. The popular back-propagation algorithm was introduced in Rumelhart et al. [27]. Since then, many well-known SGD methods with adaptive learning rates were proposed and applied in practice, such as the Polyak momentum [28], AdaGrad [29], RMSProp [30], Adam [31], and AMSGrad [32].

The behavior of gradient descent methods in neural networks is better understood when the input has Gaussian distribution. Tian [6] showed that the population gradient descent can recover the true weight vector with random initialization for one-layer one-neuron model. Brutzkus and Globerson [7] proved that a convolution filter with non-overlapping input can be learned in polynomial time. Du et al. [33] showed (stochastic) gradient descent with random initialization can learn the convolutional filter in polynomial time and the convergence rate depends on the smoothness of the input distribution and the closeness of patches. Du et al. [34] analyzed the polynomial convergence guarantee of randomly initialized gradient descent algorithm for learning a one-hidden-layer convolutional neural network. A hybrid projected SGD (so called BinaryConnect) is widely used for training various weight quantized DNNs [35, 36]. Recently, a Moreau envelope based relaxation method (BinaryRelax) is proposed and analyzed to advance weight quantization in DNN training [37]. Also a blended coarse gradient descent method [14] is introduced to train fully quantized DNNs in weights and activation functions, and overcome vanishing gradients. For earlier work on coarse gradient (a.k.a. straight through estimator) (see [3840] among others).

Non-SGD methods for deep learning include the Alternating Direction Method of Multipliers (ADMM) to transform a fully-connected neural network into an equality-constrained problem [41]; method of auxiliary coordinates (MAC) to replace a nested neural network with a constrained problem without nesting [42]. Zhang et al. [43] handled deep supervised hashing problem by an ADMM algorithm to overcome vanishing gradients.

For a similar model to (9) and treatment in a general context (see [44]); and in image processing (see [45]).

3. Preliminaries

3.1. The One-Layer Non-overlap Network

Consider the network introduced in Figure 1. Let σ denote the binarized ReLU activation function, σ(z): = χ{z>0}. The training sample loss is

where w* ∈ ℝd is the underlying (non-zero) teaching parameter. Note that (5) is invariant under scaling ww/c, w*w*/c, for any scalar c > 0. Without loss of generality, we assume ‖w*‖ = 1. Given independent training samples {Z1, …, ZN}, the associated empirical risk minimization reads

The empirical risk function in (6) is piece-wise constant and has i.e., zero partial w gradient. If σ were differentiable, then back-propagation would rely on:

However, σ has zero derivative i.e., rendering (7) inapplicable. We study the coarse gradient descent with σ′ in (7) replaced by the (sub)derivative μ′ of the regular ReLU function μ(x): = max(x, 0). More precisely, we use the following surrogate of :

with μ′(x) = σ(x). The constant represents a ReLU function μ with smaller slope, and will be necessary to give a stronger convergence result for our main findings. To simplify our analysis, we let N ↑ ∞ in (6), so that its coarse gradient approaches 𝔼Z[g(w, Z)]. The following lemma asserts that 𝔼Z[g(w, Z)] has positive correlation with the true gradient ∇f(w), and consequently, −𝔼Z[g(w, Z)] gives a reasonable descent direction.

Lemma 1. [14] If θ(w, w*) ∈ (0, π), and ‖w‖ ≠ 0, then the inner product between the expected coarse and true gradient w.r.t. w is

3.2. The Relaxed Variable Splitting Coarse Gradient Descent Method

Suppose we want to train the network in a way that wt converges to a limit in some neighborhood of w*, and we also want to promote sparsity in the limit . A classical approach is to minimize the Lagrangian: ϕ(w) = f(w) + λ‖w1, for some λ > 0. In practice, the ℓ1 penalty can also be replaced by ℓ0 or Tℓ1. Our proposed relaxed variable splitting (RVS) proceeds by first extending ϕ into a function of two variables f(w) + λ‖u1, and consider the augmented Lagrangian:

Let Sα be the soft thresholding operator, Sα(x) = sgn(x) max{|x| − α, 0}. The resulting RSVCGD method is described in Algorithm 1:

Algorithm 1

1:  Input: The step size η, parameters λ, β
2:  Initialize: u1, w1
3:  fort = 1, 2, …, Tdo
4:    
5:    
6:    
7:  Output:ut,wt

RVSCGD Algorithm

3.3. Comparison With ADMM

A well-known, modern method to solve the minimization problem ϕ(w) = f(w) + λ‖w1 is the Alternating Direction Method of Multipliers (or ADMM). In ADMM, we consider the Lagrangian

and apply the updates:

Although widely used in practice, the ADMM method has several drawbacks when it comes to regularizing deep neural networks: Firstly, the ℓ1 penalty is often replaced by ℓ0 in practice; but ‖·‖0 is non-differentiable and non-convex, thus current theory in optimization does not apply [46]. Secondly, the update is not applicable in practice on DNN, as it requires one to know fully how f(w) behaves. In most ADMM adaptations on DNN, this step is replaced by a simple gradient descent. Lastly, the Lagrange multiplier zt tends to reduce the sparsity of the limit of ut, as it seeks to close the gap between wt and ut.

In contrast, the RVSCGD method resolves all these difficulties presented by ADMM. Firstly, without the linear term 〈z, wu〉, one has an explicit formula for the update of u, which can be easily implemented. Secondly, the update of wt is not an argmin update, but rather a gradient descent iteration itself, so our theory does not deviate from practice. Lastly, without the Lagrange multiplier term zt, there will be a gap between wt and ut at the limit. The ut is much more sparse than in the case of ADMM, and numerical results showed that f(wt) and f(ut) behave very similarly on deep networks. An intuitive explanation for this is that when the dimension of wt is high, most of its components that will be pruned off to get ut have very small magnitudes, and are often the redundant weights.

In short, the RVSCGD method is easier to implement (no need to keep track of the variable zt), can greatly increase sparsity in the weight variable ut, while also maintaining the same performance as the ADMM method. Moreover, RVSCGD has convergence guarantee and limit characterization as stated below.

4. Main Results

Theorem 1. Suppose that the initialization and penalty parameters of the RVSCGD algorithm satisfy:

(i) θ(w0, w*) ≤ π − δ, for some δ > 0;

(ii), and;

(iii) η is small such, whereLis the Lipschitz constant in Lemma 4; and for allt, .

Then the Lagrangiandecreases monotonically; and (ut, wt) converges sub-sequentially to a limit point, with , such that:

(i) Letand, then θ < δ;

(ii) The limit pointsatisfiesand

where Sλ/β(·) is the soft-thresholding operator of ℓ1, for some constant;

(iii) The limit pointis close to the ground truth w* such that

Remark 1. As the sign ofagrees with, Equation (12) implies thatw*equals an expansion ofor equivalentlyis (up to a scalar multiple) a shrinkage of w*, which explains the source of sparsity in. The assumption on η is reasonable, as will be shown below:is bounded away from zero, and thusis also bounded.

The proof is provided in details in section 5. Here we provide an overview of the key steps. First, we show that there exists a constant Lf such that

then we show that the Lipschitz gradient property still holds when replaced by the coarse gradient:

and subsequently show

These inequalities hold when ‖wt‖ ≥ M, for some M > 0. It can be shown that with bad initialization, one may have ‖wt‖ → 0 as t → ∞. We circumvent this problem by normalizing wt at each iteration.

Next, we show the iterations satisfy θt+1 ≤ θt, and . Finally, an analysis of the stationary point yields the desired bound.

In none of these steps do we use convexity of the ℓ1 penalty term. Here we extend our result to ℓ0 and transformed ℓ1 (Tℓ1) regularization [21].

Corollary 1.1. Suppose that the initialization of the RVSCGD algorithm satisfies the conditions in Theorem 1, and that the ℓ1penalty is replaced by ℓ0or T1. Then the RVSCGD iterations converge to a limit pointsatisfying Equation (12) with ℓ0's hard thresholding operator [18] or Tℓ1thresholding [19] replacing Sλ/β, and similar bound (13) holds.

5. Proof of Main Results

The following Lemmas give an outline for the proof of Theorem 1.

Lemma 2. If every entry of Z is i.i.d. sampled from, andw‖ ≠ 0, then the true gradient of the population loss f(w) is

for θ(w, w*) ∈ (0, π); and the expected coarse gradient w.r.t. w is

Lemma 3. (Properties of true gradient)

Givenw1, w2with min{‖w1‖, ‖w2‖} = c > 0 and max{‖w1‖, ‖w2‖} = C, there exists a constantLf > 0 depends onc and C such that

Moreover, we have

Lemma 4. (Properties of expected coarse gradient)

Ifw1, w2satisfy, and, then there exists a constant K such that

Moreover, there exists a constant L such that

Remark 2. The conditionin Lemma 4 is to match the RVSCGD algorithm and to give an explicit value for K. The result still holds in general when 0 < c ≤ ‖w1‖, ‖w2‖ ≤ C. Compared to Lemma 3, whenand, one has, which is a sharper bound thanin the coarse gradient case.

Lemma 5. (Angle Descent)

Let θt: = θ(wt, w*). Suppose the initialization of the RVSCGD algorithm satisfies θ0 ≤ π − δ and, then θt+1 ≤ θt.

Lemma 6. (Lagrangian Descent)

Suppose the initialization of the RVSCGD algorithm satisfies, where L is the Lipschitz constant in Lemma 4, then.

Lemma 7. (Properties of limit point)

Suppose the initialization of the RVSCGD algorithm satisfies: θ(w0, w*) ≤ π − δ, for some δ > 0, λ is small such that, and η is small such that. Letand, then (ut, wt) converges to a limit pointsuch that

Lemmas 2, 3 follow directly from Yin et al. [14]. The proof of Lemmas 4, 5, 6, 7 are provided below.

5.1. Proof of Lemma 4

First suppose ‖w1‖ = ‖w2‖ = 1. By Lemma 2, we have

for j = 1, 2. Consider the plane formed by wj and w*, since ‖w*‖ = 1, we have an equilateral triangle formed by wj and w* (see Figure 2).

Figure 2

Simple geometry shows

Thus the expected coarse gradient simplifies to

which implies

with .

Now suppose . By Equation (15), we have , for all C > 0. Then,

where the first inequality follows from (19), and the second inequality is from the constraint , with equality when . Letting , the first claim is proved.

It remains to show the gradient descent inequality. By Yin et al. [14], we have

Let . Then

We will show

for ‖w1‖ = ‖w2‖ = 1 and θ2 ≤ θ1. By Equation (18),

It remains to show

or there exists a constant K1 such that

Notice that by writing , we have

where the last equality follows since ‖w1‖ = ‖w2‖ = 1 implies 〈w1 + w2, w2w1 〉 = 0. On the other hand,

so it suffices to show there exists a constant K2 such that

Notice the function θ ↦ θ + cosθ is monotonically increasing on [0, π]. For θ1, θ2 ∈ [0, π] with θ2 ≤ θ1, the LHS is non-positive, and the inequality holds. Thus, one can take , and .

5.2. Proof of Lemma 5

Due to normalization in the RVSCGD algorithm, ‖wt‖ = 1 for all t. By Equation (18), we have

and the update of u is the well-known soft-thresholding of w [15, 22]:

where Sλ/β(·) is the soft-thresholding operator:

and Sλ/β(w) applies the thresholding to each component of w. Then the update of w has the form

for some constant Ct > 0. Suppose the initialization satisfies θ(w0, w*) ≤ π − δ, for some δ > 0. It suffices to show that if θt ≤ π − δ, then θt+1 ≤ π − δ. To this end, since , we have . Consider the worst case scenario: wt, w*, ut+1 are co-planar with , and w*, ut+1 are on two sides of wt (see Figure 3). We need to be in region I. This condition is satisfied when β is small such that

or

since , we have ‖ut+1‖ ≤ 1. Thus, it suffices to have .

Figure 3

5.3. Proof of Lemma 6

By definition of the update on u, we have . It remains to show . First notice that since

where Ct > 0 is the normalizing constant, thus

For a fixed u: = ut+1 we have

Since ‖wt‖, ‖wt+1‖ = 1, we know (wt+1wt) bisects the angle between wt+1 and −wt. The assumption guarantees and θ(−wt, wt+1) < π. It follows that θ(wt+1wt, wt) and θ(wt+1wt, wt+1) are strictly less than . On the other hand, also lies in the plane bounded by wt+1 and −wt. Therefore,

This implies . Moreover, when Ct ≥ 1:

And when :

Thus, we have

Therefore, if η is small so that and , the update on w will decrease . Since Ct ≤ 2, the condition is satisfied when .

5.4. Proof of Lemma 7

Since is non-negative, by Lemma 5, 6, converges to some limit . This implies (ut, wt) converges to some stationary point . By the update of wt, we have

for some constant , where , c1 > 0 due to our assumption, and . For expression (20) to hold, we need

Expression (21) implies , and w* are co-planar. Let . From expression (21), and the fact that , we have

which implies Recall cos(a+b) = cos a cos b − sin a sin b. Thus,

which implies

By the initialization of β, we have . This implies θ < δ.

Finally, expression (20) can also be written as

From expression (23), we see that w*, after subtracting some vector whose signs agree with , and whose non-zero components are at most , is parallel to . This implies is some soft-thresholded version of w*, modulo normalization. Moreover, since , for small λ such that , we must have

On the other hand,

therefore, , for some constant C such that .

Finally, consider the equilateral triangle with sides , and . By the law of sines,

as θ is small, is near . We can assume . Together with expression (22), we have

5.5. Proof of Theorem 1

Combining Lemmas 2–7, Theorem 1 is proved.

5.6. Proof of Corollary

Lemma 8. [19] Let

where. Thenis the Tℓ1thresholding, equal to gλ(x) if |x| > t; zero elsewhere. Hereif; , elsewhere.

Lemma 9. [18] LetThenis the ℓ0hard thresholding, if; zero elsewhere.

We proceed by an outline similar to the proof of Theorem 1:

Step 1. First we show that and both decrease under the update of ut and wt. To see this, notice that the update on ut decreases and by definition. Then, for a fixed u = ut+1, the update on wt decreases and by a similar argument to that found in Theorem 1.

Step 2. Next, we show θ(wt, w*) ≤ π − δ, for some δ > 0, for all t, with initialization θ(w0, w*) = π − δ. For , by Lemma 8, we have

And for , by Lemma 9,

In both cases, each component of ut+1 is a thresholded version of the corresponding component of wt. This implies , and thus the argument in Theorem 1 follows through, and we have θ(wt, w*) ≤ π − θ, for all t.

Step 3. Finally, the equilibrium condition from Equation (21) still holds for the limit point, and a similar argument shows that .

6. Numerical Experiments

In this section, we demonstrate two simple experiments on implementing RVSCGD in practice.

Firstly, we numerically verify our result on the one-layer, non-overlap network, using RVSCGD with ℓ0 penalty. The experiment was run with parameters k = 20, d = 50, β = 4.e − 3, λ = 1.e − 4, and η = 1.e − 5. Results are displayed in Figure 4. It can be seen that the RVSCGD converges quickly for this toy model; and the quantities , θ(wt, w*), decrease monotonically, as stated in Theorem 1.

Figure 4

Secondly, we extend our method to a multi-layer network. Consider a variation of LeNet [47], where we replace all ReLU activations with the binarized ReLU function. The model is then trained on the MNIST dataset for 100 epochs using SGD with momentum 0.9, weight decay 5.e-4, and learning rate 1.e-3, which is decayed by a factor of 10 at epoch 60. The RVSCGD algorithm is applied on this model using the same training setting. The results are displayed in Table 1. Notice that the base model has an accuracy of 89.13%, which is lower than reported in Lecun [47]; this is because of the binarized ReLU replacement. Table 1 also shows that RVSCGD can effectively sparsify this variation of LeNet, with sparsity up to 83.76 and 4.39% loss in performance. We believe the loss in accuracy is mainly from 1-bit ReLU activation, which has too low a resolution to preserve important deep network information. We believe with higher bit quantization of weights and/or activations, networks can be more effectively pruned while still maintaining good performance (see [14]). This is a topic for our future studies.

Table 1

PenaltyβλAccuracySparsity
Base model1089.310
RGSM (GL)11.e-787.1733.31
11.e-585.3466.67
11.e-384.9283.76

Accuracy and sparsity of RVSCGD on a LeNet variation, on the MNIST dataset.

7. Conclusion

We introduced a variable splitting coarse gradient descent method to learn a one-hidden layer neural network with sparse weight and binarized activation in a regression setting. The proof is based on the descent of a Lagrangian function and the angle between the sparse and true weights, and applies to ℓ1, ℓ0 and Tℓ1 sparse penalties. We plan to extend our work to a classification setting in the future.

Statements

Data availability statement

All datasets generated for this study are included in the article/supplementary material.

Author contributions

TD performed the analysis. All authors contributed to the discussions and production of the manuscript.

Funding

The work was partially supported by NSF grants IIS-1632935 and DMS-1854434.

Acknowledgments

This manuscript has been released as a pre-print at http://export.arxiv.org/pdf/1901.09731 [48].

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  • 1.

    HintonGDengLYuDDahlGEMohamedAJaitlyNet al. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag. (2012) 29:8297. 10.1109/MSP.2012.2205597

  • 2.

    KrizhevskyASutskeverIHintonGE. ImageNet classification with deep convolutional neural networks. In: BartlettPLPereiraFCNBurgesCJCBottouLWeinbergerKQ, editors. Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012 (2012). p. 110614. Available online at: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks

  • 3.

    DauphinYNFanAAuliMGrangierD. Language modeling with gated convolutional networks. In: Proceedings of the 34th International Conference on Machine Learning, Vol. 70. Sydney, NSW (2017). p. 93341. 10.5555/3305381.3305478

  • 4.

    BlumALRivestRL. Training a 3-node neural network is NP-complete. In: JoséHSWernerRRivestRL, editors. Machine Learning: From Theory to Applications: Cooperative Research at Siemens and MIT. Berlin; Heidelberg: Springer (1993). p. 928. 10.1007/3-540-56483-7_20

  • 5.

    ShamirO. Distribution-specific hardness of learning neural networks. J Mach Learn Res. (2018) 19:153235. 10.5555/3291125.3291157

  • 6.

    TianY. An analytical formula of population gradient for two-layered ReLU network and its applications in convergence and critical point analysis. In: Proceedings of the 34th International Conference on Machine Learning, Vol. 70. Sydney, NSW: JMLR.org (2017). p. 340413. 10.5555/3305890.3306033

  • 7.

    BrutzkusAGlobersonA. Globally optimal gradient descent for a convnet with Gaussian inputs. In: Proceedings of the 34th International Conference on Machine Learning, Vol. 70. Sydney, NSW: JMLR.org (2017). p. 60514.

  • 8.

    ZhangCBengioSHardtMRechtBVinyalsO. Understanding deep learning requires rethinking generalization. In: 5th International Conference on Learning Representations, ICLR 2017. Toulon (2017). Available online at: https://openreview.net/forum?id=Sy8gdB9xx

  • 9.

    LeCunYDenkerJSollaS. Optimal brain damage. In: Proceedings of the 2nd International Conference on Neural Information Processing Systems. Cambridge, MA: MIT Press (1989). p. 589605. 10.5555/109230.109298

  • 10.

    HanSMaoHDallyWJ. Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. In: BengioYLeCunY, editors. 4th International Conference on Learning Representations, ICLR 2016. San Juan (2016). Available online at: http://arxiv.org/abs/1510.00149

  • 11.

    UllrichKMeedsEWellingM. Soft weight-sharing for neural network compression. In: 5th International Conference on Learning Representations, ICLR 2017. Toulon (2017). Available online at: https://openreview.net/forum?id=HJGwcKclx

  • 12.

    MolchanovDAshukhaAVetrovD. Variational dropout sparsifies deep neural networks. In: Proceedings of the 34th International Conference on Machine Learning, Vol. 70. Sydney, NSW: JMLR.org (2017). p. 2498507. 10.5555/3305890.3305939

  • 13.

    LouizosCWellingMKingmaD. Learning sparse neural networks through L0 regularization. In: 6th International Conference on Learning Representations, ICLR 2018. Vancouver, BC (2018). Available online at: https://openreview.net/forum?id=H1Y8hhg0b

  • 14.

    YinPZhangSLyuJOsherSQiYXinJ. Blended coarse gradient descent for full quantization of deep neural networks. Res Math Sci. (2019) 6:14. 10.1007/s40687-018-0177-6

  • 15.

    DaubechiesIDefriseMMolCD. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun Pure Appl Math. (2004) 57:141357. 10.1002/cpa.20042

  • 16.

    CandésERombergJTaoT. Stable signal recovery from incomplete and inaccurate measurements. Commun Pure Appl Math. (2006) 59:120723. 10.1002/cpa.20124

  • 17.

    BlumensathTDaviesM. Iterative thresholding for sparse approximations. J Fourier Anal Appl. (2008) 14:62954. 10.1007/s00041-008-9035-z

  • 18.

    BlumensathT. Accelerated iterative hard thresholding. Signal Process. (2012) 92:7526. 10.1016/j.sigpro.2011.09.017

  • 19.

    ZhangSXinJ. Minimization of transformed l1 penalty: closed form representation and iterative thresholding algorithms. Commun Math Sci. (2017) 15:51137. 10.4310/CMS.2017.v15.n2.a9

  • 20.

    NikolovaM. Local strong homogeneity of a regularized estimator. SIAM J Appl Math. (2000) 61:63358. 10.1137/S0036139997327794

  • 21.

    ZhangSXinJ. Minimization of transformed l1 penalty: theory, difference of convex function algorithm, and robust application in compressed sensing. Math Program Ser B. (2018) 169:30736. 10.1007/s10107-018-1236-x

  • 22.

    DonohoD. Denoising by soft-thresholding. IEEE Trans Inform Theor. (1995) 41:61327. 10.1109/18.382009

  • 23.

    MoreauJJ. Proximité et dualité dans un espace hilbertien. Bull Soc Math France. (1965) 93:27399. 10.24033/bsmf.1625

  • 24.

    LivniRShalev-ShwartzSShamirO. On the computational efficiency of training neural networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, Vol. 1. Cambridge, MA: MIT Press (2014). p. 85563.

  • 25.

    Shalev-ShwartzSShamirOShammahS. Failures of gradient-based deep learning. In: PrecupDTehYW, editors. Proceedings of the 34th International Conference on Machine Learning, Vol. 70. Sydney, NSW: PMLR (2017). p. 306775. Available online at: http://proceedings.mlr.press/v70/shalev-shwartz17a.html

  • 26.

    RobbinsHMonroS. A stochastic approximation method. Ann Math Stat. (1951) 22:4007. 10.1214/aoms/1177729586

  • 27.

    RumelhartDHintonGWilliamsR. Learning representations by back-propagating errors. Nature. (1986) 323:5336. 10.1038/323533a0

  • 28.

    PolyakB. Some methods of speeding up the convergence of iteration methods. USSR Comput Math Math Phys. (1964) 4:117. 10.1016/0041-5553(64)90137-5

  • 29.

    DuchiJHazanESingerY. Adaptive subgradient methods for online learning and stochastic optimization. J Machine Learn Res. (2011) 12:212159. 10.5555/1953048.2021068

  • 30.

    TielemanTHintonG. Divide the Gradient by a Running Average of Its Recent Magnitude. Technical report. Coursera: Neural networks for machine learning (2017).

  • 31.

    KingmaDBaJ. Adam: a method for stochastic optimization. In: BengioYLeCunY, editors. 3rd International Conference on Learning Representations, ICLR 2015. San Diego, CA (2015). Available online at: http://arxiv.org/abs/1412.6980

  • 32.

    ReddiSJKaleSKumarS. On the convergence of adam and beyond. In: 6th International Conference on Learning Representations, ICLR 2018. Vancouver, BC (2018). Available online at: https://openreview.net/forum?id=ryQu7f-RZ

  • 33.

    DuSLeeJTianY. When is a convolutional filter easy to learn?arXiv [preprint] arXiv:1709.06129. (2017).

  • 34.

    DuSLeeJTianYSinghAPoczosB. Gradient descent learns one-hidden-layer CNN: don't be afraid of spurious local minima. In: DyJKrauseA, editors. Proceedings of the 35th International Conference on Machine Learning, Vol. 80. Stockholm: PMLR (2018). p. 133948. Available online at: http://proceedings.mlr.press/v80/du18b.html

  • 35.

    CourbariauxMBengioYDavidJ-P. BinaryConnect: training deep neural networks with binary weights during propagations. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, Vol. 2 (Cambridge, MA: MIT Press (2015). p. 312331.

  • 36.

    YinPZhangSQiYXinJ. Quantization and training of low bit-width convolutional neural networks for object detection. J Comput Math. (2019) 37:34959. 10.4208/jcm.1803-m2017-0301

  • 37.

    YinPZhangSLyuJOsherSQiYXinJ. BinaryRelax: a relaxation approach for training deep neural networks with quantized weights. SIAM J Imag Sci. (2018) 11:220523. 10.1137/18M1166134

  • 38.

    HintonG. Neural Networks for Machine Learning, Coursera. Coursera, Video Lectures (2012).

  • 39.

    HubaraICourbariauxMSoudryDEl-YanivRBengioY. Binarized neural networks: training neural networks with weights and activations constrained to +1 or −1. arXiv [preprint] arXiv:160202830. (2016).

  • 40.

    CaiZHeXSunJVasconcelosN. Deep learning with low precision by half-wave Gaussian quantization. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI (2017). p. 540614.

  • 41.

    TaylorGBurmeisterRXuZSinghBPatelAGoldsteinT. Training neural networks without gradients: a scalable ADMM approach. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning, Vol. 48. New York, NY: JMLR.org (2016). p. 272231.

  • 42.

    Carreira-PerpinanMWangW. Distributed optimization of deeply nested systems. In: KaskiSCoranderJ, editors. Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics. Reykjavik: PMLR (2014). p. 1019. Available online at: http://proceedings.mlr.press/v33/carreira-perpinan14.html

  • 43.

    ZhangZChenYSaligramaV. Efficient training of very deep neural networks for supervised hashing. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV (2016). p. 148795.

  • 44.

    AttouchHBolteJRedontPSoubeyranA. Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Lojasiewicz inequality. Math Oper Res. (2010) 35:43857. 10.1287/moor.1100.0449

  • 45.

    WuT. Variable splitting based method for image restoration with impulse plus Gaussian noise. Math Probl Eng. (2016) 2016:3151303. 10.1155/2016/3151303

  • 46.

    WangYZengJYinW. Global convergence of ADMM in nonconvex nonsmooth optimization. J Sci Comput. (2019) 78:2963. 10.1007/s10915-018-0757-z

  • 47.

    LecunYBottouLBengioYHaffnerP. Gradient-based learning applied to document recognition. Proc IEEE. (1998) 86:2278324. 10.1109/5.726791

  • 48.

    DinhTXinJ. Convergence of a relaxed variable splitting coarse gradient descent method for learning parse weight binarized activation neural network. arXiv [preprint] arXiv:1901.09731. (2019).

Summary

Keywords

sparsification, 1-bit activation, regularization, convergence, coarse gradient descent

Citation

Dinh T and Xin J (2020) Convergence of a Relaxed Variable Splitting Coarse Gradient Descent Method for Learning Sparse Weight Binarized Activation Neural Network. Front. Appl. Math. Stat. 6:13. doi: 10.3389/fams.2020.00013

Received

24 January 2020

Accepted

14 April 2020

Published

06 May 2020

Volume

6 - 2020

Edited by

Lucia Tabacu, Old Dominion University, United States

Reviewed by

Jianjun Wang, Southwest University, China; Yuguang Wang, University of New South Wales, Australia

Updates

Copyright

*Correspondence: Thu Dinh

This article was submitted to Mathematics of Computation and Data Science, a section of the journal Frontiers in Applied Mathematics and Statistics

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics