Skip to main content

ORIGINAL RESEARCH article

Front. Appl. Math. Stat., 11 December 2023
Sec. Mathematics of Computation and Data Science

Algorithm unfolding for block-sparse and MMV problems with reduced training overhead

  • 1Electrical Engineering and Computer Science, Communications and Information Theory, Technical University Berlin, Berlin, Germany
  • 2Institute of Optical Sensor Systems, German Aerospace Center (DLR), Berlin, Germany
  • 3Institute for Mathematical Stochastics, Carl-Friedrich-Gauß-Faculty, Technical University Brunswick, Brunswick, Germany

In this study, we consider algorithm unfolding for the multiple measurement vector (MMV) problem in the case where only few training samples are available. Algorithm unfolding has been shown to empirically speed-up in a data-driven way the convergence of various classical iterative algorithms, but for supervised learning, it is important to achieve this with minimal training data. For this, we consider learned block iterative shrinkage thresholding algorithm (LBISTA) under different training strategies. To approach almost data-free optimization at minimal training overhead, the number of trainable parameters for algorithm unfolding has to be substantially reduced. We therefore explicitly propose a reduced-size network architecture based on the Kronecker structure imposed by the MMV observation model and present the corresponding theory in this context. To ensure proper generalization, we then extend the analytic weight approach by Liu and Chen to LBISTA and the MMV setting. Rigorous theoretical guarantees and convergence results are stated for this case. We show that the network weights can be computed by solving an explicit equation at the reduced MMV dimensions which also admits a closed-form solution. Toward more practical problems, we then considered convolutional observation models and show that the proposed architecture and the analytical weight computation can be further simplified and thus open new directions for convolutional neural networks. Finally, we evaluate the unfolded algorithms in numerical experiments and discuss connections to other sparse recovering algorithms.

1 Introduction

This study connects the multiple measurement vector (MMV) problem, block- or joint-sparsity and recent results of deep unfolding of the iterative shrinkage thresholding algorithm (ISTA) to reconstruct unknown joint-sparse vectors from given linear observations. Such vectors could be, for example, signals received at the different antennas in a wireless communication problem or, in a computational imaging setup, discrete images observed at different detectors or aggregation stages. Compressed sensing is a way to reconstruct compressive measurements from their underdetermined systems and first theoretical breakthroughs were achieved by Candés et al. [1] and Donoho [2], leading to an approach where fewer samples can be used, than stated within the Nyquist–Shannon sampling theorem [3]. They were able to show that unknown vectors can be reconstructed using convex optimization if the linear mapping fulfilled certain assumptions [1, 4]. These idea rely on minimizing ℓ1-norm to promote sparsity and the approach of basis pursuit [5]. These convex optimization problems could then be solved with iterative algorithms, in Figueiredo et al. [6] gradient projection approaches are presented and in Fornasier and Rauhut [7] the idea of thresholding algorithms, which will be also discussed in this study. Although this is already a well researched field, in practice leading to high computational effort because of many iterations and large underlying systems and is thus not suitable for real-world applications. Thus, Karol Gregor and Yann LeCun proposed to use the iterative structure of these algorithms for a neural network and to train each iteration step [8], which is also referred to as deep unfolding and will be discussed in Section 2. Convergence for deep unfolding of the iterative thresholding algorithm has also been studied by Chen et al. [9]. Liu and Chen [10] proposed in to exclude the weight matrix from the data-driven optimization approach and pre-compute this by data-free optimization and thus presented Analytical LISTA (ALISTA). In recent results, Chen et al. could reduce the training procedure even more, by showing that only tuning three hyperparameters is sufficient, proposing HyperLISTA [11]. Creating large sets of training data is often difficult in practice, and thus, it is important to reduce the trainable parameters. Therefore, we will extend the already stated concepts for the block-sparse setting and especially for the multiple measurement vector problem, developing suitable learned algorithms, with only a few trainable parameters and similar theoretical guarantees as ALISTA.

1.1 Multiple measurement vector problems

Multiple measurement vector (MMV) problems occur in many applications, for example, in tomography [12], communication [13], blurred image reconstruction, or superresolution [14]. In the following, we derive the connection to block-sparsity. In an MMV problem, we assume that we derived d ∈ ℕ measurements yl ∈ ℝm, l = 1, …, d from d sparse signal vectors xl ∈ ℝn sharing the same support supp(xl)={i:|xil|0}, which is referred to as joint-sparsity [15, 16]. The MMV problem can then be presented as solving the following d equations

yl=Kxl+ϵ~    (1)

for l = 1, …, d and with K ∈ ℝm×n. This can be rewritten in the following matrix equation form

Y=KX+,    (2)

where X = (x1, …, xd) ∈ ℝn×d, Y = (y1, …, yd) ∈ ℝm×d. With the vectorizing operator vec(·), stacking each column of a matrix on each other, we can cast Equation (2) into a block-sparse problem. We have

Y=KXYT=XTKTvec(YT)=(KId)vec(XT),

where we used the well-known vectorization property of matrix equations, see for example Schacke [17]. Here, ⊗ is the Kronecker product. The vector x = vec(XT) is block-sparse with n blocks of length d, if the signals xl are jointly sparse. With D=(KId)ny×nx, where ny = m · d and nx = n · d, we obtain the block-sparse setting considered in this work.

1.2 Block sparsity

In the more general setting, we want to reconstruct an unknown vector xnx from a given matrix Dny×nx and given yny

y=Dx+ϵ,    (3)

with nx = nd, ny = md for some n, m, d ∈ ℕ. We assume that noise ϵny is added to Dx. In applications, we often have this problem is ill-posed, i.e., ny < nx or D is not invertible. We assume that x is the concatenation of n “smaller” vectors of length d, called blocks, i.e., x[i] ∈ ℝd,

xT=[x1xd=x[1]xd+1x2d=x[2]xnx-d+1xnx=x[n]]T.    (4)

Following the notation in Yonina and Eldar [18], we define

x2,0=i=1nI(x[i]2>0),

where I(‖x[i]‖2 > 0) = 1 if ‖x[i]‖2 > 0 and equal to zero otherwise and the ℓ2-norm is defined as x22=i=1n|xi|2. We call xnx s-block-sparse if ‖x2,0s. Similar to Equation (4), we can construct the matrix D in Equation (3) from n “smaller” matrices D[i]ny×d, i = 1, …, n

D=(D[1]D[2]D[n]).

Without loss of generality, we can assume that these blocks are orthonormal, see Yonina and Eldar [18], i.e., D[i]TD[i]=Id, where Id is the d × d identity matrix. This assumption simplifies the presentation of several statements below.

Definition 1. The block-coherence of a matrix Dny×nx is defined as

μb(D)=maxij1dD[i]TD[j]2.    (5)

Here, we use A2=λmax(ATA), where λmax denotes the largest eigenvalue of matrix ATA. Note that the block coherence can also be introduced with a normalization factor 1/(‖D[i]‖2D[j]‖2), but since we assume orthonormal blocks, we can neglect this. This reduces to the already known coherence μ for d = 1

μ(D)=maxij|D:,iTD:,j|,

where D:, i denotes the ith column of D, see Donoho et al. [19]. In Yonina and Eldar [18], it is shown that 0 ≤ μb(D) ≤ μ(D) ≤ 1, and it is possible to derive recovery statements for small μb similar to μ, for details see Yonina and Eldar [18]. We can consider also the cross block coherence, which compares two matrices and will be important in the following.

Definition 2. For B,Dny×nx with B[i]TD[i]=Id the cross block coherence is defined as

μb(B,D)=maxij1dB[i]TD[j]2.    (6)

Similar to ordinary basis pursuit and LASSO [5, 2022], we use the following ℓ2,1-LASSO to solve Equation (3)

minxnx12Dx-y22+αx2,1    (7)

where x2,1=i=1nx[i]2 is the ℓ2,1-norm of x, which will promote block-sparsity for the solution of Equation (7). This convex programm can be solved by the fixed point iteration known as the block iterative shrinkage thresholding algroithm (Block-ISTA/BISTA):

x(k)=ηαγ(x(k-1)-γ[DT(Dx(k-1)-y)]),    (8)

where ηα is the block-soft-thresholding operator, given as

ηα(x)[i]=max{0,1-αx[i]2}x[i].    (9)

BISTA is an already well-studied proximal gradient algorithm based on the functional in Equation (7). It is known that the fixed point iteration in Equation (8) converges for γ(0,1L), where L=D22 is the Lipschitz constant of the least squares term in Equation (7) w.r.t to x, to a solution of Equation (7), if at least one solution exists, see for example Byrne [23], Bauschke et al. [24], and Beck [25]. On the other hand, the choice of the regularization parameter α has to be done empirically and is very crucial for a “good” recovery. If α is set too large, this can lead to too much damping, possibly setting blocks to 0 that actually have a non-zero norm. If α is too small, we get reverse effects. In practice, this leads to problems since computing the iterations require high computational effort. Deep unfolding is a way to tackle these problems, i.e., reduce the number of iterations by learning optimal regularization parameters and step-sizes. There are already classical concepts in increasing the convergence speed by using an additional step in updating the current iterate x(k), by using the previous x(k−1), resulting in Block Fast ISTA [25, 26]. On the other hand, the choice of optimal parameters is still solved empirically.

2 Deep unfolding and learned BISTA

Recently, the idea of deep unfolding has been developed, where the goal is to optimize these parameters of such an iterative algorithm [8, 10, 27]. This in turn gives us an iterative algorithm with optimal chosen step-size γ and regularization parameters, but we will see that we do not have to restrict our self only to those parameters. Recently, Fu et al. proposed Ada-BlockLISTA by applying deep unfolding to block-sparse recovery [28]. They show an increase in the convergence speed with numerical examples but do not cover theoretical studies.

In the following, we are going to present the idea of deep unfolding, then we are going to derive Learned BISTA (LBISTA).

2.1 Deep unfolding

We will now formalize the concept of deep unfolding for an arbitrary operator that depends on a certain set of parameters, before we apply this to the previously presented fixed point iteration. To this end, we define an operator

T(·;θ,y):XX    (10)

which depends on a set of parameters θ ∈ Θ, such as the stepsize of a gradient descent operator and an input y. For example, for BISTA, this would be θ = (α, γ) with

T(x;θ,y)=ηαγ(x-γ[DT(Dx-y)]).

We assume that Fix(T(·;θ, y)) ≠ ∅ and that we have convergence for the fixed point iteration

Tk(x(0);θ,y)=x(k)    (11)

for an arbitrary x(0)X. Deep unfolding now interprets each iteration step as the layer of a neural network and uses the parameters θ ∈ Θ as trainable variables. In more detail, this means we look at the Kth iteration of Equation (11), i.e., the composition

(TT)Ktimes(x(0);θ,y)=x(K)

for some K ∈ ℕ. By unfolding this iterative scheme, we define

Tθ(k-1)(·;y):=T(·;θ(k),y)

for k = 1, …, K and set θ(k) as the set of trainable variables in this operator, so they can vary in each iteration step. For example, in LBISTA, we will get θ(k) = (α(k), γ(k)) in the later called tied case. With this we get the following composition

(Tθ(K-1)Tθ(0))(x(0);y)=x(K).    (12)

and define the operator

Tθ~:=Tθ(K-1)Tθ(0),    (13)

where we get the full parameter space Θ~=Θ××Θ, with trainable variables θ~=i=1Kθ(i). TΘ~ will then be the neural network which will be trained with respect to θ~. So in the end, after training, we have an iterative algorithm with fixed but optimized parameters. It seems thus that deep unfolding can be applied to any iterative algorithm and help us to estimate the best choice of parameters, but we will only present deep unfolding for BISTA and consider deep unfolding for arbitrary operators in future studies.

2.2 Learning

In this section, we give an overview for the training procedure used in this study. The general idea of supervised learning is to choose model parameters such that the predictions are close, in some sense, to the unknown target, i.e., in our case, the unknown vector x in Equation (3) generating the measurement y. Hence, we aim to minimize an expected loss over an unknown distribution D:

minθ[R(θ):=𝔼x*˜D[(x^-x*)]]    (14)

where ℓ(·) is a given loss function, x^=Tθ~(x(0);y) is the output of the model, and x* is the ground-truth. Here, we will use the squared ℓ2-loss (x)=12x22. The objective functional R(θ)=𝔼x*˜D[(x^-x*)] is also called risk of the model. Since the underlying distribution D is unknown, we take a batch of S independently drawn samples of input and output data (xj*,yj) for j = 1, …, ntrain according to Equation (3) and minimize instead data-driven the empirical risk

RS(θ)=1ntrainj=1ntrain(x^j-xj*).    (15)

Proceeding in this way for all layer at once is sometimes referred as end-to-end learning. Because of the special structure of our deep unfolding models, and inspired by Musa et al. [29], we instead train the network layer-wise, by optimizing only θcase(k-1) for layer k yielding the following training procedure: Let k ∈ {1, …, K},

minθcase(k-1)𝔼x*~D[(x^(k)-x*)],

where x^(k) is the output of the kth layer. We realized this training as follows, we generate a validation set (xi,validation*,yi,validation), used to evaluate the model while training and a training set (xi,train*,yi,train), i = 1, …, ntrain, used to calculate (15). This objective is locally minimized by gradient descent methods. As a stopping criteria, we evaluate the normalized mean square error, defined as

NMSE(x,x^(k))=x^(k)-x*22x*22,

depending on the validation set, and stop if the maximum of all evaluated NMSE stays the same for a given number of iterations. See Algorithm 1, where Adam is the ADAM Optimizer [30] depending on an training rate tr and the functional which should be minimized with respect to given variables, here the loss function ℓ(·) with respect to θk−1.

ALGORITHM 1
www.frontiersin.org

Algorithm 1. Training.

2.3 Learned BISTA

In the following, we present four different unfolding techniques for BISTA. We present a tied (weights are shared between different layers) and untied (individual weights per layer) case, which refers to different training approaches

S=I-BTDB=γD.

Tied LBISTA: The idea of LBISTA is now to fix the matrices S and B for all layers but also include them in our set of trainable variables:

x(k)=ηα(k-1)(Sx(k-1)+BTy).    (16)

For LBISTA (Equation 16), we get trainable variables

θ=((α(k))k=0K-1,S,B),

where we initialize S = IBTD and B = γD. Algorithm (16) is also referred to as vanilla LISTA in the sparse case. Inspired by the LISTA-CP model, i.e., LISTA with coupled parameters, proposed in Liu and Chen [10], we will also consider LBISTA-CP

x(k)=ηα(k-1)(x(k-1)-γ(k-1)BT(Dx(k-1)-y)).    (17)

For LBISTA-CP (Equation 17), we get

θ=((α(k))k=0K-1,(γ(k))k=0K-1,B),

where we initialize B = D.

Untied LBISTA: The idea of untied LBISTA is then to use in each layer different matrices S and B to train, i.e.,

x(k)=ηα(k-1)(S(k-1)x(k-1)+(B(k-1))Ty).    (18)

For LBISTA (untied) Equation (18), we get trainable variables

θ=((α(k))k=0K-1,(S(k))k=0K-1,(B(k))k=0K-1),

where we initialize S(k) = IBTD and B(k) = γD for every k = 0, …, K − 1. Inspired by Algorithm (17), we will also consider Algorithm (19), which will be referred to as LBISTA-CP (untied), [9, 10]:

x(k)=ηα(k-1)(x(k-1)-γ(k-1)(B(k-1))T(Dx(k-1)-y)).    (19)

For LBISTA-CP (untied) (Equation 19), we get

θ=((α(k))k=0K-1,(B(k))k=0K-1),

where we initialize B(k) = D for every k = 0, …, K − 1. Hence, compared to O(nynx+K) parameters in the tied case, now more training data and longer training time is required to train now O(Knynx+K) parameters.

We initialize the trainable variables with values from original BISTA. In Chen et al. [9], it has been shown that convergence of LISTA-CP (untied) can be guaranteed if the matrices B(k) belong to a certain set and their proof can be extended to block-sparsity. The steps are very similar to the convergence proof for learned block analytical ISTA given in the next section.

3 Analytical LBISTA

In the previous section, we presented several approaches for learned BISTA where optimal weights are optimized in a data-driven fashion. In Liu and Chen [10] instead proposed to analytically pre-compute the weights and only train step-size and threshold parameters. It turns out that this Analytic LISTA (ALISTA) with the so called analytical weight matrix is as good as the learned weights. In the following, we are going to extend and improving the theoretical statements for ALISTA to the block-sparse case and propose analytical LBISTA. In contrast to Liu and Chen [10], we will provide a direct solution and also show different ways to calculate the analytical weight matrix in different settings.

3.1 Upper and lower bound

This part of the study will focus on combining and extending several theoretical statements from Chen et al. [9] and Liu and Chen [10] and applying these for the block-sparse case. With the following two theorems, we are then going to present analytical LBISTA, by showing that this is as good as LBISTA-CP (untied) with a pre-computed B.

3.1.1 Upper bound

In this section, we start with an upper bound for the error of the approximation generated by Equation (19), i.e., LBISTA-CP (untied), and the exact solution x* for given parameters. For this, we modify Assumption 1 from Liu and Chen [10] to be consistent with the block-sparse setting.

Assumption 1. We assume (x,ϵ)Xb(M,s,σ) with

(x,ϵ)Xb(M,s,σ)          ={x:x[i]2Mi=1,,n,   x2,0s,ϵ2σ}.    (20)

As already mentioned, a matrix with small block-coherence has good recovery conditions. In Liu and Chen [10] proposed analytical LISTA, where the pre-computed matrix B is minimizing the mutual cross-coherence. This motivates the following definition.

Definition 3. With Dny×nx, we define the generalized mutual block-coherence

μ~b(D)=infBny×nxBT[l]D[l]=Id1ln{maxij1i,jn1dBT[i]D[j]2}             =infBny×nxBT[l]D[l]=Id1ln{μb(B,D)}.    (21)

We define, analogously to Liu and Chen [10] with Wb(D) the set of all Bny×nx which attain the infimum in Equation (21), i.e., Wb(D)={Bny×nx:μb(B,D)=μ~b(D)}.

Note that the set Wb(D) is non-empty because the set of feasible matrices {Bny×nx:BT[i]D[i]=Id,1in} contains at least D because we assume DT[i]D[i]=Id, also 0μ~b(D)μb(D) and therefore, Equation (21) is a feasible and bounded program, see Supplementary material to [9]. We will call matrices from Wb(D) analytical weight matrices.

Definition 4. The block-support of a block-sparse-vector xXB(b,s) is defined as

suppb(x)={i:x[i]20i=1,n}.    (22)

We will now derive an upper bound for the ℓ2-error and thus showing convergence of LBISTA-CP for a special matrix B and given parameters α(k) and γ(k). In [9], Liu et al. showed linear convergence for unfolded ISTA with additional noise, more precivesly for LISTA-CP (untied), if the matrices B(k) belonged to a certain set. In Liu and Chen [10], it was shown that we can pre-compute such a matrix B, chose B(k) = B, chosen by a data-free optimization problem and still have the same performance. For this, new proposed unfolded algorithm linear convergence was also shown, without additional noise. In Liu and Chen [10], convergence only in the noiseless case was shown, but the results derived in Chen et al. [9] were derived with bounded noise. Thus, we are going to combine these two proofs and extend it to block-sparsity:

For given (x, ϵ), y = Dx + ϵ and parameters {θ(k)}k=1K, we abbreviate with {x(k)(x,ϵ)}k=1K the sequence generated by (19) with x(0) = 0. Further, we define

CX(k)=sup(x,ϵ)Xb(M,s,σ){x(k)(x,ϵ)-x2,1}       C=supkmaxj=1,...,n|γ(k)|B[j]2    (23)

Theorem 1. For any BWb(D) and any sequence γ(k)(0,2μ(2s-1)+1) and parameters (B(k): = B, α(k), γ(k)), for kK, with

α(k)-Cσγ(k)μ(D)CX(k)[1,κ]    (24)

for some κ ≥ 1, with μ=dμ~b(D). With M > 0 and s < (μ−1 + 1)/2, we have

suppb(x(k)(x*))suppb(x*)=:𝕊,    (25)
x(k)-x*2exp(-τ=0k-1ã(τ))sM    (26)
    +Cσ(1+τ=0k-1exp(-s=τk-τã(s))).

where

ã(τ)=-log(γ(τ)μ((κ+1)s-1)+|1-γ(τ)|)         >0.

Note that in the noiseless case and with κ = 1, we obtain the results in Liu and Chen [10]. However, from the latter theorem, it follows that one can relax this condition to κ > 1. We also see from the proof of Theorem 1 that one needs at least α(k)dγ(k)μ~b(D)CXk()+Cσ to have Equation (25). On the other hand, one can always find such a κ from the trained (and then fixed) parameters and thus use therefore the theorem afterward. Obviously, a worse α effects the upper bound of the ℓ2-error and thus appears in ã(τ). Summarizing, above theorem shows now convergence on the training set even if κ ≠ 1.

3.1.2 Lower bound

This section states the lower bound for the ℓ2,1-error, showing that for convergence in the ℓ2,1-norm the defined parameters in Theorem 1 are optimally chosen. We now modify Assumption 2 from Liu and Chen [10] to be consistent with the block-sparse setting.

Assumption 2. x* is sampled from PX. PX satisfies: 2 ≤ 𝕊 ≤ s and 𝕊 is uniformly distributed over the whole index set. The non-zero blocks of x* satisfy the uniform distribution and ‖x[i]‖2M for all i ∈ 𝕊. And we assume, ϵ = 0.

The latter theorem states that the analytical weight matrix should minimize the generalized mutual block coherence. Therefore, for a lower bound, we will only consider matrices that are bounded away from the identity.

Definition 5. With Dny×nx, s ≤ 2, σ̄min>0 we set

W̄(D,s,σ̄min):={Bny×nx    (27)
      :σmin(I(B[(jS)])TD[(jS)]) σ¯min}.

The parameters are chosen from the following set.

Definition 6. Let {x(k)}k=1 be generated by x(k+1)=ηα(k)(x(k)-(B(k))T(Dx(k)-y)) with parameters {B(k),α(k)}k=0 and x(0) = 0. We define the set of all parameters guaranteeing no false positive blocks in x(k) by

T={{B(k)W̄(D,s,σ̄min),θ(k)}k=0:    (28)
         suppB(x)S,x*Xb(M,s,0),k}.

This set is non-empty since Equation (25) holds true if α(k) are chosen large enough. Following mainly the proof in Liu and Chen [10] by extending the setting from sparsity to block-sparsity, the lower bound for the ℓ2,1-norm can be stated as follows.

Theorem 2. Let {x(k)}k=1 be generated by x(k+1)=ηα(k)(x(k)-(B(k))T(Dx(k)-y)). Under Assumptions 2, {W(k)W̄(D,s,σ̄min),θ(k)}k=0T and ϵ > 0, we have

x(k)-x*2,1ϵx*2exp(-ck),    (29)

with probability 1-ϵs32-ϵ2 and c=log3-logσ̄min.

3.2 Analytical LBISTA

Analogously to Liu and Chen [10] and following the previous two theorems, decompose LBISTA-CP (untied), Algorithm 17, into two steps:

x(k)=ηα(k-1)(x(k-1)-γ(k-1)B~T(Dx(k-1)-y)),    (30)

where, in the first step, B~ is pre-computed, such that

μb(B~,D)=μ~b(D).

In the second step, the parameters θ=((α(k))k=0K-1,(γ(k))k=0K-1) are trained layer wise, as discussed in the previous section. This results in a comparable method, with only O(K) trainable parameters, instead of O(nynx+K) for LBISTA-CP (Equation 17) or even O(Knynx+K) for LBISTA (untied) Equation 18.

4 Computing the analytical weight matrix

ALBISTA relies on the analytical weight matrix, deriving this matrix can be challenging in practice, thus this section focuses on computing this matrix. We follow the procedure in Liu and Chen [10] by estimating Equation (21) with an upper bound. But in addition to Liu and Chen [10], we state a closed form for the upper bound, and currently, this is done by a projected gradient descent approach.

4.1 Solving an upper bound

Since the objective in Equation (21) is not differentiable, one solves the following upper bound problem

minBny×nx1dBTDF2    (31)
s.t.BT[i]D[i]=Idfori=1,,n.

This is derived from the following inequality

maxij1dBT[i]D[j]22maxij1dBT[i]D[j]F2                                             1di,jBT[i]D[j]F2                                             =1dBTDF2.

In Liu and Chen [10], this is solved by a projected gradient method, but since this is a constrained linear least squares problem in the vector space of matrices with Frobenius inner product the following Theorem states a closed form of the solution of Equation (31).

Theorem 3. The minimizer Bny×nx of Equation (31) is given as the concatenation

B=(B[1],B[2],,B[n]),

where the n blocks are given as

B[i]=Ki+(D[i]-EiHi)

with

Ki=(2DDT)2+D[i]D[i]T,Ei=2DDTD[i],Ri=D[i]-2DDTKi+Ei,Si=-D[i]TKi+Ei,Li=RiTRi+SiTSi,Mi=Ki+Ei(I-Li+Li),Hi=Li+SiT+(I-Li+Li)(I+MiTMi)-1        (Ki+Ei)TKi+(D[i]-EiLi+SiT),

for i = 1, …, n.

The proof can be found in Appendix C.

Let d = 1 and the singular value decomposition of D given as D = VΣUT and assume B=VΣ~UT. Then, the solution of Equation (31) is given in an even simpler form since

BTDF2=UΣ~VTVΣUTF2                    =UΣ~ΣUTF2                    =Σ+ΣF2.

Choosing B=D+,Tdiag(d~)-1, where d~=diag(D+D), i.e., a normalized pseudo-inverse, yields also a solution for Equation (31). Here, diag follows the matlab/python notation, where diag of a matrix gives the vector of the main diagonal and gives a diagonal matrix with a given vector on its main diagonal. For d ≥ 2, the orthonormal block constraints in Equation (31) would not be met since with this construction we can only guarantee that the diagonal elements of BTD are equal to one but cannot control what happens on the off-diagonal, thus not yielding a feasible solution.

4.2 Computing the analytical weight matrix in a MMV problem

In practice, often large data sets are obtained, i.e., by a large amount of measurements or measurements yl, xl representing pictures. For instance, 1, 000 pixels and 200 measurements lead to a matrix D with (1, 000·200)2 elements. Applying the theory for analytical LBISTA could thus be difficult in practice. Although D is sparse, it can take a long time to calculate the analytical weight matrix. The following Theorem states the connection between the MMV setting and the block sparse setting for ALBISTA, showing that it is sufficient to minimize the generalized mutual coherence Equation (32), see Liu and Chen [10], for K instead of minimizing the generalized mutual block-coherence (Equation 21) for D = KId. Moreover, in many applications the measurement model involves convolutions, see for example Ahmadi et al. [14], so we conclude this section by considering the cases where K is a circular or Toeplitz matrix.

Theorem 4. Let B~ attain the minimum in

infB~n×nB~:,iTK:,i=11inmaxij|B~:,iTK:,j|,    (32)

where B~:,i refers to the ith column. Then the minimum of

infBny×nxBT[l]D[l]=11ln{maxij1i,jn1dBT[i]D[j]2},

is given as B=B~Id, if D = KId.

The proof can be found in Appendix D. Moreover, the following relation holds, if D = KId,

μb(D)=1dμ(K),

thus, the block-coherence of D can be enhanced by increasing the number of measurements d. It is feasible to solve Equation (31) by the pseudo inverse in the MMV setting, since this is solved for K, i.e., d = 1

4.3 Circular matrix case

Consider now the following setting, where the measurements yl ∈ ℝn are obtained by a circular convolution of xl with vector k, i.e.,

yl=kxl    (33)
    =Kxl,l=1,,d,

where K is a circular matrix generated by a vector k ∈ ℝn, i.e., K = circ(k) ∈ ℝn×n. Applying Theorem 3 to the circular case yields the following lemma.

Lemma 1. Let k ∈ ℝn and let B = circ(b) ∈ ℝn×n where b ∈ ℝn is given by

(2KKTkkT0)(bλ)=(01),    (34)

λ ∈ ℝ. Then, B attains the minimum in Equation (31).

The latter statement implies a simpler way to compute b ∈ ℝn by using singular value decomposition

B*K=U*diag(σ(B))UU*diag(σ(K))U         =U*diag(σ(B)σ(K))U,

where σ(B) ∈ ℝn is the vector of singular values of B, respectively K, filled with zeros and U an unitary matrix. With U=1/nF, where F is the Discrete Fourier Transform (DFT) matrix, this leads to the conclusion σ(B)=Fb=b^, and thus

b=F-1(1/k^¯),    (35)

where k^=Fk. The expression 1/k^ should be interpreted point wise and to be zero if k^i=0. This also concludes that the computation of B tends to be difficult in practice if K has not full rank, since this means that k^ has at least one zero entry. In this case, b has to be scaled with nrank(K) since

bTk=1nb^Tk^=1nk^0        =rank(K)n=!1.

4.4 Toeplitz matrix case

Considering the more general convolutional setting,

y=k*x,    (36)

where km~ and x ∈ ℝn with m~<n. This results in a Toeplitz matrix K

K=(k100k2k1 0km~km~ 0km~ 00km~)m×n.

The reasoning of the previous section cannot be applied to show that the solution of Equation (31) must be a Toeplitz matrix. But the following can be observed: Let b be constructed as discussed for K~=circ(k), where k is the concatenation of k and a zero vector of suitable dimensions, i.e., the first column of K. The analytical weight matrix B w.r.t. K can be constructed as

B:,i=Ti-1b,i=1,,n,

where T is the m × m cyclic shift matrix, i.e., only the m × n submatrix of B~=circ(b)m×m is used. The columns of K can also be expressed through the cyclic shift of k. Hence,

B:,iTK:,i=bTT-(i-1)Ti-1k=bTk=1,

i.e., B is a feasible solution of Equation (31) for K. On the other hand, the cross coherence is bounded since

maxi,jn.ij|B:,iTK:,j|=maxi,jn,ij|bTT-(i-1)Tj-1k|                                             =maxi,jn,ij|bTTj-ik|                                             maxi,jm,ij|bTTj-ik|                                             =maxi,jm,ij|B~:,iTK~:,j|                                             =B~TK~-Immax                                             =nF-1diag(σ(B)σ(K))1nF-Immax                                             =F-1ImF-Immax=0,

where ‖Dmax = maxi,j|dij| is the maximum norm and diag(σ(B)⊙σ(K)) = Im derives from Equation (35). Note: To have this upper bound K~ needs to have full rank, which is the case if the full time continuous FT of k has no zero points. Or one has to adjust the discrete grid. Thus, constructing B by extending K to a circular matrix is a feasible approach.

4.5 Connection to CNNs

It is known that using the Fast Fourier Transform (FFT) in CNNs can decrease the computation time if the convolutional filter is big [31, 32]. In Pratt et al. [31] showed that training weights in the Fourier domain can reduce training time while maintaining efficiency. By using this, the costs of O(n2) operations could be reduced to O(nlog(n)) operations. To connect the theory of FFT-CNNs and of unrolling ISTA in the context of deconvolution (Equations 33 or 36), the gradient step can be viewed as follows:

x-γb*(k*x-y)  =x-γb*k*x+γb*y  =(e-γb*k)*x+γb*y  =f(γ)*x+γb~.    (37)

This can be interpreted as a convolutional layer with kernel f(γ) = (e − γb * k) and bias b~=b*y, where e = [1, 0, …, 0]. This means that ALISTA, with a Toeplitz matrix, can be interpreted as a CNN only two trainable parameters per layer, γ, λ. In the setting of FFT-CNN, the update rule can be formulated as

x-γb*(k*x-y)  =F-1((ê-γb^k^)x^)+γb~.

On the other hand, using FFT CNNs shows only a speed up if we deal with large data sets, i.e. by evaluating high-resolution images, or large filters, i.e., if m~ is greater than log(n).

5 Numerical examples

In the following, we are going to present numerical results achieved by the presented algorithms.1 We will investigate two MMV scenarios. First, the measurements yl are obtained with a random Gaussian matrix, and second obtained by a redcued-rank random circular convolution. In each scenario, we will enforce the Kronecker structure and thus reducing the training cost by training only low dimensional m × n matrices. Furthermore, in the convolutional setting, the circular structure will be also enforced, thus reducing the training costs even more. To have a fair comparison, all algorithms are initialized with the analytical weight matrix.

5.1 MMV setting

The training data are sampled from an unknown distribution X and generated as follows. The signals x are generated for a given number of blocks n, given block length d and a possibility if a block x[i] is active or not, i.e., if ‖x[i]‖2 ≠ 0 or ‖x[i]‖2 = 0, called pnz (probability of non-zeros). If a block is active, the elements of this blocks are given by a normal Gaussian distribution with variance σ2 = 1. The measurements y are obtained by Equation (3), where the elements of ϵ are given from a normal Gaussian distribution with variance σ2=pnz·nx/ny·10-SNRdB/10. SNRdB is the signal-to-noise ratio given in decibel. We consider the following cases. In each case, we generate x with d = 15, n = 128, m = 32, and pnz = 10%.

Gaussian measurement matrix: In the Gaussian setting, we sample a m × n matrix K iid from a Gaussian distribution with variance σ2 = 1. We normalize the columns, s.t. D has orthonormal blocks, as assumed in the beginning.

Circular convolution matrix: We construct the circular matrix as follows. At first, we generate a random iid sampled vector ã but set a certain amount of elements to zero. We define k = ℜ(F−1ã), and thus we can generate a rank deficient Matrix D = KId, where K = circ(k). Thus, yl is obtained through a circular convolution with symmetric k^=Fk. It is important for this section to generate a rank deficient matrix to have compressive observations. Otherwise, we would get a trivial problem if K, respectively D, has full rank. Computing D = KId yields the desired matrix. The properties of these two matrices can be found in Table 1.

TABLE 1
www.frontiersin.org

Table 1. Properties of matrix K in both scenarios.

5.2 Discussion

The results of the proposed methods can be found in Figure 1A for the Gaussian Problem case in Figure 1B for the circular case. We also consider the performance of a version of AMP [33]. AMP can be viewed as an Bayesian extension of ISTA with an additional Onsager correction term, before applying the thresholding operator, i.e.,

v(k)=y-Dx(k)+b(k)v(k-1)x(k+1)=ηα(x(k)+γDTv(k))

where b(k) = 𝔼[η′(x(k))]. Here, we will train only γ, α, with the same procedure as already discussed and choose DT = BT. By using BT instead of DT we are resembling Orthogonal AMP (OAMP) [34], which follows a similar idea as ALISTA. In Ma and Ping [34], B is chosen to be de-correlated with respect to K, i.e.,

tr(Iny-BTD)=0.

The analytical weight matrix B satisfies also this condition. Moreover, in Ma and Ping [34], the choice of different matrices is also discussed. Thus, untrained ALISTA with correction term can be viewed as a special case of OAMP. Note also, that the structure of Trainable ISTA, proposed in Ito et al. [35], is also based on OAMP and thus there are interrelations between AMP, unfolding ISTA, and analytical ISTA. We refer to learned AMP with analytical matrix as ALAMP. Different from Ma and Ping [34], we use the ℓ2,1-regularizer, instead of only ℓ1-regularization, since we consider the MMV setting. This is also discussed in [36, 37]. Every proposed algorithm performs better, in terms of NMSE, as their untrained original. Interestingly learned AMP, with analytical weight matrix B, has a performance almost as good as LBISTA (untied) with only a fraction of trainable parameters. This may come from the fact, that block-soft thresholding is not the correct MMSE estimator for the generated signals x and thus the correction of b(k) yields the better estimation for x. As expected, we get almost the same performance of ALBISTA and LBISTA CP (untied). In Figure 2A, we show similar plots for the justification of Theorem 1, as also seen in Liu and Chen [10]. In particular, Figure 2A shows that α(k)γ(k) is proportional to the maximal ℓ2,1-error over all training signals. An interesting behavior, which carries over from the sparse case, is that the learned ℓ2,1-regularization parameters approach zero, as k increases. If α is close to zero, we approach a least-squares problem. This means that after LBISTA found the support of the unknown signal x* the algorithm consist only of the least squares fitting. Figure 2B shows that the trained γ(k) are bound in an interval. Note that, in contrast to Liu and Chen [10], Theorem 1 is based on a more general assumption, onto the thresholding parameters. One can take a suitable κ and obtains the upper bound for the ℓ2-error and thus have convergence on the training set if the sparsity assumptions are met. Figure 3 shows the training loss over the training iterations for the results presented in Figure 1. One can see that ALBISTA needs less training iterations as LBISTA CP (untied) or LBISTA (untied) and thus less training data. The observed jumps occur when moving from one layer to the next due to the layer-wise training, Algorithm 1. A layer is defined to be optimized if NMSE converged within 1e − 5.

FIGURE 1
www.frontiersin.org

Figure 1. NMSE in dB over layers/iterations, pnz = 10%, without noise. BISTA, and FastBISTA are evaluated with α = 1 and γ = 1/(1.01‖D2). (A) Gaussian measurement problem. (B) Circular convolution problem.

FIGURE 2
www.frontiersin.org

Figure 2. Plots justifying results in Theorem 1. (A) Parameters and maximal ℓ2,1-error in the noiseless case on the problem with circular convolution matrix. (B) Trained γ over layers/iterations on the problem with Gaussian measurement matrix.

FIGURE 3
www.frontiersin.org

Figure 3. Training history for results shown in Figure 1A respectively Figure 1B. One can see that ALBISTA needs less training iterations than LBISTA CP (untied) or LBISTA (untied). (A) Gaussian measurement problem. (B) Circular convolution problem.

6 Conclusion

We proposed ALISTA for the block-sparse and MMV case, important for many real-world applications, and derived corresponding theoretical convergence and recovery results. We relaxed the conditions for the regularization parameter and thus obtained a more precise upper-bound after ALISTA is trained. Nevertheless, this is still dependent on a sharp sparsity assumption on the unknown signals. We investigated and derived a direct solution for the analytical weight matrix in the general block sparse setting as well for one convolutional scenarios. The last section provides numerical results and includes interrelations to AMP.

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/janhauffen/Block-ALISTA.

Author contributions

JH, NM, and PJ contributed to the concept of the work. JH wrote the first draft of the article. NM and PJ supervised the work. All authors contributed to manuscript revision, read, and approved the submitted version.

Acknowledgments

We acknowledge support by the German Research Foundation and the Open Access Publication Fund of TU Berlin.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fams.2023.1205959/full#supplementary-material

Footnotes

References

1. Candès EJ, Romberg J, Tao T. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans Inf Theory. (2006) 52:489–509. doi: 10.1109/TIT.2005.862083

CrossRef Full Text | Google Scholar

2. Donoho DL. Compressed sensing. IEEE Trans Inf Theory. (2006) 52:1289–306. doi: 10.1109/TIT.2006.871582

CrossRef Full Text | Google Scholar

3. Shannon CE. Communication in the presence of noise. Proc IRE. (1949) 37:10–21. doi: 10.1109/JRPROC.1949.232969

CrossRef Full Text | Google Scholar

4. Candes EJ, Tao T. Decoding by linear programming. IEEE Trans Inf Theory. (2005) 51:4203–15. doi: 10.1109/TIT.2005.858979

CrossRef Full Text | Google Scholar

5. Rudelson M, Vershynin R. Geometric approach to error-correcting codes and reconstruction of signals. Int Mathem Res Notices. (2005) 2005:4019–41. doi: 10.1155/IMRN.2005.4019

CrossRef Full Text | Google Scholar

6. Figueiredo MA, Nowak RD, Wright SJ. Gradient projection for sparse reconstruction: application to compressed sensing and other inverse problems. IEEE J Sel Top Signal Process. (2007) 1:586–97. doi: 10.1109/JSTSP.2007.910281

CrossRef Full Text | Google Scholar

7. Fornasier M, Rauhut H. Iterative thresholding algorithms. Appl Comput Harmon Anal. (2008) 25:187–208. doi: 10.1016/j.acha.2007.10.005

CrossRef Full Text | Google Scholar

8. Gregor K, LeCun Y. Learning fast approximations of sparse coding. In: Proceedings of the 27th International Conference on International Conference on Machine Learning (2010). p. 399–406.

Google Scholar

9. Chen X, Liu J, Wang Z, Yin W. Theoretical linear convergence of unfolded ISTA and its practical weights and thresholds. In: Conference on Neural Information Processing Systems (NeurIPS 2018) (2018).

Google Scholar

10. Liu J, Chen X. ALISTA: Analytic weights are as good as learned weights in LISTA. In: International Conference on Learning Representations (ICLR) (2019).

Google Scholar

11. Chen X, Liu J, Wang Z, Yin W. Hyperparameter tuning is all you need for LISTA. In: Advances in Neural Information Processing Systems (2021). p. 34.

Google Scholar

12. Gorodnitsky IF, George JS, Rao BD. Neuromagnetic source imaging with FOCUSS: a recursive weighted minimum norm algorithm. Electroencephalogr Clin Neurophysiol. (1995) 95:231–51. doi: 10.1016/0013-4694(95)00107-A

PubMed Abstract | CrossRef Full Text | Google Scholar

13. Fengler A, Musa O, Jung P, Caire G. Pilot-based unsourced random access with a massive MIMO receiver, interference cancellation, and power control. IEEE J Select Areas Commun. (2022) 40:1522–34. doi: 10.1109/JSAC.2022.3144748

CrossRef Full Text | Google Scholar

14. Ahmadi S, Burgholzer P, Mayr G, Jung P, Caire G, Ziegler M. Photothermal super resolution imaging: a comparison of different thermographic reconstruction techniques. NDT E Int. (2020) 111:102228. doi: 10.1016/j.ndteint.2020.102228

CrossRef Full Text | Google Scholar

15. Ziniel J, Schniter P. Efficient high-dimensional inference in the multiple measurement vector problem. IEEE Trans Signal Proc. (2012) 61:340–54. doi: 10.1109/TSP.2012.2222382

CrossRef Full Text | Google Scholar

16. Chen J, Huo X. Theoretical results on sparse representations of multiple-measurement vectors. IEEE Trans Signal Proc. (2006) 54:4634–43. doi: 10.1109/TSP.2006.881263

CrossRef Full Text | Google Scholar

17. Schacke K. On the kronecker product. Master's thesis, University of Waterloo. (2004).

Google Scholar

18. Yonina C, Eldar HB. Block-sparsity: coherence and efficient recovery. arXiv preprint arXiv:08120329. (2008). doi: 10.48550/arXiv.0812.0329

CrossRef Full Text | Google Scholar

19. Donoho DL, Elad M, Temlyakov VN. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans Inf Theory. (2005) 52:6–18. doi: 10.1109/TIT.2005.860430

CrossRef Full Text | Google Scholar

20. Chen SS, Donoho DL, Saunders MA. Atomic decomposition by basis pursuit. SIAM Rev. (2001) 43:129–59. doi: 10.1137/S003614450037906X

CrossRef Full Text | Google Scholar

21. Kutyniok G. Compressed sensing. Mitteilungen der Deutschen Mathematiker-Vereinigung. (2014) 1:24–9. doi: 10.1515/dmvm-2014-0014

CrossRef Full Text | Google Scholar

22. Foucart S, Rauhut H. A mathematical introduction to compressive sensing. Bull Am Math. (2017) 54:151–65. doi: 10.1090/bull/1546

CrossRef Full Text | Google Scholar

23. Byrne CL. Applied Iterative Methods. Wellesley, MA: AK Peters. (2008). doi: 10.1201/9780429295492

CrossRef Full Text | Google Scholar

24. Bauschke HH, Combettes PL, Bauschke HH, Combettes PL. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. New York, NY: Springer. (2011). doi: 10.1007/978-1-4419-9467-7

CrossRef Full Text | Google Scholar

25. Beck A. A fast iterative shrinkage-thresholding algorithm for linear inverse problem. Soc Ind Appl Mathem. (2009) 2:183–202. doi: 10.1137/080716542

CrossRef Full Text | Google Scholar

26. Combettes PL, Pesquet JC. Proximal splitting methods in signal processing. In: Fixed-Point Algorithms for Inverse Problems in Science and Engineering. Springer (2011). p. 185–212. doi: 10.1007/978-1-4419-9569-8_10

CrossRef Full Text | Google Scholar

27. Kim D, Park D. Element-wise adaptive thresholds for learned iterative shrinkage thresholding algorithms. IEEE Access. (2020) 8:45874–86. doi: 10.1109/ACCESS.2020.2978237

CrossRef Full Text | Google Scholar

28. Fu R, Monardo V, Huang T, Liu Y. Deep unfolding network for block-sparse signal recovery. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE (2021). p. 2880–2884. doi: 10.1109/ICASSP39728.2021.9414163

CrossRef Full Text | Google Scholar

29. Musa O, Jung P, Caire G. Plug-and-play learned gaussian-mixture approximate message passing. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE (2021). p. 4855–4859. doi: 10.1109/ICASSP39728.2021.9414910

CrossRef Full Text | Google Scholar

30. Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2015). doi: 10.48550/arXiv.1412.6980

CrossRef Full Text | Google Scholar

31. Pratt H, Williams B, Coenen F, Zheng Y. FCNN: Fourier convolutional neural networks. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer (2017). p. 786–798. doi: 10.1007/978-3-319-71249-9_47

CrossRef Full Text | Google Scholar

32. Chitsaz K, Hajabdollahi M, Karimi N, Samavi S, Shirani S. Acceleration of convolutional neural network using FFT-based split convolutions. arXiv preprint arXiv:200312621. (2020). doi: 10.48550/arXiv.2003.12621

CrossRef Full Text | Google Scholar

33. Donoho DL, Maleki A, Montanari A. Message-passing algorithms for compressed sensing. Proc Nat Acad Sci. (2009) 106:18914–9. doi: 10.1073/pnas.0909892106

PubMed Abstract | CrossRef Full Text | Google Scholar

34. Ma J, Ping L. Orthogonal amp. IEEE Access. (2017) 5:2020–33. doi: 10.1109/ACCESS.2017.2653119

CrossRef Full Text | Google Scholar

35. Ito D, Takabe S, Wadayama T. Trainable ISTA for sparse signal recovery. IEEE Trans Signal Proc. (2019) 67:3113–25. doi: 10.1109/TSP.2019.2912879

CrossRef Full Text | Google Scholar

36. Kim J, Chang W, Jung B, Baron D, Ye JC. Belief propagation for joint sparse recovery. arXiv preprint arXiv:1102.3289. (2011). doi: 10.48550/arXiv.1102.3289

CrossRef Full Text | Google Scholar

37. Chen Z, Sohrabi F, Yu W. Sparse activity detection for massive connectivity. IEEE Trans Signal Proc. (2018) 66:1890–904. doi: 10.1109/TSP.2018.2795540

CrossRef Full Text | Google Scholar

Keywords: unfolding, compressed sensing, multiple measurement vector problem, deep learning, block-sparsity

Citation: Hauffen JC, Jung P and Mücke N (2023) Algorithm unfolding for block-sparse and MMV problems with reduced training overhead. Front. Appl. Math. Stat. 9:1205959. doi: 10.3389/fams.2023.1205959

Received: 14 April 2023; Accepted: 17 October 2023;
Published: 11 December 2023.

Edited by:

Holger Rauhut, RWTH Aachen University, Germany

Reviewed by:

Junhong Lin, Zhejiang University, China
Stefan Kunis, Osnabrück University, Germany

Copyright © 2023 Hauffen, Jung and Mücke. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Jan Christian Hauffen, ai5oYXVmZmVuQHR1LWJlcmxpbi5kZQ==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.