Skip to main content

ORIGINAL RESEARCH article

Front. Appl. Math. Stat., 07 September 2021
Sec. Mathematics of Computation and Data Science
This article is part of the Research Topic High-Performance Tensor Computations in Scientific Computing and Data Science View all 11 articles

A Block-Sparse Tensor Train Format for Sample-Efficient High-Dimensional Polynomial Regression

  • Department of Mathematics, Technische Universität Berlin, Berlin, Germany

Low-rank tensors are an established framework for the parametrization of multivariate polynomials. We propose to extend this framework by including the concept of block-sparsity to efficiently parametrize homogeneous, multivariate polynomials with low-rank tensors. This provides a representation of general multivariate polynomials as a sum of homogeneous, multivariate polynomials, represented by block-sparse, low-rank tensors. We show that this sum can be concisely represented by a single block-sparse, low-rank tensor.

We further prove cases, where low-rank tensors are particularly well suited by showing that for banded symmetric tensors of homogeneous polynomials the block sizes in the block-sparse multivariate polynomial space can be bounded independent of the number of variables.

We showcase this format by applying it to high-dimensional least squares regression problems where it demonstrates improved computational resource utilization and sample efficiency.

1 Introduction

An important problem in many applications is the identification of a function from measurements or random samples. For this problem to be well-posed, some prior information about the function has to be assumed and a common requirement is that the function can be approximated in a finite dimensional ansatz space. For the purpose of extracting governing equations the most famous approach in recent years has been SINDy [1]. However, the applicability of SINDy to high-dimensional problems is limited since truly high-dimensional problems require a nonlinear parameterization of the ansatz space. One particular reparametrization that has proven itself in many applications are tensor networks. These allow for a straight-forward extension of SINDy [2] but can also encode additional structure as presented in [3]. The compressive capabilities of tensor networks originate from this ability to exploit additional structure like smoothness, locality or self-similarity and have hence been used in solving high-dimensional equations [47]. In the context of optimal control tensor train networks have been utilized for solving the Hamilton–Jacobi–Bellman equation in [8,9], for solving backward stochastic differential equations in [10] and for the calculation of stock options prices in [11,12]. In the context of uncertainty quantification they are used in [1315] and in the context of image classification they are used in [16,17].

A common thread in these publications is the parametrization of a high-dimensional ansatz space by a tensor train network which is then optimized. In most cases this means that the least-squares error of the parametrized function to the data is minimized. There exist many methods to perform this minimization. A well-known algorithm in the mathematics community is the alternating least-squares (ALS) [18,19], which is related to the famous DMRG method [20] for solving the Schrödinger equation in quantum physics. Although, not directly suitable for recovery tasks, it became apparent that DMRG and ALS can be adapted to work in this context. Two of these extensions to the ALS algorithm are the stablilized ALS approximation (SALSA) [21] and the block alternating steepest descent for Recovery (bASD) algorithm [13]. Both adapt the tensor network ranks and are better suited to the problem of data identification. Since the set of tensor trains of fixed rank forms a manifold [22] it is also possible to perform gradient based optimization schemes [48]. This however is not a path that we pursue in this work. Our contribution extends the ALS (and SALSA) algorithm and we believe that it can be applied to many of the fields stated above.

In this work we consider ansatz spaces of homogeneous polynomials of fixed degree and their extension to polynomials of bounded degree. We introduce the concept of block-sparsity as an efficient way to parametrize homogeneous polynomials with low rank tensors. Although, this is not the first instance in which sparsity is used in the context of low-rank tensors (see [2426]), we believe, that this is the first time where block-sparsity is used to parametrize homogeneous polynomials. The sparsity used in the previous works is substantially different to the block-sparsity discussed in this work. Block-sparsity is preserved under most tensor network operations such as summation, orthogonalization and rounding and the parametrization of tangent spaces which is not the case for standard sparsity. This is important since orthogonalization is an essential part of numerically stable and efficient optimization schemes and means that most of the existing tensor methods, like HSVD (see [27]), ALS, SALSA or Riemannian optimization can be performed in this format. We also show that, if the symmetric tensor of a homogeneous polynomial is banded, it can be represented very efficiently in the tensor train format, since the sizes of the non-zero blocks can be bounded independently of the number of variables. In physics this property can be associated with the property of locality, which can be used to identify cases where tensor trains work exceptionally well.

Quantum physicists have used the concept of block-sparsity for at least a decade [28] but it was introduced to the mathematics community only recently in [29]. In the language of quantum mechanics one would say that there exists an operator for which the coefficient tensor of any homogeneous polynomial is an eigenvector. This encodes a symmetry, where the eigenvalue of this eigenvector is the degree of the homogeneous polynomial, which acts as a quantum number and corresponds to the particle number of bosons and fermions.

The presented approach is very versatile and can be combined with many polynomial approximation strategies like the use of Taylor’s theorem in [30] and there exist many approximation theoretic results that ensure a good approximation with a low degree polynomial for many classes of functions (see e.g. [31]).

In addition to the approximation theoretic results, we can motivate these polynomial spaces by thinking about the sample complexity for successful recovery in the case of regression problems. In [32] it was shown that for tensor networks the sample complexity, meaning the number of data points needed, is related to the dimension of the high-dimensional ansatz space. But, these huge sample sizes are not needed in most practical examples [14]. This suggests that the regularity of the sought function must have a strong influence on the number of samples that are required. However, for most practical applications, suitable regularity guarantees cannot be made — neither for the best approximation nor for the initial guess, nor any iterate of the optimization process. By restricting ourselves to spaces of homogeneous polynomials, the gap between observed sample complexity and proven worst-case bound is reduced.

In the regression setting, this means that we kill two birds with one stone. By applying block-sparsity to the coefficient tensor we can restrict the ansatz space to well-behaved functions which can be identified with a reasonable sample size. At the same time we reduce the number of parameters and speed up the least-squares minimization task. Finally, note that this parametrization allows practitioners to devise algorithms that are adaptive in the degree of the polynomial, thereby increasing the computational resource utilization even further. This solves a real problem in practical applications where the additional and unnecessary degrees of freedom of conventional low-rank tensor formats cause many optimization algorithms to get stuck in local minima.

The remainder of this work is structured as follows. Notation introduces basic tensor notation, the different parametrizations of polynomials that are used in this work and then formulates the associated least-squares problems. In Theoretical Foundation we state the known results on sampling complexity and block sparsity. Furthermore, we set the two results in relation and argue why this leads to more favorable ansatz spaces. This includes a proof of rank-bounds for a class of homogeneous polynomials which can be represented particularly efficient as tensor trains. Method Description derives two parametrizations from the results of Theoretical Foundation and presents the algorithms that are used to solve the associated least-squares problems. Finally, Numerical Results gives some numerical results for different classes of problems focusing on the comparison of the sample complexity for the full- and sub-spaces. Most notably, the recovery of a quantity of interest for a parametric PDE, where our approach achieves successful recovery with relatively few parameters and samples. We observed that for suitable problems the number of parameters can be reduced by a factor of almost 10.

2 Notation

In our opinion, using a graphical notation for the involved contractions in a tensor network drastically simplifies the expressions making the whole setup more approachable. This section introduces this graphical notation for tensor networks, the spaces that will be used in the remainder of this work and the regression framework.

2.1 Tensors and Indices

Definition 2.1. Let dN. Then n=(n1,,nd)Nd is called a dimension tuple of orderd and xRn1××nd=:Rn is called a tensor of orderdand dimensionn. Let Nn={1,,n} then a tuple (l1,,ld)Nn1××Nnd=:Nn is called a multi-index and the corresponding entry of x is denoted by x (l1, …, ld). The positions 1, …, d of the indices l1, …, ld in the expression x (l1, …, ld) are called modes ofx.

To define further operations on tensors it is often useful to associate each mode with a symbolic index.

Definition 2.2. A symbolic indexi of dimension n is a placeholder for an arbitrary but fixed natural number between 1 and n. For a dimension tuple n of order d and a tensor xRn we may write x (i1, …, id) and tacitly assume that ik are indices of dimension nk for each k = 1, …, d. When standing for itself this notation means x(i1,,id)=xRn and may be used to slice the tensor

xi1,l2,,ldRn1

where lkNnk are fixed indices for all k = 2, …, d. For any dimension tuple n of order d we define the symbolic multi-index in = (i1, …, id) of dimension n where ik is a symbolic index of dimension nk for all k = 1, …, d.

Remark 2.3. We use the letters i and j (with appropriate subscripts) for symbolic indices while reserving the letters k, l and m for ordinary indices.

Example 2.4. Let x be an order 2 tensor with mode dimensions n1 and n2, i.e. an n1-by-n2 matrix. Then x (1, j) denotes the 1-th row of x and x (i, 2) denotes the 2-th column of x.

Inspired by Einstein notation we use the concept of symbolic indices to define different operations on tensors.

Definition 2.5. Let i1 and i2 be (symbolic) indices of dimension n1 and n2, respectively and let φ be a bijection

φ:Nn1×Nn2Nn1n2.

We then define the product of indices with respect to φ as j = φ(i1, i2) where j is a (symbolic) index of dimension n1n2. In most cases the choice of bijection is not important and we will write i1i2≔φ(i1, i2) for an arbitrary but fixed bijection φ. For a tensor x of dimension (n1, n2) the expression

yi1i2=xi1,i2

defines the tensor y of dimension (n1n2) while the expression

xi1,i2=yi1i2

defines xRn1×n2 from yRn1n2.

Definition 2.6. Consider the tensors xRn1×a×n2 and yRn3×b×n4. Then the expression

zin1,in2,j1,j2,in3,in4=xin1,j1,in2yin3,j2,in4(1)

defines the tensor zRn1×n2×a×b×n3×n4 in the obvious way. Similary, for a = b the expression

zin1,in2,j,in3,in4=xin1,j,in2yin3,j,in4(2)

defines the tensor zRn1×n2×a×n3×n4. Finally, also for a = b the expression

zin1,in2,in3,in4=xin1,j,in2yin3,j,in4(3)

defines the tensor zRn1×n2×n3×n4 as

zin1,in2,in3,in4=k=1axin1,k,in2yin3,k,in4.

We choose this description mainly because of its simplicity and how it relates to the implementation of these operations in the numeric libraries numpy [33] and xerus [34].

2.2 Graphical Notation and Tensor Networks

This section will introduce the concept of tensor networks [35] and a graphical notation for certain operations which will simplify working with these structures. To this end we reformulate the operations introduced in the last section in terms of nodes, edges and half-edges.

Definition 2.7. For a dimension tuple n of order d and a tensor xRn the graphical representation of x is given by.www.frontiersin.orgwhere the node represents the tensor and the half-edges represent the d different modes of the tensor illustrated by the symbolic indices i1, …, id.

With this definition we can write the reshapings of Defintion 2.5 simply as

www.frontiersin.org

and also simplify the binary operations of Definition 2.6.

Definition 2.8. Let xRn1×a×n2 and yRn3×b×n4 be two tensors. Then Operation Eq. 1 is represented by

www.frontiersin.org

and defines zR×a×b×. For a = b Operation Eq. 2 is represented by

www.frontiersin.org

and defines zR×a× and Operation Eq. 3 defines zR× by.

www.frontiersin.org

With these definitions we can compose entire networks of multiple tensors which are called tensor networks.

2.3 The Tensor Train Format

A prominent example of a tensor network is the tensor train (TT) [19,36], which is the main tensor network used throughout this work. This network is discussed in the following subsection.

Definition 2.9. Let n be an dimensional tuple of order-d. The TT format decomposes an order d tensor xRn into dcomponent tensorsxkRrk1×nk×rk for k = 1, …, d with r0 = rd = 1. This can be written in tensor network formula notation as

xi1,,id=x1i1,j1x2j1,i2,j2xdjd1,id.

The tuple (r1, …, rd−1) is called the representation rank of this representation.

In graphical notation it looks like this.

www.frontiersin.org

Remark 2.10. Note that this representation is not unique. For any pair of matrices (A, B) that satisfies AB = Id we can replace xk by xk (i1, i2, j) ⋅ A (j, i3) and xk+1 by B (i1, j) ⋅ x (j, i2, i3) without changing the tensor x.

The representation rank of x is therefore dependent on the specific representation of x as a TT, hence the name. Analogous to the concept of matrix rank we can define a minimal necessary rank that is required to represent a tensor x in the TT format.

Definition 2.11. The tensor train rank of a tensor xRn with tensor train components x1Rn1×r1, xkRrk1×nk×rk for k = 2, …, d − 1 and xdRrd1×nd is the set

TT-rankx=r1,,rd

of minimal rk’s such that the xk compose x.

In [[22], Theorem 1a] it is shown that the TT-rank can be computed by simple matrix operations. Namely, rk can be computed by joining the first k indices and the remaining dk indices and computing the rank of the resulting matrix. At last, we need to introduce the concept of left and right orthogonality for the tensor train format.

Definition 2.12. Let xRm×n be a tensor of order d + 1. We call xleft orthogonal if

xim,j1xim,j2=Idj1,j2.

Similarly, we call a tensor xRm×n of order d + 1 right orthogonal if

xi1,jnxi2,jn=Idi1,i2.

A tensor train is left orthogonal if all component tensors x1, …, xd−1 are left orthogonal. It is right orthogonal if all component tensors x2, …, xd are right orthogonal.

Lemma 2.1 [36].For every tensorxRnof orderdwe can find left and right orthogonal decompositions.

For technical purposes it is also useful to define the so-called interface tensors, which are based on left and right orthogonal decompositions.

Definition 2.13. Let x be a tensor train of order d with rank tuple r.

For every k = 1, …, d and = 1, …, rk, the -th left interface vector is given by

τk,(x)i1,i2,,ik=x1i1,j1xkjk1,ik,

where x is assumed to be left orthogonal. The -th right interface vector is given by

τk+1,(x)ik+1,,id=xk+1,ik+1,jk+1xdjd1,id

where x is assumed to be right orthogonal.

2.4 Sets of Polynomials

In this section we specify the setup for our method and define the majority of the different sets of polynomials that are used. We start by defining dictionaries of one dimensional functions which we then use to construct the different sets of high-dimensional functions.

Definition 2.14. Let pN be given. A function dictionary of size p is a vector valued function Ψ=(Ψ1,,Ψp):RRp.

Example 2.15. Two simple examples of a function dictionary that we use in this work are given by the monomial basis of dimension p, i.e.

Ψmonomialx=1xx2xp1T(4)

and by the basis of the first p Legendre polynomials, i.e.

ΨLegendrex=1x123x21125x33xT.(5)

Using function dictionaries we can define the following high-dimensional space of multivariate functions. Let Ψ be a function dictionary of size pN. The d-th order product space that corresponds to the function dictionary Ψ is the linear span

Vpdk=1dΨmk:mNpd.(6)

This means that every function uVpd can be written as

ux1,,xd=ci1,,idk=1dΨxkik(7)

with a coefficient tensor cRp where p = (p, …, p) is a dimension tuple of order d. Note that equation Eq. 7 uses the index notation from Definition 2.6 with arbitrary but fixed xk’s. Since Rp is an intractably large space, it makes sense for numerical purposes to consider the subset

TrVpduVpd:TT-rankcr(8)

where the TT rank of the coefficient is bounded. Every uTr(Vpd) can thus be represented graphically as

www.frontiersin.org

where the Ck’s are the components of the tensor train representation of the coefficient tensor cRp of uVpd.

Remark 2.16. In this way every tensor cRp (in the tensor train format) corresponds one to one to a function uVpd.

An important subspace of Vpd is the space of homogeneous polynomials. For the purpose of this paper we define the subspace of homogeneous polynomials of degree g as the space

Wgdk=1dΨmk:mNpdandk=1dmk=d+g,(10)

where again ⟨•⟩ is the linear span. From this definition it is easy to see that a homogeneous polynomial of degree g can be represented as an element of Vpd where the coefficient tensor c satisfies

cm1,,md=0,ifk=1dmkd+g.

In Theoretical Foundation we will introduce an efficient representation of such coefficient tensors c in a block sparse tensor format.

Using Wgd we can also define the space of polynomials of degree at most g by

Sgd=g̃=0gWg̃d.(11)

Based on this characterization we will define a block-sparse tensor train version of this space in Theoretical Foundation.

2.5 Parametrizing Homogeneous Polynomials by Symmetric Tensors

In algebraic geometry the space Wgd is considered classically only for the dictionary Ψmonomial of monomials and is typically parameterized by a symmetric tensor

ux=Bi1,···,igxi1···xig,xRd(12)

where d = (d, …, d) is a dimension tuple of order g and BRd satisfies B (m1, …, mg) = B (σ(m1, …, mg)) for every permutation σ in the symmetric group Sg. We conclude this section by showing how the representation Eq. 7 can be calculated from the symmetric tensor representation Eq. 12, and vice versa. By equating coefficients we find that for every (m1,,md)Npd either m1 + ⋯ + mdd + g and c (m1, …, md) = 0 or

cm1,,md=σn:σSgBσn1,,ngwheren1,,ng=(1,,1m11 times,2,,2m21 times,,)Ndg.

Since B is symmetric the sum simplifies to

σn:σSgBσn1,,ng=gm11,,md1Bn1,,ng.

From this follows that for (n1,,ng)Ndg

Bn1,,ng=1gm11,,md1cm1,,mdwheremk=1+=1gδk,nfor allk=1,,d

and δk,ℓ denotes the Kronecker delta. This demonstrates how our approach can alleviate the difficulties that arise when symmetric tensors are represented in the hierarchical tucker format [37] in a very simple fashion.

2.6 Least Squares

Let in the following Vpd be the product space of a function dictionary Ψ such that VpdL2(Ω). Consider a high-dimensional function fL2(Ω) on some domain ΩRd and assume that the point-wise evaluation f(x) is well-defined for x ∈ Ω. In practice it is often possible to choose Ω as a product domain Ω = Ω1 × Ω2 ×⋯ Ωd by extending f accordingly. To find the best approximation uW of f in the space WVpd we then need to solve the problem

uW=argminuWfuL2Ω2.(13)

A practical problem that often arises when computing uW is that computing the L2(Ω)-norm is intractable for large d. Instead of using classical quadrature rules one often resorts to a Monte Carlo estimation of the high-dimensional integral. This means one draws M random samples {x(m)}m=1,,M from Ω and estimates

fuL2Ω21Mm=1MfxmuxmF2,

where ‖⋅‖F is the Frobenius norm. With this approximation we can define an empirical version of uW as

uW,M=argminuW1Mm=1MfxmuxmF2.(14)

For a linear space W, computing uW,M amounts to solving a linear system and does not pose an algorithmic problem. We use the remainder of this section to comment on the minimization problem Eq. 14 when a set of tensor trains is used instead.

Given samples (x(m))m=1,,M we can evaluate uVpd for each x(m)=(x1(m),,xd(m)) using Eq. 7 If the coefficient tensor c of u can be represented in the TT format then we can use Eq. 9 to perform this evaluation efficiently for all samples (x(m))m=1,,M at once. For this we introduce for each k = 1, …, d the matrix

Ξk=Ψxk1ΨxkMRp×M.(15)

Then the M-dimensional vector of evaluations of u at all given sample points is given by.

www.frontiersin.orgwhere we use Operation Eq. 2 to join the different M-dimensional indices.

The alternating least-squares algorithm cyclically updates each component tensor Ck by minimizing the residual corresponding to this contraction. To formalize this we define the operator ΦkRM×rk1×nk×rk as

.

www.frontiersin.org(16)

Then the update for Ck is given by a minimal residual solution of the linear system

Ck=argminCRrk1×nk×rkΦkj,i1,i2,i3Ci1,i2,i3FjF2

where F(m)≔y(m)f (x(m)) and i1, i2, i3, j are symbolic indices of dimensions rk−1, nk, rk, M, respectively. The particular algorithm that is used for this minimization may be adapted to the problem at hand. These contractions are the basis for our algorithms in Method Description. We refer to [19] for more details on the ALS algorithm.

Note that it is possible to reuse parts of the contractions in Φk through so called stacks. In this way not the entire contraction has to be computed for every k. The dashed boxes mark the parts of the contraction that can be reused. Details on that can be found in [38].

3 Theoretical Foundation

3.1 Sample Complexity for Polynomials

The accuracy of the solution uW,M of Eq. 14 in relation to uW is subject to tremendous interest on the part of the mathematics community. Two particular papers that consider this problem are [32,39]. While the former provides sharper error bounds for the case of linear ansatz spaces the latter generalizes the work and is applicable to tensor network spaces. We now recall the relevant result for convenience.

Proposition 3.1.For any setWL2(Ω) ∩ L(Ω), define the variation constant

KWsupvW\0vLΩ2vL2Ω2.

Letδ ∈ (0, 2−1/2). IfWis a subset of a finite dimensional linear space andk≔ max{K ({fuW}), K ({uW}− W)} < ∞it holds that

PfuW,ML2Ω3+4δfuWL2Ω1q

whereqdecreases exponentially with a rate ofln(q)O(Mδ2k2).

Proof. Since k < , Theorems 2.7 and 2.12 in [32] ensure that

fuW,ML2Ω1+21+δ1δfuWL2Ω.

holds with a probability of at least 12Cexp(12Mδ2k2). The constant C is independent of M and, since W is a subset of a finite dimensional linear space, depends only polynomially on δ and k−1. For δ ∈ (0, 2−1/2) it holds that 1+δ1δ1+2δ. This concludes the proof.

Note that the value of k depends only on f and on the set W but not on the particular choice of representation of W. However, the variation constant of spaces like Vpd still depends on the underlying dictionary Ψ. Although the proposition indicates that a low value of k is necessary to achieve a fast convergence, the tensor product spaces Vpd considered thus far does not exhibit a small variation constant. The consequence of Proposition 3.1 is that elements of this space are hard to learn in general and may require an infeasible number of samples. To see this consider Ω = [−1,1]d and the function dictionary ΨLegendre of Legendre polynomials Eq. 5. Let LNpd and define P(x)k=1d2k1(ΨLegendre(xk))k for all L. Then, {P}L is an L2-orthonormal basis for the linear subspace VP:LVpd and one can show that

KV=supxΩLPx2=Lk=1d2k1,(17)

by using techniques from [[32], Sample Complexity for Polynomials] and the fact that each P attains its maximum at 1. If L=Npd, we can interchange the sum and product in Eq. 17 and can conclude that K(Vpd)=p2d. This means that we have to restrict the space Vpd to obtain an admissible variation constant. We propose to use the space Wgd of homogeneous polynomials of degree g. Employing Eq. 17 with L = { : || = d + g} we obtain the upper bound

KWgdd1+gd1max||=d+gk=1d2k1d1+gd12gd+3gmodd2gd+1dgmodd

where the maximum is estimated by observing that (2 (1 + 1) − 1) (22 − 1) ≤ (21 − 1) (2 (2 + 1) − 1) ⇔ 21. For gd this results in the simplified bound K(Wgd)(3ed1+gg)g, where e is the Euler number. This improves the variation constant substantially compared to the bound K(Vpd)p2d, when gd. A similar bound for the dictionary of monomials Ψmonomial is more involved but can theoretically be computed in the same way.

In this work, we focus on the case where the samples are drawn according to a probability measure on Ω. This however is not a necessity and it is indeed beneficial to draw the samples from an adapted sampling measure. Doing so, the theory in [32] ensures that K(V) = dim(V) for all linear spaces V — independent of the underlying dictionary Ψ. This in turn leads to the bounds K(Vpd)=pd and K(Wgd)=(d1+gd1)(ed1+gg)g for gd. These optimally weighted least-squares methods however, are not the focus of this work and we refer the interested reader to the works [39,40].

3.2 Block Sparse Tensor Trains

Now that we have seen that it is advantageous to restrict ourselves to the space Wgd we need to find a way to do so without loosing the advantages of the tensor train format. In [29] it was rediscovered from the physics community (see [28]) that if a tensor train is an eigenvector of certain Laplace-like operators it admits a block sparse structure. This means for a tensor train c the components Ck have zero blocks. Furthermore, this block sparse structure is preserved under key operations, like e.g. the TT-SVD. One possible operator which introduces such a structure is the Laplace-like operator

L=k=1d=1k1Ipdiag0,1,,p1=k+1dIp.(18)

This is the operator mentioned in the introduction encoding a quantum symmetry. In the context of quantum mechanics this operator is known as the bosonic particle number operator but we simply call it the degree operator. The reason for this is that for the function dictionary of monomials Ψmonomial the eigenspaces of L for eigenvalue g are associated with homogeneous polynomials of degreee g. Simply put, if the coefficient tensor c for the multivariate polynomial uVpd is an eigenvector of L with eigenvalue g, then u is homogeneous and the degree of u is g. In general there are polynomials in Vpd with degree up to (p − 1)d. To state the results on the block-sparse representation of the coefficient tensor we need the partial operators

Lk=m=1k=1m1Ipdiag0,1,,p1=m+1kIpLk+1=m=k+1d=k+1m1Ipdiag0,1,,p1=m+1dIp,

for which we have

L=Lk=k+1dIp+l=1kIpLk+1.

In the following we adopt the notation x = Lc to abbreviate the equation

xi1,,id=Li1,,id,j1,,jdcj1,,jd

where L is a tensor operator acting on a tensor c with result x.

Recall that by Remark 2.16 every TT corresponds to a polynomial by multiplying function dictionaries onto the cores. This means that for every = 1, …, r the TT τk,(c) corresponds to a polynomial in the variables x1, …, xk and the TT τk+1,(c) corresponds to a polynomial in the variables xk+1, …, xd. In general these polynomials are not homogeneous, i.e. they are not eigenvectors of the degree operators Lk and Lk+1. But since TTs are not uniquely defined (cf. Remark 2.10) it is possible to find transformations of the component tensors Ck and Ck+1 that do not change the tensor c or the rank r but result in a representation where each τk,(c) and each τk+1,(c) correspond to a homogeneous polynomial. Thus, if c represents a homogeneous polynomial of degree g and τk,(c) is homogeneous with deg(τk,(c))=g̃ then τk+1,(c) must be homogeneous with deg(τk,(c))=gg̃.

This is put rigorously in the first assertion in the subsequent Theorem 3.2. There Sk,g̃ contains all the indices for which the reduced basis polynomials satisfy deg(τk,(c))=g̃. Equivalently, it groups the basis functions τk+1,(c) into functions of order gg̃. The second assertion in Theorem 3.2 states that we can only obtain a homogeneous polynomial of degree g̃+m in the variables x1, …, xk by multiplying a homogeneous polynomial of degree g̃ in the variables x1, …, xk−1 with a univariate polynomial of degree m in the variable xk. This provides a constructive argument for the proof and can be used to ensure block-sparsity in the implementation. Note that this condition forces entire blocks in the component tensor Ck in equation (20) to be zero and thus decreases the degrees of freedom.

Theorem 3.2 [[29], Theorem 3.2]. Letp = (p, …, p) be a dimension tuple of sizedandcRp\{0}, be a tensor train of rankr = (r1, …, rd−1). ThenLc = gcif and only ifchas a representation with component tensorsCkRrk1×p×rkthat satisfies the following two properties.

1. For allg̃{0,1,,g}there existSk,g̃{1,,rk}such that the left and right unfoldings satsify

Lkτk,(c)=g̃τk,(c)Lk+1τk+1,(c)=gg̃τk+1,(c)(19)

forSk,g̃.

2. The component tensors satisfy a block structure in the setsSk,g̃form = 1, …, p

Ck1,m,200g̃gm1:1Sk1,g̃2Sk,g̃+m1(20)

where we setS0,0=Sd,g={1}.

Note that this generalizes to other dictionaries and is not restricted to monomials.

Although, block sparsity also appears for g + 1 ≠ p we restrict ourselves to the case g + 1 = p in this work. Note that then the eigenspace of L for the eigenvalue g has a dimension equal to the dimension of the space of homogeneous polynomials, namely (d1+gd1). Defining ρk,g̃|Sk,g̃|, we can derive the following rank bounds.

Lemma 3.3 [[29], Lemma 3.6]. Letp = (p, …, p) be a dimension tuple of sizedandcRp\{0}, withLc = gc. Assume thatg + 1 = pthen the block sizesρk,g̃from Theorem 3.2 are bounded by

ρk,g̃mink+g̃1k1,dk+gg̃1d1(21)

for allk = 1, …, d − 1 andg̃=0,,gandρk,0 = ρk,g = 1.

The proof of this lemma is based on a simple combinatorial argument. For every k consider the size of the groups ρk1,ḡ for ḡg̃. Then ρk,ḡ can not exceed the sum of these sizes. Similarly, ρk,ḡ can not exceed ḡg̃ρk+1,ḡ. Solving these recurrence relations yields the bound.

Example 3.1 (Block Sparsity). Let p = 4 and g = 3 be given and let c be a tensor train such that Lc = gc. Then for k = 2, …, d − 1 the component tensors Ck of c exhibit the following block sparsity (up to permutation). For indices i of order rk−1 and j of order rk

Cki,1,j=*0000*0000*0000*Cki,2,j=0*0000*0000*0000Cki,3,j=00*0000*00000000Cki,4,j=000*000000000000.

This block structure results from sorting the indices i and j in such a way that maxSk,g̃+1=minSk,g̃+1 for every g̃.

The maximal block sizes ρk,g̃ for k = 1, …, d − 1 are given by

ρk,0=1,ρk,1=mink,dk,ρk,2=mink,dk,ρk,3=1.

As one can see by Lemma 3.3 the block sizes ρk,g̃ can still be quite high.

The expressive power of tensor train parametrizations can be understood by different concepts, such as locality or self similarity. We use the remainder of this section to provide d-independent rank bounds in the context of locality.

Definition 3.2. Let uWgd be a homogeneous polynomial and B be the symmetric coefficient tensor introduced in Parametrizing homogeneous polynomials by symmetric tensors We say that u has a variable locality of Kloc if B (1, …, g) = 0 for all (1,,g)Ndg with

max|m1m2|:m1,m2=1,,g>Kloc.

Example 3.3. Let u be a homogeneous polynomial of degree 2 with variable locality Kloc. Then the symmetric matrix B (cf. Eq. 12) is Kloc-banded. For Kloc = 0 this means that B is diagonal and that u takes the form

ux==1dBx2.

This shows that variable locality removes mixed terms.

Remark 3.4. The locality condition in the following Theorem 3.4 is a sufficient, but in no way necessary, condition for a low rank. But since locality is a prominent feature of many physical phenomena, this condition allows us to identify an entire class of highly relevant functions which can be approximated very efficiently.

Consider, for example, a many-body system in one dimension, where each body is described by position and velocity coordinates. If the influence of neighboring bodies is much higher than the influence of more distant ones, the coefficients of the polynomial parts that depend on multiple variables often can be neglected. The forces in this system then exhibit a locality structure. An example of this is given in equation Eq. 6 in [3], where this structure is exhibited by the force that acts on the bodies. A similar structure also appears in the microscopic traffic models in Notation of [41].

Another example is given by the polynomial chaos expansion of the stochastic process

Xtξ1,,ξtci1,,itk=1tΨξkik

for tN, where Ψ is the function dictionary of Hermite polynomials. In many applications, it is justified to assume that the magnitude of the covariance Cov(Xt1,Xt2) decays with the distance of the indices |t1t2| If the covariance decays fast enough, the coefficient tensor exhibits approximate locality, i.e. it can be well approximated by a coefficient tensor that satisfies the locality condition. Examples of this are Gaussian processes with a Matérn kernel [42,43] or Markov processes.

Theorem 3.4.Letp = (p, …, p) be a dimension tuple of sizedandcRp\{0}correspond to a homogeneous polynomial of degreeg + 1 = p(i.e.Lc = gc) with variable localityKloc. Then the block sizesρk,g̃are bounded by

ρk,g̃=1Klocmin{Kloc+1+g̃2Kloc,+gg̃21}(22)

for allk = 1, …, d − 1 andg̃=1,,g1as well asρk,0 = ρk,g = 1.

Proof. For fixed g > 0 and a fixed component Ck recall that for each l the tensor τk,l(c) corresponds to a reduced basis function vl in the variables x1, …, xk and that for each l the tensor τk+1,l(c) corresponds to a reduced basis function wl in the variables xk+1, …, xd. Further recall that the sets Sk,g̃ group these vl and wl. For all lSk,g̃ it holds that deg(vl)=g̃ and deg(wl)=gg̃. For g̃=0 and g̃=g we know from Lemma 3.3 that ρk,g̃=1. Now fix any 0<g̃<g and arrange all the polynomials vl of degree g̃ in a vector v and all polynomials wl of degree gg̃ in a vector w. Then every polynomial of the form vQw for some matrix Q satisfies the degree constraint and the maximal possible rank of Q provides an upper bound for the block size ρk,g̃. However, due to the locality constraint we know that certain entries of Q have to be zero. We denote a variable of a polynomial as inactive if the polynomial is constant with respect to changes in this variable and active otherwise. Assume that the polynomials in v are ordered (ascendingly) according to the smallest index of their active variables and that the polynomials in w are ordered (ascendingly) according to the largest index of their active variables. With this ordering Q takes the form

Q=0Q1*Q2**Q3***QKloc.

This means that for = 1, …, Kloc each block Q matches a polynomial vl of degree g̃ in the variables xkKloc+,,xk with a polynomial wl of degree gg̃ in the variables xk+1, …, xk+.

Observe that the number of rows in Q decreases while the number columns increases with . This means that we can subdivide Q as

Q=000QC00*QR0,

where QC contains the blocks Q with more rows than columns (i.e. full column rank) and QR contains the blocks Q with more columns than rows (i.e. full row rank). So QC is a tall-and-skinny matrix while QR is a short-and-wide matrix and the rank for general Q is bounded by the sum over the column sizes of the Q in QC plus the sum over the row sizes of the Q in QR i.e.

rankQ==1KlocrankQ.

To conclude the proof it remains to compute the row and column sizes of Q. Recall that the number of rows of Q equals the number of polynomials u of degree g̃ in the variables xkKloc+,,xk that can be represented as u(xkKloc+,,xk)=xkKloc+ũ(xkKloc+,,xk). This corresponds to all possible ũ of degree g̃1 in the Kloc + 1 variables xkKloc+,,xk. This means that

#rowsQKloc+1+g̃2Kloc

and a similar argument yields

#columnsQ+gg̃21.

This concludes the proof.

This lemma demonstrates how the combination of the model space Wgd with a tensor network space can reduce the space complexity by incorporating locality.

Remark 3.5. The rank bound in Theorem 3.4 is only sharp for the highest possible rank. The precise bounds can be much smaller, especially for the first and last ranks, but are quite technical to write down. For this reason, we do not provide them.

One sees that the bound only depends on g and Kloc and is therefore d-independent.

Remark 3.6. The rank bounds presented in this section do not only hold for the monomial dictionary Ψmonomial but for all polynomial dictionaries Ψ that satisfy deg(Ψk) = k − 1 for all k = 1, …, p. When we speak of polynomials of degree g, we mean the space Wgd={vVpd:deg(v)=g}. For the dictionary of monomials Ψmonomial the space Wgd contains only homogeneous polynomials in the classical sense. However, when the basis of Legendre polynomials ΨLegendre is used one obtains a space in which the functions are not homogeneous in the classical sense. Note that we use polynomials since they have been applied successfully in practice, but other function dictionaries can be used as well. Also note that the theory is much more general as shown in [29] and is not restricted to the degree counting operator.

With Theorem 3.4 one sees that tensor trains are well suited to parametrize homogeneous polynomials of fixed degree where the symmetric coefficient tensor B (cf. Eq. 12) is approximately banded (see also Example 3.3). This means, that there exist an Kloc such that the error for a best approximation of B by a tensor B̃ with variable locality Kloc is small. However, Kloc is not known precisely in practice but can only be assumed by physical understanding of the problem at hand. Therefore, we still rely on rank adaptive schemes to find appropriate rank and block sizes. Moreover, the locality property heavily depends on the ordering of the modes. This ordering can be optimized, for example, by using entropy measures for the correlation of different modes, as it is done in quantum chemistry (cf. [[44], Remark 4.2]) or by model selection methods (cf. [25,27,45,46]).

4 Method Description

In this section we utilize the insights of Theoretical Foundation to refine the approximation spaces Wgd and Sgd and adapt the alternating least-squares (ALS) method to solve the related least-squares problems. First, we define the subset

BρWgd{uWgd:c is block-sparse with ρk,g̃ρ for 0g̃g,k=1,,d}(23)

and provide an algorithm for the related least-squares problem in Algorithm 1 which is a slightly modified version of the classical ALS [19].1 With this definition a straight-forward application of the concept of block-sparsity to the space Sgd is given by

Sg,ρd=g̃=0gBρWg̃d.(24)

This means that every polynomial in Sg,ρd can be represented by a sum of orthogonal coefficient tensors2

g̃=0gcg̃whereLcg̃=g̃cg̃.(25)

There is however another, more compact, way to represent this function. Instead of storing g + 1 different tensor trains c(0), …, c(g) of order d, we can merge them into a single tensor c of order d + 1 such that c(id,g̃̃)=c(g̃)(id). The summation over g̃ can then be represented by a contraction of a vector of 1’s to the (d + 1)-th mode. To retain the block-sparse representation we can view the (d + 1)-th component as an artificial component representing a shadow variable xd+1.

Remark 4.1. The introduction of the shadow variable xd+1 contradicts the locality assumptions of Theorem 3.4 and implies that the worst case rank bounds must increase. This can be problematic since the block size contributes quadratically to the number of parameters. However, a proof similar to that of Theorem 3.4 can be made in this setting and one can show that the bounds remain independent of d

ρk,g̃1+̲=1KlocminKloc+1+g̃2Kloc,+1̲+gg̃2+1̲1(26)

where the changes to Eq. 22 are underlined. This is crucial, since in practice one can assume locality by physical understanding of the problem at hand. With this statement, we can guarantee that the ranks are only slightly changed by the auxiliary contraction and the locality property is not destroyed.

We denote the set of polynomials that results from this augmented block-sparse tensor train representation as

Sg,ρd,aug(27)

where again ρ provides a bound for the block-size in the representation.

Since Sg,ρd,aug is defined analogously to Bρ(Wgd) we can use Algorithm 1 to solve the related least-squares problem by changing the contraction Eq. 16 to

.(28)

www.frontiersin.org

ALGORITHM 1
www.frontiersin.org

Algorithm 1. Extended ALS (SALSA) for the least-squares problem on Bρ(Wgd)

To optimize the coefficient tensors c(0), …, c(g) in the space Sg,ρd we resort to an alternating scheme. Since the coefficient tensors are mutually orthogonal we propose to optimize each c(g̃) individually while keeping the other summands {c(k)}kg̃ fixed. This means that we solve the problem

ug̃=argminuWg̃d1Mm=1Mfxmk=0kg̃gukxmuxmF2(29)

which can be solved using Algorithm 1. The original problem Eq. 14 is then solved by alternating over g̃ until a suitable convergence criterion is met. The complete algorithm is summarized in Algorithm 2.

ALGORITHM 2
www.frontiersin.org

Algorithm 2. Alternating extended ALS (SALSA) for the least-squares problem on Sg,ρd

The proposed representation has several advantages. The optimization with the tensor train structure is computationally less demanding than solving directly in Sgd. Let D=dim(Sgd)=d+gd. Then a reconstruction on Sgd requires to solve a linear system of size M × D while a microstep in an ALS sweep only requires the solution of systems of size less than Mpr2 (depending on the block sizes ρk,g̃). Moreover, the stack contractions as shown in Least Squares also benefit from the block sparse structure. This also means that the number of parameters of a full rank r tensor train can be much higher than the number of parameters of several c(m)’s which individually have ranks that are even larger than r.

Remark 4.2. Let us comment on the practical pondering behind choosing Sg,ρd,aug or Sg,ρd by stating some pros and cons of [1]these parametrizations. We expect that solving the least-squares problem for Sg,ρd,aug will be faster than for Sg,ρd since it is computational more efficient to optimize all polynomials simultaneously than every degree individually in an alternating fashion. On the other hand, the hierarchical scheme of the summation approach may allow one to utilize multi-level Monte Carlo approaches. Together with the fact that every degree g̃ possesses a different optimal sampling density this may result in a drastically improved best case sample efficiency for the direct method. Additionally, with Sg,ρd it is easy to extend the ansatz space simply by increasing g which is not so straight-forward for Sg,ρd,aug. Which approach is superior depends on the problem at hand.

5 Numerical Results

In this section we illustrate the numerical viability of the proposed framework on some simple but common problems. We estimate the relative errors on test sets with respect to the sought function f and are interested in the required number of samples leading to recovery. Our implementation is meant only as a proof of concept and does not lay any emphasis on efficiency. The rank is chosen a priori, the stopping criteria are naïvely implemented and rank adaptivity, as would be provided by SALSA, is missing all together.3 For this reason we only compare the methods in terms of degrees of freedom and accuracy and not in terms of computing time. These are relevant quantities nonetheless, since the degrees of freedom are often the limiting factor in high dimensions and the computing time is directly related to the number of degrees of freedom.

In the following we always assume p = g + 1. We also restrict the group sizes to be bounded by the parameter ρmax. In our experiments we choose ρmax without any particular strategy but ideally, ρmax would be determined adaptively by the use of SALSA, which we did not do in this work. For every sample size the error plots show the distribution of the errors between the 0.15 and 0.85 quantile. The code for all experiments has been made publicly available at https://github.com/ptrunschke/block_sparse_tt.

5.1 Riccati Equation

In this section we consider the closed-loop linear quadratic optimal control problem

minimizeuyL20,×1,12+λuL20,2subject toty=x2y+utχ0.4,0.4,t,x0,×1,1y0,x=y0x,x1,1xyt,1=xyt,1=0

After a spatial discretization of the heat equation with finite differences we obtain a d-dimensional system of the form

minimizeu0ytQyt+λut2dtsubject toẏ=Ay+Buandy0=y0.

It is well known [47] that the value function for this problem takes the form v(y0)=y0Py0 where P can be computed by solving the algebraic Riccati equation (ARE). It is therefore a homogeneous polynomial of degree 2. This function is a perfect example of a function that can be well-approximated in the space W2d. We approximate the value function on the domain Ω = [−1,1]d for d = 8 with the parameters g = 2 and ρmax = 4.

In this experiment we use the dictionary of monomials Ψ = Ψmonomial (cf. Eq. 4) and compare the ansatz spaces W28, B4(W28), T6(V38) and V38. Since the function v(x) is a general polynomial we use Lemma 3.3 to calculate the maximal block size 4. This guarantees perfect reconstruction since B4(W28)=W28. The rank bound 6 is chosen s.t. B4(W28)T6(V38). The degrees of freedom of all used spaces are listed in Table 1. In Figure 1 we compare the relative error of the respective ansatz spaces. It can be seen that the block sparse ansatz space recovers almost as well as the sparse approach. As expected, the dense TT format is less favorable with respect to the sample size.

TABLE 1
www.frontiersin.org

TABLE 1. Degrees of freedom for the full space Wgd of homogeneous polynomials of degree g =2, the TT variant Bρmax(Wgd) with maximal block size ρmax =4, the space Tr(Vpd) with TT rank bounded by r =6, and the full space Vpd for completeness.

FIGURE 1
www.frontiersin.org

FIGURE 1. 0.15–0.85 quantiles for the recovery error in W28 (blue), B4(W28) (orange), and T6(V38) (green). The relative error is computed with respect to the L2-norm using a Monte Carlo estimation with 106 samples.

A clever change of basis, given by the diagonalization of Q, can reduce the required block size from 4 to 1. This allows to extend the presented approach to higher dimensional problems. The advantage over the classical Riccati approach becomes clear when considering non-linear versions of the control problem that do not exhibit a Riccati solution. This is done in [8,9] using the dense TT-format Tr(Vpd).

5.2 Gaussian Density

As a second example we consider the reconstruction of an unnormalized Gaussian density

fx=expx22.

again on the domain Ω = [−1,1]d with d = 6. For the dictionary Ψ = ΨLegendre [cf. Eq. 5] we chose g = 7, ρmax = 1 and r = 8 and compare the reconstruction w.r.t. Sgd, Sg,ρmaxd and Tr(Vpd), defined in (11), (24) and (8). The degrees of freedom resulting from these different discretizations are compared in Table 2. This example is interesting because here the roles of the spaces are reversed. The function has product structure

fx=expx12expxd2

and can therefore be well approximated as a rank 1 tensor train with each component Ck just being a best approximation for exp(xk2) in the used function dictionary. Therefore, we expect the higher degree polynomials to be important. A comparison of the relative errors to the exact solution are depicted in Figure 2. This example demonstrates the limitations of the ansatz space S76 which is not able to exploit the low-rank structure of the function f. Using S7,16 can partially remedy this problem as can be seen by the improved sample efficiency. But since S7,16S76 the final approximation error of S7,16 can not deceed that of S76. One can see that the dense format T1(V86) produces the best results but is quite unstable compared to the other ansatz classes. This instability is a result of the non-convexity of the set Tr(Vpd) and we observe that the chance of getting stuck in a local minimum increases when the rank r is reduced from 8 to 1. Finally, we want to address the peaks that are observable at M ≈ 500 samples for T8(V86) and M ≈ 1716 samples for S76. For this recall that the approximation in S76 amounts to solving a linear system which is underdetermined for M < 1716 samples and overdetermined for M > 1716 samples. In the underdetermined case we compute the minimum norm solution and in the overdetermined case we compute the least-squares solution. It is well-known that the solution to such a reconstruction problem is particularly unstable in the area of this transition [39]. Although the set S7,16 is non-linear we take the peak at M ≈ 500 as evidence for a similar effect which is produced by the similar linear systems that are solved in the micro steps in the ALS.

TABLE 2
www.frontiersin.org

TABLE 2. Degrees of freedom for the full space Sgd, the TT variant Sg,ρmaxd with maximal block size ρmax =1, the space Tr(Vpd) with TT rank bounded by r =1, the space Tr(Vpd) with TT rank bounded by r =8, and the full space Vpd for completeness.

FIGURE 2
www.frontiersin.org

FIGURE 2. 0.15–0.85 quantiles for the recovery error in S76 (blue), S7,16 (orange), T1(V86) (green), and T8(V86) (red). The relative error is computed with respect to the L2-norm using a Monte Carlo estimation with 106 samples.

5.3 Quantities of Interest

The next considered problem often arises when computing quantities of interest from random partial differential equations. We consider the stationary diffusion equation

xax,yxux,y=fxxDux,y=0xD

on D = [−1,1]2. This equation is parametric in y ∈ [−1,1]d. The randomness is introduced by the uniformly distributed random variable yU([1,1]d) that enters the diffusion coefficient

ax,y1+6π2k=1dk2sinϖ̂kx1sinϖ̌kx2yk

with ϖ̂k=πk2 and ϖ̌k=πk2. The solution u often measures the concentration of some substance in the domain Ω and one is interested in the total amount of this substance in the entire domain

MyΩux,ydx.

An important result proven in [31] ensures the p summability, for some 0 < p ≤ 1, of the polynomial coefficients of the solution of this equation in the dictionary of Chebyshev polynomials. This means that the function is very regular and we presume that it can be well approximated in Sgd for the dictionary of Legendre polynomials ΨLegendre. For our numerical experiments we chose d = 10, g = 5 and ρmax = 3 and again compare the reconstruction w.r.t. Sgd, the block-sparse TT representations of Sg,ρmaxd and Sg,ρmaxd,aug and a dense TT representation of Tr(Vpd) with rank r ≤ 14. Admittedly, the choice d = 10 is relatively small for this problem but was necessary since the computation on Sgd took prohibitively long for larger values. A comparison of the degrees of freedom for the different ansatz spaces is given in Table 3 the relative errors to the exact solution are depicted in Figure 3. In this plot we can recognize the general pattern that a lower number of parameters can be associated with an improved sample efficiency. However, we also observe that for small M the relative error for Sg,ρd is smaller than for Sg,ρd,aug. We interpret this as a consequence of the regularity of u since the alternating scheme for the optimization in Sg,ρd favors lower degree polynomials by construction. In spite of this success, we have to point out that optimizing over Sg,ρd took about 10 times longer than optimizing over Sg,ρd,aug. Finally, we observe that the recovery in T14(V610) produces unexpectedly large relative errors when compared to previous results in [13]. This suggests that the rank-adaptive algorithm from [13] has a strong regularizing effect that improves the sample efficiency.

TABLE 3
www.frontiersin.org

TABLE 3. Degrees of freedom for the full space Sgd, the TT variant Sg,ρmaxd with maximal block size ρmax =3, the space Tr(Vpd) with TT rank bounded by r =14, and the full space Vpd for completeness.

FIGURE 3
www.frontiersin.org

FIGURE 3. 0.15–0.85 quantiles for the recovery error in S510 (blue), S5,310 (orange), S5,310,aug (green), and T14(V610) (red). The relative error is computed with respect to the L2-norm using a Monte Carlo estimation with 106 samples. The experiment for T14(V610) was stopped early at M =1,200 due to its prohibitive computational demand and because the expected behaviour is already observable.

6 Conclusion

We introduce block sparsity [28,29] as an efficient tool to parametrize, multivariate polynomials of bounded degree. We discuss how to extend this to general multivariate polynomials of bounded degree and prove bounds for the block sizes for certain polynomials. As an application we discuss the problem of function identification from data for tensor train based ansatz spaces and give some insights into when these ansatz spaces can be used efficiently. For this we motivate the usage of low degree multivariate polynomials by approximation results (e.g. [30,31]) and recent results on sample complexity [32]. This leads to a novel algorithm for the problem at hand. We then demonstrate the applicability of this algorithm to different problems. Up until now block sparse tensor trains are not used for these recovery tasks. The numerical examples, however, demonstrate that at least dense tensor trains can not compete with our novel block-sparse approach. We observe that the sample complexity can be much more favorable for successful system identification with block sparse tensor trains than with dense tensor trains or purely sparse representations. We expect that the inclusion of rank-adaptivity using techniques from SALSA or bASD is straight forward, which we therefore consider an interesting direction from an applied point of view for forthcoming papers. We expect, that this would improve the numerical results even further. The introduction of rank-adaptivity would moreover alleviate the problem of having to choose a block size a-priori. Finally, we want to reiterate that the spaces of polynomials with bounded degree are predestined for the application of least-squares recovery with an optimal sampling density (cf [39]) which holds opportunities for further improvement of the sample efficiency. This leads us to the strong believe that the proposed algorithm can be applied successfully to other high dimensional problems in which the sought function exhibits sufficient regularity.

Data Availability Statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://github.com/ptrunschke/block_sparse_tt.

Author Contributions

MG is the main contributor of the block-sparse tensor train section. RS is the main contributor of the method section. PT is the main contributor of the sample complexity section, the implementation of the experiments and contributed to the proof of the main Theorem. All authors contributed to every section and contributed equally to the introduction and to the discussion.

Funding

MG was funded by DFG (SCHN530/15-1). RS was supported by the Einstein Foundation Berlin. PT acknowledges support by the Berlin International Graduate School in Model and Simulation based Research (BIMoS).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Footnotes

1It is possible to include rank adaptivity as in SALSA [21] or bASD [13] and we have noted this in the relevant places.

2The orthogonality comes from the symmetry of L which results in orthogonal eigenspaces.

3We expect, that an application of the state-of-the-art SALSA algorithm to the block-sparse model class, as described in Section 4, is straight forward.

References

1. Brunton, SL, Proctor, JL, and Kutz, JN. Discovering Governing Equations from Data by Sparse Identification of Nonlinear Dynamical Systems. Proc Natl Acad Sci USA (2016) 113(15):3932–7. doi:10.1073/pnas.1517384113

PubMed Abstract | CrossRef Full Text | Google Scholar

2., Gelß, P, Klus, S, Eisert, J, and Schütte, C. Multidimensional Approximation of Nonlinear Dynamical Systems. J Comput Nonlinear Dyn (2019) 14(6). doi:10.1115/1.4043148

CrossRef Full Text | Google Scholar

3. Goeßmann, A, Götte, M, Roth, I, Ryan, S, Kutyniok, G, and Eisert, J. Tensor Network Approaches for Data-Driven Identification of Non-linear Dynamical Laws (2020). NeurIPS2020 - Tensorworkshop.

4. Kazeev, V, and Khoromskij, BN. Low-Rank Explicit QTT Representation of the Laplace Operator and its Inverse. SIAM J Matrix Anal Appl (2012) 33(3):742–58. doi:10.1137/100820479

CrossRef Full Text | Google Scholar

5. Kazeev, V, and Schwab, C. Quantized Tensor-Structured Finite Elements for Second-Order Elliptic PDEs in Two Dimensions. Numer Math (2018) 138(1):133–90. doi:10.1007/s00211-017-0899-1

CrossRef Full Text | Google Scholar

6. Bachmayr, M, and Kazeev, V. Stability of Low-Rank Tensor Representations and Structured Multilevel Preconditioning for Elliptic PDEs. Found Comput Math (2020) 20(5):1175–236. doi:10.1007/s10208-020-09446-z

CrossRef Full Text | Google Scholar

7. Eigel, M, Pfeffer, M, and Schneider, R. Adaptive Stochastic Galerkin FEM with Hierarchical Tensor Representations. Numer Math (2016) 136(3):765–803. doi:10.1007/s00211-016-0850-x

CrossRef Full Text | Google Scholar

8. Dolgov, S, Kalise, D, and Kunisch, K. Tensor Decomposition Methods for High-Dimensional Hamilton-Jacobi-Bellman Equations. arXiv (2021). 1908.01533 [cs, math].

9. Oster, M, Sallandt, L, and Schneider, R. Approximating the Stationary Hamilton-Jacobi-Bellman Equation by Hierarchical Tensor Products. arXiv (2021). arXiv:1911.00279 [math].

10. Richter, L, Sallandt, L, and Nüsken, N. Solving High-Dimensional Parabolic PDEs Using the Tensor Train Format. arXiv (2021). arXiv:2102.11830 [cs, math, stat].

11. Christian, B, Martin, E, Leon, S, and Philipp, T. Pricing High-Dimensional Bermudan Options with Hierarchical Tensor Formats. arXiv (2021). arXiv:2103.01934 [cs, math, q-fin].

12. Glau, K, Kressner, D, and Statti, F. Low-Rank Tensor Approximation for Chebyshev Interpolation in Parametric Option Pricing. SIAM J Finan Math (2020) 11(3):897–927. Publisher: Society for Industrial and Applied Mathematics doi:10.1137/19m1244172

CrossRef Full Text | Google Scholar

13. Eigel, M, Neumann, J, Schneider, R, and Wolf, S. Non-intrusive Tensor Reconstruction for High-Dimensional Random PDEs. Comput Methods Appl Math (2019) 19(1):39–53. doi:10.1515/cmam-2018-0028

CrossRef Full Text | Google Scholar

14. Eigel, M, Schneider, R, Trunschke, P, and Wolf, S. Variational Monte Carlo–Bridging Concepts of Machine Learning and High-Dimensional Partial Differential Equations. Adv Comput Math (2019) 45(5):2503–32. doi:10.1007/s10444-019-09723-8

CrossRef Full Text | Google Scholar

15. Zhang, Z, Yang, X, Oseledets, IV, Karniadakis, GE, and Daniel, L. Enabling High-Dimensional Hierarchical Uncertainty Quantification by Anova and Tensor-Train Decomposition. IEEE Trans Comput.-Aided Des Integr Circuits Syst (2015) 34(1):63–76. doi:10.1109/tcad.2014.2369505

CrossRef Full Text | Google Scholar

16. Klus, S, and Gelß, P. Tensor-Based Algorithms for Image Classification. Algorithms (2019) 12(11):240. doi:10.3390/a12110240

CrossRef Full Text | Google Scholar

17. Stoudenmire, E, and Schwab, DJ “Advances in Neural Information Processing Systems,” in Supervised Learning with Tensor Networks. Editors D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett. Curran Associates, Inc. (2016) 29. Available at: https://proceedings.neurips.cc/paper/2016/file/5314b9674c86e3f9d1ba25ef9bb32895-Paper.pdf/

Google Scholar

18. Oseledets, IV. DMRG Approach to Fast Linear Algebra in the TT-Format. Comput Methods Appl Math (2011) 11(3):382–393. doi:10.2478/cmam-2011-0021

CrossRef Full Text | Google Scholar

19. Holtz, S, Rohwedder, T, and Schneider, R. The Alternating Linear Scheme for Tensor Optimization in the Tensor Train Format. SIAM J Sci Comput (2012) 34(2):A683–A713. doi:10.1137/100818893

CrossRef Full Text | Google Scholar

20. White, SR. Density Matrix Formulation for Quantum Renormalization Groups. Phys Rev Lett (1992) 69(19):2863–6. doi:10.1103/physrevlett.69.2863

PubMed Abstract | CrossRef Full Text | Google Scholar

21. Grasedyck, L, and Krämer, S. Stable ALS Approximation in the TT-Format for Rank-Adaptive Tensor Completion. Numer Math (2019) 143(4):855–904. doi:10.1007/s00211-019-01072-4

CrossRef Full Text | Google Scholar

22. Holtz, S, Rohwedder, T, and Schneider, R. On Manifolds of Tensors of Fixed TT-Rank. Numer Math (2012) 120(4):701–31. doi:10.1007/s00211-011-0419-7

CrossRef Full Text | Google Scholar

23. Lubich, C, Oseledets, IV, and Vandereycken, B. Time Integration of Tensor Trains. SIAM J Numer Anal (2015) 53(2):917–41. doi:10.1137/140976546

CrossRef Full Text | Google Scholar

24. Chevreuil, M, Lebrun, R, Nouy, A, and Rai, P. A Least-Squares Method for Sparse Low Rank Approximation of Multivariate Functions. Siam/asa J Uncertainty Quantification (2015) 3(1):897–921. doi:10.1137/13091899x

CrossRef Full Text | Google Scholar

25. Grelier, E, Anthony, N, and Chevreuil, M. Learning with Tree-Based Tensor Formats. arXiv (2019). arXiv:1811.04455 [cs, math, stat].

26. Grelier, E, Anthony, N, and Lebrun, R. Learning High-Dimensional Probability Distributions Using Tree Tensor Networks. arXiv (2021). arXiv:1912.07913 [cs, math, stat].

27. Haberstich, C. Adaptive Approximation of High-Dimensional Functions with Tree Tensor Networks for Uncertainty Quantification (2020) Theses, École centrale de Nantes.

28. Singh, S, Pfeifer, RNC, and Vidal., G. Tensor Network Decompositions in the Presence of a Global Symmetry. Phys Rev A (2010) 82(5):050301. doi:10.1103/physreva.82.050301

CrossRef Full Text | Google Scholar

29. Markus, B, Michael, G, and Max, P. Particle Number Conservation and Block Structures in Matrix Product States. arXiv (2021). arXiv:2104.13483 [math.NA, quant-ph].

30. Breiten, T, Kunisch, K, and Pfeiffer, L. Taylor Expansions of the Value Function Associated with a Bilinear Optimal Control Problem. Ann de l'Institut Henri Poincaré C, Analyse non linéaire (2019) 36(5):1361–99. doi:10.1016/j.anihpc.2019.01.001

CrossRef Full Text | Google Scholar

31. Hansen, M, and Schwab, C. Analytic Regularity and Nonlinear Approximation of a Class of Parametric Semilinear Elliptic PDEs. Mathematische Nachrichten (2012) 286(8-9):832–60. doi:10.1002/mana.201100131

CrossRef Full Text | Google Scholar

32. Eigel, M, Schneider, R, and Trunschke, P. Convergence Bounds for Empirical Nonlinear Least-Squares. arXiv. arXiv:2001.00639 [cs, math], 2020.

33. Oliphant, T. Guide to NumPy (2006).

34. Huber, B, and Wolf, S. Xerus - A General Purpose (2014). Tensor Library.

35. Espig, M, Hackbusch, W, Handschuh, S, and Schneider, R. Optimization Problems in Contracted Tensor Networks. Comput Vis Sci. (2011) 14(6):271–85. doi:10.1007/s00791-012-0183-y

CrossRef Full Text | Google Scholar

36. Oseledets, IVV. Tensor-Train Decomposition. SIAM J Sci Comput (2011) 33(5):2295–317. doi:10.1137/090752286

CrossRef Full Text | Google Scholar

37. Hackbusch, W. On the Representation of Symmetric and Antisymmetric Tensors. Preprint. Leipzig, Germany: Max Planck Institute for Mathematics in the Sciences (2016).

38. Wolf, S. Low Rank Tensor Decompositions for High Dimensional Data Approximation, Recovery and Prediction. [PhD thesis]. TU Berlin (2019).

39. Cohen, A, and Migliorati, G. Optimal Weighted Least-Squares Methods. SMAI J Comput Math (2017) 3:181–203. doi:10.5802/smai-jcm.24

CrossRef Full Text | Google Scholar

40. Haberstich, C, Anthony, N, and Perrin, G. Boosted Optimal Weighted Least-Squares. arXiv (2020). arXiv:1912.07075 [math.NA].

41. Göttlich, S, and Schillinger, T. Microscopic and Macroscopic Traffic Flow Models Including Random Accidents (2021).

42. Rasmussen, CE, and Williams, CKI. Gaussian Processes for Machine Learning. Cambridge: MIT Press (2006).

43. Cornford, D, Nabney, IT, and Williams, CKI. Modelling Frontal Discontinuities in Wind fields. J Nonparametric Stat (2002) 14(1-2):43–58. doi:10.1080/10485250211392

CrossRef Full Text | Google Scholar

44. Szalay, S, Pfeffer, M, Murg, V, Barcza, G, Verstraete, F, Schneider, R, et al. Tensor Product Methods and Entanglement Optimization Forab Initioquantum Chemistry. Int J Quan Chem. (2015) 115(19):1342–91. doi:10.1002/qua.24898

CrossRef Full Text | Google Scholar

45. Michel, B, and Nouy, A. Learning with Tree Tensor Networks: Complexity Estimates and Model Selection. arXiv (2021). arXiv:2007.01165 [math.ST].

46. Ballani, J, and Grasedyck, L. Tree Adaptive Approximation in the Hierarchical Tensor Format. SIAM J Sci Comput (2014) 36(4):A1415–A1431. doi:10.1137/130926328

CrossRef Full Text | Google Scholar

47. Curtain, RF, and Hans, Z. An Introduction to Infinite-Dimensional Linear Systems Theory. New York: Springer (1995).

48. Steinlechner, M. Riemannian Optimization for High-Dimensional Tensor Completion. SIAM J Sci Comput (2016) 38(5):S461–S484. doi:10.1137/15m1010506

CrossRef Full Text | Google Scholar

Keywords: sample efficiency, homogeneous polynomials, sparse tensor networks, alternating least square, empirical L2 approximation

Citation: Götte M, Schneider R and Trunschke P (2021) A Block-Sparse Tensor Train Format for Sample-Efficient High-Dimensional Polynomial Regression. Front. Appl. Math. Stat. 7:702486. doi: 10.3389/fams.2021.702486

Received: 29 April 2021; Accepted: 21 July 2021;
Published: 07 September 2021.

Edited by:

Edoardo Angelo Di Napoli, Helmholtz-Verband Deutscher Forschungszentren (HZ), Germany

Reviewed by:

Mazen Ali, École centrale de Nantes, France
Katharina Kormann, Uppsala University, Sweden
Antonio Falco, Universidad CEU Cardenal Herrera, Spain

Copyright © 2021 Götte, Schneider and Trunschke. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Philipp Trunschke, ptrunschke@mail.tu-berlin.de

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.