Skip to main content

ORIGINAL RESEARCH article

Front. Appl. Math. Stat., 13 January 2022
Sec. Statistics and Probability
This article is part of the Research Topic 2022 Applied Mathematics and Statistics – Editor’s Pick View all 15 articles

A Statistically and Numerically Efficient Independence Test Based on Random Projections and Distance Covariance

  • Georgia Institute of Technology, Atlanta, GA, United States

Testing for independence plays a fundamental role in many statistical techniques. Among the nonparametric approaches, the distance-based methods (such as the distance correlation-based hypotheses testing for independence) have many advantages, compared with many other alternatives. A known limitation of the distance-based method is that its computational complexity can be high. In general, when the sample size is n, the order of computational complexity of a distance-based method, which typically requires computing of all pairwise distances, can be O(n2). Recent advances have discovered that in the univariate cases, a fast method with O(n log  n) computational complexity and O(n) memory requirement exists. In this paper, we introduce a test of independence method based on random projection and distance correlation, which achieves nearly the same power as the state-of-the-art distance-based approach, works in the multivariate cases, and enjoys the O(nK log  n) computational complexity and O( max{n, K}) memory requirement, where K is the number of random projections. Note that saving is achieved when K < n/ log  n. We name our method a Randomly Projected Distance Covariance (RPDC). The statistical theoretical analysis takes advantage of some techniques on the random projection which are rooted in contemporary machine learning. Numerical experiments demonstrate the efficiency of the proposed method, relative to numerous competitors.

1 Introduction

Test of independence is a fundamental problem in statistics, with many existing work including the maximal information coefficient (MIC) [1], the copula based measures [2,3], the kernel based criterion [4] and the distance correlation [5,6], which motivated our current work. Note that the above works as well as ours focus on the testing for independence, which can be formulated as statistical hypotheses testing problems. On the other hand, interesting developments (e.g., [7]) aim at a more general framework for interpretable statistical dependence, which is not the goal of this paper.

Distance correlation proposed by [6] is an important method in the test of independence. The direct implementation of distance correlation takes O(n2) time, where n is the sample size. The time cost of distance correlation could be substantial when the sample size is just a few thousand. When the random variables are univariate, there exist efficient numerical algorithms of time complexity O(n log  n) [8]. However, for the multivariate random variables, we have not found any efficient algorithms in existing papers after an extensive literature survey.

Independence tests of multivariate random variables could have a wide range of applications. In many problem settings, as mentioned in [9], each experimental unit will be measured multiple times, resulting in multivariate data. Researchers are often interested in exploring potential relationships among subsets of these measurements. For example, some measurements may represent attributes of physical characteristics while others represent attributes of psychological characteristics. It may be of interest to determine whether there exists a relationship between the physical and psychological characteristics. A test of independence between pairs of vectors, where the vectors may have different dimensions and scales, becomes crucial. Moreover, the number of experimental units, or equivalently, sample size, could be massive, which requires the test to be computationally efficient. This work will meet the demands for numerically efficient independence tests of multivariate random variables.

The newly proposed test of independence between two (potentially multivariate) random variables X and Y works as follows. Firstly, both X and Y are randomly projected to one-dimensional spaces. Then the fast computing method for distance covariances between a pair of univariate random variables is adopted to compute for a surrogate distance covariance. The above two steps are repeated numerous times. The final estimate of the distance covariance is the average of all aforementioned surrogate distance covariances.

For numerical efficiency, we will show (in Theorem 3.1) that the newly proposed algorithm enjoys the O(Kn log  n) computational complexity and O( max{n, K}) memory requirement, where K is the number of random projections and n is the sample size. On the statistical efficiency, we will show (in Theorem 4.19) that the asymptotic power of the test of independence by utilizing the newly proposed statistics is as efficient as its original multivariate counterpart, which achieves the state-of-the-art rates.

The rest of this paper is organized as follows. In Section 2, we review the definition of distance covariance, its fast algorithm in univariate cases, and related distance-based independence tests. Section 3 gives the detailed algorithm for distance covariance of random vectors and corresponding independence tests. In Section 4, we present some theoretical properties on distance covariance and the asymptotic distribution of the proposed estimator. In Section 5, we conduct numerical examples to compare our method against others in the existing literature. Some discussions are presented in Section 6. We conclude in Section 7. All technical proofs, as well as the formal presentation of algorithms, are relegated to the appendix when appropriate.

Throughout this paper, we adopt the following notations. We denote cp=π(p+1)/2Γ((p+1)/2) and cq=π(q+1)/2Γ((q+1)/2) as two constants, where Γ(⋅) denotes the Gamma function. We will also need the following constants: Cp=c1cp1cp=πΓ((p+1)/2)Γ(p/2) and Cq=c1cq1cq=πΓ((q+1)/2)Γ(q/2). For any vector v, let vt denote its transpose.

2 Review of Distance Covariance: Definition, Fast Algorithm, and Related Independence Tests

In this section, we review some related existing works. In Section 2.1, we recall the concept of distance variances and correlations, as well as some of their properties. In Section 2.2, we discuss the estimators of distance covariances and correlations, as well as their computation. We present their applications in the test of independence in Section 2.3.

2.1 Definition of Distance Covariances

Measuring and testing the dependency between two random variables is a fundamental problem in statistics. The classical Pearson’s correlation coefficient can be inaccurate and even misleading when nonlinear dependency exists [6]. propose the novel measure–distance correlation–which is exactly zero if and only if two random variables are independent. A limitation is that if the distance correlation is implemented based on its original definition, the corresponding computational complexity can be as high as O(n2), which is not desirable when n is large.

We review the definition of the distance correlation in [6]. Let us consider two random variables XRp, YRq,p1,q1. Let the complex-valued functions ϕX,Y(⋅), ϕX(⋅), and ϕY(⋅) be the characteristic functions of the joint density of X and Y, the density of X, and the density of Y, respectively. For any function ϕ, we denote |ϕ|2=ϕϕ̄, where ϕ̄ is the conjugate of ϕ; in words, |ϕ| is the magnitude of ϕ at a particular point. For vectors, let us use |⋅| to denote the Euclidean norm. In [6], the definition of distance covariance between random variables X and Y is

V2(X,Y)=Rp+q|ϕX,Y(t,s)ϕX(t)ϕY(s)|2cpcq|t|p+1|s|q+1dtds,

where two constants cp and cq have been defined at the end of Section 1. The distance correlation is defined as

R2(X,Y)=V2(X,Y)V2(X,X)V2(Y,Y).

The following property has been established in the aforementioned paper.

Theorem 2.1 nSuppose XRp,p1 and YRq,q1 are two random variables, the following statements are equivalent:

1) X is independent of Y;

2) ϕX,Y(t, s) = ϕX(t)ϕY(s), for any tRp and sRq;

3) V2(X,Y)=0;

4) R2(X,Y)=0.

Given sample (X1, Y1), …, (Xn, Yn), we can estimate the distance covariance by replacing the population characteristic function with the sample characteristic function: for i=1,tRp,sRq, we define

ϕ̂X(t)=1nj=1neiXjtt,ϕ̂Y(s)=1nj=1neiYjts, and ϕ̂X,Y(t,s)=1nj=1neiXjtt+iYjts.

Consequently one can have the following estimator for V2(X,Y):

Vn2(X,Y)=Rp+q|ϕ̂X,Y(t,s)ϕ̂X(t)ϕ̂Y(s)|2cpcq|t|p+1|s|q+1dtds.

Note that the above formula is convenient to define a quantity, however, is not convenient for computation, due to the integration on the right-hand side. In the literature, other estimates have been introduced and will be presented in the following.

2.2 Fast Algorithm in the Univariate Cases

The paper [10] gives an equivalent definition for the distance covariance between random variables X and Y:

V2(X,Y)=E[d(X,X)d(Y,Y)]=E[|XXYY|]2E[|XXYY|]+E[|XX|]E[|YY|],

where the double centered distance d(⋅, ⋅) is defined as

d(X,X)=|XX|EX[|XX|]EX[|XX|]+E[|XX|],

where EX, EX and E are expectations over X, X′ and (X, X′), respectively.

Motivated by the above definition, one can give an unbiased estimator for V2(X,Y). The following notations will be utilized: for 1 ≤ i, jn,

aij=|XiXj|,bij=|YiYj|,ai=l=1nail,bi=l=1nbil,a=k,l=1nakl,andb=k,l=1nbkl.

It has been proven [8, 28] that

Ωn(X,Y)=1n(n3)ijaijbij2n(n2)(n3)i=1naibi+abn(n1)(n2)(n3)

is an unbiased estimator of V2(X,Y). In addition, a fast algorithm has been proposed [8] for the aforementioned sample distance covariance in the univariate cases with complexity order O(n log  n) and storage O(n). We list the result below for reference purpose.

Theorem 2.2 (Theorem 3.2 & Corollary 4.1 in [8]). Suppose X1, …, Xn and Y1,,YnR. The unbiased estimator Ωn defined in (2.5) can be computed by an O(n log  n) algorithm.In addition, as a byproduct, the following result is established in the same paper.

Corollary 2.3 The quantity

abn(n1)(n2)(n3)=k,l=1naklk,l=1nbkln(n1)(n2)(n3)

can be computed by an O(n log  n) algorithm.We will use the above result in our test of independence. However, as far as we know, in the multivariate cases, there does not exist any work on the fast algorithm of the order of complexity O(n log  n). This paper will fill in this gap by introducing an order O(nK log  n) complexity algorithm in multivariate cases.

2.3 Distance Based Independence Tests

Ref. [6] proposed an independence test using the distance covariance. We summarize it below as a theorem, which serves as a benchmark. Our test will be aligned with the following one, except that we introduced a new test statistic, which can be more efficiently computed, and it has comparable asymptotic properties with the test statistic that is used below.

Theorem 2.4 ([6], Theorem 6). For potentially multivariate random variables X and Y, a prescribed level αs, and sample size n, one rejects the independence if and only if

nVn2(X,Y)S2>(Φ1(1αs/2))2,

where Vn2(X,Y) has been defined in (2.2), Φ(⋅) denote the cumulative distribution function of the standard normal distribution and

S2=1n4i,j=1n|XiXj|i,j=1n|YiYj|.

Moreover, let α(X, Y, n) denote the achieved significance level of the above test. If E[|X|+|Y|]<, then for all 0 < αs < 0.215, one can show the following:

limnα(X,Y,n)αs, and supX,Ylimnα(X,Y,n):V(X,Y)=0=αs.

Note that the quantity Vn2(X,Y) that is used above as in [6] differs from the one that will be used in our proposed method. As mentioned, we use the above as an illustration for distance-based tests of independence, as well as the theoretical/asymptotic properties that such a test can achieve.

3 Numerically Efficient Method for Random Vectors

This section is made of two components. We present a random-projection-based distance covariance estimator that will be proven to be unbiased with a computational complexity that is O(Kn log  n) in Section 3.1. In Section 3.2, we describe how the test of independence can be done by utilizing the above estimator. For users’ convenience, stand-alone algorithms are furnished in the Supplementary Appendix.

3.1 Random Projection Based Methods for Approximating Distance Covariance

We consider how to use a fast algorithm for univariate random variables to compute or approximate the sample distance covariance of random vectors. The main idea works as follows: first, projecting the multivariate observations on some random directions; then, using the fast algorithm to compute the distance covariance of the projections; at last, averaging distance covariances from different projecting directions.

More specifically, our estimator can be computed as follows. For potentially multivariate X1,,XnRp and Y1,,YnRq, let K be a predetermined number of iterations, we do:

1) For each k (1 ≤ kK), randomly generate uk and vk from Uniform(Sp1) and Uniform(Sq1), respectively. Here Sp1 and Sq1 are the unit spheres in Rp and Rq, respectively.

2) Let uktX and vktY denote the projections of X and Y to the space that are orthogonal to vectors uk and vk, respectively. That is we have

uktX=(uktX1,,uktXn), and vktY=(vktY1,,vktYn).

Note that samples uktX and vktY are now univariate.

3) Utilize the fast (i.e., order O(n log  n)) algorithm that was mentioned in Theorem 2.2 to compute for the unbiased estimator in Eq. 2.5 with respect to uktX and vktY. Formally, we denote

Ωn(k)=CpCqΩn(uktX,vktY),

where Cp and Cq have been defined at the end of Section 1.

(4) The above three steps are repeated for K times. The final estimator is

Ω̄n=1Kk=1KΩn(k).

To emphasize the dependency of the above quantity with K, we sometimes use a notation Ω̄n,KΩ̄n.

See Algorithm 1 in the Supplementary Appendix for a stand-alone presentation of the above method. In the light of Theorem 2.2, we can handily declare the following.

Theorem 3.1 For potentially multivariate X1,,XnRp and Y1,,YnRq, the order of computational complexity of computing the aforementioned Ω̄n is O(Kn log  n) with storage O( max{n, K}), where K is the number of random projections.The proof of the above theorem is omitted because it is straightforward from Theorem 2.2. The statistical properties of the proposed estimator Ω̄n will be studied in the subsequent section (specifically in Section 4.4).

3.2 Test of Independence

By a later result (cf. Theorem 4.19), we can apply Ω̄n in the independence tests. The corresponding asymptotic distribution of the test statistic Ω̄n can be approximated by a Gamma(α, β) distribution with α and β given in Eq. 4.7. We can compute the significance level of the test statistic by permutation and conduct the independence test accordingly. Recall that we have potentially multivariate X1,,XnRp and Y1,,YnRq. Recall that K denotes the number of Monte Carlo iterations in our previous algorithm. Let αs denote the prescribed significance level of the independence test. Let L denote the number of random permutations that we will adopt. We would like to test the null hypothesis H0X and Y are independent—against its alternative. Recall Ω̄n is our proposed estimator in Eq. 3.1. The following algorithm describes a test of independence, which applies permutation to generate a threshold.

1) For each , 1 ≤ L, generate a random permutation of Y: Y,=(Y1,Yn);

2) Using the algorithm in Section 3.1, one can compute the estimator Ω̄n as in Eq. 3.1 for X and Y∗,; denote the outcome to be V=Ω̄n(X,Y,). Note under random permutations, X and Y∗, are independent.

3) The above two steps are executed for all = 1, …, L. One rejects H0 if and only if we have

1+=1LI(Ω̄n>V)1+L>αs.

See Algorithm 2 in the Supplementary Appendix for a stand-alone description.

It is notified that one can use the approximate asymptotic distribution information to estimate a threshold in the independence test. The following describes such an approach. Recall that random vectors X1,,XnRp and Y1,,YnRq, number of random projections K, and a prescribed significance level αs have been mentioned earlier.

1) For each k (1 ≤ kK), randomly generate uk and vk from uniform(Sp1) and uniform(Sq1), respectively.

2) Use the fast algorithm in Theorem 2.2 to compute the following quantities:

Ωn(k)=CpCqΩn(uktX,vktY),Sn,1(k)=Cp2Cq2Ωn(uktX,uktX)Ωn(vktY,vktY),Sn,2(k)=Cpaukn(n1),Sn,3(k)=Cqbvkn(n1),

where Cp and Cq have been defined at the end of Section 1 and in the last equation, the auk and bvk are defined as follows:

aijuk=|ukt(XiXj)|,bijvk=|vkt(YiYj)|,auk=k,l=1nakluk,bvk=k,l=1nbklvk.

3) For the aforementioned k, randomly generate uk and vk from uniform(Sp1) and uniform(Sq1), respectively. Use the fast algorithm that is mentioned in Theorem 2.2 to compute the following.

Ωn,X(k)=Cp2Ωn(uktX,uktX),Ωn,Y(k)=Cp2Ωn(vktY,vktY).

where Cp and Cq have been defined at the end of Section 1.

4) Repeat the previous steps for all k = 1, …, K. Then we compute the following quantities:

Ω̄n=1Kk=1KΩn(k),S̄n,1=1Kk=1KSn,1(k),S̄n,2=1Kk=1KSn,2(k),S̄n,3=1Kk=1KSn,3(k),Ω̄n,X=1Kk=1KΩn,X(k),Ω̄n,Y=1Kk=1KΩn,Y(k),α=12S̄n,22S̄n,32K1KΩ̄n,XΩ̄n,Y+1KS̄n,1,
β=12S̄n,2S̄n,3K1KΩ̄n,XΩ̄n,Y+1KS̄n,1.

5) Reject H0 if nΩ̄n+S̄n,2S̄n,3>Gamma(α,β;1αs); otherwise, accept it. Here Gamma(α, β; 1 − αs) is the 1 − αs quantile of the distribution Gamma(α, β).

The above procedure is motivated by the observation that the asymptotic distribution of the test statistic nΩ̄n can be approximated by a Gamma distribution, whose parameters can be estimated by Eq. 3.2 and Eq. 3.3. A stand-alone description of the above procedure can be found in Algorithm 3 in the Supplementary Appendix.

4 Theoretical Properties

In this section, we establish the theoretical foundation of the proposed method. In Section 4.1, we study some properties of the random projections and the subsequent average estimator. These properties will be needed in studying the properties of the proposed estimator. We study the properties of the proposed distance covariance estimator (Ωn) in Section 4.2, taking advantage of the fact that Ωn is a U-statistic. It turns out that the properties of eigenvalues of a particular operator play an important role. We present the relevant results in Section 4.3. The main properties of the proposed estimator (Ω̄n) are presented in Section 4.4.

4.1 Using Random Projections in Distance-Based Methods

In this section, we will study some properties of distance covariances of randomly projected random vectors. We begin with a necessary and sufficient condition of independence.

Lemma 4.1 Suppose u and v are points on the hyper-spheres: uSp1={uRp:|u|=1} and vSq1. We have

random vectorsXRpandYRqareindependent

if and only if

V2(utX,vtY)=0,foranyuSp1,vSq1.

The proof is relatively straightforward. We relegate a formal proof to the appendix. This lemma indicates that the independence is somewhat preserved under projections. The main contribution of the above result is to motivate us to think of using random projection, to reduce the multivariate random vectors into univariate random variables. As mentioned earlier, there exist fast algorithms of distance-based methods for univariate random variables.The following result allows us to regard the distance covariance of random vectors of any dimension as an integral of distance covariance of univariate random variables, which are the projections of the aforementioned random vectors. The formulas in the following lemma provide the foundation for our proposed method: the distance covariances in the multivariate cases can be written as integrations of distance covariances in the univariate cases. our proposed method essentially adopts the principle of Monte Carlo to approximate such integrals. We again relegate the proof to the Supplementary Appendix.

Lemma 4.2 Suppose u and v are points on unit hyper-spheres: uSp1={uRp:|u|=1} and vSq1. Let μ and ν denote the uniform probability measure on Sp1 and Sq1, respectively. Then, we have for random vectors XRp and YRq,

V2(X,Y)=CpCqSp1×Sq1V2(utX,vtY)dμ(u)dν(v),

where Cp and Cq are two constants that are defined at the end of Section 1. Moreover, a similar result holds for the sample distance covariance:

Vn2(X,Y)=CpCqSp1×Sq1Vn2(utX,vtY)dμ(u)dν(v).

Besides the integral equations in the above lemma, we can also establish the following result for the unbiased estimator. Such a result provides the direct foundation of our proposed method. Recall that Ωn, which is in Eq. 2.5, is an unbiased estimator of the distance covariance V2(X,Y). A proof is provided in the Supplementary Appendix.

Lemma 4.3 Suppose u and v are points on the hyper-spheres: uSp1={uRp:|u|=1} and vSq1. Let μ and ν denote the measure corresponding to the uniform densities on the surfaces Sp1 and Sq1, respectively. Then, we have

Ωn(X,Y)=CpCqSp1×Sq1Ωn(utX,vtY)dμ(u)dν(v),

where Cp and Cq are constants that were mentioned at the end of Section 1.From the above lemma, recalling the design of our proposed estimator Ω̄n as in Eq. 3.1, it is straightforward to see that the proposed estimator Ω̄n is an unbiased estimator of Ωn(X, Y). For completeness, we state the following without a proof.

Corollary 4.4 The proposed estimator Ω̄n in Eq. 3.1) is an unbiased estimator of the estimator Ωn(X, Y) that was defined in Eq. 2.5.Note that the estimator Ω̄n in Eq. 3.1 evidently depends on the number of random projections K. Recall that to emphasize such a dependency, we sometimes use a notation Ω̄n,KΩ̄n. The following concentration inequality shows the speed that Ω̄n,K can converge to Ωn as K.

Lemma 4.5 Suppose E[|X|2] < and E[|Y|2] < . For any ϵ > 0, we have

P|Ω̄n,KΩn|>ϵ2expCKϵ2Tr[ΣX]Tr[ΣY],

where ΣX and ΣY are the covariance matrices of X and Y, respectively, TrX] and TrY] are their matrix traces, and C=225Cp2Cq2 is a constant.The proof is a relatively standard application of Hoeffding’s inequality [11], which has been relegated to the appendix. The above lemma essentially indicates that the quantity |Ω̄n,KΩn| converges to zero at a rate no worse than O(1/K).

4.2 Asymptotic Properties of the Sample Distance Covariance Ωn

The asymptotic behavior of the sample distance covariance Ωn in Eq. 2.5 of this paper, has been studied in many places, seeing [5,8,10,12]. We found that it is still worthwhile to present them here, as we will use them to establish the statistical properties of our proposed estimator. The asymptotic distributions of Ωn will be studied under two situations: 1) a general case and 2) when X and Y are assumed to be independent. We will see that the asymptotic distributions are different in these two situations.

It has been showed in ([8], Theorem 3.2) that Ωn is a U-statistic. In the following, we state the result without formal proof. We will need the following function, denoted by h4, which takes four pairs of input variables:

h4((X1,Y1),(X2,Y2),(X3,Y3),(X4,Y4))=141i,j4,ij|XiXj||YiYj|14i=141j4,ji|XiXj|1j4,ji|YiYj|+1241i,j4,ij|XiXj|1i,j4,ij|YiYj|.

Note that the definition of h4 coincides with Ωn when the number of observations n = 4.

Lemma 4.6 (U-statistics). Let Ψ4 denote all distinct 4-subset of {1, …, n} and let us define Xψ = {Xi|iψ} and Yψ = {Yi|iψ}, then Ωn is a U-statistic and can be expressed as

Ωn=n41ψΨ4h4Xψ,Yψ.

From the literature of the U-statistics, we know that the following quantities play critical roles. We state them here:

h1((X1,Y1))=E2,3,4[h4((X1,Y1),(X2,Y2),(X3,Y3),(X4,Y4))],h2((X1,Y1),(X2,Y2))=E3,4[h4((X1,Y1),(X2,Y2),(X3,Y3),(X4,Y4))],h3((X1,Y1),(X2,Y2),(X3,Y3))=E4[h4((X1,Y1),(X2,Y2),(X3,Y3),(X4,Y4))],

where E2,3,4 stands for taking expectation over (X2, Y2), (X3, Y3) and (X4, Y4); E3,4 stands for taking expectation over (X3, Y3) and (X4, Y4); and E4 stands for taking expectation over (X4, Y4); respectively.One immediate application of the above notations is the following result, which quantifies the variance of Ωn. Since the formula is a known result, seeing ([13], Chapter 5.2.1 Lemma A), we state it without proof.

Lemma 4.7 (Variance of the U-statistic). The variance of Ωn could be written as

Var(Ωn)=n41l=144ln44lVar(hl)=16nVar(h1)+240n2Var(h1)+72n2Var(h2)+O1n3,

where O(⋅) is the standard big O notation in mathematics.From the above lemma, we can see that Var(h1) and Var(h2) play indispensable roles in determining the variance of Ωn. The following lemma shows that under some conditions, we can ensure that   Var(h1) and Var(h2) are bounded. A proof has been relegated to the appendix.

Lemma 4.8 If we have E[|X|2]<, E[|Y|2]< and E[|X|2|Y|2]<, then we have Var(h4) < ∞. Consequently, we also have Var(h1) < and Var(h2) < ∞.Even though as indicated in Lemma 4.7, the quantities h1(X1, Y1) and h2((X1, Y1), (X2, Y2)) play important roles in determining the variance of Ωn, in a generic case, they do not have a simple formula. The following lemma gives the generic formulas for h1(X1, Y1) and h2((X1, Y1), (X2, Y2)). Its calculation can be found in the Supplementary Appendix.

Lemma 4.9 (Generic h1 and h2). In the general case, assuming (X1, Y1), (X, Y), (X′, Y′), and (X, Y) are independent and identically distributed, we have

h1((X1,Y1))=12E[|X1X||Y1Y|]12E[|X1X||Y1Y|]+12E[|X1X||YY|]12E[|X1X||YY|]+12E[|XX||Y1Y|]12E[|XX||Y1Y|]+12E[|XX||YY|]12E[|XX||YY|].

We have a similar formula for h2((X1, Y1), (X2, Y2)) in (B.7). Due to its length, we do not display it here.If one assumes that X and Y are independent, we can have a simpler formula for h1, h2, as well as their corresponding variances. We list the results below, with detailed calculations relegated to the appendix. One can see that under independence, the corresponding formulas are much simpler.

Lemma 4.10 When X and Y are independent, we have the following. For (X, Y) and (X′, Y′) that are independent and identically distributed as (X1, Y1) and (X2, Y2), we have

h1((X1,Y1))=0,
h2((X1,Y1),(X2,Y2))=16|X1X2|E[|X1X|]E[|X2X|]+E[|XX|](|Y1Y2|E[|Y1Y|]E[|Y2Y|]+E[|YY|]),
Var(h2)=136V2(X,X)V2(Y,Y),

where E stands for the expectation operators with respect to X, X and X, Y, or Y and Y′, whenever appropriate, respectively.If we have 0 <  Var(h1) < , it is known that the asymptotic distribution of Ωn is normal, as stated in the following. Note that based on Lemma 4.10, X and Y cannot be independent; otherwise one should have h1 = 0 almost surely. The following theorem is based on a known result on the convergence of U-statistics, seeing ([13], Chapter 5.5.1 Theorem A). We state it without a proof.

Theorem 4.11 Suppose 0 <  Var(h1) < and Var(h4) < , then we have

ΩnPV2(X,Y)

moreover, we have

n(ΩnV2(X,Y))DN(0,16Var(h1)),asn.

When X and Y are independent, the asymptotic distribution of nΩn is no longer normal. In this case, from Lemma 4.10, we have

h1((X1,Y1))=0 almost surely, and Var[h1((X1,Y1))]=0.

The following theorem, which applies a result in ([13], Chapter 5.5.2), indicates that nΩn converges to a weighted sum of (possibly infinitely many) independent χ12 random variables.

Theorem 4.12 If X and Y are independent, the asymptotic distribution of Ωn is

nΩnDi=1λi(Zi21)=i=1λiZi2i=1λi,

where Zi2χ12 i.i.d, λi’s are the eigenvalues of operator G that is defined as

Gg(x1,y1)=Ex2,y2[6h2((x1,y1),(x2,y2))g(x2,y2)],

where function h2((⋅, ⋅), (⋅, ⋅)) was defined in (4.3).

Proof The asymptotic distribution of Ωn is from the result in ([13], Chapter 5.5.2).See Section 4.3 for more details on methods for computing the value of λi’s. In particular, we will show that we have i=1λi=E[|XX|]E[|YY|] (Corollary 4.15) and i=1λi2=V2(X,X)V2(Y,Y) (which is essentially from Eq. 4.4 and Lemma 4.7).

4.3 Properties of Eigenvalues λi’s

From Theorem 4.12, we see that the eigenvalues λi’s play important role in determining the asymptotic distribution of Ωn. We study its properties here. Throughout this subsection, we assume that X and Y are independent. Let us recall that the asymptotic distribution of sample distance covariance Ωn,

nΩnDi=1λi(Zi21)=i=1λiZi2i=1λi,

where λi’s are the eigenvalues of the operator G that is defined as

Gg(x1,y1)=Ex2,y2[6h2((x1,y1),(x2,y2))g(x2,y2)],

where function h2((⋅, ⋅), (⋅, ⋅)) was defined in Eq. 4.3. By definition, eigenvalues λ1, λ2, … corresponding to distinct solutions of the following equation

Gg(x1,y1)=λg(x1,y1).

We now study the properties of λi’s. Utilizing Lemma 12 and Eq. 4.4 in [12], we can verify the following result. We give details of verifications in the Supplementary Appendix.

Lemma 4.13 Both of the following two functions are positive definite kernels:

hX(X1,X2)=|X1X2|+E[|X1X|]+E[|X2X|]E[|XX|]

and

hY(Y1,Y2)=|Y1Y2|+E[|Y1Y|]+E[|Y2Y|]E[|YY|].

The above result gives us a foundation to apply the equivalence result that has been articulated thoroughly in [12]. Equipped with the above lemma, we have the following result, which characterizes a property of λis. The detailed proof can be found in the Supplementary Appendix.

Lemma 4.14 Suppose 1, λ2, …} are the set of eigenvalues of kernel 6h2((x1, y1), (x2, y2)), {λ1X,λ2X,} and {λ1Y,λ2Y,} are the sets of eigenvalues of the positive definite kernels hX and hY, respectively. We have the following:

{λ1,λ2,}={λ1X,λ2X,}{λ1Y,λ2Y,};

that is, each λi satisfying (4.5) can be written as, for some j, j′,

λi=λjXλjY

where λjXand λjY are the eigenvalues corresponding to kernel functions hX(X1, X2) and hY(Y1, Y2), respectively.Above lemma implies that eigenvalues of h2 could be obtained immediately after knowing the eigenvalues of hX and hY. But, in practice, there usually does not exist analytic solution for even the eigenvalues of hX or hY. Instead, given the observations (X1, …, Xn) and (Y1, …, Yn), we can compute the eigenvalues of matrices K̃X=(hX(Xi,Xj))n×n and K̃Y=(hY(Yi,Yj))n×n and use those empirical eigenvalues to approximate λ1X,λ2X, and λ1Y,λ2Y,, and then consequently λ1, λ2, …We end this subsection with the following corollary on the summations of eigenvalues, which is necessary for the proof of Theorem 4.12. The proof can be found in the Supplementary Appendix.

Corollary 4.15 The aforementioned eigenvalues λ1X,λ2X, and λ1Y,λ2Y, satisfy

i=1λiX=E[|XX|],andi=1λiY=E[|YY|].

As a result, we have

i=1λi=E[|XX|]E[|YY|],

and

i=1λi2=V2(X,X)V2(Y,Y).

4.4 Asymptotic Properties of Averaged Projected Sample Distance Covariance Ω̄n

We have reviewed the properties of the statistics Ωn in a previous section (Section 4.2). The disadvantage of directly applying Ωn (which is defined in Eq. 2.5) is that for multivariate X and Y, the implementation may require at least O(n2) operations. Recall that for univariate X and Y, an O(n log  n) algorithm exists, cf. Theorem 2.2. The proposed estimator (Ω̄n in Eq. 3.1) is the averaged distance covariances, after randomly projecting X and Y to one-dimensional spaces, respectively. In this section, we will study the asymptotic behavior of Ω̄n. It turns out that the analysis will be similar to the works in Section 4.2. The asymptotic distribution of Ω̄n will differ in two cases: (1) the general case and (2) the case when X and Y are independent.

As preparation for presenting the main result, we recall and introduce some notations. Recall the definition of Ω̄n:

Ω̄n=1Kk=1KΩn(k),

where

Ωn(k)=CpCqΩn(uktX,vktY)

and constants Cp, Cq have been defined at the end of Section 1. By Corollary 4.4, we have EΩn(k)=Ωn, where E stands for the expectation with respect to the random projection. Note that from the work inSection 4.2, estimator Ωn(k) is a U-statistic. The following equation reveals that estimator Ω̄n is also a U-statistic,

Ω̄n=n41ψΨ4CpCqKk=1Kh4(uktXψ,vktYψ)n41ψΨ4h̄4(Xψ,Yψ),

where

h̄4(Xψ,Yψ)=1Kk=1KCpCqh4(uktXψ,vktYψ).

We have seen that quantities h1 and h2 play significant roles in the asymptotic behavior of statistic Ωn. Let us define the counterpart notations as follows:

h̄1((X1,Y1))=E2,3,4[h̄4((X1,Y1),(X2,Y2),(X3,Y3),(X4,Y4))]1Kk=1Kh1(k),h̄2((X1,Y1),(X2,Y2))=E3,4[h̄4((X1,Y1),(X2,Y2),(X3,Y3),(X4,Y4))]1Kk=1Kh2(k),

where E2,3,4 stands for taking expectation over (X2, Y2), (X3, Y3) and (X4, Y4); E3,4 stands for taking expectation over (X3, Y3) and (X4, Y4); as well as the following:

h1(k)=E2,3,4[CpCqh4(uktXψ,vktYψ)],h2(k)=E3,4[CpCqh4(uktXψ,vktYψ)].

In the general case, we do not assume that X and Y are independent. Let U = (u1, …, uK) and V = (v1, …, vK) denote the collection of random projections. We can write the variance of Ω̄nas follows. The proof is an application of Lemma 4.7 and the law of total covariance. We relegate it to the Supplementary Appendix.

Lemma 4.16 Suppose EU,V[VarX,Y(h̄1|U,V)]>0 and Varu,v(V2(utX,vtY))>0, then, the variance of Ω̄n is

Var(Ω̄n)=1KVaru,v(V2(utX,vtY))+16nEU,V[VarX,Y(h̄1|U,V)]+72n2EU,V[VarX,Y(h̄2|U,V)]+O1n3.

Equipped with the above lemma, we can summarize the asymptotic properties in the following theorem. We state it without proof as it is an immediate result from Lemma 4.16 as well as the contents in ([13], Chapter 5.5.1 Theorem A).

Theorem 4.17 Suppose 0<EU,V[VarX,Y(h̄1|U,V)]<, EU,V[VarX,Y(h̄4|U,V)]<. Also, let us assume that K → ∞, n → ∞, then we have

Ω̄nPV2(X,Y).

And, the asymptotic distribution of Ω̄n could differ under different conditions.

1) If K → ∞ and K/n → 0, then

KΩ̄nV2(X,Y)DN0,Varu,v(V2(utX,vtY)).

2) If n → ∞ and K/n → ∞, then

nΩ̄nV2(X,Y)DN0,16EU,V[VarX,Y(h̄1|U,V)].

3) If n → ∞ and K/n → C, where C is some constant, then

nΩ̄nV2(X,Y)DN0,1CVaru,v(V2(utX,vtY))+16EU,V[VarX,Y(h̄1|U,V)].

Since our main idea is to utilize Ω̄n to approximate the quantity Ωn, it is of interest to compare the asymptotic variance of Ωn in Theorem 4.11 with the asymptotic variances in the above theorem. We present some discussions in the following remark.

Remark 4.18 Let us recall the asymptotic properties of Ωn,

n(ΩnV2(X,Y))DN(0,16Var(h1)).

Then, we make the comparison in the following different scenarios.

1) If K → ∞ and K/n → 0, then the convergence rate of Ω̄n is much slower than Ωn as K ≪ n.

2) If n → ∞ and K/n → ∞, then the convergence rate of Ω̄n is the same with Ωn and their variances is also the same

3) If n → ∞ and K/n → C, where C is some constant, then the convergence rate of Ω̄n is the same with Ωn but the variance of Ω̄n is larger than that of Ωn.

Generally, when X is not independent of Y, Ω̄n is as good as Ωn in terms of convergence rate. However, in the independence test, the convergence rate of test statistics under the null hypotheses is of more interest. In the following context of this section, we will show that Ω̄n has the same convergence rate with Ωn when X is independent of Y.Now, let us consider the case that X and Y are independent. Similarly, by Lemma 4.10, we have

h̄1(k)=0,h̄1=0,almost surely, and ,Var(h̄1)=0.

And, by Lemma 4.1, we know that

V2(utX,vtY)=0,u,v,

which implies

Varu,vV2(utX,vtY)=0.

Therefore, we only need to consider VarX,Y(h̄2|U,V). Suppose (U, V) is given, a result in ([13], Chapter 5.5.2), together with Lemma 4.16, indicates that nΩ̄n converges to a weighted sum of (possibly infinitely many) independent χ12 random variables. The proof can be found in the Supplementary Appendix.

Theorem 4.19 If X and Y are independent, given the value of U = (u1, …, uK) and V = (v1, …, vK), the asymptotic distribution of Ω̄n is

nΩ̄nDi=1λ̄i(Zi21)=i=1λ̄iZi2i=1λ̄i,

where Zi2χ12 i.i.d, and

i=1λ̄i=CpCqKk=1KE[|ukt(XX)|]E[|vkt(YY)|],
i=1λ̄i2=Cp2Cq2K2k,k=1KV2uktX,uktXV2vktY,vktY.

Remark 4.20 Let us recall that if X and Y are independent, the asymptotic distribution of Ωn is

nΩnDi=1λi(Zi21).

Theorem 4.19. shows that under the null hypotheses, Ω̄n enjoys the same convergence rate with Ωn.There usually does not exist a close-form expression for i=1λ̄iZi2, but we can approximate it with the Gamma distribution whose first two moments matched. Thus, we have that i=1λ̄iZi2 could be approximated by Gamma(α, β) with probability density function.

βαΓ(α)xα1eβx,x>0,

where

α=12i=1λ̄i2i=1λ̄i2,β=12i=1λ̄ii=1λ̄i2.

See [14] Section 3 for an empirical justification on this Gamma approximation. See [15] for a survey on different approximation methods of the weighted sum of the chi-square distribution.The following result shows that both i=1λ̄i and i=1λ̄i2 could be estimated from data, see appendix for the corresponding justification.

Proposition 4.21 We can approximate i=1λ̄i and i=1λ̄i as follows:

i=1λ̄iCpCqKn2(n1)2k=1Kaukbvk,i=1λ̄i2K1KΩn(X,X)Ωn(Y,Y)+Cp2Cq2Kk=1KΩn(uktX,uktX)Ωn(vktY,vktY).

5 Simulations

Our numerical studies follow the works of [4,6,12]. In Section 5.1, we study how the performance of the proposed estimator is influenced by some parameters, including the sample size, the dimensionalities of the data, as well as the number of random projections in our algorithm. We also study and compare the computational efficiency of the direct method and the proposed method in Section 5.2. The comparison of the corresponding independence test with other existing methods will be included in Section 5.3.

5.1 Impact of Sample Size, Data Dimensions and the Number of Monte Carlo Iterations

In this part, we will use some synthetic data to study the impact of sample size n, data dimensions (p, q) and the number of the Monte Carlo iterations K on the convergence and test power of our proposed test statistic Ω̄n. The significance level is set to be αs = 0.05. Each experiment is repeated for N = 400 times to get reliable mean and variance of estimators.

In first two examples, we fix data dimensions p = q = 10 and let the sample size n vary in 100, 500, 1000, 5000, 10000 and let the number of the Monte Carlo iterations K vary in 10, 50, 100, 500, and 1000. The data generation mechanism is described as follows, and it generates independent variables.

Example 5.1 We generate random vectors XR10 and YR10. Each entry Xi follows Unif(0, 1), independently. Each entry Yi=Zi2, where Zi follows Unif(0, 1), independently.See Figure 1 for the boxplots of the outcomes of Example 5.1. In each subfigure, we fix the Monte Carlo Iteration Number K and let the number of observations n grow. It is worth noting that the scale of each subfigure could be different to display the entire boxplots. This experiment shows that the estimator converges to 0 regardless of the number of the Monte Carlo iterations. It also suggests that K = 50 Monte Carlo iterations should suffice in the independent cases.The following example is to study dependent variables.

FIGURE 1
www.frontiersin.org

FIGURE 1. Boxplots of estimators in Example 5.1: dimension of X and Y is fixed to be p = q = 10; the result is based on 400 repeated experiments. In each subplot, y-axis represents the value of distance covariance estimators.

Example 5.2 We generate random vectors XR10 and YR10 Each entry Xi follows Unif(0, 1), independently. Let Yi denote the i-th entry of Y. We let Y1=X12 and Y2=X22 The rest entry of Y, Yi=Zi2, i = 3, …, 10, where Zi follows Unif(0, 1), independently.See Figure 2 for the boxplots of the outcomes of Example 5.2. In each subfigure, we fix the number of the Monte Carlo iterations K and let the number of observations n grow. This example shows that if K is fixed, the variation of the estimator remains regardless of the sample size n. In the dependent cases, the number of the Monte Carlo iterations K plays a more important role in estimator convergence than sample size n.The outcomes of Example 5.1 and 5.2 confirm the theoretical results that the proposed estimator converges to 0 as sample size n grows in the independent case, and converges to some nonzero number as the number of the Monte Carlo iterations K grows in the dependent case.In the following two examples, we fix the sample size n = 2000 as we noticed that our method is more efficient than the direct method when n is large. We fix the number of the Monte Carlo iterations K = 50 and relax the restriction on the data dimensions to allow p ≠ q and let p, q vary in (10, 50, 100, 500, 1000). We continue with an independent case as follows.

FIGURE 2
www.frontiersin.org

FIGURE 2. Boxplots of our estimators in Example 5.2: dimension of X and Y is fixed to be p = q = 10; the result is based on 400 repeated experiments. In each subplot, y-axis represents the value of distance covariance estimators.

Example 5.3 We generate random vectors XRp and YRq. Each entry of X follows Unif(0, 1), independently. Each entry Yi=Zi2, where Zi follows Unif(0, 1), independently.See Figure 3 for the boxplots of the outcomes of Example 5.3. In each subfigure, we fix the dimension of X and let the dimension of Y grow. It is worth noting that the scale of each subfigure could be different to display the entire boxplots. It shows that the proposed estimator converges fairly fast in independent cases regardless of the dimension of the data.The following presents a dependent case. In this case, only a small number of entries in X and Y are dependent, which means that the dependency structure between X and Y is low-dimensional though X or Y could be of high dimensions.

FIGURE 3
www.frontiersin.org

FIGURE 3. Boxplot of Estimators in Example 5.3: both sample size and the number of Monte Carlo iterations is fixed, n = 2000, K = 50; the result is based on 400 repeated experiments. In each subplot, y-axis represents the value of distance covariance estimators.

Example 5.4 We generate random vectors XRp and YRq. Each entry of X follows Unif(0, 1), independently. We let the first 5 entries of Y to be the square of the first 5 entries of X and let the rest entries of Y to be the square of some independent Unif(0, 1) random variables. Specifically, we let Yi=Xi2,i=1,,5, and, Yi=Zi2,i=6,,q, where Zi’s are drawn independently from Unif(0, 1).See Figure 4 for the boxplots of the outcomes of Example 5.4. In each subfigure, we fix the dimension of X and let the dimension of Y grow. The test power of the proposed test against data dimensions can be seen in Table 1. It is worth noting that when the sample size is fixed, the test power of our method decays as the dimension of X and Y increase. We use the Direct Distance Covariance (DDC) defined in Eq. 2.5 on the same data. As a contrast, the test power of DDC is 1.000 even p = q = 1000. This example raises a limitation of random projection: it may fail to detect the low dimensional dependency in high dimensional data. A possible remedy for this issue is performing dimension reduction before applying the proposed method. We do not research further along this direction since it is beyond the scope of this paper.

FIGURE 4
www.frontiersin.org

FIGURE 4. Boxplots of the proposed estimators in Example 5.4: both sample size and the number of the Monte Carlo iterations are fixed: n = 2000 and K = 50; the result is based on 400 repeated experiments. In each subplot, y-axis represents the value of distance covariance estimators.

TABLE 1
www.frontiersin.org

TABLE 1. Test Power in Example 5.4: this result is based 400 repeated experiments; the significance level is 0.05.

5.2 Comparison With Direct Method

In this section, we would like to illustrate the computational and space efficiency of the proposed method (RPDC). RPDC is much faster than the direct method (DDC, Eq. 2.5) when the sample size is large. It is worth noting that DDC is infeasible when the sample size is too large as its space complexity is O(n2). See Table 2 for a comparison of computing time (unit: second) against the sample size n. This experiment is run on a laptop (MacBook Pro Retina, 13-inch, Early 2015, 2.7 GHz Intel Core i5, 8 GB 1867 MHz DDR3) with MATLAB R2016b (9.1.0.441655).

TABLE 2
www.frontiersin.org

TABLE 2. Speed Comparison: Direct Distance Covariance vs. Randomly Projected Distance Covariance.

5.3 Comparison With Other Independence Tests

In this part, we compare the statistical test power of the proposed test (RPDC) with Hilbert-Schmidt Independence Criterion (HSIC) [4] as HSIC is gaining attention in machine learning and statistics communities. In our experiments, a Gaussian kernel with standard deviation σ = 1 is used for HSIC. We also compare with Randomized Dependence Coefficient (RDC) [16], which utilizes the technique of random projection as we do. Two classical tests for multivariate independence, which are described below, are included in the comparison as well as Direct Distance Covariance (DDC) defined in Eq. 2.5.

Wilks Lambda (WL): the likelihood ratio test of hypotheses Σ12 = 0 with μ unknown is based on

det(S)det(S11)det(S22)=det(S22S21S111S12)det(S22),

where det(⋅) is the determinant, S, S11 and S22 denote the sample covariances of (X, Y), X and Y, respectively, and S12 is the sample covariance Cov̂(X,Y). Under multivariate normality, the test statistic

W=nlogdet(IS221S21S111S12)

has the Wilks Lambda distribution Λ(q, n − 1 − p, p), see [17].

Puri-Sen (PS) statistics: [18], Chapter 8, proposed similar tests based on more general sample dispersion matrices T. In that test S, S11, S12 and S22 are replaced by T, T11, T12 and T22, where T could be a matrix of Spearman’s rank correlation statistics. Then, the test statistic becomes

W=nlogdet(IT221T21T111T12)

The critical values of the Wilks Lambda (WL) and Puri-Sen (PS) statistics are given by Bartlett’s approximation ([19], Section 5.3.2b): if n is large and p, q > 2, then

n12(p+q+3)logdet(IS221S21S111S12)

has an approximation χ2(pq) distribution.

The reference distributions of RDC and HSIC are approximated by 200 permutations. And the reference distributions of DDC and RPDC are approximated by Gamma Distribution. The significance level is set to be αs = 0.05 and each experiment is repeated for N = 400 times to get reliable type-I error/test power.

We start with an example that (X, Y) is multivariate normal. In this case, WL and PS are expected to be optimal as the assumptions of these two classical tests are satisfied. Surprisingly, DDC has comparable performance with the aforementioned two methods. RPDC can achieve satisfactory performance when the sample size is reasonably large.

Example 5.5 We set the dimension of the data to be p = q = 10. We generate random vectors XR10 and YR10 from the standard multivariate normal distribution N(0,I10). The joint distribution of (X, Y) is also normal and we have Cor(Xi, Yi) = ρ, i = 1, …, 10, and the rest correlation are all 0. We set the value of ρ to be 0 and 0.1 to represent independent and correlated scenarios, respectively. The sample size n is set to be from 100 to 1500 with an increment of 100.Figure 5 plots the type-I error in subfigure (a) and test power in subfigure (b) against sample size. In the independence case (ρ = 0.0), the type-I error of each test is always around the significance level αs = 0.05, which implies the Gamma approximation works well for asymptotic distributions. In the dependent case (ρ = 0.1), the overall performance of RPDC is close to HSIC and RPDC outperforms when the sample size is smaller and underperforms when the sample size is larger. Unfortunately, RDC’s test power is insignificant.Next, we compare those methods when (X, Y) is no longer multivariate normal and the dependency between X and Y is non-linear. We even add a noise term to compare their performance in both low and high noise-to-signal ratio scenarios. In this case, DDC and RPDC are much better than WL, PS, and RDC. The performance of HSIC is close to DDC and RPDC when the noise is low but much worse than those two when the noise is high.

FIGURE 5
www.frontiersin.org

FIGURE 5. Type-I Error/Test Power vs. Sample Size n in Example 5.5: the result is based on 400 repeated experiments.

Example 5.6 We set the dimension of data to be p = q = 10. We generate random vector XR10 from the standard multivariate normal distribution N(0,I10). Let the i-th entry of Y be Yi=log(Xi2)+ϵi,i=1,,q, where ϵi’s are independent random errors, ϵiN(0,σ2). We set the value of σ to be 1 and 3 to represent low and high noise ratios, respectively. In the σ = 1 case, the sample size n is from 100 to 1000 with an increment 20; and in the σ = 3 case, the sample size n is from 100 to 4000 with an increment 100.Figure 6 plots the test power of each test against sample size. In both low and high noise cases, none of WL, PS, and RDC has any test power. In the low noise case, all of RPDC, DDC, and HSIC have satisfactory test power (> 0.9) when the sample size is greater than 300. In the high noise case, RPDC and DDC could achieve more than 0.8 in test power once the sample size is greater than 500 while the test power of HSIC reaches 0.8 when the sample size is more than 2000.In the following example, we generate the data similarly with Example 5.6 but the difference is that the dependency is changing over time. Specifically, X and Y are independent at the beginning but they become dependent after some time point. Since all those tests are invariant with the order of the observations, this experiment simply means that only a proportion of observations are dependent while the rest are not.

FIGURE 6
www.frontiersin.org

FIGURE 6. Test Power vs. Sample Size n in Example 5.6: significance level is αs = 0.05; the result is based on N = 400 repeated experiments.

Example 5.7 We set the dimension of data to be p = q = 10. We generate random vector XtR10,t=1,,n, from the standard multivariate normal distribution N(0,I10). Let the i-th entry of Yt be Yt,i=log(Zt,i2)+ϵt,i,t=1,,T and Yt,i=log(Xt,i2)+ϵt,i,t=T+1,,n, where Zti.i.d.N(0,I10) and ϵt,i’s are independent random errors, ϵt,iN(0,1). We set the value of T to be 0.5n and 0.8n to represent early and late dependency transition, respectively. In the early change case, the sample size n is from 500 to 2000 with an increment 100; and in the late change case, the sample size n is from 500 to 4000 with an increment 100.Figure 7 plots the test power of each test against sample size. In both early and late change cases, none of WL, PS, and RDC has any test power. In the early change case, all of RPDC, DDC, and HSIC have satisfactory test power (> 0.9) when the sample size is greater than 1500. In the late change case, DDC and HSIC could achieve more than 0.8 in test power once sample size reaches 4000 while the test power of RPDC is only 0.6 when the sample size is 4000. As expected, the performance of DDC is better than RPDC in both cases and the performance of HSIC is between DDC and RPDC.

FIGURE 7
www.frontiersin.org

FIGURE 7. Test Power vs. Sample Size n in Example 5.7: significance level is αs = 0.05; the result is based on N = 400 repeated experiments.

Remark 5.8 The examples in this subsection show that though RPDC underperforms DDC when the sample size is relatively small, RPDC could achieve the same test power with DDC when the sample size is sufficiently large. Thus, when the sample size is large enough, RPDC is superior to DDC because of its computational efficiency in both time and space.

6 Discussions

6.1 A Discussion on the Computational Efficiency

We compare the computational efficiency of the proposed method (RPDC) and the direct method (DDC) in Section 5.2. We will discuss this issue here.

As XRp and YRq are multivariate random variables, the effect of p and q on computing time could be significant when p and q are not negligible compared to sample size n. Now, we analyze the computational efficiency of DDC and RPDC by taking p and q into consideration. The computational complexity of DDC becomes O(n2(p + q)) and that of RPDC becomes O(nK( log  n + p + q)). Let us denote the total number of operations in DDC by O1 and that in RPDC by O2. Then, there exist constants L1 and L2 such that

O1L1n2(p+q), and O2L2nK(logn+p+q).

There is no doubt that O2 will eventually be much less than O1 as sample size n grows. Due to the complexity of the fast algorithm, we expect L2 > L1, which means the computing time of RPDC is even larger than DDC when the sample size is relatively small. Then, we need to study an interesting problem: what is the break-even point in terms of sample size n when RPDC and DDC have the same computing time?

Let n0 = n0(p + q, K) denote the break-even point, which is a function of p + q and number of Monte Carlo iterations K. For simplicity, we fix K = 50 since 50 iterations could achieve satisfactory test power as we showed in Example 5.4. Then, n0 becomes a function solely depending on p + q. Since it is hard to derive the close form of n0, we derive it numerically instead. For fixed p + q, we let the sample size vary and record the difference between the running time of the two methods. Then, we fit the difference of running time against sample size with a smoothing spline. The root of this spline is the numerical value of n0 at p + q.

We plot the n0 against p + q in Figure 8. As the figure shows, the break-even sample size decreases as the data dimension increases, which implies that our proposed method is more advantageous than the direct method when random variables are of high dimension. However, as shown in Example 5.4, the random projection-based method does not perform well when high dimensional data have a low dimensional dependency structure. We should be cautious to use the proposed method when the dimension is high.

FIGURE 8
www.frontiersin.org

FIGURE 8. Break-Even Sample Size n0 against Data Dimension p + q. This figure is based on 100 repeated experiments.

6.2 Connections With Existing Literature

It turns out that distance-based methods are not the only choices in independence tests. See [20] and the references therein to see alternatives.

Our proposed method utilizes random projections, which bears a similarity with the randomized feature mapping strategy [21] that was developed in the machine learning community. Such an approach has been proven to be effective in kernel-related methods [2226]. However, a closer examination will reveal the following difference: most of the aforementioned work is rooted in the Bochner’s theorem [27] from harmonic analysis, which states that a continuous kernel in the Euclidean space is positive definite if and only if the kernel function is the Fourier transform of a non-negative measure. In this paper, we will deal with the distance function which is not a positive definite kernel. We will manage to derive a counterpart to the randomized feature mapping, which was the influential idea that has been used in [21].

Random projections have been used in [28] to develop a powerful two-sample test in high dimensions. They derived an asymptotic power function for their proposed test, and then provide sufficient conditions for their test to achieve greater power than other state-of-the-art tests. They then used the receiver operating characteristic (ROC) curves (that are generated from simulated data) to evaluate its performance against competing tests. The derivation of the asymptotic relative efficiency (ARE) is of its own interests. Despite the usage of random projection, the details of their methodology are very different from the one that is studied in the present paper.

Several distribution-free tests that are based on sample space partitions were suggested in [29] for univariate random variables. They proved that all suggested tests are consistent and showed the connection between their tests and the mutual information (MI). Most importantly, they derived fast (polynomial-time) algorithms, which are essential for large sample size, since the computational complexity of the naive algorithm is exponential in sample size. Efficient implementations of all statistics and tests described in the aforementioned paper are available in the R package HHG, which can be freely downloaded from the Comprehensive R Archive Network, http://cran.r-project.org/. Null tables can be downloaded from the first author’s website.

Distance-based independence/dependence measurements sometimes have been utilized in performing a greedy feature selection, often via dependence maximization [8,30,31], and it has been effective on some real-world datasets. This paper simply mentions such a potential research line, without pursuing it.

Paper [32] derives an efficient approach to compute for the conditional distance correlations. We noted that there are strong resemblances between the distance covariances and its conditional version. The search for a potential extension of the work in this paper to conditional distance correlation can be a meaningful future topic of research.

Paper [33] provides some important insights into the power of distance covariance for multivariate data. In particular, they discover that distance-based independence tests have limiting power under some less common circumstances. As a remedy, they propose tests based on an aggregation of marginal sample distance and extend their approach to those based on Hilbert-Schmidt covariance and marginal distance/Hilbert-Schmidt covariance. It could be another interesting research direction but beyond the scope of this paper.

7 Conclusion

A significant contribution of this paper is we demonstrated that the multivariate variables in the independence tests need not imply the higher-order computational desideratum of the distance-based methods.

Distance-based methods are important in statistics, particularly in the test of independence. When the random variables are univariate, efficient numerical algorithms exist. It is an open question when the random variables are multivariate. This paper studies the random projection approach to tackle the above problem. It first turns the multivariate calculation problem into univariate calculation one via a random projection. Then they study how the average of those statistics out of the projected (therefore univariate) samples can approximate the distance-based statistics that were intended to use. Theoretical analysis was carried out, which shows that the loss of asymptotic efficiency (in the form of the asymptotic variance of the test statistics) is likely insignificant. The new method can be numerically much more efficient, when the sample size is large, which is well-expected under this information (or big-date) era. Simulation studies validate the theoretical statements. The theoretical analysis takes advantage of some newly available results, such as the equivalence of the distance-based methods with the reproducible kernel Hilbert spaces [12]. The numerical methods utilize a recently appeared algorithm in [8].

Data Availability Statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

Author Contributions

All authors listed have made a substantial, direct, and intellectual contribution to the work and approved it for publication.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Acknowledgments

This material was based upon work partially supported by the National Science Foundation under Grant DMS-1127914 to the Statistical and Applied Mathematical Sciences Institute. Any opinions, findings, and conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. This work has also been partially supported by NSF grant DMS-1613152.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fams.2021.779841/full#supplementary-material

References

1. David, N, Yakir, A, Hilary, K, Sharon, R, Peter, J, Eric, S, et al. Detecting Novel Associations in Large Data Sets. Science (2011) 334(6062):1518–24.

PubMed Abstract | Google Scholar

2. Schweizer, B, and Wolff, EF. On Nonparametric Measures of Dependence for Random Variables. Ann Stat (1981) 879–85. doi:10.1214/aos/1176345528

CrossRef Full Text | Google Scholar

3. Siburg, KF, and Stoimenov, PA. A Measure of Mutual Complete Dependence. Metrika (2010) 71(2):239–51. doi:10.1007/s00184-008-0229-9

CrossRef Full Text | Google Scholar

4. Gretton, A, Bousquet, O, Smola, A, and Schölkopf, B. Measuring Statistical Dependence with hilbert-schmidt Norms. International Conference on Algorithmic Learning Theory. Springer (2005). p. 63–77. doi:10.1007/11564089_7

CrossRef Full Text | Google Scholar

5. Székely, GJ, and Rizzo, ML. Brownian Distance Covariance. Ann Appl Stat (2009) 3(4):1236–65. doi:10.1214/09-aoas312

CrossRef Full Text | Google Scholar

6. GáborSzékely, J, MariaRizzo, L, and Bakirov, NK. Measuring and Testing Dependence by Correlation of Distances. Ann Stat (2007) 35(6):2769–94. doi:10.1214/009053607000000505

CrossRef Full Text | Google Scholar

7. Matthew, R, and Nicolae, DL. On Quantifying Dependence: a Framework for Developing Interpretable Measures. Stat Sci (2013) 28(1):116–30.

Google Scholar

8. Huo, X, and Székely, GJ. Fast Computing for Distance Covariance. Technometrics (2016) 58(4):435–47. doi:10.1080/00401706.2015.1054435

CrossRef Full Text | Google Scholar

9. Taskinen, S, Oja, H, and Randles, RH. Multivariate Nonparametric Tests of independence. J Am Stat Assoc (2005) 100(471):916–25. doi:10.1198/016214505000000097

CrossRef Full Text | Google Scholar

10. Lyons, R. Distance Covariance in Metric Spaces. Ann Probab (2013) 41(5):3284–305. doi:10.1214/12-aop803

CrossRef Full Text | Google Scholar

11. Hoeffding, W. Probability Inequalities for Sums of Bounded Random Variables. J Am Stat Assoc (1963) 58(301):13–30. doi:10.1080/01621459.1963.10500830

CrossRef Full Text | Google Scholar

12. Sejdinovic, D, Sriperumbudur, B, Gretton, A, and Fukumizu, K. Equivalence of Distance-Based and RKHS-Based Statistics in Hypothesis Testing. Ann Stat (2013) 41(5):2263–91. doi:10.1214/13-aos1140

CrossRef Full Text | Google Scholar

13. Serfling, RJ. Approximation Theorems of Mathematical Statistics (Wiley Series in Probability and Statistics). Wiley-Interscience (1980).

Google Scholar

14. George, EP, Some Theorems on Quadratic Forms Applied in the Study of Analysis of Variance Problems, I. Effect of Inequality of Variance in the One-Way Classification. Ann Math Stat (1954) 25(2):290–302.

Google Scholar

15. DeanBodenham, A, and Adams, NM. A Comparison of Efficient Approximations for a Weighted Sum of Chi-Squared Random Variables. Stat Comput pages (2014) 1–12.

Google Scholar

16. Lopez-Paz, D, Hennig, P, and Schölkopf, B. The Randomized Dependence Coefficient. In Advances in Neural Information Processing Systems, 1–9. 2013.

Google Scholar

17. Wilks, SS. On the Independence of K Sets of Normally Distributed Statistical Variables. Econometrica (1935) 3:309–26. doi:10.2307/1905324

CrossRef Full Text | Google Scholar

18. Puri, ML, and Sen, PK. Nonparametric Methods in Multivariate Analysis. Wiley Series in Probability and Mathematical Statistics. Probability and mathematical statistics. Wiley (1971).

Google Scholar

19. Mardia, KV, Bibby, JM, and Kent, JT. Multivariate Analysis. New York, NY: Probability and Mathematical Statistics. Acad. Press (1982).

Google Scholar

20. Lee, K-Y, Li, B, and Zhao, H. Variable Selection via Additive Conditional independence. J R Stat Soc Ser B (Statistical Methodology) (2016) 78(Part 5):1037–55. doi:10.1111/rssb.12150

CrossRef Full Text | Google Scholar

21. Rahimi, A, and Recht, B. Random Features for Large-Scale Kernel Machines. In Advances in Neural Information Processing Systems, 1177–84. 2007.

Google Scholar

22. Achlioptas, D, McSherry, F, and Schölkopf, B. Sampling Techniques for Kernel Methods. In Annual Advances In Neural Information Processing Systems 14: Proceedings Of The 2001 Conference (2001).

Google Scholar

23. Avrim, B. Random Projection, Margins, Kernels, and Feature-Selection. In Subspace, Latent Structure and Feature Selection, 52–68. Springer, 2006.

Google Scholar

24. Cai, T, Fan, J, and Jiang, T. Distributions of Angles in Random Packing on Spheres. J Mach Learn Res (2013) 14(1):1837–64.

PubMed Abstract | Google Scholar

25. Drineas, P, and Mahoney, MW. On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning. J Machine Learn Res (2005) 6(Dec):2153–75.

Google Scholar

26. Frieze, A, Kannan, R, and Vempala, S. Fast Monte-Carlo Algorithms for Finding Low-Rank Approximations. J Acm (2004) 51(6):1025–41. doi:10.1145/1039488.1039494

CrossRef Full Text | Google Scholar

27. Rudin, W. Fourier Analysis on Groups. John Wiley & Sons (1990).

Google Scholar

28. Miles, L, Laurent, J, and Wainwright, J. A More Powerful Two-Sample Test in High Dimensions Using Random Projection. In Advances in Neural Information Processing Systems, 1206–14. 2011.

Google Scholar

29. Heller, R, Heller, Y, Kaufman, S, Brill, B, and Gorfine, M. Consistent Distribution-free K-Sample and independence Tests for Univariate Random Variables. J Machine Learn Res (2016) 17(29):1–54.

Google Scholar

30. Li, R, Zhong, W, and Zhu, L. Feature Screening via Distance Correlation Learning. J Am Stat Assoc (2012) 107(499):1129–39. doi:10.1080/01621459.2012.695654

PubMed Abstract | CrossRef Full Text | Google Scholar

31. Zhu, L-P, Li, L, Li, R, and Zhu, L-X. Model-free Feature Screening for Ultrahigh-Dimensional Data. J Am Stat Assoc (2012).

Google Scholar

32. Wang, X, Pan, W, Hu, W, Tian, Y, and Zhang, H. Conditional Distance Correlation. J Am Stat Assoc (2015) 110(512):1726–34. doi:10.1080/01621459.2014.993081

CrossRef Full Text | Google Scholar

33. Zhu, C, Zhang, X, Yao, S, and Shao, X Distance-based and Rkhs-Based Dependence Metrics in High Dimension. Ann Stat (2020) 48(6):3366–94. doi:10.1214/19-aos1934

CrossRef Full Text | Google Scholar

Keywords: independence test, distance covariance, random projection, hypotheses test, multivariate hypothesis test

Citation: Huang C and Huo X (2022) A Statistically and Numerically Efficient Independence Test Based on Random Projections and Distance Covariance. Front. Appl. Math. Stat. 7:779841. doi: 10.3389/fams.2021.779841

Received: 19 September 2021; Accepted: 16 November 2021;
Published: 13 January 2022.

Edited by:

David Banks, Duke University, United States

Reviewed by:

Dominic Edelmann, German Cancer Research Center, Germany
Gabor J. Szekely, National Science Foundation (NSF), United States

Copyright © 2022 Huang and Huo. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Xiaoming Huo, huo@gatech.edu

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.