ORIGINAL RESEARCH article

Front. Appl. Math. Stat., 03 December 2020

Sec. Mathematics of Computation and Data Science

Volume 6 - 2020 | https://doi.org/10.3389/fams.2020.593406

The Spectral Underpinning of word2vec

  • 1. Program in Applied Mathematics, Yale University, New Haven, CT, United States

  • 2. Department of Pathology, Yale School of Medicine, New Haven, CT, United States

  • 3. Interdepartmental Program in Computational Biology and Bioinformatics, New Haven, CT, United States

  • 4. School of Computer Science and Engineering, The Hebrew University, Jerusalem, Israel

  • 5. Department of Mathematics, University of Washington, Seattle, WA, United States

Abstract

Word2vec introduced by Mikolov et al. is a word embedding method that is widely used in natural language processing. Despite its success and frequent use, a strong theoretical justification is still lacking. The main contribution of our paper is to propose a rigorous analysis of the highly nonlinear functional of word2vec. Our results suggest that word2vec may be primarily driven by an underlying spectral method. This insight may open the door to obtaining provable guarantees for word2vec. We support these findings by numerical simulations. One fascinating open question is whether the nonlinear properties of word2vec that are not captured by the spectral method are beneficial and, if so, by what mechanism.

1 Introduction

Word2vec was introduced by Mikolov et al. [1] as an unsupervised scheme for embedding words based on text corpora. We will try to introduce the idea in the simplest possible terms and refer to [13] for the way it is usually presented. Let be a set of elements for which we aim to compute a numerical representation. These may be words, documents, or nodes in a graph. Our input consists of an matrix P with non-negative elements , which encode, by a numerical value, the relationship between the set and a set of context elements . The meaning of contexts is determined by the specific application, where in most cases the set of contexts is equal to the set of elements (i.e. for any ) [2].

The larger the value of , the larger the connection between and . For example, such a connection can be quantified by the probability that a word appears in the same sentence as another word. Based on P, Mikolov defined an energy function which depends on two sets of vector representations and . Maximizing the functional with respect to these sets yields and which can serve as a low dimensional representations for the words and contexts respectively. Ideally, this embedding should encapsulate the relations captured by the matrix P.

Assuming a uniform prior over the n elements, the energy function , introduced by Mikolov et al. [1] can be written asThe exact relation between 1 and the formulation in [4] appears in Supplementary Material. Word2vec is based on maximizing this expression over all There is no reason to assume that the maximum is unique. It has been observed that if and are similar elements in the data set (namely, words that frequently appear in the same sentence), then or tend to have similar numerical values. Thus, the values are useful for embedding . One could also try to maximize the symmetric loss that arises from enforcing and is given by In Section 5 we show that the symmetric functional yields a meaningful embedding for various datasets. Here, the interpretation of the functional is straight-forward: we wish to pick in a way that makes large. Assuming P is diagonalizable, this is achieved for w that is a linear combination of the leading eigenvectors. At the same time, the exponential function places a penalty over large entries in w.

Our paper initiates a rigorous study of the energy functional , however, we emphasize that a complete description of the energy landscape remains an interesting open problem. We also emphasize that our analysis has direct implications for computational aspects as well: for instance, if one were interested in maximizing the nonlinear functional, the maximum of its linear approximation (which is easy to compute) is a natural starting point. A simple example is shown in Figure 1: the underlying dataset contains 200 points in where the first 100 points are drawn from a Gaussian distribution, and the second 100 points are drawn from a second Gaussian distribution. The matrix P is the row-stochastic matrix induced by a Gaussian kernel where α is a scaling parameter discussed in Section 5. We observe that, up to scaling, the maximizer of the energy functional (black) is well approximated by the spectral methods introduced below.

FIGURE 1

2 Motivation and Related Works

Optimizing over energy functions such as 1 to obtain vector embeddings is done for various applications, such as words [4], documents [5] and graphs [6]. Surprisingly, very few works addressed the analytic aspects of optimizing over the word2vec functional. Hashimoto et. al. [7] derived a relation between word2vec and stochastic neighbor embedding [8]. Cotterell et al. [9] showed that when P is sampled according to a multinomial distribution, optimizing over 1 is equivalent to exponential family PCA [10]. If the number of elements is large, optimizing over 1 becomes impractical. As an efficient alternative, Mikolov et al. [4] suggested a variation based on negative sampling. Levy and Goldberg [11] showed that if the embedding dimension is sufficiently high, then optimizing over the negative sampling functional suggested in [4] is equivalent to factorizing the shifted Pointwise Mutual Information matrix. This work was extended in [12], where similar results were derived for additional embedding algorithms such as [3, 13, 14]. Decomposition of the PMI matrix was also justified by Arora et al. [15], based on a generative random walk model. A different approach was introduced by Landgraf [16], that related the negative sampling loss function to logistic PCA.

In this work, we focus on approximating the highly nonlinear word2vec functional by Taylor expansion. We show that in the regime of embedding vectors with small magnitude, the functional can be approximated by the spectral decomposition of the matrix P. This draws a natural connection between word2vec and classical, spectral embedding methods such as [17, 18]. By rephrasing word2vec as a spectral method in the “small vector limit,” one gains access to a large number of tools that allow one to rigorously establish a framework under which word2vec can enjoy provable guarantees, such as in [19, 20].

3 Results

We now state our main results. In Section 3.1 we establish that the energy functional has a nice asymptotic expansion around and corresponds naturally to a spectral method in that regime. Naturally, such an asymptotic expansion is only feasible if one has some control over the size of the entries of the extremizer. We establish in Section 3.2 that the vectors maximizing the functional are not too large. The results in Section 3.2 are closely matched by numerical results: in particular, we observe that in practice, a logarithmic factor smaller than our upper bound. The proofs are given in Section 4 and explicit numerical examples are shown in Section 5. In Section 6 we show empirically that the relation between word2vec and the spectral approach holds also for embedding in more than one dimension.

3.1 First Order Approximation for Small Data

The main idea is simple: we make an ansatz assuming that the optimal vectors are roughly of size . If we assume that the vectors are fairly “typical” vectors of size , then each entry is expected to scale approximately as . Our main observation is that this regime is governed by a regularized spectral method. Before stating our theorem, let denote the inequality up to universal multiplicative constants.

Theorem 3.1 (Spectral Expansion). If , thenNaturally, since we are interested in maximizing this quantity, the constant factor plays no role. The leading terms can be rewritten aswhere 1 is the matrix all of whose entries are 1. This suggests that the optimal maximizing the quantity should simply be the singular vectors associated to the matrix . The full expansion has a quadratic term that serves as an additional regularizer. The symmetric case (with ansatz ) is particularly simple, since we haveAssuming P is similar to a symmetric matrix, the optimal w should be well described by the leading eigenvector of with an additional regularization term ensuring that is not too large. We consider this simple insight to be the main contribution of this paper, since it explains succinctly why an algorithm like word2vec has a chance to be successful. We also give a large number of numerical examples showing that in many cases the result obtained by word2vec is extremely similar to what we obtain from the associated spectral method.

3.2 Optimal Vectors Are Not Too Large

Another basic question is as follows: how large is the norm of the vector(s) maximizing the energy function? This is of obvious importance in practice, however, as seen in Theorem 3.1, it also has some theoretical relevance: if w has large entries, then clearly one cannot hope to capture the exponential nonlinearity with a polynomial expansion. Assuming , the global maximizer of the second-order approximationsatisfiesThis can be seen as follows: if , then . Plugging in shows that the maximal energy is at least size . For any vector exceeding in size, we see that the energy is less than that establishing the bound. We obtain similar boundedness properties for the fully nonlinear problem for a fairly general class of matrices.

Theorem 3.2 (Generic Boundedness.). Let satisfy . ThensatisfiesWhile we do not claim that this bound is sharp, however it does nicely illustrate that the solutions of the optimization problem must be bounded. Moreover, if they are bounded, then so are their entries; more precisely, implies that, for ‘flat’ vectors, the typical entry is of size and thus firmly within the approximations that can be reached by a Taylor expansion. It is clear that a condition such as is required for boundedness of the solutions. This can be observed by considering the row-stochastic matrixWriting , we observe that the arising functional is quite nonlinear even in this simple case. However, it is fairly easy to understand the behavior of the gradient ascent method on the axis sinceis monotonically increasing until . Therefore it is, a priori, unbounded since ε can be arbitrarily close to 0.

In practice, one often uses word2vec for matrices whose spectral norm is and which have the additional property of being row-stochastic. We also observe empirically that the global optimizer has a mean value close to 0 (the expansion in Theorem 3.1 suggests why this would be the case). We achieve a similar boundedness theorem in which the only relevant operator norm is that of the operator restricted to the subspace of vectors having mean 0.

Theorem 3.3 (Boundedness for row-stochastic matrices). Let be a row-stochastic matrix and letdenote the restriction of P to that subspace and suppose that . LetIf w has a mean value sufficiently close to 0,where , thenThe matrix given above, illustrates that some restrictions are necessary, in order to obtain a nicely bounded gradient ascent. There is some freedom in the choice of the constants in Theorem 3.3. Numerical experiments show that the results are not merely theoretical: extremizing vectors tend to have a mean value sufficiently close to 0 for the theorem to be applicable.

3.3 Outlook

Summarizing, our main arguments are as follows:

  • The energy landscape of the word2vec functional is well approximated by a spectral method (or regularized spectral method) as long as the entries of the vector are uniformly bounded. In any compact interval around 0, the behavior of the exponential function can be appropriately approximated by a Taylor expansion of sufficiently high degree.

  • There are bounds that suggests that the energy of the embedding vector scale as ; this means that, for “flat” vectors, the individual entries grow at most like . Presumably this is an artifact of the proof.

  • Finally, we present examples in Section 4 showing that in many cases the embedding obtained by maximizing the word2vec functional are indeed accurately predicted by the second order approximation.

This suggests various interesting lines of research: it would be nice to have refined versions of Theorems 3.2 and 3.3 (an immediate goal being the removal of the logarithmic dependence and perhaps even pointwise bounds on the entries of w). Numerical experiments indicate that Theorems 3.2 and 3.3 are at most a logarithmic factor away from being optimal. A second natural avenue of research proposed by our paper is to differentiate the behavior of word2vec and that of the associated spectral method: are the results of word2vec (being intrinsically nonlinear) truly different from the behavior of the spectral method (arising as its linearization)? Or, put differently, is the nonlinear aspect of word2vec that is not being captured by the spectral method helpful for embedding?

4 Proofs

Proof of Theorem 3.1. We recall our assumption of and (where the implicit constant affects all subsequent constants). We remark that the subsequent arguments could also be carried out for any at the cost of different error terms; the arguments fail to be rigorous as soon as , since then, a priori, all terms in the Taylor expansion of could be of roughly the same size. We start with the Taylor expansionIn particular, we note thatWe use the series expansionto obtainHere, the second sum can be somewhat simplified sinceAltogether, we obtain thatSince , we haveand have justified the desired expansion.

Proof of Theorem 3.2. Setting results in the energyNow, let w be a global maximizer. We obtainwhich is the desired result.

Proof of Theorem 3.3. We expand the vector w into a multiple of the constant vector of norm 1, the vectorand the orthogonal complement viawhich we abbreviate as . We expand,Since P is row-stochastic, we have and thus . Moreover, we havesince has mean value 0. We also observe, again because has mean value 0, thatCollecting all these estimates, we obtainWe also recall the Pythagorean theorem,Abbreviating , we can abbreviate our upper bound asThe function,is monotonically increasing on . Thus, assuming thatwe get, after some elementary computation,However, we also recall from the proof of Theorem 3.2 thatAltogether, since the energy in the maximum has to exceed the energy in the origin, we haveand therefore,

5 Examples

We validate our theoretical findings by comparing, for various datasets, the representation obtained by the following methods: i) optimizing over the symmetric functional in 1, ii) optimizing over the spectral method suggested by Theorem 3.1 and iii) computing the leading eigenvector of . We denote by w, and u be the three vectors obtained by (i)–(iii), respectively. The comparison is performed for two artificial datasets, two sets of images, a seismic dataset and a text corpus. For the artificial, image and seismic data, the matrix P is obtained by the following steps: we compute a pairwise kernel matrixwhere α is a scale parameter set as in [21] using a max-min scale. The max-min scale is set to

This global scale guarantees that each point is connected to at least one other point. Alternatively, adaptive scales could be used as suggested in [22]. We then compute P viaThe matrix P can be interpreted as a random walk over the data points, (see for example [18]). Theorem 3.1 holds for any matrix P that is similar to a symmetric matrix, here we use the common construction from [18], but our results hold for other variations as well. To support our approximation in Theorem 3.1, Figure 7 shows a scatter plot of the scaled vector u vs. . In addition, we compute the correlation coefficient between u and bywhere μ and are the means of u and respectively. A similar measure is done for w and u. In addition, we illustrate that the norm is comparable to , which supports the upper bound in Theorem 3.3.

5.1 Noisy Circle

Here, the elements are generated by adding Gaussian noise with mean 0 and to a unit circle (see the left panel of Figure 2). The right panel shows the extracted representations w, along with the leading eigenvector u scaled by where λ is the corresponding eigenvalue. The correlation coefficients and are equal to 0.98, 0.99 respectively.

FIGURE 2

5.2 Binary MNIST

Next, we use a set of 300 images of the digits 3 and 4 from the MNIST dataset [23]. Two examples from each category are presented in the left panel of Figure 3. Here, the extracted representations w and match the values of the scaled eigenvector u (see right panel of Figure 3). The correlation coefficients and are both higher than 0.999.

FIGURE 3

5.3 COIL100

In this example, we use images from Columbia Object Image Library (COIL100) [24]. Our dataset contains 21 images of a cat captured at several pose intervals of 5 degrees (see left panel of Figure 4). We extract the embedding w and and reorder them based on the true angle of the cat at every image. In the right panel, we present the values of the reordered representations w, and u overlayed with the corresponding objects. The values of all representations are strongly correlated with the angle of the object. Moreover, the correlation coefficients and , are 0.97 and 0.99 respectively.

FIGURE 4

5.4 Seismic Data

Seismic recordings could be useful for identifying properties of geophysical events. We use a dataset collected in Israel and Jordan, described in [25]. The data consists of 1632 seismic recordings of earthquakes and explosions from quarries. Each recording is described by a sonogram with 13 frequency bins, and 89 time bins [26] (see the left panel of Figure 5). Events could be categorized into 5 groups using manual annotations of their origin. We flatten each sonogram into a vector, and extract embeddings w, , and u. In the right panel of this figure, we show the extracted representations of all events. We use dashed lines to annotate the different categories and sort the values within each category based on u. The coefficient is equal to 0.89, and .

FIGURE 5

5.5 Text Data

As a final evaluation we use a corpus of words from the book “Alice in Wonderland” as processed in [27]. To define a co-occurrence matrix, we scan the sentences using a window size covering 5 neighbors before and after each word. We subsample the top 1000 words in terms of occurrences in the book. The matrix P is then defined by normalizing the co-occurrence matrix. In Figure 6 we present centered and normalized versions of the representations w, and the leading left singular vector v of . The coefficient is equal to 0.77, and .

FIGURE 6

6 Multi-Dimensional Embedding

In Section 3 we have demonstrated that under certain assumptions, the maximizer of the energy functional in 2 is governed by a regularized spectral method. For simplicity, we have restricted our analysis to a one dimensional symmetric representations, i.e. . Here, we demonstrate empirically that this result holds even when embedding n elements in higher dimensions.

Let be the embedding vector associated with , where is the embedding dimension. The symmetric word2vec functional is given bywhere . A similar derivation to the one presented in Theorem 3.1 (the one dimensional case) yields the following approximation of 5,Note that both the symmetric functional in 5 and its approximation in 6 are invariant to multiplying W with an orthogonal matrix. That is, W and produce the same value in both functionals, where .

To understand how the maximizer of 5 is governed by a spectral approach, we perform the following comparison. i) We obtain the optimizer W of 5 via gradient descent, and compute its left singular vectors, denoted . ii) We compute the right singular vectors of , denoted by . iii) Compute the pairwise absolute correlation values

We experiment on two datasets: 1) A collection of 5 Gaussians, and 2) images of hand written digits from MNIST. The transition probability matrix P was constructed as described in Section 5.

6.1 Data from Distinct Gaussians

In this experiment we generate a total of 2,500 samples, consisting of five sets of size 500. The samples in the i-th set are drawn independently according to a Gaussian distribution , where 1 is a dimensional all ones vector, and r is scalar that controls the separation between the Gaussian centers.

Figure 8 shows the absolute correlation value of the pairwise correlation between and for . The correlation between the result obtained via the word2vec functional 5 and the right singular vectors of increase when the separation between the Gaussians is high.

FIGURE 7

FIGURE 8

6.2 Multi-Class MNIST

The data consists of samples from the MNIST hand-written dataset with images from each digit (). We compute a dimensional word2vec embedding W by optimizing (5).

Figure 9 shows the absolute correlation between the and . As evident from the correlation matrix, the results obtained by both methods span similar subspaces.

FIGURE 9

Funding

YK is supported in part by NIH grants UM1DA051410, R01GM131642, P50CA121974 and R61DA047037. SS was funded by NSF‐DMS 1763179 and the Alfred P. Sloan Foundation. EP has been partially supported by the Blavatnik Interdisciplinary Research Center (ICRC), the Federmann Research Center (Hebrew University) Israeli Science Foundation research grant no. 1523/16, and by the DARPA PAI program (Agreement No. HR00111890032, Dr. T. Senator).

Statements

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

Author contributions

AJ, EP, OL, SS, YK: designed research; EP, OL, AJ, SS, JP: performed research; SS, EP, AJ, OL contributed new reagents or analytic tools; EP, JP, OL: analyzed data; SS, EP, AJ, OL wrote the paper.

Acknowledgments

The authors thank James Garritano and the anonymous reviewers for their helpful feedback.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fams.2020.593406/full#supplementary-material

References

  • 1.

    MikolovTChenKCorradoGDeanJ. Efficient estimation of word representations in vector space. In: YBengioYLeCun, editors. 1st International conference on learning representations, ICLR 2013; 2013 May 2–4; Scottsdale, Arizona, USA. Workshop Track Proceedings (2013).

  • 2.

    GoldbergYLevyO. word2vec explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 (2014).

  • 3.

    GroverALeskovecJ. node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, San Fransisco, CA, (2016) p. 85564. 10.1145/2939672.2939754.

  • 4.

    MikolovTSutskeverIChenKCorradoGSDeanJ. Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst (2013) 26:31119.

  • 5.

    LeQMikolovT. Distributed representations of sentences and documents. In: International conference on machine learning, Beijing, China, (2014):118896.

  • 6.

    NarayananAChandramohanMVenkatesanRChenLLiuYJaiswalS. graph2vec: Learning distributed representations of graphs. arXiv preprint arXiv:1707.05005 (2017).

  • 7.

    HashimotoTBAlvarez-MelisDJaakkolaTS. Word embeddings as metric recovery in semantic spaces. TACL (2016) 4:27386. 10.1162/tacl_a_00098.

  • 8.

    HintonGERoweisST. Stochastic neighbor embedding. Adv Neural Inf Process Syst (2003) 15:85764.

  • 9.

    CotterellRPoliakAVan DurmeBEisnerJ. Explaining and generalizing skip-gram through exponential family principal component analysis. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, (Valencia, Spain: Association for Computational Linguistics), Vol. 2 (2017). p. 17581. Short Papers.

  • 10.

    CollinsMDasguptaSSchapireRE. A generalization of principal components analysis to the exponential family. Adv Neural Inf Process Syst (2002):61724.

  • 11.

    LevyOGoldbergY. Neural word embedding as implicit matrix factorization. Adv Neural Inf Process Syst (2014) 3:217785.

  • 12.

    QiuJDongYMaHLiJWangKTangJ. “Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec,” in Proceedings of the eleventh ACM international conference on web search and data mining, Marina Del Rey, CA (New York, NY: Association for Computing Machinery), (2018):45967.

  • 13.

    PerozziBAl-RfouRSkienaS. “Deepwalk: online learning of social representations,” in Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, (New York, NY: Association for Computing Machinery), (2014):70110.

  • 14.

    TangJQuMWangMZhangMYanJMeiQ. “Line: Large-scale information network embedding,” in Proceedings of the 24th international conference on world wide web, Florence, Italy (New York, NY:Association for Computing Machinery), (2015):106777.

  • 15.

    AroraSLiYLiangYMaTRisteskiA. Random walks on context spaces: towards an explanation of the mysteries of semantic word embeddings. arXiv abs/1502.03520 (2015).

  • 16.

    LandgrafAJBellayJ. Word2vec skip-gram with negative sampling is a weighted logistic pca. arXiv preprint arXiv:1705.09755 (2017).

  • 17.

    BelkinM.NiyogiP. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput (2003). 15:137396. 10.1162/089976603321780317.

  • 18.

    CoifmanR. R.LafonS.Diffusion maps. Appl Comput Harmon Anal (2006) 21:530. 10.1016/j.acha.2006.04.006.

  • 19.

    SingerAWuHT. Spectral convergence of the connection Laplacian from random samples. Information Inference J IMA (2017) 6:58123. 10.1093/imaiai/iaw016.

  • 20.

    BelkinMNiyogiP. Convergence of Laplacian eigenmaps. Adv Neural Inf Process Syst (2007) 19:12936.

  • 21.

    LafonSKellerYCoifmanRR. Data fusion and multicue data matching by diffusion maps. IEEE Trans Pattern Anal Mach Intell (2006) 28:178497. 10.1109/tpami.2006.223.

  • 22.

    LindenbaumOSalhovMYeredorAAverbuchA. Gaussian bandwidth selection for manifold learning and classification. Data Min Knowl Discov (2020) 137. 10.1007/s10618-020-00692-x.

  • 23.

    LeCunYBottouLBengioYHaffnerP. Gradient-based learning applied to document recognition. Proc IEEE (1998) 86:2278324. 10.1109/5.726791.

  • 24.

    NeneSANayarSKMuraseH. Columbia object image library (coil-20). Technical Report CUCS006-96, Columbia University, Available at: https://www1.cs.columbia.edu/CAVE/publications/pdfs/Nene_TR96.pdf (1996).

  • 25.

    LindenbaumOBregmanYRabinNAverbuchA. Multiview kernels for low-dimensional modeling of seismic events. IEEE Trans Geosci Rem Sens (2018) 56:330010. 10.1109/tgrs.2018.2797537.

  • 26.

    JoswigM. Pattern recognition for earthquake detection. Bull Seismol Soc Am. (1990). 80:17086.

  • 27.

    JohnsenP. A text version of “alice’s adventures in wonderland [Dataset]”. Availavle at:https://gist.github.com/phillipj/4944029 (2019). Accessed August 8, 2020.

Summary

Keywords

dimensionality reduction, word embedding, spectral method, word2vec, skip-gram model, nonlinear functional

Citation

Jaffe A, Kluger Y, Lindenbaum O, Patsenker J, Peterfreund E and Steinerberger S (2020) The Spectral Underpinning of word2vec. Front. Appl. Math. Stat. 6:593406. doi: 10.3389/fams.2020.593406

Received

10 August 2020

Accepted

21 October 2020

Published

03 December 2020

Volume

6 - 2020

Edited by

Dabao Zhang, Purdue University, United States

Reviewed by

Yuguang Wang, Max-Planck-Gesellschaft (MPG), Germany

Halim Zeghdoudi, University of Annaba, Algeria

Updates

Copyright

*Correspondence: Ariel Jaffe,

This article was submitted to Mathematics of Computation and Data Science, a section of the journal Frontiers in Applied Mathematics and Statistics

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics