Published: March 2, 2016
Wednesday, March 02, 2016 3:00 PM - 4:00 PM
Main Campus - Engineering Classroom Wing - 257: Newton Lab
Alex Gittens; International Computer Science Institute; University of California, Berkeley

Why (some) nonlinear embeddings capture compositionality linearly 

Dimensionality reduction methods have been used to represent words with vectors in NLP applications since at least the introduction of latent semantic indexing in the late 1980s, but word embeddings developed in the past several years have exhibited a robust ability to map semantics in a surprisingly straightforward manner onto simple linear algebraic operations. These embeddings are trained on cooccurrence statistics and intuitively justified by appealing to the distributional hypothesis of Harris and Firth, but are typically presented in an ad-hoc algorithmic manner. We consider the canonical skip-gram Word2vec embedding, one of the most well-known of these recent word embeddings, and establish a corresponding generative model that maps the composition of words onto the addition of their 
embeddings. We argue for the optimality of the canonical Word2vec embedding by drawing a connection to the sufficient dimensionality reduction principle of Globerson and Tishby. By virtue of, first, the fact that words can be replaced more generally with arbitrary symbols and, second, natural 
connections between Word2vec, the classical RC(M) association model for contingency tables, Bayesian PCA decompositions and Poisson matrix completion our results are meaningful in a broad context.