Published: Jan. 23, 2018

With the growing scale and complexity of datasets in scientific disciplines, traditional data analysis methods are no longer practical to extract meaningful information and patterns. The need to process large-scale datasets by memory and computation efficient algorithms arises in all fields of science and engineering. Randomization and probabilistic techniques have become fundamental tools in modern data science and machine learning for analyzing large-scale datasets. In particular, randomized dimensionality reduction techniques are effective in modern data settings since they provide a non-adaptive data-independent mapping of high-dimensional datasets into a lower dimensional space. However, such methods require strong theoretical understanding to ensure that the key properties of original data are preserved. In this talk, we will focus on two important topics in modern data analysis: (1) K-means clustering and (2) low-rank approximation of kernel matrices for analyzing datasets with highly complex and nonlinear structures. Specifically, we will present a randomized algorithm for K-means clustering in the one-pass streaming setting that does not require incoherence or distributional assumptions on the data. Moreover, we will talk about the Nystrom method for generating low-rank approximations of kernel matrices that arise in many machine learning problems. We will discuss how randomized dimensionality reduction techniques allow us to obtain highly accurate and efficient low-rank approximations compared to other state-of-the-art Nystrom methods.