High Performance Approaches for Large-Scale Non-Linear Spectral Dimensionality Reduction of Scientific Data
MetadataShow full item record
The vast majority of the current big data, coming from, for example, high performance high fidelity numerical simulations, high resolution scientific instruments (microscopes, DNA sequencers, etc.) or Internet of Things streams and feeds, is a result of complex non-linear processes. While these non-linear processes can be characterized by low dimensional manifolds, the actual observable data they generate is high-dimensional. This high-dimensional data is inherently difficult to explore and analyze, owing to the em curse of dimensionality and empty space phenomenon that render many statistical and machine learning techniques (e.g. clustering, classification, model fitting, etc.) inadequate. In this context, non-linear spectral dimensionality reduction has proved to be an indispensable tool. Non-linear spectral dimensionality reduction methods rely on a spectral decomposition of a feature matrix that captures properties of the underlying manifold from given samples, and effectively bring the original data into a more human-intuitive low dimensional space that makes quantitative and qualitative analysis of non-linear processes possible (e.g., by enabling visualization). However, the leading manifold learning methods, such as considered here Isomap, remain ineffective in their out-of-the-box form when confronted with modern, large-scale datasets. Existing tools for dimensionality reduction restrict datasets which we may analyze. This is because of both computational and memory complexity that are at least quadratic in the number of input points. Manifolds from which samples are inherently skewed are also highly difficult to analyze, and a common point of failure for many methods. In this work, we address and propose state-of-the-art solutions to these obstacles common to existing manifold learning methods. Specifically, we develop and propose methods which (i) learn data in a streaming fashion, i.e. by performing costly decomposition up-front to learn the manifold and mapping data points to the manifold in a cost efficient manner, (ii) adaptively learn neighborhoods of a manifold when data are available in highly-correlated samples, and (iii) allow for exact manifold learning of Big Data with the Apache Spark model in a distributed memory manner on a cluster, thereby enabling dimensionality reduction of datasets an order of magnitude larger than before. We provide numerous examples of our methods on real and synthetic data. To exhibit correctness and showcase performance, we make use of both real and synthetic commonly used machine learning benchmark datasets. We demonstrate the practical value of our methods with applications in microfluidics, materials design and manufacturing, and information retrieval. Here, we identify the shortcomings of current methods, motivate the need for solutions to pressing issues in the presence of big data, and highlight the impact of our work in other domains.