Methodologies for Learning Robust Feature Representations
MetadataShow full item record
In order to accurately draw inferences and make predictions based on a given set of data samples, one needs to find a suitable feature representation that efficiently models the underlying data manifold. The model should reflect the compact global structure, capture the behavior of data, and be robust in the presence of noise. While learning the manifold is not feasible if the data belongs to an arbitrary distribution, most real world data does have a rich structure, and thus lends itself to being modeled in a compact form. However, given that data sampling is always finite and tends to be noisy, this calls for addressing the technical challenge of selecting an appropriate manifold model and designing suitable regularizations. This thesis focuses on studying these problems in the context of data with independent subspace structure and deep learning algorithms. Extant literature has predominantly analyzed data with independent subspace structure, i.e., given a K class problem, each class lies near a linear subspace independent of others, making each pair of subspaces disjoint. Face images and motion in videos have been found to exhibit this property of subspace linearity. In this thesis, we propose three different dimensionality reduction algorithms that deal with data which approximately satisfy this independent subspace property: 1) we show that random projections preserve the independence between subspaces even without the knowledge of the actual data; 2) we develop an efficient supervised algorithm that preserves the subspace structure of data with K classes using just 2 K projection vectors; 3) we develop an algorithm that learns an embedding for labeled data such that samples from each class lies in a low dimensional subspace. However, the independent subspace structure assumption has a restricted range of applications. Thus, in this thesis we also study a class of algorithms that have wider applicability and can model data sampled from more general distributions. Deep learning algorithms have recently become popular for automatically learning useful features from given data and omit the need for designing hand-crafted features. We provide novel analysis and algorithms in this direction. Auto-Encoders (AE) are a sub-class of algorithms commonly used by the deep learning community and have recently become popular for learning data distribution. While doing so, they exploit sparse distributed structure (many-to-many relationship between the original and latent feature space) present in the data distribution. In this thesis, we analytically show the conditions on activation functions and regularizations that encourage sparsity in the hidden representation of AEs. Our analysis shows multiple regularized AEs and activation functions share similar underlying properties that encourages sparsity. We also study the first layer of neural networks with rectified linear and sigmoid activation in this thesis. We show that if we assume the observed data is generated from the true first layer hidden representation and if the distribution of the hidden representation is bounded independent non-negative and sparse (BINS), then this representation can be recovered for every corresponding data sample under a PAC bound by forward propagating data. We show that this view unifies multiple existing but disparate techniques in the deep learning community. Finally, we propose a novel technique called Normalization Propagation for avoiding the problem of internal covariate shift (ICS) which has been shown to slow down convergence while training deep neural networks. Since ICS is caused due to shifting distribution of the input to hidden layers during training, our algorithm propagates the normalization done at data level to all higher layers and ensures that each hidden layer's input is a standard Normal distribution. We show our proposed algorithm achieves state of the art results on multiple benchmark datasets.