Learning from and actively selecting pairwise constraints in data science
MetadataShow full item record
In last decade, we have witnessed a data explosion in computer vision, data mining and bioinformatics. In this BIG DATA era, how to effectively and efficiently analyze the increasing amount of large-scale data set is attracting more and more attention. The increasing scale of datasets leads to the growth of the number of classes that results in the hardness of asking humans to provide the accurate class label for data point. Accordingly, we study another type of human supervision: pairwise constraints which represent the degree of the semantic similarity between pair of data points, and could be relatively easily obtained in large scale scenarios. Based on pairwise constraints, we study two kinds of problems and make some progress: utilizing pairwise constraints for distance metric learning, and actively selecting pairwise constraints for semi-supervised clustering. First of all, based on pairwise constraints, we present an effective metric learning method based on the max-margin framework, and its kernelized extension. In order to efficiently learn the distance function, we adopt the cutting plane algorithm, and propose an approximation method based on matching pursuit that allows linear-time training even in the kernel case. However, this learned distance metric is a single Mahalanobis metric like most metric learning methods--it cannot handle heterogeneous data well. And those that learn multiple metrics throughout the feature space have demonstrated superior accuracy, but at a severe cost to computational efficiency. Thus, we consider a new angle on the metric learning problem and learn a single metric that is able to implicitly adapt its distance function throughout the feature space. This metric adaptation is accomplished by using a random forest-based classifier to underpin the distance function and incorporate both absolute pairwise position and standard relative position into the representation. Our third contribution looks at fast nearest neighbor retrieval in large-scale dataset with the given learned distance. We propose a novel hashing method that represents data point in the compact binary code to achieve both fast queries and efficient data storage. Different from most of traditional hash methods which focus on learning projection functions, our method focus on studying a new quantization strategy that adaptively assigns varying numbers of bits to different hyperplanes based on their information content, which significantly improves on traditional hashing methods. Finally, rather than randomly selecting pairwise constraints, we turn to actively select pairwise constraints with assessing how the selected pairwise constraints would improve the final assignment for semi-supervised clustering. We propose a novel online framework for active semi-supervised spectral clustering that selects pairwise constraints as clustering proceeds, based on the principle of uncertainty reduction, and significantly improves performance than randomly sampling.