Techniques for Multiple Source Learning
MetadataShow full item record
The coming of big data era brings unique challenges and opportunities for data mining researches. As vast oceans of data produced from trillions of connected devices on a daily basis, useful knowledge is usually deeply buried in data of multiple types, from different sources, in different formats, and with different representations. Many interesting patterns and knowledge can't be mined from a single source perspective, rather have to be discovered from the integrative analysis of multiple information sources available. Although many algorithms have been proposed to tackle multiple information sources, real-life applications continue to pose new challenges. The data can be gigantic, noisy, unreliable and highly imbalanced. In this thesis, we propose to explore the techniques of multiple source learning in the challenging scenarios. There are two major parts of learning from the correlation among multiple information sources: Supervised learning and Unsupervised learning. In supervised learning, we focus on the task of classification with multiple information sources. Multiple data sources for the same set of objects are able to provide complimentary predictive powers, and by integrating their expertise, the prediction accuracy can be significantly improved. We demonstrate the benefits of combing multiple sources on two specific learning scenarios. In transfer learning, we propose a novel two-step framework to tackle the irrelevant sources and imbalanced distribution that bother most existing work. In link prediction, we integrate the multiple sources of networks to provide a robust prediction on the friendship relationship on a social network. For unsupervised learning, the goal of the thesis research is to explore the differences among multiple information sources to find inconsistency. Such knowledge can't be discovered in the single source and can only be revealed by joint analysis of the correlation among multiple sources. We propose a joint matrix factorization method and a deep network to detect the inconsistency embedded in multiple sources. The proposed approaches can benefit many applications, and in particular, we show how the proposed methods can help estimate the information trustworthiness on the online recommendation systems. In this thesis, we address the challenges faced by many applications with multiple data sources. With the proposed multiple source transfer learning framework, we handle the challenges that the sources may not be relevant and the imbalanced distributions. The proposed inconsistency detection is dramatically different traditional anomaly detection in that it can't be discovered by the traditional methods. The algorithms developed in this thesis have been proved useful in many areas, including social network analysis, cyber-security and healthcare systems, and have the potential of being applied to many other areas. As both the amount of the data and number of sources in the world continue to explode, there are great opportunities and challenges to infer meaningful knowledge from multiple sources of massive data collection.