From hierarchies to metrics: Learning nonlinear models of semantic association
Johnson, David M.
MetadataShow full item record
Modeling the degree of semantic similarity or dissimilarity between instances is one of the most elemental problems in machine learning and data mining. With a strong enough similarity model, other problems such as classification, clustering and retrieval become much more tractable, or even trivial. Unfortunately, selection of an appropriate measure is itself a very difficult task. Accurately representing semantic similarity within a particular space requires a deep understanding of the semantic setting, potentially including the relationship between every possible pair of instances. Two of the most popular approaches for generating rich semantic association models are metric learning---which seeks to explicitly learn a semantic distance function over a data space---and hierarchical clustering---which attempts to arrange data into a semantically meaningful multi-level tree structure. I show that these different approaches are in fact strongly interrelated, and that tools and techniques from one modality can be used to improve performance in the other. I present work on four different novel methods in this field. The Random Forest Distance, approaches metric learning from the perspective on a classification problem, and uses the tree-based random forest classifier to solve that problem in a far more flexible manner than is possible under traditional metric modalities. My work with semi-supervised hierarchical clustering explores new methods for generating semantically meaningful cluster trees, and demonstrates that a cluster hierarchy can function as a metric. The Hierarchy Forest Distance combines insights from both other methods to construct a state-of-the-art semi-supervised nonlinear metric, and the Generalized Hierarchy Forest Distance extends this method to function on multilabel or other complex data.