Mining hidden associations in text corpora through concept chain and graph queries
The availability of large volumes of text documents has created the potential to discover valuable information hidden in those texts. This in turn has created the need for automated methods of discovering such information without having to read it all. The main theme of this dissertation is based on the hypotheses that the whole (document collection) is greater than the sum of its parts (individual documents). Interesting links and hidden information that connect facts, propositions or hypotheses can be found by using novel text mining techniques along with traditional data mining techniques. We refer to this research area as unapparent information revelation (UIR). The goal of this dissertation is to automate techniques that will sift through these extensive document collections and find such links. Previous work in our UIR group has defined Concept Chain Queries (CCQ) and Concept Graph Queries (CGQ), special cases of text mining in document collections focusing on detecting links between two or more concepts across text documents. A concept chain query involving concepts A and B has the following meaning: find the most plausible relationships between concept A and concept B assuming that one or more instances of both concepts occur in the corpus, but not necessarily in the same document. Different from traditional search, CCQ is interpreted as finding the best concept chain and evidence trail across multiple documents that connect two concepts. CCQ can be extended to CGQ where three or more concepts are involved. In this dissertation, the UIR problem is approached from various perspectives. I adapt the traditional bag-of-words approach, the existing Association Rule Mining method and the Local Context Analysis technique to address this problem. Specifically, I have shown that it is possible to improve knowledge discovery in document collections through combining text retrieval and link analysis techniques. Additionally, an explanation of the retrieved chain (graph), in terms of a cross-document evidence trail, is also generated for further investigation. The latter is a special case of a cross document summary. Experiments on different data sets are presented that demonstrate the effectiveness of the new algorithm.