Natural language summarization of text and videos using topic models
MetadataShow full item record
Probabilistic topic models have recently become the cornerstone of unsupervised exploratory analysis of text documents using Bayesian statistics. The strength of the models lie in their modularity--random variables can be introduced or modified to suit the requirements of the different applications. Many of these models however consider modeling only one particular view of the observations such as treating documents as a flat collection of words ignoring the nuances of the different classes of annotations which may be present in an implicit and/or explicit form. We extend a few existing unsupervised topic models such as Latent Dirichlet Allocation (LDA) to model documents which are annotated from two different perspectives. The perspectives consist of both a word level (e.g. part-of-speech, affect, positional etc.) tag annotation and a document level (e.g. crowd-sourced document labels, captions of embedded multimedia) highlighting. The new models are dubbed as the Tag 2 LDA class of models whose primary goal is to combine the best aspects of supervised and unsupervised modeling learning under one framework. Additionally, the correspondence class of Tag 2 LDA models explored in this context are state-of-the-art among the family of parametric tag-topic models in terms of predictive log likelihoods. These models are presented in Chapter 4. The field of automatic summary generation is increasingly gaining traction and there is a steady rise in demand of the summarization algorithms that is applicable to a wide variety of genres of text and other kinds of data as well (e.g. video). Producing short summaries in a human readable form is very attractive particularly for very large datasets. However, the problem is NP-Hard even for smaller domains such as summarizing small sets of newswire documents. We use the Tag 2 LDA class of models in conjunction with local models (e.g. extracting syntactic and semantic roles of words, Rhetorical Structure trees, etc.) to do multi-document summarization of text documents based on information needs that are guided by a common information model. The guided summarization task, as laid out in recent text summarization competitions, aims to cover information needs by asking questions like "who did what when and where?" We also have successfully applied multi-modal topic models to summarize domain specific videos into natural language text directly from low level features extracted from the videos. The experiments performed for this task are described in detail in Chapter 5. Finally, in Chapter 6, we show that using topic models it is possible to outperform keyword summaries generated by annotating videos through state-of-the-art object recognition techniques from computer vision. Summarizing a video in terms of natural language generated from such keywords in context removes the laborious frame-by-frame drawing of bounding boxes surrounding objects of interest--a scheme which is required for annotating videos to training a large number of object detectors. The topic models that we develop for this purpose instead use easily available short lingual descriptions of entire videos to predict text for a given domain specific test video. The models are also novel in handling both text and video features particularly with regards to multimedia topic discovery from captioned videos whose features can belong to both discrete and real valued domains.