Bayesian background models for retrieval of handwritten documents
MetadataShow full item record
There is a huge collection of handwritten documents on the web. The retrieval of these handwritten documents requires transcribing and indexing these documents. Transcription is achieved by using a Handwritten Text Recognizer (HTR). In an unconstrained environment, the state of the art HTR performs with a very high error rate (~40%) for Latin scripts such as English. Alternatively, word spotting is an effective approach for retrieval of handwritten documents. Here the dictionary is restricted to a few keywords and the output is constituted of candidate regions in the documents where keywords are present. This dissertation has two parts. First, we have developed a script-independent recognition based keyword spotting framework that is readily integrated with any recognizer. The framework incorporates a dynamic background model to separate keywords from non-keywords, while relying on local character level and global word level scores to learn a classifier. We present results on both a segmentation-free line based recognizer and a segmentation-based word recognizer. We have developed a Bayesian formulation to incorporate variations in writing styles in the model and apply the variational inference algorithm to approximate the posterior. The approach is extended by learning weights on individual samples of keywords (and non-keywords) rather than score features. Our model can be improved by adding labeled samples. The second part of the dissertation describes the use of Bayesian Active Learning for selecting samples from an unlabeled set to enable learning of a robust classifier. We have extended this approach to identify samples (unlabeled data) where the parameters disagree but are simultaneously confident. A significant advantage of our approach is that the behavior of the recognizer remains consistent across different scripts in terms of the confidence scores returned for keyword and non-keyword samples. We demonstrate this by applying prior learned on weights from one script onto samples from other script in the multilingual setup. The methods developed have been validated on publicly available datasets: handwritten documents from the IAM dataset for English, AMA dataset for Arabic, and the LAW dataset for Devanagiri. The system is also evaluated on a synthetic multilingual dataset.