Word spotting in offline multilingual handwritten documents based on Hidden Markov Models
MetadataShow full item record
Recognition of unconstrained handwritten documents continues to be a challenging task primarily due to the vast variability in writing styles and applications which do not offer the means to constrain large vocabularies. Word spotting has often been proposed as an alternative to full transcription for keyword based retrieval and indexing of document images. Word spotting techniques are mainly categorized into template based and learning based approaches. The template based approaches require at least one query image in the training set and have been found to usually result in a high number of false positives. Recently, learning based approaches have been proposed as an alternative. While they regularly outperform the template based approaches, they have the shortcomings of being inefficient and non-scalable across scripts. Previous work has dealt with the non-scalability issue across scripts by considering a separate system for each script/language independently with independent modules for preprocessing, query representation and word segmentation. In this dissertation, we describe a new methodology for word spotting that deals with large background vocabularies without the need for separately built word or character segmentation for each script. Our approach is based on Hidden Markov Models of trained characters that can be used to simulate any keyword query, even those unseen in the training corpus. The main contributing idea of our approach is the utilization of script-independent methods for feature extraction, training and recognition and their scalability over multiple scripts such as English, Arabic, and Devanagari. Our methodology of combining character filler models and background models has outperformed the state of the art methods for line based word spotting system. It has been evaluated on public datasets of different languages such as IAM for English, AMA for Arabic, and LAW for Devanagari or Indian languages. It lays the foundation for the development of the first learning based word spotting system for multilingual handwritten documents. Both the 'script identifier based' and 'script identifier free' approaches were used to spot keywords in multilingual scripts. The 'initial script identifier' switch free approach showed higher accuracy in detecting keywords in documents containing more than one script and also better performance overall.