A stochastic framework for font-independent Devanagari OCR
MetadataShow full item record
Font-independent OCR solutions for Latin and Oriental scripts are commercially available and widely used in Digital library applications. However, accurate OCRs are still not available for Devanagari, a script used by over 400 million people in more than forty languages including Hindi and Sanskrit. Challenges in Devanagari OCR include: (i) Large number of character classes, (ii) Character shapes made of complex primitives that cannot be easily segmented using conventional character segmentation approaches, (iii) Variable representations of the same character in different fonts, and (iv) Preponderance of poor print quality or poor quality paper that causes unpredictable character distortions. We address this challenge by segmenting the characters into components which are horizontally or vertically juxtaposed, and connected along non-linear boundaries. Most techniques in the literature have approximated the segmentation process by using sliding windows or projection profiles in a single direction. We adopt a Block Adjacency Graph (BAG) representation, where each node of the BAG represents a part of the character image and the edges represent their interconnections. Characters are segmented by selecting subgraphs while also accommodating the natural breaks and joints in characters and the various ways in which alphabets can join. Instead of the common approach of using font-dependent rules to guide the segmentation and classification process, we have developed a recognition driven segmentation method that generates multiple segmentation results for each character. Word hypotheses are generated by integrating image recognition results with a language model that encodes frequencies of alphabets and syllabic characters. This is in contrast with the previous use of language models primarily as a post-processing technique. We take advantage of the syllabic-alphabetic nature of the Devanagari script by designing a stochastic framework where the primitives are syllabic characters made of one or more alphabets. We use dictionary lookup to enhance the word hypotheses. On a publicly available, multi-font test set of 10,606 words, we have achieved top choice word accuracy of 75%, and top-5 choice word accuracy of 85%. This is a significant improvement over the performance of previous techniques on the same test set.