Language models and automatic topic categorization for information retrieval in handwritten documents
MetadataShow full item record
People have become accustomed to accessing (searching, retrieving, reading) information online with ease. This information is largely limited to ASCII documents, though access to printed and spoken documents has significantly improved in recent years. Despite several decades of research in handwriting recognition, the goal of having computers access handwritten information from unconstrained document images is still elusive. Current handwriting recognition systems are only capable of recognizing words that are present in a restricted lexicon typically comprised of 10-1000 words. As the size of the lexicon grows, the recognition accuracy falls sharply and is reported to be around 30% for a 10K word lexicon. It is generally believed in the Information Retrieval (IR) community that OCR accuracy needs to be around ~70%-80% for an acceptable user experience. The objective of this research is to raise the accuracy levels on unconstrained handwritten documents so that they can be accessed with the same ease as printed or ASCII documents. We have used statistical language modeling techniques to advance the state-of-the-art in handwriting recognition. We use an innovative `noisy-channel' model to correct the errors made by OCR systems and use automatic topic categorization methods based on state-of-the art statistical models like Maximum Entropy and Latent Dirichlet Allocation to reduce the effective size of the lexicon. A beam-search technique is used to prune and re-rank the multiple hypotheses using the confidences returned by the recognizer and the topic categories. We have adapted the Vector Space Model and the Latent Semantic Analysis technique to account for the multiple (top-N) recognition hypotheses. In the retrieval phase we use convolution and relevance models to improve the precision of the retrieval results. The experimental evaluation was performed on a publicly available IAM dataset of over 1500 pages written by more than 600 writers covering a wide variety of topics and a separate dataset of 5000 hand-filled medical forms. Our topic categorization approach improves the top choice accuracy of word recognition from 30% to 45% (on average) and the top-10 accuracy is improved from 50% to the "sweet spot" range of 70%. Also, the experiments on these improved top-10 results in conjunction with our improved retrieval models show a performance at par with the retrieval on ground truth of the handwritten documents.