Indexing and retrieval of low quality handwritten documents
MetadataShow full item record
Decades of the development in document analysis and recognition techniques has made it possible to convert large amount of documents into electronic formats and store them into computers. In recent years, the achievement in information retrieval has provided a powerful tool for prompt access to the information that lies in the documents. Inspired by the success of applications in the above two areas, in this thesis, we investigate methods that aim at improving the performance of retrieving handwritten document images. Unlike the retrieval of machine-printed documents from which we will anticipate very high OCR accuracy, the retrieval of handwritten document images is more challenging due to document analysis and recognition errors. In existing methods to retrieve handwritten document images, usually the index is built on the text collected from top- n ( n > 1) candidates returned by a word recognizer. Different weights may apply to the candidates according to their ranks. Effective as these primitive methods are, with the assumptions of flawless word segmentation and isolated word recognition, these methods are vulnerable by word segmentation errors and cannot take advantage of the language model which has become a standard component in the state-of-the-art handwriting recognition systems. However, incorporation of the word segmentation scores (probabilities) and language model into any existing indexing techniques in general increases the complexity of the problem. In our indexing method, we solved this challenging problem by separating the term counts from standard IR models, estimating them on the word sequence level, and plugging them back in the IR models. A fast algorithm using dynamic programming was proposed to reduce the time complexity. In addition to the application in document retrieval, we also used the word segmentation information in keyword retrieval. In another major contribution of this paper, we applied the Markov random field (MRF) modeling to the binarization problem. The MRF can precisely describe the constraint of local smoothness in the image. We can also use the constraint of smoothness to remove the grid from the form image, which is a very useful application in form image preprocessing. This research work virtually addresses a general topic in the preprocessing of degraded handwritten document images. Applications in both handwriting recognition and handwritten document image retrieval can benefit from our approach.