Content-based handwritten document indexing and retrieval
MetadataShow full item record
Information retrieval on textual data has been well studied and its applications (such as web searching) have become ubiquitous in our daily lives. However content-based image retrieval on handwritten document collections still remains a challenging problem. Here "content-based" means that the search will analyze the actual content of the images, instead of merely the metadata. In the context of handwritten documents, the word "content" might refer different things, such as writing style, shape of words and characters, or the truth of the writing. Accordingly, two different types of retrieval can be performed: "query by example" and semantic (or "query by text") retrieval. While both of them have their own applications in the real world, the second one is more intuitive and user-friendly, since it uses not only the low level underlying computational features, but also the understanding of documents. This work explores several automatic techniques to do both types of retrieval upon handwritten document collections. These techniques are three-fold: (i) indexing, (ii) "query by example" retrieval and (iii) "query by text" retrieval. For indexing, we focus on the problem of word segmentation and transcript mapping. Word segmentation is the task of segmenting text line images into word image, which is one of the most important preprocessing steps in order to perform any word level analysis or recognition. We propose the use of neural network with a new set of global and local features to make the classification between inter-word and intra-word gaps. The transcript mapping problem is an alignment problem between the handwritten document image and its transcript. It is not a trivial task simply because the word segmentation algorithm is error prone. A recognition based dynamic programming algorithm is proposed to solve this problem. It is also shown to improve the accuracy of automatic word segmentation. In "query by example" retrieval, the query can be either a full page document or a single word image. For the document level retrieval, a statistical model is learned to determine whether the writing styles of two documents are similar or not. Gamma and Gaussian distributions are used for the modeling. Word level retrieval is performed by a feature based similarity search algorithm. For each word image, a 1024-bit binary feature vector is extracted for this purpose. "Query by text" retrieval is a more challenging task because word level segmentation is error prone and word recognition with large lexicon size is still an unsolved problem. The current solution for this problem is to manually annotate the collection, which is costly. By taking the idea from machine translation in textual information retrieval, we propose a statistical approach for word recognition and use the probabilistic annotation results to do language model retrieval on handwritten documents. For all these approaches, their performances are empirically compared on several test collections. The main contributions of this work are a detailed examination of different levels of content-based image retrieval for handwritten documents, and the development of a retrieval system that allows either image or text queries. The new word segmentation method shows an improved performance over a previous method and is useful in forensic document analysis. In addition, a large handwriting database of 3824 pages (about 573,600 labeled words) was created using the proposed transcript-mapping algorithm. This database was used predominantly in this dissertation and it serves as a useful resource for future handwriting analysis and recognition research.