Probabilistic random field based method for annotated machine printed documents preprocessing
MetadataShow full item record
Today, the convenience of search, both on the personal computer hard disk and on the web, is essentially limited to machine-printed text documents and images because of the poor accuracy of handwriting recognizers. The proposed research will advance the state-of-the-art in realizing search of hand-annotated documents. We will primarily target machine-printed documents which have been annotated by hand by multiple writers in an office/collaborative environment. In applications where the annotations are action instructions (such as, "make 4 copies", "remove Figure X" etc.) we can envision the proposed system serving as the front end of an OCR-based NLP module. We expect that the techniques developed in this dissertation will be also useful for retrieval of pages from material in languages for which accurate OCRs do not exist. The main research task proposed is that of segmenting handwritten text, machine printed text, noise or overlapped text, sometimes referred to as the task of "ink separation". Prior techniques primarily use histogram thresholding and analysis of the connectivity of strokes. These algorithms, although effective, rely on heuristic rules of spatial constraints, and are not scalable across applications. We have developed a system that is composed by three parts: the binariztion of document images (focus on hand-held devices captured documents), a boosted tree classifier to perform the initial classification which is followed by a Markov random field (MRF) based approach to re-label the initial segments based on their statistical dependencies within a neighborhood. The MRF based binarization will provide a reliable binarized document image for segmentation even with bad illumination. The boost tree will allow dividing the training data set into several small clusters and use a simple classifier to solve the initial labeling at the cluster (homogeneous) levels. The overlapped text will be further separated using a MRF based method. The isolated handwritten textual blocks will be indexed (unsupervised) based on writing instrument, style, ink color, etc. as being possible indicators of different writers. We have shown the ability to selectively remove the annotations belonging to a particular writer and allow the end user of the system to view an unmarked document even though the original document image is marked up. This feature will be accomplished by intelligent document restoration whereby the removal of overlapping strokes does not damage the underlying machine-printed text. We have performed experiments on a large document dataset and report results.