Metadata Analysis in Unstructured Documents Using Classical and Deep Learning Methods
Nair, Rathin Radhakrishnan
MetadataShow full item record
Metadata by definition is any set of data that describes and provides information about other data. Specifically, document metadata entails any information that can better represent, or guide in the improved understanding of a document. The most common document metadata available include title, author, edit time etc, which are auto-generated at the time of file creation. There is also content-based metadata but often is currently overlooked e.g. information from graphics, author specific characteristics etc. In this thesis, we focus on studying approaches to extracting and understanding such implicit content-based document metadata in machine-printed and handwritten documents. The two key contributions of this thesis work are to handle (a) graphics metadata: where we offer new approaches to extract and understand information graphics and (b) handwritten text metadata: where we seek to capture author specific feature representation. The vast amount of publicly available scanned handwritten document collections are unstructured but current approaches such as OCRs make the assumption that the document under consideration maintains a uniform structure. Hence in non-uniform documents they find it challenging to handle text and non-text data, for e.g. in a line plot with text content an OCR would overlook the text data. This calls for an automated technique to process these types of documents and digitize them. In the first part of the thesis, we study a class of deep learning architectures to help us segment the different parts of the document image. Specifically, to facilitate segmentation, we discuss a novel approach using convolution neural networks(CNN) to learn a feature representation for different types of data like machine-printed, handwritten text, graphics. The second part of the thesis addresses obtaining information from non-text data which opens an unexplored avenue of metadata information, thus advancing existing text data understanding techniques. We discuss novel methods to extract text and non-text data from information graphics like line plots, phase diagrams etc and infer a representational message using Bayesian networks. Finally, we discuss a neural network model that performs adaptive handwriting recognition and works with limited labeled data. Long Short Term Memory (LSTM) is a sub-class of algorithms that have been used in the domain of handwriting recognition over the years. We postulated that authors follow a unique writing style both in terms of handwriting and sentence formulation, hence we developed an adaptive LSTM-based handwriting recognition model. We exploit the user-specific features by adapting neural networks to better recognize handwritten text. In summary, this thesis discusses an end-to-end system for converting a collection of documents into a digital archive. These digital archives will enable indexing and searching the collection. We implement a CNN based network to spot the different section in individual pages of the collection. And on the identified text sections we propose to implement a LSTM and neural network based language model for recognition and transcription. Finally, we discuss some approaches to handle non-text data and since understanding graphics needed some definitive goals, we focus specifically on information graphics such as line plots and phase diagrams in our work.