Machine learning for person identification with applications in forensic document analysis
MetadataShow full item record
Person identification from evidence is the primary goal of forensic analysis. The term identification answers the question "Whose sample is this?" and the term verification answers "Are these two samples from the same person?", where an evidentiary sample is from a forensic modality eg., handwriting, signature, fingerprint, DNA etc. Forensic document analysis primarily deals with handwriting and signatures. The identification problem of searching a large corpus of handwriting/signature samples to retrieve all that originated from a person, is an Information Retrieval(IR) task. The task of identification from evidence is divided into three stages, (i) Data extraction and indexing, (ii) Data analysis and learning models and (iii) Inference/Retrieval. The objective of the research is to identify and propose appropriate statistical machine learning tools to provide a solution to the above three stages. In this work, the handwriting and signatures modalities are used to ascertain the validity of the proposed approaches. The task of data extraction and indexing is to process the data from its raw form and to make it usable for the learning/analysis stage. We propose the use of Conditional Random Fields(CRFs) to identify and distinguish components such as signatures, machine-print, handwriting and noise from a given document. Line segmentation and word recognition are the next two steps for extracting discriminating features. We propose to use a robust statistical approach for line segmentation and CRFs for handwritten word recognition. Clustering the feature extracted data using infinite mixture models is proposed for fast and efficient retrieval from a large corpus. The task of data analysis and learning models involves understanding the characteristics of the data. We propose two different statistical approaches: one termed as learning and the other adaptation . In the first approach( learning ), a large collection of training data, comprising of samples from a general population is used. We propose the use of two sets of ensemble of pairs created from the whole population: one set consists of pairs of samples from the same individual and the other set consists of pairs of samples from different individuals. Learning from these ensemble of pairs is a one time process. In the second approach( adaption ), multiple known samples of a person specific to the case at hand is used to learn the variation and similarities specific to the person. This information is then used to make a probabilistic decision on any given an unknown sample using a Bayesian approach. The third stage is an Inference (verification) or a Retrieval task. The inference task involves verifying whether or not a given questioned sample belongs to the same person as that of the known(s) sample(s). We propose an approach to quantify the strength of evidence for such verifications. In the retrieval task a given questioned sample is matched against a database of samples, and the goal is to sort the database based on its similarity to the questioned sample. Here query expansion and relevance feedback are two techniques that are analyzed to improve search results. The proposed approaches are validated from experiments conducted on handwriting and signatures corpus.