Evaluating the probability of identification in forensic science
Forensic identification is the task of determining whether or not observed evidence arose from a known source. The objective of this research is to make it possible for identification/exclusion opinions presented in court to be accompanied by a probability statement. At present, in most forensic domains outside of DNA evidence, it is not possible to make such a statement since the necessary probability distributions cannot be computed with reasonable accuracy, although the probabilistic approach itself is well-understood. In principle, it involves determining a likelihood ratio (LR) – the ratio of the joint probability of the evidence and source under the identification hypothesis (that the evidence came from the source) and under the exclusion hypothesis (that the evidence did not arise from the source), where the evidence and/or source of various forensic modalities could be represented in different kinds of structures, like scalars, vectors, and graphs. Evaluating the joint probability is computationally intractable when the number of variables is even moderately large. It is also statistically infeasible since the number of parameters to be determined from the data is exponential with the number of variables. An approximate method is to replace the joint probability by another probability: that of distance (or similarity) between evidence and object under the two hypotheses. While this reduces the number of parameters needed to a constant irrespective of the number of variables, it is an oversimplification leading to errors. We consider a third method which decomposes the LR into a product of two factors, one based on distance and the other on rarity. This is a known result for the univariate Gaussian case, which has an intuitive appeal - forensic examiners assign higher importance to rare attributes in the evidence. We generalize this approach to more complex data such as vectors and graphs, in which we propose efficient algorithms for learning the structure of Bayesian networks that makes the LR evaluation computationally tractable. Empirical evaluations of the three methods, done with several data types (continuous features, binary features, multinomial and graph) and several modalities (handwriting with binary features, handwriting with multinomial features and footwear impressions with continuous features), show that the distance and rarity method is significantly better than the distance only method.