Show simple item record

dc.contributor.advisorZola, Jaroslaw
dc.contributor.authorKaran, Subhadeep
dc.date.accessioned2019-07-30T15:12:03Z
dc.date.available2019-07-30T15:12:03Z
dc.date.issued2019
dc.date.submitted2019-05-17 21:28:59
dc.identifier.urihttp://hdl.handle.net/10477/80041
dc.descriptionPh.D.
dc.description.abstractBayesian networks (BNs) are probabilistic graphical models often used in big data analytics to capture conditional relationships among a set of random variables. They are critical class of models that are capable to represent both association and causation, and hence are indispensable in building AI systems. Over the years, BNs have been successfully applied in many domains including diagnostic systems, clinical decision support, systems biology, and genomics. However, the problem of learning structure of BNs from data, which is a typical starting point in BNs applications, is known to be NP-hard. To date, both heuristics and exact learning algorithms have been proposed to tackle the problem. While heuristics are widespread in real-life applications, exact algorithms are believed to learn {\it better} networks at the much higher computational cost. In this work, we first show that compared to heuristics the exact structure learning algorithms are delivering measurably better structures in terms of Structural Hamming Distance, and enable more accurate inferences, measured through the Kullback-Leiber divergence between the actual and inferred probability distributions. Then, we propose high performance strategies to address computational challenges of exact structure learning algorithms.The problem of learning exact BN structure is a data-driven optimization problem that involves three key subproblems: (i) counting queries, to evaluate objective function, (ii) parent sets identification to establish local dependencies between variables, and (iii) structure search, to combine locally optimal solutions into globally optimal structure. In this work, we propose efficient approach for each subproblem. Specifically, we design programming abstraction and algorithmic realizations for fast counting in machine learning applications, which offers order or magnitude higher throughput of counting queries than the existing techniques. We then introduce a new shared and distributed memory approach for the exact parent sets assignment problem. To achieve scalability, we derive theoretical bounds to constrain the search space when MDL scoring function is used, and we reorganize the underlying dynamic programming problem such that the computational density is increased and fine-grain synchronization is eliminated. We demonstrate that the resulting method maintains strong scalability on large Apache Spark clusters, and it can be used to efficiently process data sets with over 70 variables, far beyond the reach of the currently available solutions. To address the structure search problem, we reduce it to a single source shortest path problem in partial order lattices. Because the resulting graphs grow exponentially with the number of input variables, we introduce a new approach that exploits partial relationships between variables to constrain the number of ways in paths in the graph can be extended, while remaining provably optimal. Via experimental results, we demonstrate that the method provides up to three times improvement in runtime, and orders of magnitude reduction in memory consumption over the current best algorithms. The proposed algorithms are implemented in the open source SABNA package -- a high-quality software package that can be used by researchers and practitioners to analyze complex data sets, leveraging parallelism of modern shared-memory multi-core processors.
dc.formatapplication/pdf
dc.language.isoen
dc.publisherState University of New York at Buffalo
dc.rightsUsers of works found in University at Buffalo Institutional Repository (UBIR) are responsible for identifying and contacting the copyright owner for permission to reuse. University at Buffalo Libraries do not manage rights for copyright-protected works and cannot assist with permissions.
dc.subjectComputer science
dc.titleHigh Performance Algorithms for Exact Structure Learning of Bayesian Networks
dc.typeDissertation
dc.typeText
dc.rights.holderCopyright retained by author.
dc.contributor.departmentComputer Science and Engineering


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record