Making Machine Learning Work in Chemistry: Methodological Innovation, Software Development, and Application Studies
MetadataShow full item record
In this dissertation, we highlight recent developments in the application of machine learning for molecular modeling and simulation. After giving a brief overview of the foundations, components, and workflow of a typical supervised learning approach for chemical problems, we discuss how machine learning relates to, supports, and augments more traditional physics-based approaches in computational research.We introduce the design and major elements of the open-source ML program package ChemML that facilitate the broader dissemination of cutting-edge machine learning techniques in an automated, yet flexible and customizable, format. ChemML makes an effort to reach non-expert users for which the novelty of ML research may be daunting, and to help share best practices and guidelines. We next utilize our machine learning package to underpin two application studies. For the first time, we introduce and assess the challenge of imbalanced data in the Harvard Clean Energy Project (CEP) data set. We present an ensemble method based on the unsupervised learning approaches to extract the underlying classes of molecules in the data set of organic molecules. This method enables us to identify under-represented classes of the molecules, and provide reasonable solutions to alleviate the side effects of learning from imbalanced data. Moreover, we present an innovative feature selection scheme in the space of molecular descriptors. It is based on systematic trends in the mean values of descriptors for compound classes with different target property values. It is a simple, intuitively motivated procedure, and its results lend themselves to chemical interpretation. We present a proof-of-principle study concerned with modeling the principal energy levels of organic semiconductor compounds in the CEP data set.In addition, we construct efficient machine learning models to accurately and efficiently predict the density, polarizability, and refractive index (RI) values of 1.5 million organic molecules.Using transfer learning and fine-tuning approaches, we try to evaluate the applicability domain of the available models for the out-of-sample predictions. This is the largest study so far developing a data-driven model for the screening of the high-RI candidates. The results of this study support the view that machine learning methods can accelerate the exploration of remoter area of the molecular space.We conclude by outlining challenges and future research directions that need to be addressed in order to make machine learning a mainstream chemical engineering tool.