Marc Thomson is entering his senior year at the University of Colorado Boulder, studying applied math and chemical engineering. His current research deals with computational chemistry, in which he hopes he can apply machine learning. Marc is also interested in history and politics, particularly geopolitics. In his free time, Marc enjoys camping and hiking.
Project: Predicting Chemical Properties from Structure
Goal of the project:
This project predicts physical properties of organic compounds based on chemical structure. In particular, the code estimates melting point, boiling point, heat of fusion, heat of vaporization, heat of combustion and heat of formation. These properties are determined from chemical structure and properties that can easily be calculated from structure. They do not require any experimental or quantum mechanical data as inputs.
Summary of work:
Mathematica includes a large repository of chemical data that proves useful for this project, with around 44,000 entries. This data was filtered to remove missing entries, as well as those containing elements other than carbon, hydrogen, nitrogen and oxygen, leaving around 5,500 samples. Most of the work lay in processing the data and extracting useful features. Features included molecules’ geometry, topology, functional groups and substructures, moments of inertia and a host of other features. Once the data was gathered and saved, properties were predicted by the random forest algorithm, which performed better than neural networks.
Results and future work:
The machine learning algorithm predicted melting point with moderate accuracy. The MAE of the predictions was about 30 K, with an average relative error of 10%. Predictions of other properties fared much the same. It may be that accurate prediction from structure has a limit, and quantum mechanical information is required for accuracy. Future studies should include estimates of dipole moments, as well as a broader range of organic molecules (e.g. compounds containing sulfur, phosphorous and halogens). Further improvements can be made by describing how molecules fit together in 3D space.