Data Science Workshop

I recently participated in a three-week workshop by UC Berkeley’s Graduate Data Science Organization (GDSO). Together with three other team members, I analyzed a data set of ceramics materials. We developed a model that predicts certain properties (density, fracture toughness, etc.) of an unknown material.

gdso-0 Determining the properties of a new material through experimentation and rigorous theoretical calculations is time-consuming and costly. This is particularly challenging when a large number of candidate materials needs to be assessed regarding their suitability for a specific application. However, by using machine learning, material properties can be inferred in a more efficient manner. The results can be used to quickly select a smaller subset of the candidate materials which can then be investigated with experiments and calculations.

We used a data set containing over 4,000 materials from the NIST Structural Ceramics Database. This dataset contains a large number of features but only a few features are available for any given entry. Our first challenge thus was to consolidate entries. We then standardized the features, removed outliers, eliminated anomalous entries, and selected the most relevant features. We also included additional features using the matminer package.

gdso-1 We then used linear regression and random forest regression to predict key physical properties. While linear regression strongly overfit the data, random forest regression gave a more accurate prediction.

I thought this workshop was a productive exercise, not only in applying different machine learning algorithms, but also because we worked with a real-world materials data set that required a good amount of preparation and cleaning. I also liked that I got some practical experience in developing a script in a team environment using git. This was a valuable experience because so far I’ve mostly developed code on my own.

You can find the GitHub repository of this project here.