December 2018 – Christopher Bronner

Telstra is Australia’s largest telecom provider. The company provides a dataset of service disruption events on Kaggle in which events are categorized in three levels of severity. I built a model to predict the severity of uncategorized events.

The first step of this project was to consolidate the data which was provided in separate files. I then reformatted the data to be completely numerical and checked for missing values and outliers. As several features were non-ordinal categorical features, I encoded them using one-hot encoding. The different files were linked by a common index but a given index was sometimes assigned multiple entries which I grouped together.

For model building, I started by trying out a different classification algorithms, with random forests being the best-performing one. Due to the imbalance of the different classes, I chose the F1 score as an appropriate metric and the best model achieved a score of 0.65.

In order to improve the model, I used different feature selection techniques, namely choosing features by correlation with the target, recursive feature elimination, principal component analysis, adding interaction features, and combinations thereof. However, the best model remained the original random forest model.

The python code for this project is available in a jupyter notebook on GitHub.

In this side project, I analyzed a dataset of house transactions in the Seattle area. In addition to visualizing the data, I built a regression model predicting the sales price based on other properties of the houses, such as size of the living area, the number bathrooms and bedrooms, and a rating of the house.

The underlying data are available on Kaggle. The dataset contains data of 21,613 house transactions. For each house, the price is provided along with 20 other features. I started with an exploratory data analysis, focusing on the sales price. Prices are between $75,000 and $7.7 million with a median sales price of $450,000. The exact distribution of prices is shown in the histogram below.

img-price-histogram

The feature most strongly correlated with price is the size of the living area. The correlation between these two features is 0.70. The relationship between these two quantities is visualized in the hexagonal binning plot below.

img-price-sqft

Another factor influencing sales is time. The provided data span a 12-month period in 2014/15. Throughout the year, the number of daily transactions varies based on two major cycles: a seasonal variation with a decrease in sales in the winter and an increase in sales during the summer, and a weekly cycle due to strongly decreased sales activity during the weekend (except for the first weekend in January).

img-prices-year

In the second phase of the project, I built a predictive regression model. I started with ordinary least squares regression, which readily achieves an R2 score of 0.70. I then applied two different feature selection methods (successively adding features according to correlation with the target variable and recursive feature elimination), hoping that a subset of the features would yield the same result. While this was not the case, I learned which features had the strongest influence on model performance, particularly the size of the living area (“sqft_living”). This procedure is visualized in the plot below.

img-correlation

Different regression algorithms (including Ridge and Lasso regression, Support Vector Regression, Random Forest Regression) did not yield significantly better results. On other hand, adding polynomial features to the feature set improved the score to 0.82. I attempted different combinations of polynomial features, RFE, and PCA but the score didn’t improve any further.

Finally, I used Basemap to visualize the house sales prices on a map. Each marker on the map below represents one sale, with the color indicating the price (red being a high price, blue being a low price). It becomes clear from this map that houses are more expensive in the downtown area of Seattle, and even more so in Redmond (east of Lake Washington). In contrast, cheaper houses are located north and south of the city, particularly around the airport. However, even in those areas, prices are increased near the waterfront.

img-map

The python code of this project is available in a jupyter notebook on GitHub and on Kaggle.

Christopher Bronner

Data Scientist | Washington, D.C.

Month: December 2018

Predicting Severity of Service Disruptions in Telstra Network

House Sales in King County