Data Engineering – Christopher Bronner

Telstra is Australia’s largest telecom provider. The company provides a dataset of service disruption events on Kaggle in which events are categorized in three levels of severity. I built a model to predict the severity of uncategorized events.

The first step of this project was to consolidate the data which was provided in separate files. I then reformatted the data to be completely numerical and checked for missing values and outliers. As several features were non-ordinal categorical features, I encoded them using one-hot encoding. The different files were linked by a common index but a given index was sometimes assigned multiple entries which I grouped together.

For model building, I started by trying out a different classification algorithms, with random forests being the best-performing one. Due to the imbalance of the different classes, I chose the F1 score as an appropriate metric and the best model achieved a score of 0.65.

In order to improve the model, I used different feature selection techniques, namely choosing features by correlation with the target, recursive feature elimination, principal component analysis, adding interaction features, and combinations thereof. However, the best model remained the original random forest model.

The python code for this project is available in a jupyter notebook on GitHub.

map-flights-places

As a person who likes to travel a lot, and an expat who travels to his native country a lot, I accumulate a lot of flights. In fact, I keep a spreadsheet of all flights I’ve been on. Since I am currently teaching myself data engineering in python, I thought it was a good exercise to visualize the flight data contained in my spreadsheet, along with a list of places I’ve visited in my life.

The red markers in the above map represent places I’ve visited. The raw data were simply a list of cities I wrote down in a text file. I imported this file with pandas and used the geopy module to look up the geographic coordinates for each place. Then I created a map using the basemap module and plotted the visited places in it. Basemap allows you to draw maps with all kinds of projections, and having traveled exclusively on the northern hemisphere, this “polar Lambert azimuthal projection” seems like a great choice.

The flight data from my spreadsheet were also loaded into a pandas DataFrame. I then cross-referenced the three-letter IATA airport codes in my spreadsheet with a data set of over 50,000 airports and obtained each airport’s geographical coordinates by merging the respective DataFrames. Using the coordinates and basemap‘s drawgreatcircle method, I drew each flight on the map.

The jupyter notebook of this project is available on GitHub.

Christopher Bronner

Data Scientist | Washington, D.C.

Category: Data Engineering

Predicting Severity of Service Disruptions in Telstra Network

Flight Data Visualization