WaiPRACTICE August 2020
View the Project on GitHub women-in-ai-ireland/August-2020-WaiLearn-MachineLearning
Catherine Lalanne | LinkedIn|GitHub Heejin Yoon | LinkedIn|GitHub Luciana Azibuike | LinkedIn|GitHub
A newbie evaluation on trying out Machine Learning Models for Classification and Data Augmentation to Support better results
We are immensely grateful to Nabanita Roy for pointing out this very interesting dataset,her previous work formed the base for which we were able to build on, you can have a look at her work here (Part I and Part II)!
Based on the data exploration and performance analysis carried out by Nabanita, the performance of the previous models were very low for the recall metric having just a 0.06 score for the best performing model SVM. This behaviour resulted in the conclusion that out of all the Actual Positive (hazardous) shifts, only 6% of the shifts have been predicted as Positive. Our focus was on improving this.
Predicting Positive here is predicting “ the possibility of hazardous situation occurrence, where an appropriate supervision service can reduce a risk of rockburst (e.g. by distressing shooting) or withdraw workers from the threatened area. Good prediction of increased seismic activity is therefore a matter of great practical importance. “
We wouldn’t want to send miners into a mine with a substandard model!
On the other hand, Precision for the best performing model is only 0.67, which means that out of all the shifts predicted as hazardous, (1 - 0.67) = 23% were actually low risk, which would mean a non insignificant number of shifts where miners may have been told to stay home, and thus a lower productivity.
Our shared implementation notebook can be found here!
As a small improvement, we replaced the One-hot encoder for these Categorical variables with Numerical encoding as the assessment coding being graded from Low to High, coding numerically added meaningful information. To view the full data attributes, see here.
The encoding of these attributes was implemented by a simple mapping of the known values as seen in line 'put line number showing numerical encoding' of our implementation notebook, this encoding allowed for easier interpretability as we will see later. It was also relevant for the plotting of the correlation matrix as unless attributes are numeric, they are completely ignored.
To depict the correlation of final features to the Target (Class), a heatmap was implemented (shown in Figure 1):
Figure 1: Correlation Matrix
The original data has a lot more Class 0 than Class 1 points which is obvious in the data visualisations and is likely to impact the performance of all models. As suggested by Nabanita Roy , we tried the Synthetic Minority Oversampling Technique (SMOTE) oversampling technique as seen in line 'put line number showing numerical encoding' of our implementation notebook.
Figure 2: Original dataset shape Counter({0: 2414, 1: 170})
Figure 3: Resampled dataset shape Counter({0: 1931, 1: 1931})
Interestingly, for discrete features like nBumps (the number of seismic bumps recorded within the previous shift), new rows have some non integers values!
With the numerical encoding for the Categorical variables, we get the following results:
Note, with the previous One Hot encoders, we got slightly worse results:
The results contained in the tables above are ordered by f1-score, so as to try and make a good compromise between Recall and Accuracy, however if Recall is the most crucial criteria the order would have been slightly changed.
Note that the Test set was quite small, as the full raw dataset is quite small: 2584 rows x 16 features. Sor for some of the Top performing results:
Quite a small difference in absolute numbers!
Going forward, we recommend further improving the obtained results by doing the followig: