WaiPRACTICE

Logo

WaiPRACTICE August 2020

View the Project on GitHub women-in-ai-ireland/August-2020-WaiLearn-MachineLearning

Contributors

Catherine Lalanne | LinkedIn|GitHub
Heejin Yoon | LinkedIn|GitHub
Luciana Azibuike | LinkedIn|GitHub

Predicting Hazardous Seismic: Evaluating Machine Learning Models using Scikit Learn

A newbie evaluation on trying out Machine Learning Models for Classification and Data Augmentation to Support better results

Contents:

  1. Premise
  2. Our Contribution
  3. Our Conclusion/What we have Learnt
  4. Future Work
  5. Refrences
  6. Contributors


We are immensely grateful to Nabanita Roy for pointing out this very interesting dataset,her previous work formed the base for which we were able to build on, you can have a look at her work here (Part I and Part II)!
Based on the data exploration and performance analysis carried out by Nabanita, the performance of the previous models were very low for the recall metric having just a 0.06 score for the best performing model SVM. This behaviour resulted in the conclusion that out of all the Actual Positive (hazardous) shifts, only 6% of the shifts have been predicted as Positive. Our focus was on improving this.


1. Premise

Predicting Positive here is predicting “ the possibility of hazardous situation occurrence, where an appropriate supervision service can reduce a risk of rockburst (e.g. by distressing shooting) or withdraw workers from the threatened area. Good prediction of increased seismic activity is therefore a matter of great practical importance. “

We wouldn’t want to send miners into a mine with a substandard model!

On the other hand, Precision for the best performing model is only 0.67, which means that out of all the shifts predicted as hazardous, (1 - 0.67) = 23% were actually low risk, which would mean a non insignificant number of shifts where miners may have been told to stay home, and thus a lower productivity.

Our shared implementation notebook can be found here!

2. Our Contribution

The encoding of these attributes was implemented by a simple mapping of the known values as seen in line 'put line number showing numerical encoding' of our implementation notebook, this encoding allowed for easier interpretability as we will see later. It was also relevant for the plotting of the correlation matrix as unless attributes are numeric, they are completely ignored.


To depict the correlation of final features to the Target (Class), a heatmap was implemented (shown in Figure 1):


image
Figure 1: Correlation Matrix


The original data has a lot more Class 0 than Class 1 points which is obvious in the data visualisations and is likely to impact the performance of all models. As suggested by Nabanita Roy , we tried the Synthetic Minority Oversampling Technique (SMOTE) oversampling technique as seen in line 'put line number showing numerical encoding' of our implementation notebook.


image
Figure 2: Original dataset shape Counter({0: 2414, 1: 170})


image
Figure 3: Resampled dataset shape Counter({0: 1931, 1: 1931})

Interestingly, for discrete features like nBumps (the number of seismic bumps recorded within the previous shift), new rows have some non integers values!


image

image


image

Note, with the previous One Hot encoders, we got slightly worse results:


image

The results contained in the tables above are ordered by f1-score, so as to try and make a good compromise between Recall and Accuracy, however if Recall is the most crucial criteria the order would have been slightly changed.

Note that the Test set was quite small, as the full raw dataset is quite small: 2584 rows x 16 features. Sor for some of the Top performing results:

Quite a small difference in absolute numbers!


3. Our Conclusion/What we have Learnt

4. Future Work

Going forward, we recommend further improving the obtained results by doing the followig:

5. References

  1. Introduction to Scikit Learn:Understanding Classification Models for Supervised Machine Learning
  2. To Understand Model Performance Metrics
  3. To Understand the feature data (data types, missing values, outliers, etc): see-Data Exploration and Peparation
  4. Visualization (boxplots, scatter plots, correlation matrix, etc)- see Data Exploration and Peparation