Predicting Hazardous Seismic: Evaluating Machine Learning Models using Scikit Learn

A newbie evaluation on trying out Machine Learning Models for Classification and Data Augmentation to Support better results

We are immensely grateful to Nabanita Roy for pointing out this very interesting dataset,her previous work formed the base for which we were able to build on, you can have a look at her work here (Part I and Part II)!
Based on the data exploration and performance analysis carried out by Nabanita, the performance of the previous models were very low for the recall metric having just a 0.06 score for the best performing model SVM. This behaviour resulted in the conclusion that out of all the Actual Positive (hazardous) shifts, only 6% of the shifts have been predicted as Positive. Our focus was on improving this.

1. Premise

Predicting Positive here is predicting “ the possibility of hazardous situation occurrence, where an appropriate supervision service can reduce a risk of rockburst (e.g. by distressing shooting) or withdraw workers from the threatened area. Good prediction of increased seismic activity is therefore a matter of great practical importance. “

We wouldn’t want to send miners into a mine with a substandard model!

On the other hand, Precision for the best performing model is only 0.67, which means that out of all the shifts predicted as hazardous, (1 - 0.67) = 23% were actually low risk, which would mean a non insignificant number of shifts where miners may have been told to stay home, and thus a lower productivity.

Our shared implementation notebook can be found here!

2. Our Contribution

Pre-processing the Data Further

As a small improvement, we replaced the One-hot encoder for these Categorical variables with Numerical encoding as the assessment coding being graded from Low to High, coding numerically added meaningful information. To view the full data attributes, see here.

Seismic: result of shift seismic hazard assessment in the mine working obtained by the seismic method (a - lack of hazard, b - low hazard, c - high hazard, d - danger state);
Seismoacoustic: result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method;
Shift: information about type of a shift (W - coal-getting, N -preparation shift);

The encoding of these attributes was implemented by a simple mapping of the known values as seen in line 'put line number showing numerical encoding' of our implementation notebook, this encoding allowed for easier interpretability as we will see later. It was also relevant for the plotting of the correlation matrix as unless attributes are numeric, they are completely ignored.

Explanatory Data Analysis: Checking the Correlations between Features and the Target

To depict the correlation of final features to the Target (Class), a heatmap was implemented (shown in Figure 1):

Figure 1: Correlation Matrix

Augment data using Synthetic Minority Oversampling Technique (SMOTE).

The original data has a lot more Class 0 than Class 1 points which is obvious in the data visualisations and is likely to impact the performance of all models. As suggested by Nabanita Roy , we tried the Synthetic Minority Oversampling Technique (SMOTE) oversampling technique as seen in line 'put line number showing numerical encoding' of our implementation notebook.

Figure 2: Original dataset shape Counter({0: 2414, 1: 170})

Figure 3: Resampled dataset shape Counter({0: 1931, 1: 1931})

Interestingly, for discrete features like nBumps (the number of seismic bumps recorded within the previous shift), new rows have some non integers values!

Model Test Results

With the numerical encoding for the Categorical variables, we get the following results:

Note, with the previous One Hot encoders, we got slightly worse results:

The results contained in the tables above are ordered by f1-score, so as to try and make a good compromise between Recall and Accuracy, however if Recall is the most crucial criteria the order would have been slightly changed.

Note that the Test set was quite small, as the full raw dataset is quite small: 2584 rows x 16 features. Sor for some of the Top performing results:

When the RFC SMOTE max_leaf_nodes=10 Confusion Matrix has the top Recall score at: 20 / (20 + 14) = 0.59:

[[398 85]
[ 14 20]]

The XGBoost SMOTE Confusion Matrix Recall score is: 16 / (16 +18) = 0.47

[[424 59]
[ 18 16]]

Quite a small difference in absolute numbers!

3. Our Conclusion/What we have Learnt

Training any model on the augmented dataset (SMOTE) improves the model performance significantly.
XGBoost wins by a small margin, based on shallow Decision Trees.
The RandomForestClassifier and Decision Tree Classifier perform well, although not with the GridSearchCV Optimised version! It seems like optimising against the SMOTE dataset can be counter-productive and choosing values like max_leaf_nodes=10 intuitively adapted to a small dataset works in this case scenario.
It might be good practise to remove highly correlated features when performing feature selection.
Some models like random forest and XGBoost expose the Feature Importance’s which are great both to inform feature selection and checking for data leakage, as well as getting feedback and trust from the customer;
In this case, it is recommended to look at the top performing models with feature importance support, as well as the best simple Decision Tree.
From plotting the feature importances to get more information for feature selection, Nbumps (the number of seismic bumps recorded within the previous shift) is understandably a good predictor for hazard in the current shift, as well as nbumps2 (the number of seismic bumps in energy range [10^2,10^3]) registered within the previous shift).

4. Future Work

Going forward, we recommend further improving the obtained results by doing the followig:

Testing more esemble machine learning models.
Investigating further to validate a number of assumptions about which metrics is most important and why there seems to be a correlation between the Shift type and the hazard Classification, could that be data leakage? There is a need to talk directly to the users.

5. References

Introduction to Scikit Learn:Understanding Classification Models for Supervised Machine Learning
To Understand Model Performance Metrics
To Understand the feature data (data types, missing values, outliers, etc): see-Data Exploration and Peparation
Visualization (boxplots, scatter plots, correlation matrix, etc)- see Data Exploration and Peparation

WaiPRACTICE

Contributors