Working with Imbalanced Data sets

Theory

Imbalanced data sets, in the context of supervised classification problems, refer to the case when the class distribution is highly skewed or disproportionate. Since general supervised learning algorithms assume them to be balanced, they perform accuracy maximization. However, this, in turn, will propagate a model bias and be addressed to some extent, when we use Average accuracy instead of overall accuracy.

For a binary classification problem, average accuracy is the mean of the positive class accuracy and negative class accuracy.

There are two ways of handling such problems –

Algorithm level 

Where we modify the learning algorithm itself, to accommodate this property.

Data level

Where we modify the training data. This will address most of your beginner as well as intermediate level learning problems. Approaches include under-sampling, variants of oversampling and SMOTE.

Implementation

Python has a wonderful package named imbalanced-learn, that provides all the relevant resources. For using SMOTE with support vector machines, keep type = “svm”. By default, it is “regular”.

In R, we use a package named ROSE.

Reading resources

A good resource from AnalyticsVidya regarding how to handle imbalanced data sets. You can also go through this KDD 2018 paper where they make their deep learning architecture robust to class-imbalance.

What is your take on this topic?