Whenever we do classification in ML, we often assume that target label is evenly distributed in our dataset. This helps the training algorithm to learn the features as we have enough examples for all the different cases. For example, in learning a spam filter, we should have good amount of data which corresponds to emails which are spam and non spam.
This even distribution is not always possible. Let’s take an example of fraud detection. Fraud detection is a use case, where by looking at transaction we need to decide is the transaction is fraudulent or not. In majority of the cases, the transaction will be normal. So the data for fraudulent data is very small compared to normal ones. In these cases, there will be imbalance in target labels. This will effect the quality of models we can build.So in next series of posts we will discuss about what’s class imbalance and how to handle it in python and spark.
This is the second post in the series where we discuss about handling class imbalance using undersampling technique. You can read all the blogs in the series here.
Undersampling is one of the techniques used for handling class imbalance. In this technique, we under sample majority class to match the minority class. So in our example, we take random sample of non-fraud class to match number of fraud samples. This makes sure that the training data has equal amount of fraud and non-fraud samples.
Undersampling in Python
The below is the code to do the undersampling in python.
1. Find Number of samples which are Fraud
2. Get indices of non fraud samples
3. Random sample non fraud indices
4. Find the indices of fraud samples
5. Concat fraud indices with sample non-fraud ones
6. Get Balance Dataframe
Visualising Undersampled Data
The below is the class distribution of under_sample dataframe.
In the above plot, you can observe that classes are distributed evenly now.
Running Logistic Regression on Undersampled Data
Once we have undersampled data, we need to train on that.
The below code runs logistic regression on undersampled data.
Once we run trained the model, we can verify the model using accuracy and recall scores as we did in last post.
The result is
As you can observe from the result, our recall has improved a lot. It was 61% percent when data was unbalanced but now it’s 92%. This means our model is pretty good identifying the fraud.
Accuracy score has gone down because we undersampled data. This is fine in our case because if we miss classify some non-fraud transactions as fraud it doesn’t do any harm.
Whenever we undersample data, the training data size reduces significantly. So even though model works well for balanced data, we need to make sure does it generalise well. The below code calculates score for full data using above model
The result is
As you can observe from the results, accuracy score is still good for when we predict for unbalanced data. This makes sure that our model generalises well even if it’s trained on undersample data.
Using Class Weight
In above code, we did class imbalance explicitly. But scikit-learn logistic regression has a option named class_weight when specified does class imbalance handling implicitly.
The below code shows how to do the same
The result is
In this post we understood how to handle class imbalance using undersampling technique.
In our next post, we will discuss how to implement undersampling in spark.