Skip to main content

Handling imbalanced data

  • A very common issue when working with classification tasks is that of class imbalance: when one class is highly outnumbered in comparison to the second one (this can also be extended to multi-class). In general, we are talking about imbalance when the ratio of the two classes is not 1:1. In some cases, a delicate imbalance is not that big of a problem, but there are industries/problems in which we can encounter ratios of 100:1, 1000:1, or even worse.
y_train.value_counts(normalize=True)
  • In this recipe, our dataset the default class is only 1.98% of the entire sample. In such cases, gathering more data (especially of the default class) might simply not be feasible, and we need to resort to some techniques that can help us in understanding and avoiding the accuracy paradox. Accuracy paradox refers to a case in which inspecting accuracy as the evaluation metric creates the impression of having a very good classifier (a score of 90%, or even 99.9%), while in reality, it simply reflects the distribution of the classes. That is why, in cases of class imbalance, it is highly advisable to use evaluation metrics that account for that, such as precision/recall, F1 Score, or Cohen's kappa.

How to do it...

Execute the following steps to handling class imbalance

  1. Import the libraries:
from imblearn.over_sampling import RandomOverSampler
  1. Oversample the data
ros = RandomOverSampler(random_state=42)
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)
y_train_ros.value_counts(normalize=True)

How it works...

  • In Step 1, we loaded the required libraries.
  • In Step 2, we used the RandomOverSampler class from the imblearn library to randomly oversample the minority class in order to match the size of the majority class