Imbalanced data typically refers to a classification problem where the number of observations per class is not equally distributed; often you'll have a large amount of data/observations for one class (referred to as the majority class), and much fewer observations for one or more other classes (referred to as the minority classes).
- Balanced Dataset: In a Balanced dataset, there is approximately equal distribution of classes in the target column.
- Imbalanced Dataset: In an Imbalanced dataset, there is a highly unequal distribution of classes in the target column.
- If there are 900 ‘Yes’ and 100 ‘No’ then it represents an Imbalanced dataset as there is highly unequal distribution of the two classes. .
- If there are 550 ‘Yes’ and 450 ‘No’ then it represents a Balanced dataset as there is approximately equal distribution of the two classes.
- Hence, there is a significant amount of difference between the sample sizes of the two classes in an Imbalanced Dataset
- Algorithms may get biased towards the majority class and thus tend to predict output as the majority class.
- Minority class observations look like noise to the model and are ignored by the model.
- Imbalanced dataset gives misleading accuracy score.
- In this technique, we reduce the sample size of Majority class and try to match it with the sample size of Minority Class.
- Example : Let’s take an imbalanced training dataset with 1000 records.
- Target class ‘Yes’ = 900 records
- Target class ‘No’ = 100 records
- Target class ‘Yes’ = 100 records
- Target class ‘No’ = 100 records
- Now, both classes have the same sample size.
- Low computation power needed.
- Some important patterns might get lost due to dropping of records.
- Only beneficial for huge datasets with millions of records.
- Note : Under Sampling should only be done when we have huge number of records.
- In this technique, we increase the sample size of Minority class by replication and try to match it with the sample size of Majority Class.
- Example: Let’s take the same imbalanced training dataset with 1000 records.
- Target class ‘Yes’ = 900 records
- Target class ‘No’ = 100 records
- Target class ‘Yes’ = 900 records
- Target class ‘No’ = 900 records
- Patterns are not lost which enhances the model performance.
- Replication of the data can lead to overfitting.
- High computation power needed.
- It depends upon the dataset.
- If we have a huge dataset then choose ‘Under sampling’ otherwise go with ‘Over Sampling’.
- ‘Tree-based models’ find it easy to deal with Imbalanced dataset compared to Non-tree based Models due to their hierarchical structure.
- Decision Trees
- Random Forests
- Gradient Boosted Trees
- Anomaly or Outlier Detection algorithms are ‘one class classification algorithms’ that helps in identifying outliers ( rare data points) in the dataset.
- In an Imbalanced dataset, assume ‘Majority class records as Normal data’ and ‘Minority Class records as Outlier data’.
- These algorithms are trained on Normal data.
- A trained model can predict if the new record is Normal or Outlier.