Index

What is Imbalanced data?

Imbalanced data typically refers to a classification problem where the number of observations per class is not equally distributed; often you'll have a large amount of data/observations for one class (referred to as the majority class), and much fewer observations for one or more other classes (referred to as the minority classes).

Handling Imbalanced Data for Classification

Balanced vs Imbalanced Dataset

Balanced Dataset: In a Balanced dataset, there is approximately equal distribution of classes in the target column.
Imbalanced Dataset: In an Imbalanced dataset, there is a highly unequal distribution of classes in the target column.

Case 1:

If there are 900 ‘Yes’ and 100 ‘No’ then it represents an Imbalanced dataset as there is highly unequal distribution of the two classes. .

Case 2:

If there are 550 ‘Yes’ and 450 ‘No’ then it represents a Balanced dataset as there is approximately equal distribution of the two classes.

Hence, there is a significant amount of difference between the sample sizes of the two classes in an Imbalanced Dataset

Problem with Imbalanced dataset:

Algorithms may get biased towards the majority class and thus tend to predict output as the majority class.
Minority class observations look like noise to the model and are ignored by the model.
Imbalanced dataset gives misleading accuracy score.

Techniques to deal with Imbalanced dataset :

Under Sampling :

In this technique, we reduce the sample size of Majority class and try to match it with the sample size of Minority Class.
Example : Let’s take an imbalanced training dataset with 1000 records.

Before Under Sampling :

Target class ‘Yes’ = 900 records
Target class ‘No’ = 100 records

After Under Sampling :

Target class ‘Yes’ = 100 records
Target class ‘No’ = 100 records
Now, both classes have the same sample size.

Pros :

Low computation power needed.

Cons :

Some important patterns might get lost due to dropping of records.

Only beneficial for huge datasets with millions of records.

Note : Under Sampling should only be done when we have huge number of records.

Over Sampling :

In this technique, we increase the sample size of Minority class by replication and try to match it with the sample size of Majority Class.
Example: Let’s take the same imbalanced training dataset with 1000 records.

Before Over Sampling :

Target class ‘Yes’ = 900 records
Target class ‘No’ = 100 records

After Over Sampling :

Target class ‘Yes’ = 900 records
Target class ‘No’ = 900 records

Pros :

Patterns are not lost which enhances the model performance.

Cons :

Replication of the data can lead to overfitting.

High computation power needed.

So, Which one to choose ‘Under Sampling’ or ‘Over Sampling’ ?

It depends upon the dataset.
If we have a huge dataset then choose ‘Under sampling’ otherwise go with ‘Over Sampling’.

Using Tree Based Models :

‘Tree-based models’ find it easy to deal with Imbalanced dataset compared to Non-tree based Models due to their hierarchical structure.

Different Tree Based Models are :

Decision Trees
Random Forests
Gradient Boosted Trees

Using Anomaly Detection Algorithms:

Anomaly or Outlier Detection algorithms are ‘one class classification algorithms’ that helps in identifying outliers ( rare data points) in the dataset.
In an Imbalanced dataset, assume ‘Majority class records as Normal data’ and ‘Minority Class records as Outlier data’.
These algorithms are trained on Normal data.
A trained model can predict if the new record is Normal or Outlier.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

imbalanced_data.md

imbalanced_data.md

Index

What is Imbalanced data?

Handling Imbalanced Data for Classification

Balanced vs Imbalanced Dataset

Case 1:

Case 2:

Problem with Imbalanced dataset:

Techniques to deal with Imbalanced dataset :

Under Sampling :

Before Under Sampling :

After Under Sampling :

Pros :

Cons :

Over Sampling :

Before Over Sampling :

After Over Sampling :

Pros :

Cons :

So, Which one to choose ‘Under Sampling’ or ‘Over Sampling’ ?

Using Tree Based Models :

Different Tree Based Models are :

Using Anomaly Detection Algorithms:

How to detect an imbalanced class

Files

imbalanced_data.md

Latest commit

History

imbalanced_data.md

File metadata and controls

Index

What is Imbalanced data?

Handling Imbalanced Data for Classification

Balanced vs Imbalanced Dataset

Case 1:

Case 2:

Problem with Imbalanced dataset:

Techniques to deal with Imbalanced dataset :

Under Sampling :

Before Under Sampling :

After Under Sampling :

Pros :

Cons :

Over Sampling :

Before Over Sampling :

After Over Sampling :

Pros :

Cons :

So, Which one to choose ‘Under Sampling’ or ‘Over Sampling’ ?

Using Tree Based Models :

Different Tree Based Models are :

Using Anomaly Detection Algorithms:

How to detect an imbalanced class