How to deal with imbalanced data in machine learning practically ?

4 min readSep 30, 2024

Imbalanced data is common in certain classification problems and it impacts model performance as well as usability if not dealt appropriately. There are various known methods to deal with imbalance data but they fail to provide expected results. There are some lesser known methods which provide satisfactory results.

What is imbalanced data: Data imbalance happens when your data has one class in majority and another class in minority. Ex: if there are 1000 transactions happening in the bank and out of 1000 only 50 transactions are fraudulent transactions, we can say this data is imbalanced data.

What is the issue with imbalanced data: Machine learning algorithms are quick to optimize by selecting the easiest path. If the data is severely imbalanced. Ex: 99.99% Majority class, 0.01% minority class the model will not get enough examples of minority class. This will lead to overfitting and the model will not be able to generalize well.

Apart from this the model will also give the accuracy of 99.99% which can give the false sense of good performance. while the model is not at all performing in terms of precision and recall.

What are some known ways to solve the class imbalance problem

  1. Under-sampling: This technique is simple in which the data from the majority class is randomly reduced to reduce the class imbalance.

Pros: Very simple technique to implement

Cons: This might result in overfitting as the model will not be able to learn from the data removed from the samples and will not be able to generalize. Also, the removal of the samples might change the data distribution which might not reflect the actual population.

Under sampling

2. Over-sampling: This technique is the reverse of under-sampling. In over sampling, we keep the data of the majority class the same but over sample randomly the data of the minority class.

Pros: Very simple technique to implement and no loss of samples.

Cons: As we are changing the distribution of data the model will predict false positives. Also the model will overfit as it does not have enough examples of minority class.

Over-Sampling

3. Mix-sampling: Mix sampling is the combination of under-sampling and oversampling. In mis-sampling some part of data from the majority class is randomly reduced while some part of the minority class is over-sampled.

Mix-Sampling

4. SMOTE Synthetic over-sampling: In SMOTE we generate the additional data or synthetic data for the minority class by adding small noise to it. Ex. if I have a data point related to height and weight of a child 95 cm. and 40 kg. SMOTE will add some noise and generate additional points 96 cm and 41 kg , 94 cm and 39 kg

Pros: Reduces the issue of overfitting. No loss of data as it happens in under-sampling.

Cons: As the distribution of minority classes remains the same model will not be able to generalize well on minority classes. Also, training data and actual data have different ratios of majority vs minority class. Due to which model gives false positives.

source

Lesser known ways to solve the class imbalance problem

  1. Ensemble by dividing the data in majority class: In this method the majority call is divided in subsets by keeping the minority class in each subset. After that every subset is trained independently. The final prediction is the ensemble of all the predictions obtained from the model trained on subsets.

2. Tweaking the loss Function: In this method the loss is tweaked to add more penalty if the minority class is incorrect. This helps the model to focus more on the minority class. This method improves the precision.

Conclusion: We have seen what imbalanced data is and how it creates issues with models. We have also seen the commonly known and lesser known techniques to solve the issue class imbalance. In the next part we will apply these techniques practically and see how it affects the performance of model

--

--

Shivam Agarwal
Shivam Agarwal

Written by Shivam Agarwal

Shivam is an accomplished analytics professional and algo trader, sharing expertise in algo trading, data science, and AI through insightful publications.

No responses yet