Practical guide to identify and manage class imbalances in your datasets, with adapted techniques for training and evaluation.
How to Handle Class Imbalances in Your Dataset
This practical guide explains how to identify and solve class imbalance problems in your computer vision datasets. Class imbalance occurs when certain categories are heavily underrepresented compared to others, which can lead to biased models that perform poorly on minority classes.
Identifying the Imbalance
Before applying solutions, it's important to precisely quantify the imbalance in your dataset.
Class Distribution Analysis
Use Techsolut's dataset analysis tool to visualize the class distribution:
- Go to the "Dataset" tab of your project
- Click on "Statistics and Analysis"
- Check the "Class Distribution" graph
- Examine specific metrics:
- Imbalance Ratio: ratio between the most frequent and least frequent class
- Imbalance Factor: combined measure taking into account all classes
Critical Thresholds
Generally, a dataset is considered significantly imbalanced when:
- The ratio between majority and minority classes exceeds 10:1
- Some classes represent less than 5% of the total examples
- The class distribution strongly deviates from an expected uniform or natural distribution
Data-Level Strategies
1. Oversampling Minority Classes
Oversampling artificially increases the number of examples in underrepresented classes.
How to Apply It in Techsolut:
- In the "Data Preparation" tab, select "Class Balancing"
- Choose the "Oversampling" method
- Select the classes to oversample or use "Auto" for all minority classes
- Configure the options:
- Simple Method: exact reproduction of existing examples
- SMOTE: generation of new synthetic examples by interpolation
- Targeted Augmentation: intensive application of augmentation only on minority classes
Advantages:
- Preserves all available examples
- Doesn't increase training time per epoch
- Particularly effective when minority classes have few examples
Disadvantages:
- Risk of overfitting on minority classes
- May extend total training time (more examples per epoch)
2. Undersampling Majority Classes
Undersampling reduces the number of examples in overrepresented classes.
How to Apply:
- In the "Data Preparation" tab, select "Class Balancing"
- Choose the "Undersampling" method
- Configure the options:
- Random: random selection of a subset of examples
- Clustering: representative selection based on clusters
- ENN (Edited Nearest Neighbors): removal of ambiguous or non-representative examples
Advantages:
- Reduces training time
- Can improve generalization by removing redundant examples
- Limits bias towards majority classes
Disadvantages:
- Potential loss of useful information
- May reduce performance if deleted data was informative
3. Combination of Techniques
For extreme imbalances, a hybrid approach is often most effective.
SMOTE-Tomek Technique:
- Application of SMOTE to generate synthetic examples of minority classes
- Followed by Tomek Links to eliminate pairs of examples from different classes that are close in feature space
How to Apply:
- In the "Advanced Preparation" tab, select "Hybrid Balancing"
- Choose "SMOTE-Tomek" or "SMOTEBoost"
- Configure the target balancing ratio
Algorithm-Level Strategies
1. Class Weighting in the Loss Function
Assign greater weight to errors made on minority classes.
How to Apply:
- In the "Training" tab, "Advanced Configuration" section
- Enable "Class Weighting"
- Choose the weighting method:
- Inversely Proportional: weight = 1 / (class frequency)
- Logarithmic: weight = log(N / class frequency)
- Effective Number: weights based on effective number of examples
- Custom: manually specify weights per class
Example of Automatic Configuration:
# Class Proportion in the Dataset
# Class A: 70%, Class B: 20%, Class C: 10%
# Inverse Weighting
# Class A: 1/0.7 = 1.43
# Class B: 1/0.2 = 5.0
# Class C: 1/0.1 = 10.0
# Normalized for Stability
# Class A: 0.087
# Class B: 0.305
# Class C: 0.608
2. Specialized Loss Functions
Some loss functions are specifically designed to handle imbalances.
Options Available in Techsolut:
- Focal Loss
- Attenuates the contribution of well-classified examples to focus on difficult ones
- Parameters: gamma (modulation factor) and alpha (class weighting)
-
Recommended for object detection problems with extreme imbalance
-
Dice Loss
- Based on the Dice coefficient, less sensitive to imbalance
-
Particularly suited to segmentation where spatial imbalance is common
-
Asymmetric Loss
- Asymmetric version of the loss that treats false positives and false negatives differently
- Useful when one type of error is more critical than the other
3. Sampling Techniques During Training
Modify the batch sampling strategy to balance class representation.
Available Options:
- Class-weighted Sampling
- Configuration: Enable "Balanced Sampling" in training parameters
-
Each batch contains a similar number of examples from each class
-
Two-phase Learning
- Phase 1: Training on a balanced dataset
- Phase 2: Fine-tuning on the real distribution with reduced learning rate
Evaluation Strategies
It's crucial to use appropriate evaluation metrics for imbalanced datasets.
Recommended Metrics
- F1-score, Precision, and Recall
- More informative than accuracy in an imbalanced context
-
Available per class to analyze performance on minorities
-
Precision-Recall (PR) Curve
- More informative than the ROC curve for highly imbalanced datasets
-
Accessible in the "Evaluation" tab > "PR Curves"
-
Normalized Confusion Matrix
- Normalization by row (recall per class) or column (precision per class)
-
Helps identify specific confusions between classes
-
Cohen's Kappa and Matthews Correlation Coefficient
- Metrics that account for agreement expected by chance
- Useful for globally evaluating performance across all classes
Advanced Approaches
1. Curriculum Learning
Order the training to gradually expose the model to more difficult examples.
How to Implement:
- In the "Advanced Strategies" tab, enable "Curriculum Learning"
- Define the progression:
- Start with a balanced dataset
- Progressively introduce the real imbalance
- Finish with the natural distribution
2. Classifier Cascades
Use a multi-step approach to separately handle different levels of imbalance.
Implementation Example:
- First level: binary classifier "rare class vs rest"
- Second level: fine classification on examples identified as potentially from the rare class
3. Targeted Transfer Learning
Leverage pre-trained models with particular attention to minority classes.
Recommended Technique:
- Two-phase fine-tuning:
- Phase 1: Unfreeze only the last layers and train on a balanced dataset
- Phase 2: Complete fine-tuning with class weighting
Validation and Testing
Essential Stratification
Stratification ensures a proportional distribution of classes between training, validation, and test sets.
Techsolut Configuration:
- In "Data Division", enable "Stratification"
- Use "Maintain Class Distribution" to preserve original proportions
- Or use "Balance Validation" to over-represent minority classes in validation
Adapted Cross Validation
For very minority classes, use stratified cross validation.
Recommended Configuration:
- Stratified K-Fold: 5 to 10 folds depending on dataset size
- Multiple repetition for more robustness (3-5 repetitions)
Practical Example: Industrial Anomaly Detection
Scenario: Detection of rare defects (0.5% of examples) on manufactured products.
Recommended Approach:
- Data Preparation:
- Augmentation of defect images with rotations, mirrors, contrast variations
- Synthesis of new examples with conditional GANs
-
Moderate undersampling of normal examples by clustering
-
Model Configuration:
- Architecture: EfficientDet or RetinaNet (designed for imbalances)
- Focal Loss with gamma=2.0, alpha=0.75
-
Bias initialization to initially favor rare classes
-
Training Strategy:
- Two-phase learning
- Balanced batch sampling
-
Callbacks for saving based on recall of minority classes
-
Evaluation:
- Main metric: Class-weighted F1-score
- Confidence threshold optimized to maximize recall without degrading precision too much
Conclusion
Class imbalance is an omnipresent challenge in real computer vision applications. A systematic approach combining techniques at the data, algorithm, and evaluation levels generally yields the best results.
Remember that there is no universal solution - experiment with different approaches and use Techsolut's analysis features to determine the optimal strategy for your specific case.
Log your experiments in the "Notes and Experiments" tab to keep track of methods tried and their relative performances.