Comment gérer les déséquilibres de classes dans votre jeu de données

Practical guide to identify and manage class imbalances in your datasets, with adapted techniques for training and evaluation.

kafu 20/04/2025 21 vues

How to Handle Class Imbalances in Your Dataset

This practical guide explains how to identify and solve class imbalance problems in your computer vision datasets. Class imbalance occurs when certain categories are heavily underrepresented compared to others, which can lead to biased models that perform poorly on minority classes.

Identifying the Imbalance

Before applying solutions, it's important to precisely quantify the imbalance in your dataset.

Class Distribution Analysis

Use Techsolut's dataset analysis tool to visualize the class distribution:

  1. Go to the "Dataset" tab of your project
  2. Click on "Statistics and Analysis"
  3. Check the "Class Distribution" graph
  4. Examine specific metrics:
  5. Imbalance Ratio: ratio between the most frequent and least frequent class
  6. Imbalance Factor: combined measure taking into account all classes

Critical Thresholds

Generally, a dataset is considered significantly imbalanced when:
- The ratio between majority and minority classes exceeds 10:1
- Some classes represent less than 5% of the total examples
- The class distribution strongly deviates from an expected uniform or natural distribution

Data-Level Strategies

1. Oversampling Minority Classes

Oversampling artificially increases the number of examples in underrepresented classes.

How to Apply It in Techsolut:

  1. In the "Data Preparation" tab, select "Class Balancing"
  2. Choose the "Oversampling" method
  3. Select the classes to oversample or use "Auto" for all minority classes
  4. Configure the options:
  5. Simple Method: exact reproduction of existing examples
  6. SMOTE: generation of new synthetic examples by interpolation
  7. Targeted Augmentation: intensive application of augmentation only on minority classes

Advantages:

  • Preserves all available examples
  • Doesn't increase training time per epoch
  • Particularly effective when minority classes have few examples

Disadvantages:

  • Risk of overfitting on minority classes
  • May extend total training time (more examples per epoch)

2. Undersampling Majority Classes

Undersampling reduces the number of examples in overrepresented classes.

How to Apply:

  1. In the "Data Preparation" tab, select "Class Balancing"
  2. Choose the "Undersampling" method
  3. Configure the options:
  4. Random: random selection of a subset of examples
  5. Clustering: representative selection based on clusters
  6. ENN (Edited Nearest Neighbors): removal of ambiguous or non-representative examples

Advantages:

  • Reduces training time
  • Can improve generalization by removing redundant examples
  • Limits bias towards majority classes

Disadvantages:

  • Potential loss of useful information
  • May reduce performance if deleted data was informative

3. Combination of Techniques

For extreme imbalances, a hybrid approach is often most effective.

SMOTE-Tomek Technique:

  1. Application of SMOTE to generate synthetic examples of minority classes
  2. Followed by Tomek Links to eliminate pairs of examples from different classes that are close in feature space

How to Apply:

  1. In the "Advanced Preparation" tab, select "Hybrid Balancing"
  2. Choose "SMOTE-Tomek" or "SMOTEBoost"
  3. Configure the target balancing ratio

Algorithm-Level Strategies

1. Class Weighting in the Loss Function

Assign greater weight to errors made on minority classes.

How to Apply:

  1. In the "Training" tab, "Advanced Configuration" section
  2. Enable "Class Weighting"
  3. Choose the weighting method:
  4. Inversely Proportional: weight = 1 / (class frequency)
  5. Logarithmic: weight = log(N / class frequency)
  6. Effective Number: weights based on effective number of examples
  7. Custom: manually specify weights per class

Example of Automatic Configuration:

# Class Proportion in the Dataset
# Class A: 70%, Class B: 20%, Class C: 10%

# Inverse Weighting
# Class A: 1/0.7 = 1.43
# Class B: 1/0.2 = 5.0
# Class C: 1/0.1 = 10.0

# Normalized for Stability
# Class A: 0.087
# Class B: 0.305
# Class C: 0.608

2. Specialized Loss Functions

Some loss functions are specifically designed to handle imbalances.

Options Available in Techsolut:

  1. Focal Loss
  2. Attenuates the contribution of well-classified examples to focus on difficult ones
  3. Parameters: gamma (modulation factor) and alpha (class weighting)
  4. Recommended for object detection problems with extreme imbalance

  5. Dice Loss

  6. Based on the Dice coefficient, less sensitive to imbalance
  7. Particularly suited to segmentation where spatial imbalance is common

  8. Asymmetric Loss

  9. Asymmetric version of the loss that treats false positives and false negatives differently
  10. Useful when one type of error is more critical than the other

3. Sampling Techniques During Training

Modify the batch sampling strategy to balance class representation.

Available Options:

  1. Class-weighted Sampling
  2. Configuration: Enable "Balanced Sampling" in training parameters
  3. Each batch contains a similar number of examples from each class

  4. Two-phase Learning

  5. Phase 1: Training on a balanced dataset
  6. Phase 2: Fine-tuning on the real distribution with reduced learning rate

Evaluation Strategies

It's crucial to use appropriate evaluation metrics for imbalanced datasets.

  1. F1-score, Precision, and Recall
  2. More informative than accuracy in an imbalanced context
  3. Available per class to analyze performance on minorities

  4. Precision-Recall (PR) Curve

  5. More informative than the ROC curve for highly imbalanced datasets
  6. Accessible in the "Evaluation" tab > "PR Curves"

  7. Normalized Confusion Matrix

  8. Normalization by row (recall per class) or column (precision per class)
  9. Helps identify specific confusions between classes

  10. Cohen's Kappa and Matthews Correlation Coefficient

  11. Metrics that account for agreement expected by chance
  12. Useful for globally evaluating performance across all classes

Advanced Approaches

1. Curriculum Learning

Order the training to gradually expose the model to more difficult examples.

How to Implement:

  1. In the "Advanced Strategies" tab, enable "Curriculum Learning"
  2. Define the progression:
  3. Start with a balanced dataset
  4. Progressively introduce the real imbalance
  5. Finish with the natural distribution

2. Classifier Cascades

Use a multi-step approach to separately handle different levels of imbalance.

Implementation Example:

  1. First level: binary classifier "rare class vs rest"
  2. Second level: fine classification on examples identified as potentially from the rare class

3. Targeted Transfer Learning

Leverage pre-trained models with particular attention to minority classes.

  1. Two-phase fine-tuning:
  2. Phase 1: Unfreeze only the last layers and train on a balanced dataset
  3. Phase 2: Complete fine-tuning with class weighting

Validation and Testing

Essential Stratification

Stratification ensures a proportional distribution of classes between training, validation, and test sets.

Techsolut Configuration:

  1. In "Data Division", enable "Stratification"
  2. Use "Maintain Class Distribution" to preserve original proportions
  3. Or use "Balance Validation" to over-represent minority classes in validation

Adapted Cross Validation

For very minority classes, use stratified cross validation.

  1. Stratified K-Fold: 5 to 10 folds depending on dataset size
  2. Multiple repetition for more robustness (3-5 repetitions)

Practical Example: Industrial Anomaly Detection

Scenario: Detection of rare defects (0.5% of examples) on manufactured products.

  1. Data Preparation:
  2. Augmentation of defect images with rotations, mirrors, contrast variations
  3. Synthesis of new examples with conditional GANs
  4. Moderate undersampling of normal examples by clustering

  5. Model Configuration:

  6. Architecture: EfficientDet or RetinaNet (designed for imbalances)
  7. Focal Loss with gamma=2.0, alpha=0.75
  8. Bias initialization to initially favor rare classes

  9. Training Strategy:

  10. Two-phase learning
  11. Balanced batch sampling
  12. Callbacks for saving based on recall of minority classes

  13. Evaluation:

  14. Main metric: Class-weighted F1-score
  15. Confidence threshold optimized to maximize recall without degrading precision too much

Conclusion

Class imbalance is an omnipresent challenge in real computer vision applications. A systematic approach combining techniques at the data, algorithm, and evaluation levels generally yields the best results.

Remember that there is no universal solution - experiment with different approaches and use Techsolut's analysis features to determine the optimal strategy for your specific case.

Log your experiments in the "Notes and Experiments" tab to keep track of methods tried and their relative performances.

Dans cette page
Articles similaires
IA