Comment gérer les déséquilibres de classes dans votre jeu de données

FR EN

Practical guide to identify and manage class imbalances in your datasets, with adapted techniques for training and evaluation.

kafu 20/04/2025 21 vues

How to Handle Class Imbalances in Your Dataset

This practical guide explains how to identify and solve class imbalance problems in your computer vision datasets. Class imbalance occurs when certain categories are heavily underrepresented compared to others, which can lead to biased models that perform poorly on minority classes.

Identifying the Imbalance

Before applying solutions, it's important to precisely quantify the imbalance in your dataset.

Class Distribution Analysis

Use Techsolut's dataset analysis tool to visualize the class distribution:

Go to the "Dataset" tab of your project
Click on "Statistics and Analysis"
Check the "Class Distribution" graph
Examine specific metrics:
Imbalance Ratio: ratio between the most frequent and least frequent class
Imbalance Factor: combined measure taking into account all classes

Critical Thresholds

Generally, a dataset is considered significantly imbalanced when:
- The ratio between majority and minority classes exceeds 10:1
- Some classes represent less than 5% of the total examples
- The class distribution strongly deviates from an expected uniform or natural distribution

Data-Level Strategies

1. Oversampling Minority Classes

Oversampling artificially increases the number of examples in underrepresented classes.

How to Apply It in Techsolut:

In the "Data Preparation" tab, select "Class Balancing"
Choose the "Oversampling" method
Select the classes to oversample or use "Auto" for all minority classes
Configure the options:
Simple Method: exact reproduction of existing examples
SMOTE: generation of new synthetic examples by interpolation
Targeted Augmentation: intensive application of augmentation only on minority classes

Advantages:

Preserves all available examples
Doesn't increase training time per epoch
Particularly effective when minority classes have few examples

Disadvantages:

Risk of overfitting on minority classes
May extend total training time (more examples per epoch)

2. Undersampling Majority Classes

Undersampling reduces the number of examples in overrepresented classes.

How to Apply:

In the "Data Preparation" tab, select "Class Balancing"
Choose the "Undersampling" method
Configure the options:
Random: random selection of a subset of examples
Clustering: representative selection based on clusters
ENN (Edited Nearest Neighbors): removal of ambiguous or non-representative examples

Advantages:

Reduces training time
Can improve generalization by removing redundant examples
Limits bias towards majority classes

Disadvantages:

Potential loss of useful information
May reduce performance if deleted data was informative

3. Combination of Techniques

For extreme imbalances, a hybrid approach is often most effective.

SMOTE-Tomek Technique:

Application of SMOTE to generate synthetic examples of minority classes
Followed by Tomek Links to eliminate pairs of examples from different classes that are close in feature space

How to Apply:

In the "Advanced Preparation" tab, select "Hybrid Balancing"
Choose "SMOTE-Tomek" or "SMOTEBoost"
Configure the target balancing ratio

Algorithm-Level Strategies

1. Class Weighting in the Loss Function

Assign greater weight to errors made on minority classes.

How to Apply:

In the "Training" tab, "Advanced Configuration" section
Enable "Class Weighting"
Choose the weighting method:
Inversely Proportional: weight = 1 / (class frequency)
Logarithmic: weight = log(N / class frequency)
Effective Number: weights based on effective number of examples
Custom: manually specify weights per class

Example of Automatic Configuration:

# Class Proportion in the Dataset
# Class A: 70%, Class B: 20%, Class C: 10%

# Inverse Weighting
# Class A: 1/0.7 = 1.43
# Class B: 1/0.2 = 5.0
# Class C: 1/0.1 = 10.0

# Normalized for Stability
# Class A: 0.087
# Class B: 0.305
# Class C: 0.608

2. Specialized Loss Functions

Some loss functions are specifically designed to handle imbalances.

Options Available in Techsolut:

Focal Loss
Attenuates the contribution of well-classified examples to focus on difficult ones
Parameters: gamma (modulation factor) and alpha (class weighting)
Recommended for object detection problems with extreme imbalance
Dice Loss
Based on the Dice coefficient, less sensitive to imbalance
Particularly suited to segmentation where spatial imbalance is common
Asymmetric Loss
Asymmetric version of the loss that treats false positives and false negatives differently
Useful when one type of error is more critical than the other

3. Sampling Techniques During Training

Modify the batch sampling strategy to balance class representation.

Available Options:

Class-weighted Sampling
Configuration: Enable "Balanced Sampling" in training parameters
Each batch contains a similar number of examples from each class
Two-phase Learning
Phase 1: Training on a balanced dataset
Phase 2: Fine-tuning on the real distribution with reduced learning rate

Evaluation Strategies

It's crucial to use appropriate evaluation metrics for imbalanced datasets.

Recommended Metrics

F1-score, Precision, and Recall
More informative than accuracy in an imbalanced context
Available per class to analyze performance on minorities
Precision-Recall (PR) Curve
More informative than the ROC curve for highly imbalanced datasets
Accessible in the "Evaluation" tab > "PR Curves"
Normalized Confusion Matrix
Normalization by row (recall per class) or column (precision per class)
Helps identify specific confusions between classes
Cohen's Kappa and Matthews Correlation Coefficient
Metrics that account for agreement expected by chance
Useful for globally evaluating performance across all classes

Advanced Approaches

1. Curriculum Learning

Order the training to gradually expose the model to more difficult examples.

How to Implement:

In the "Advanced Strategies" tab, enable "Curriculum Learning"
Define the progression:
Start with a balanced dataset
Progressively introduce the real imbalance
Finish with the natural distribution

2. Classifier Cascades

Use a multi-step approach to separately handle different levels of imbalance.

Implementation Example:

First level: binary classifier "rare class vs rest"
Second level: fine classification on examples identified as potentially from the rare class

3. Targeted Transfer Learning

Leverage pre-trained models with particular attention to minority classes.

Recommended Technique:

Two-phase fine-tuning:
Phase 1: Unfreeze only the last layers and train on a balanced dataset
Phase 2: Complete fine-tuning with class weighting

Validation and Testing

Essential Stratification

Stratification ensures a proportional distribution of classes between training, validation, and test sets.

Techsolut Configuration:

In "Data Division", enable "Stratification"
Use "Maintain Class Distribution" to preserve original proportions
Or use "Balance Validation" to over-represent minority classes in validation

Adapted Cross Validation

For very minority classes, use stratified cross validation.

Recommended Configuration:

Stratified K-Fold: 5 to 10 folds depending on dataset size
Multiple repetition for more robustness (3-5 repetitions)

Practical Example: Industrial Anomaly Detection

Scenario: Detection of rare defects (0.5% of examples) on manufactured products.

Recommended Approach:

Data Preparation:
Augmentation of defect images with rotations, mirrors, contrast variations
Synthesis of new examples with conditional GANs
Moderate undersampling of normal examples by clustering
Model Configuration:
Architecture: EfficientDet or RetinaNet (designed for imbalances)
Focal Loss with gamma=2.0, alpha=0.75
Bias initialization to initially favor rare classes
Training Strategy:
Two-phase learning
Balanced batch sampling
Callbacks for saving based on recall of minority classes
Evaluation:
Main metric: Class-weighted F1-score
Confidence threshold optimized to maximize recall without degrading precision too much

Conclusion

Class imbalance is an omnipresent challenge in real computer vision applications. A systematic approach combining techniques at the data, algorithm, and evaluation levels generally yields the best results.

Remember that there is no universal solution - experiment with different approaches and use Techsolut's analysis features to determine the optimal strategy for your specific case.

Log your experiments in the "Notes and Experiments" tab to keep track of methods tried and their relative performances.

Cet article vous a-t-il été utile ?

Oui Non

Évaluez cet article

Commentaires (facultatif)

Comment gérer les déséquilibres de classes dans votre jeu de données

How to Handle Class Imbalances in Your Dataset

Identifying the Imbalance

Class Distribution Analysis

Critical Thresholds

Data-Level Strategies

1. Oversampling Minority Classes

How to Apply It in Techsolut:

Advantages:

Disadvantages:

2. Undersampling Majority Classes

How to Apply:

Advantages:

Disadvantages:

3. Combination of Techniques

SMOTE-Tomek Technique:

How to Apply:

Algorithm-Level Strategies

1. Class Weighting in the Loss Function

How to Apply:

Example of Automatic Configuration:

2. Specialized Loss Functions

Options Available in Techsolut:

3. Sampling Techniques During Training

Available Options:

Evaluation Strategies

Recommended Metrics

Advanced Approaches

1. Curriculum Learning

How to Implement:

2. Classifier Cascades

Implementation Example:

3. Targeted Transfer Learning

Recommended Technique:

Validation and Testing

Essential Stratification

Techsolut Configuration:

Adapted Cross Validation

Recommended Configuration:

Practical Example: Industrial Anomaly Detection

Recommended Approach:

Conclusion

Cet article vous a-t-il été utile ?

Dans cette page

Articles similaires

Articles similaires

Dans cette catégorie