Erreurs courantes lors de l'installation et du déploiement de modèles

Troubleshooting guide for common errors encountered during model installation and deployment in production.

kafu 20/04/2025 20 vues

Common Errors During Model Installation and Deployment

This troubleshooting guide covers the most frequently encountered errors when installing the Techsolut platform and deploying computer vision models in production. For each problem, we provide a precise diagnosis and step-by-step solutions.

Installation Problems

Error: "Missing Python Dependencies"

Symptoms

  • Error messages like ModuleNotFoundError: No module named 'X'
  • Installation that interrupts with dependency errors
  • Version conflicts between packages

Solutions

  1. Use the Recommended Virtual Environment
    bash python -m venv techsolut_env source techsolut_env/bin/activate # Linux/Mac techsolut_env\Scripts\activate # Windows

  2. Install All Dependencies with requirements.txt
    bash pip install --upgrade pip setuptools wheel pip install -r requirements.txt

  3. In Case of Version Conflicts

  4. Create a clean new virtual environment
  5. Install packages in the order specified in the documentation
  6. Use the --no-deps option for problematic packages then manually install their dependencies

  7. For Issues with Binaries (PyTorch, TensorFlow)

  8. Check CUDA compatibility if using a GPU
  9. Install the specific version compatible with your hardware:
    bash pip install torch==X.X.X+cu11X -f https://download.pytorch.org/whl/torch_stable.html

Error: "CUDA Not Available"

Symptoms

  • Messages indicating CUDA not available or No CUDA GPUs are available
  • Very slow performance during training
  • Errors when launching GPU tasks

Solutions

  1. Check NVIDIA Driver Installation
    bash nvidia-smi
    If this command fails, reinstall NVIDIA drivers

  2. Check CUDA Version
    bash nvcc --version
    Make sure it's compatible with your PyTorch/TensorFlow version

  3. Reinstall PyTorch with the Appropriate CUDA Support
    bash pip uninstall torch pip install torch==X.X.X+cuXXX -f https://download.pytorch.org/whl/torch_stable.html

  4. Test CUDA Availability
    python import torch print(torch.cuda.is_available()) print(torch.cuda.device_count()) print(torch.cuda.get_device_name(0))

  5. If CUDA Is Not Available on Your Machine

  6. Configure Techsolut to use CPU only
  7. Or use our remote computing option on our GPU servers

Error: "Database Connection Failure"

Symptoms

  • Error messages like OperationalError: unable to open database file
  • Unable to start the application
  • Failure during database creation or migration

Solutions

  1. Check Permissions on the Database Folder
    bash ls -la /path/to/db/folder chmod -R 755 /path/to/db/folder

  2. Verify Connection Configuration

  3. Make sure the information in config.py is correct
  4. For PostgreSQL/MySQL, check that the service is active and accessible

  5. Reset the Database (if possible)
    bash flask db reset # Warning: this deletes all existing data flask db upgrade

  6. For Remote Databases

  7. Check that the firewall allows connections
  8. Test the connection with a standard SQL client
  9. Check your host's limitations (quotas, number of connections)

Deployment Problems

Error: "Insufficient Memory During Inference"

Symptoms

  • CUDA out of memory errors
  • Application that crashes when processing images
  • Performance that degrades over time

Solutions

  1. Reduce Batch Size
  2. In config.py, modify BATCH_SIZE to a smaller value
  3. For the API, limit the number of simultaneous requests

  4. Optimize the Model for Inference

  5. Use model quantization:
    python from techsolut.optimization import quantize_model quantized_model = quantize_model(model, quantization_type='dynamic')

  6. Use Memory-Efficient Inference Mode

  7. Enable gradient-free mode:
    python with torch.no_grad(): predictions = model(inputs)
  8. Use our optimized inference function:
    python from techsolut.inference import efficient_predict results = efficient_predict(model, data, max_batch_size=4)

  9. Free GPU Memory Regularly

  10. After each inference of a large batch:
    python torch.cuda.empty_cache()
  11. For long-running services, schedule periodic restarts

Error: "High Latency in Production"

Symptoms

  • Very slow response time in production
  • Acceptable performance in development but not in production
  • Request timeouts

Solutions

  1. Measure and Identify the Bottleneck
    python from techsolut.profiling import profile_inference profile_results = profile_inference(model, sample_input) print(profile_results)

  2. Optimize Image Preprocessing

  3. Use GPU resizing if possible
  4. Preprocess images in batches
  5. Use our optimized pipeline:
    python from techsolut.preprocessing import FastImageProcessor processor = FastImageProcessor(device='cuda')

  6. Implement Request Batching

  7. Collect requests for a short interval
  8. Process them together rather than individually
  9. Use our automatic batching middleware:
    python from techsolut.serving import BatchingMiddleware app = BatchingMiddleware(app, batch_size=16, timeout=0.1)

  10. Use Result Caching

  11. For repetitive or similar inputs
  12. Configure caching in config.py
  13. Or use a solution like Redis for distributed caching

Error: "Incompatible Model Format"

Symptoms

  • Errors when loading the model in production
  • Messages like Error loading model or Unsupported op
  • Inconsistencies between dev/prod results

Solutions

  1. Check Version Compatibility
  2. Make sure PyTorch/TensorFlow versions are identical
  3. Use the same CPU/GPU architecture in dev and prod if possible

  4. Convert the Model to a Standard Format

  5. Export to ONNX for better portability:
    python from techsolut.export import convert_to_onnx convert_to_onnx(model, 'model.onnx', input_shape=[1, 3, 224, 224])

  6. Verify Model Integrity

  7. Compare checksums of model files
  8. Use our verification tool:
    bash techsolut-cli verify-model path/to/model.pth

  9. Use the Correct Model Version

  10. Check that you're not using a training checkpoint instead of the final model
  11. For Techsolut models, use the dedicated export:
    python model.export(format='production', optimized=True)

Integration Problems

Error: "Integration Failure with Existing Systems"

Symptoms

  • Errors when exchanging data with other systems
  • Format incompatibilities between Techsolut and your systems
  • Data synchronization issues

Solutions

  1. Use Integration Adapters
  2. Check our adapter library in /techsolut/integrations/
  3. Install the adapter specific to your system:
    bash pip install techsolut-adapter-erp

  4. Correctly Configure Webhooks

  5. Check webhook URLs and formats
  6. Test with our diagnostic tool:
    bash techsolut-cli test-webhook http://your-system.com/webhook

  7. Use Compatibility Mode

  8. Enable it in integration settings
  9. Specify your external system version
  10. Use automatic format converters

  11. Check Integration-Specific Logs

  12. Enable detailed logging in config.py
  13. Examine /logs/integration.log for specific error messages
  14. Use our integration diagnostic tool:
    bash techsolut-cli diagnose-integration --system=SAP --level=verbose

Error: "API Authentication Problems"

Symptoms

  • 401 or 403 errors during API calls
  • Tokens unexpectedly expiring
  • Intermittent authentication issues

Solutions

  1. Check API Key Configuration
  2. Make sure the keys in .env or secrets.yaml are correct
  3. Regenerate API keys if necessary via the admin portal

  4. Correctly Configure Authentication

  5. For OAuth2, ensure the flow is correctly implemented
  6. Use our authentication helper:
    python from techsolut.auth import OAuth2Helper auth = OAuth2Helper(client_id, client_secret, redirect_uri)

  7. Manage Token Renewal

  8. Implement automatic token refresh logic
  9. Use our token management middleware:
    python from techsolut.auth import TokenRefreshMiddleware app.wsgi_app = TokenRefreshMiddleware(app.wsgi_app)

  10. Test End-to-End Authentication

  11. Use our API test tool:
    bash techsolut-cli test-auth --endpoint=prediction --api-key=your_key

Solutions to Specific Errors

Error: "ImportError: libcudnn.so.X: cannot open shared object file"

This error indicates that the CUDA Deep Neural Network libraries are not found.

Solution

# Check installed CUDA version
nvcc --version

# Install the corresponding cuDNN version
# Download from NVIDIA website and install:
tar -xzvf cudnn-X.X-linux-x64-v8.X.X.X.tgz
sudo cp cuda/include/cudnn*.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*

# Update environment variables
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

Error: "RuntimeError: CUDA error: device-side assert triggered"

This often cryptic error typically indicates a problem with input dimensions or indices.

Solution

# Enable CUDA debugging information
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# Check input tensor dimensions
print(f"Input shape: {input_tensor.shape}")

# Verify class indices are valid
num_classes = model.module.num_classes if hasattr(model, 'module') else model.num_classes
if (targets >= num_classes).any():
    raise ValueError(f"Target class indices exceed number of classes ({num_classes})")

# Check for NaN values
if torch.isnan(input_tensor).any():
    raise ValueError("Input contains NaN values")

Error: "OSError: Unable to load weights from pytorch checkpoint file"

This error occurs when the structure of the loaded model doesn't match that of the checkpoint.

Solution

# Loading with strict=False to ignore non-critical mismatches
state_dict = torch.load('model_checkpoint.pth')
model.load_state_dict(state_dict, strict=False)

# Display key differences
pretrained_dict = torch.load('model_checkpoint.pth')
model_dict = model.state_dict()
pretrained_dict = {k: v for k, v in pretrained_dict.items() if k in model_dict}
missing_keys = set(model_dict.keys()) - set(pretrained_dict.keys())
unexpected_keys = set(pretrained_dict.keys()) - set(model_dict.keys())
print(f"Missing keys: {missing_keys}")
print(f"Unexpected keys: {unexpected_keys}")

# Use our model reconciliation tool
from techsolut.utils import reconcile_state_dict
fixed_state_dict = reconcile_state_dict(state_dict, model.state_dict())
model.load_state_dict(fixed_state_dict)

How to Get Additional Help

If you encounter a problem not covered by this guide:

  1. Check Detailed Logs
  2. Enable debug logging in config.py
  3. Examine /logs/debug.log for detailed error messages

  4. Generate a Diagnostic Report
    bash techsolut-cli generate-diagnostic-report --output=diagnostic.zip

  5. Contact Our Technical Support

  6. Send the diagnostic report to support@techsolut.fr
  7. Include a detailed description of the problem
  8. Mention the steps you've already tried

  9. Check Our Knowledge Base

  10. Visit support.techsolut.fr
  11. Search for similar issues
  12. Check recent updates and release notes

  13. Join Our Community

  14. Ask your questions on our community forum
  15. Participate in weekly troubleshooting webinars
  16. Check previous discussions on similar problems
Dans cette page
Articles similaires
IA