Overview
The SGIVU ML service implements a robust model management system that handles versioning, persistence, metadata tracking, and model lifecycle operations. This ensures reproducibility, traceability, and seamless model updates.Model Registry
The Model Registry is the central component for managing ML artifacts, implemented through theModelRegistryPort interface with concrete implementations for file system and database storage.
Architecture
The service can operate in file-only mode (no database) or database-backed mode for enterprise deployments with centralized storage.
Model Versioning
Version Format
Models are versioned using timestamp-based identifiers:20260306_143022 represents a model trained on March 6, 2026 at 14:30:22.
Why Timestamp Versioning?
Chronological Ordering
Versions are naturally sorted by training time
No Conflicts
Concurrent training jobs won’t collide (down to second precision)
Reproducibility
Easy to identify when a model was created
Simplicity
No need for separate version number management
Version Generation
Versions are automatically generated during model save:Model Persistence
File-Based Storage
When using file system storage (configured viaMODEL_DIR):
Model Artifact (model.joblib)
The trained sklearnPipeline object serialized with joblib:
- Preprocessor:
ColumnTransformerwith encoders and scalers - Model: Best estimator (LinearRegression, RandomForest, or XGBoost)
Metadata File (metadata.json)
Metadata Field Descriptions
Metadata Field Descriptions
Unique model identifier (timestamp-based)
ISO 8601 timestamp with timezone
Name of the target variable (usually
sales_count)List of feature names in the order expected by the model
Performance metrics from test set evaluation
rmse: Root Mean Squared Errormae: Mean Absolute Errormape: Mean Absolute Percentage Errorr2: R-squared scoreresidual_std: Used for prediction intervals
Comparison of all evaluated models with their metrics
Number of samples used for training
Number of samples used for evaluation
Total dataset size
Database Storage
For production deployments, models can be stored in PostgreSQL:Schema
Benefits of Database Storage
Centralized Storage
Single source of truth for all model versions
Easy Querying
SQL queries for model comparison and analysis
Automatic Backups
Models included in database backup strategy
Scalability
Handle large model collections without filesystem concerns
Model Lifecycle
Training and Registration
When a new model is trained: Code Flow (fromapp/application/services/training_service.py:88-94):
Loading for Prediction
When making predictions: Code (fromapp/application/services/prediction_service.py:177-181):
Model Replacement
The service always uses the latest model by version. Older versions are retained for auditing but not used for predictions unless explicitly loaded.
- Old model: Remains in storage with its version
- New model: Saved with newer version timestamp
- Predictions: Automatically switch to new model on next request
Feature Snapshots
For reproducibility, the service can persist feature datasets alongside models.Purpose
Feature snapshots enable:- Prediction without raw data: Use pre-computed features
- Faster inference: No need to rebuild features from transactions
- Reproducibility: Ensure predictions use exact training feature distributions
- Debugging: Compare features across model versions
Database Schema
Usage
Features are automatically saved during training (iffeature_repository is configured):
Prediction Logging
The service can log all prediction requests and responses for:- Auditing: Track who requested what predictions
- Monitoring: Detect usage patterns and anomalies
- Model evaluation: Compare predictions to actual outcomes
- Debugging: Investigate prediction issues
Database Schema
Logged Information
Querying Prediction Logs
Model Comparison
Compare performance across model versions to track improvements:Via API
candidates field showing all evaluated models:
Via Database
Visualization Example
Model Rollback
If a new model performs poorly in production, you can rollback by:Option 1: Filesystem Rollback
Rename directory to make an older version “latest”:Option 2: Explicit Version Loading
Modify the registry to load a specific version instead of latest:Option 3: Retrain with Better Data
The best solution is usually to retrain with corrected data:Monitoring and Observability
Health Checks
Verify model availability:- 200 OK: Model is available
- 500 Error with
"No hay modelos disponibles": No trained model
Metrics to Track
Model Performance Metrics
Model Performance Metrics
- RMSE: Track over time, alert if > threshold
- R²: Should be > 0.70 for good models
- MAPE: Percentage error, aim for < 15%
Operational Metrics
Operational Metrics
- Training frequency: How often are models retrained?
- Training duration: Is it increasing over time?
- Model size: Disk/memory usage per version
- Prediction latency: Response time for forecasts
Business Metrics
Business Metrics
- Prediction accuracy: Compare forecasts to actuals
- Coverage: % of segments with sufficient training data
- Usage: Predictions per segment/day
- Confidence: Are predictions consistently within CI bounds?
Alerting
Set up alerts for:Best Practices
Version Retention Policy
Keep: Last 10 versions or 90 days of modelsArchive: Older versions to cold storageDelete: Models older than 1 year (after compliance review)
Model Documentation
Store additional documentation with each version:
- Training notebook/script
- Data quality report
- Feature importance analysis
- Business context (e.g., “trained after holiday season”)
A/B Testing
For major model changes, run A/B tests:
- Route 10% of traffic to new model
- Compare predictions and user feedback
- Gradually increase traffic if successful
Reproducibility
Ensure models can be recreated:
- Pin dependency versions (
requirements.txt) - Store feature engineering code version
- Save random seeds in metadata
- Document hyperparameters
API Reference
For API operations related to model management, see:Get Latest Model
Retrieve current model metadata
Retrain Model
Trigger new model training
Troubleshooting
Model not found error
Model not found error
Error:
ModelNotTrainedError: Aún no existe un modelo entrenado.Cause: No model versions exist in MODEL_DIR or database.Solution:- Run initial training via
/v1/ml/retrain - Check
MODEL_DIRpath is correct - Verify database connectivity if using DB storage
Model deserialization errors
Model deserialization errors
Error:
ValueError: unsupported pickle protocol or module import errorsCause: Model was trained with different Python/library versions.Solution:- Ensure consistent environment (use Docker)
- Pin dependency versions
- Retrain model in current environment
Predictions differ after retrain
Predictions differ after retrain
Cause: Different training data, features, or model selection.Expected behavior: Models evolve as data changes.To investigate:
- Compare
candidatesfield in metadata - Check if different algorithm was selected
- Review training data date ranges
- Compare feature distributions
Slow model loading
Slow model loading
Cause: Large model files or network latency (DB storage).Solutions:
- Cache loaded model in memory (current implementation loads on each prediction)
- Use file storage instead of DB for faster access
- Implement model preloading during service startup
Next Steps
Training Process
Learn how models are trained
Prediction API
Use models for forecasting
Deployment Guide
Deploy SGIVU to production
Monitoring Guide
Set up model monitoring