Documentation Index
Fetch the complete documentation index at: https://mintlify.com/stevenrq/sgivu/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The SGIVU ML training pipeline transforms raw transaction data into accurate demand forecasting models. The process involves feature engineering, model training with multiple algorithms, evaluation, and versioning.Pipeline Architecture
Feature Engineering
The feature engineering pipeline is implemented inapp/infrastructure/ml/feature_engineering.py and transforms raw transaction data into ML-ready features.
Input Data Requirements
Raw transaction data must include:Unique identifier for tracking vehicle lifecycle
Vehicle category (CAR, MOTORCYCLE)
Vehicle manufacturer
Vehicle model name
Specific trim/version - cannot be empty
Transaction type:
SALE or PURCHASESale transaction price
Purchase transaction price
Record creation timestamp
Record update timestamp
Data Normalization
Before feature engineering, all categorical data is normalized:Normalization ensures consistent segmentation by handling case variations, typos, and whitespace differences.
Feature Categories
The pipeline generates three types of features:1. Categorical Features
Segmentation dimensions that identify the vehicle:2. Business Metrics
Aggregated monthly metrics per segment:| Feature | Description | Business Insight |
|---|---|---|
purchases_count | New inventory acquisitions per month | Supply side activity |
avg_margin | Average profit per sale | Profitability indicator |
avg_sale_price | Mean selling price | Price point trends |
avg_purchase_price | Mean acquisition cost | Cost trends |
avg_days_inventory | Average days from purchase to sale | Inventory velocity |
inventory_rotation | Sales-to-purchase ratio | Turnover efficiency |
3. Time-Series Features
Lagged values and rolling statistics capture temporal patterns:| Feature | Window | Purpose |
|---|---|---|
lag_1 | 1 month | Recent trend signal |
lag_3 | 3 months | Quarterly pattern |
lag_6 | 6 months | Semi-annual seasonality |
rolling_mean_3 | 3-month average | Short-term smoothing |
rolling_mean_6 | 6-month average | Long-term trend |
4. Temporal Features
Cyclical encoding captures seasonality:Using
sin and cos ensures that December (12) and January (1) are recognized as adjacent months, capturing year-end seasonality patterns.Feature Engineering Output
Thebuild_feature_table method produces a monthly aggregated dataset:
Model Training
The training process is orchestrated by theTrainingService (app/application/services/training_service.py:46-94).
Training Workflow
Train/Test Split
The split respects temporal ordering to prevent data leakage:Example: With 12 months of data:
- Training set: First 9-10 months
- Test set: Last 2-3 months
Preprocessing Pipeline
Before model fitting, data passes through sklearn preprocessing:- Categorical variables: One-hot encoded (creates binary columns per category)
- Numeric variables:
- Missing values imputed with median
- Standardized to zero mean and unit variance
Model Candidates
Three algorithms are evaluated:- Linear Regression
- Random Forest
- XGBoost
- Fast training and prediction
- Interpretable coefficients
- Good baseline performance
- Assumes linear relationships
- Limited expressiveness for complex patterns
Model Evaluation
All candidates are evaluated on the test set:RMSE (Root Mean Squared Error)
RMSE (Root Mean Squared Error)
Penalizes large errors more heavily. Same units as target variable.Lower is better. Primary metric for model selection.
MAE (Mean Absolute Error)
MAE (Mean Absolute Error)
Average absolute difference between predictions and actuals.Lower is better. More interpretable than RMSE.
MAPE (Mean Absolute Percentage Error)
MAPE (Mean Absolute Percentage Error)
Percentage error, scale-independent.Lower is better. Example: 0.087 = 8.7% average error.
R² (Coefficient of Determination)
R² (Coefficient of Determination)
Proportion of variance explained by the model.Closer to 1.0 is better. 0.89 = model explains 89% of variance.
Residual Std Dev
Residual Std Dev
Standard deviation of prediction errors. Used for confidence intervals.Used to calculate upper/lower bounds in predictions.
Model Selection and Refit
After evaluation, the best model is retrained on the full dataset:Refitting on the full dataset gives the model access to all available information for production predictions.
Prediction Generation
Once trained, the model generates multi-horizon forecasts iteratively.Iterative Forecasting
The_forecast method in PredictionService (app/application/services/prediction_service.py:266-307):
Future Feature Construction
Thebuild_future_row method creates features for months beyond the training data:
Key Insight: Future predictions use recent historical averages for business metrics (prices, margins) and automatically update lag features as new predictions are made.
Confidence Intervals
Confidence bounds are calculated using normal distribution assumptions:| Confidence Level | Z-score | Interpretation |
|---|---|---|
| 80% | 1.28 | ±1.28σ contains 80% of values |
| 90% | 1.64 | ±1.64σ contains 90% of values |
| 95% | 1.96 | ±1.96σ contains 95% of values |
| 99% | 2.58 | ±2.58σ contains 99% of values |
max(0, ...) ensures demand predictions never go negative.
Training via API
Triggering Retraining
Retraining can be triggered programmatically:Automated Retraining
For production environments, consider scheduled retraining:Best Practices
Data Quality
Ensure Complete Line Information
All transactions must have non-empty
line field. This is mandatory for segmentation.Minimum History Requirements
At least 6 months of data per segment (configurable via
MIN_HISTORY_MONTHS). More is better for capturing seasonality.Consistent Naming
Use consistent brand/model/line naming. The normalization pipeline handles some variations, but major inconsistencies should be cleaned upstream.
Training Frequency
Model Monitoring
Track these indicators for model health:- Metric degradation: Is RMSE increasing over time?
- Prediction accuracy: Compare predictions to actuals from previous months
- Coverage: Are new vehicle segments being added that lack training data?
- Residual patterns: Are errors systematic or random?
Feature Engineering Customization
Extend features for your specific use case:Troubleshooting
Training fails with 'line' missing error
Training fails with 'line' missing error
Error:
ValueError: La columna 'line' es obligatoria para entrenar el modelo.Cause: Input data is missing the line column or has empty values.Solution:- Ensure all transactions include
linefield - Backfill historical data with line information
- Use a default value (e.g., “STANDARD”) for records without specific trim info
Insufficient history error
Insufficient history error
Error:
ValueError: Se requieren al menos 6 meses para entrenar.Cause: Not enough historical months in dataset.Solution:- Adjust
MIN_HISTORY_MONTHSsetting (not recommended below 3) - Wait for more data to accumulate
- Use synthetic/demo data for testing
Poor model performance (high RMSE)
Poor model performance (high RMSE)
Symptoms: RMSE > 10, MAPE > 0.30, R² < 0.50Possible causes:
- Insufficient training data (< 12 months)
- High variance in sales patterns
- Missing important features
- Data quality issues (outliers, errors)
- Collect more historical data
- Add external features (promotions, seasonality indicators)
- Review data for anomalies
- Consider segment-specific models for heterogeneous products
Training takes too long
Training takes too long
Cause: Large dataset or complex models (XGBoost with many estimators)Solutions:
- Reduce
n_estimatorsin XGBoost/RandomForest - Sample data for faster iteration during development
- Use more powerful compute resources
- Consider incremental learning approaches
Next Steps
Model Management
Learn about versioning and model lifecycle
Prediction API
Use trained models for forecasting
ML Service Overview
Complete ML service architecture
Infrastructure
Deploy ML service to production