The Sales Forecasting System with ML + Prophet project delivers a hybrid time-series solution for accurate sales predictions, integrating seasonal/holiday effects and exogenous factors to optimize inventory and planning. It builds Prophet models for decomposition, XGBoost for feature-driven regression (weather, promotions, marketing spend), compares their performance, and stores outputs in Snowflake fact tables for analytics. The system achieves 92% accuracy (MAPE <8%), reduces overstock by 35%, handles 1M+ records daily, and was completed over 10 months from January to November 2025 for retail/e-commerce efficiency.
The architecture follows an end-to-end pipeline: historical sales data is ingested from sources, merged with exogenous features (weather APIs, promo calendars, marketing logs) via Python/Pandas, forecasted using Prophet for seasonality/holidays and XGBoost for regression, compared via metrics/visuals, and persisted in Snowflake schemas (raw, processed, fact tables) for querying/BI integration. This design ensures scalability, automation via scripts/Airflow, and hybrid selection for optimal predictions in dynamic markets.
The system uses Python for scripting and integration, Facebook Prophet for time-series forecasting, XGBoost for gradient boosting regression, and Snowflake for cloud data warehousing and storage. Additional libraries include Pandas for merging/processing, Scikit-learn for metrics (MAE/RMSE/MAPE), and Snowflake connector for ingestion; supports APIs for external data like weather.
The forecasting model employs Prophet for additive time-series with yearly/weekly seasonality, custom holidays (e.g., Black Friday dataframe), and regressors; XGBoost for regression with lagged sales, weather (temp/rain), promo flags, marketing spend, and holiday binaries, trained with squarederror objective, 0.05 LR, 1000 estimators. Comparison uses train/test splits, cross-validation metrics (MAE/RMSE), and visuals (forecast vs. actuals), with hybrid averaging 92% accuracy; features merged on date keys with validation.
Data processing ingests sales from CSV/SQL, merges exogenous sources (weather APIs, promos, spend) using Pandas time-joins, handles missing/outliers/normalization, formats for models (ds/y for Prophet, features for XGBoost). Forecasts are generated (e.g., 365-day future dataframe with pre-filled regressors), compared, and batch-inserted into Snowflake fact tables (date, product_id, forecast_prophet/xgboost, actual) via connector, ensuring daily updates, versioning, and efficiency for large datasets.
Testing includes unit for merging/model functions, integration for pipeline flow, accuracy via MAPE <10% and RMSE/MAE on test sets, and load for 1M+ records. Deployment automates via Python scripts/cron/Airflow, connects to Snowflake for storage/querying, uses phased rollout with validation scripts, and supports rollback via model versions if issues arise.
Post-deployment, monitor forecast accuracy/drift via daily metrics in Snowflake, pipeline runs, and feature alignment, aiming for >99% uptime and <30min daily processing. Maintenance includes quarterly retraining on new data, monthly data quality/validation checks, and cost controls (elastic Snowflake compute), with alerts for high MAPE deviations to trigger reviews.