The Global Retail Sales Lakehouse project delivers a full end-to-end data platform on Azure using the Medallion Architecture (Bronze, Silver, Gold). It ingests multi-region sales data from more than 50 global locations, processes it using Azure Data Factory, Azure Databricks, and Delta Lake, and provides curated analytics through Azure Synapse and Power BI. The solution enables real-time and batch processing, maintains 99.9% accuracy at the Gold layer, and gives the business a unified global view with self-service analytics.
The architecture follows a streamlined flow: data is ingested into ADLS Gen2 through ADF pipelines, stored as raw files in the Bronze layer, transformed and cleansed in the Silver layer with SCD-2 logic, and finally aggregated in the Gold layer for business reporting. Azure SQL Data Warehouse and Power BI consume this curated Gold data for dashboards and insights. This design ensures scalability, governance, and cost efficiency across the entire pipeline.
Bronze Layer: Stores all raw incoming CSV/Parquet data in its original format.
Silver Layer: Standardizes, cleanses, and builds SCD-2 dimensions and conformed facts for downstream reliability.
Gold Layer: Produces curated, aggregated tables such as store-level daily sales and product hierarchy views, optimized for fast reporting and real-time dashboards.
Daily Ingestion: ADF pipelines capture new files from REST and SFTP sources and land them into Bronze storage, validating schema and row counts.
Transformation: Databricks transforms Bronze data into trustworthy Silver datasets with cleansing, SCD-2 logic, and upserts into Delta tables.
Aggregation: PySpark jobs compute store-level metrics and load Gold fact tables. This three-pipeline setup ensures accuracy, automation, and transparency.
Testing includes unit validation, integration checks, data quality testing, and performance tuning to ensure Gold queries run in under five seconds. Deployment uses a blue-green strategy, performing a clean cutover from the legacy environment with a monitored 48-hour stabilization window. Rollback is supported by switching Power BI back to the old warehouse if needed.
After going live, the platform is monitored to maintain over 99% pipeline success rates, keep cluster utilization below 80%, ensure data freshness within four hours, and manage cloud cost under $12K per month. Alerts and dashboards support proactive issue detection and reporting.