Logistics Supply Chain ETL with 3rd-Party APIs

1. Executive Summary

The Logistics Supply Chain ETL with 3rd-Party APIs project is a robust ETL pipeline that integrates shipment, tracking, and delivery data from carriers like ShipStation, DHL, and FedEx into a unified MySQL analytics warehouse. It uses Python to pull data from REST APIs, normalizes heterogeneous responses into a star schema (fact + dimensions), orchestrates incremental updates via Apache Airflow DAGs, and builds SLA monitoring tables for on-time delivery analytics. The platform handles thousands of daily events per carrier, runs incremental loads, and was delivered in about 6 months, ahead of schedule. This solution gives operations teams clear visibility into shipment performance, delays, and carrier reliability.

2. Architecture Overview

The system follows a classic API-driven ETL architecture with orchestration for reliability:

Data Sources: REST APIs from ShipStation, DHL, and FedEx.

Ingestion Layer: Python scripts (requests-based) pulling shipment, tracking, and delivery events.

Processing Layer: Python + pandas normalize and standardize fields into a common model.

Storage Layer: MySQL database with fact_shipment and dim_carrier plus SLA tables.

Orchestration Layer: Airflow DAGs schedule and manage incremental ETL runs per carrier. Tasks are logically split per carrier and step (extract → transform → load → SLA refresh).

3. Technology Stack

  • Programming: Python 3.x
  • APIs / HTTP: requests library (ShipStation, DHL, FedEx REST APIs)
  • Data Processing: pandas (JSON flattening and normalization)
  • Database: MySQL (Relational, star-schema-based analytics)
  • Orchestration: Apache Airflow (DAGs, scheduling, retry logic)
  • Infrastructure: Docker, MySQL Workbench, SQLAlchemy
  • Connectivity: airflow.providers.mysql, TLS encryption

4. Data & Warehousing Model

Star Schema in MySQL:

dim_carrier: carrier_id, carrier_name, api_endpoint, etc.

fact_shipment: shipment_id, carrier_id, status, expected_delivery, actual_delivery, and other shipment attributes.

SLA Monitoring Tables: Derived tables (e.g., sla_monitoring) aggregate metrics such as total shipments per carrier, on-time vs late deliveries, and on-time percentage. Indexes on IDs and dates support efficient querying.

5. ETL Processing

Extract: Python jobs call APIs using secure credentials. Incremental pulls are based on timestamps or last_updated fields.

Transform: pandas normalizes nested JSON. A mapping layer standardizes fields across carriers (e.g., standardizing status names and delivery timestamps).

Load: Batch loads to MySQL via SQLAlchemy using UPSERT patterns for incremental updates. Facts and dimensions are handled in separate flows.

SLA Monitoring: SQL queries compute performance metrics and populate SLA monitoring tables, refreshed via Airflow DAGs.

6. Project Timeline (6 Months)

Project Start: June 1, 2025 | Duration: ~6 months (Delivered ahead of schedule)

  • Jun 1 – Jun 15, 2025 — Kickoff: Requirements and API onboarding.
  • Jun 16 – Jul 1, 2025 — Design: Architecture and star schema design.
  • Jul 2 – Jul 31, 2025 — Ingestion: Python API scripts for ShipStation, DHL, FedEx.
  • Aug 1 – Sep 15, 2025 — Transformation: Normalization and cross-carrier mapping.
  • Sep 16 – Oct 31, 2025 — Warehousing: MySQL schema (facts, dims, SLA tables).
  • Nov 1 – Nov 13, 2025 — Orchestration: Airflow DAG construction.
  • Nov 14 – Dec 15, 2025 — Testing & Deployment: Integration tests and rollout.

7. Testing & Deployment

Testing: Unit Tests with Pytest for API wrappers; Integration Tests with mock APIs and end-to-end Airflow runs; Performance Tests handling ~10k events in under 10 minutes.

Deployment: Airflow via Docker/Kubernetes; MySQL cloud-hosted; all configurations managed via Git and environment variables.

8. Monitoring & Maintenance

Monitoring: Airflow UI for DAG health; MySQL logs for query performance. Success target: ~99% accuracy reconciliation between APIs and MySQL.

Maintenance: Daily incremental refresh, monthly backups, and rotation of API keys. Estimated Cost: ~$200/month for infrastructure.

9. Roles & Responsibilities

Methodology: Agile with 2-week sprints, starting with a POC for one carrier.

  • 🚀 ETL Lead (1): End-to-end design and API integration strategy.
  • ⚙️ Python Developers (2): Implementation of API clients and normalization logic.
  • 🏗️ DBA (1): Schema design, indexing, and performance tuning.