Finance Ledger Harmonization System

1. Executive Summary

The Finance Ledger Harmonization System is a 12-week ETL and data warehousing project designed to ingest, harmonize, and consolidate multi-company financial ledgers into a unified PostgreSQL data warehouse. The solution processes General Ledger (GL) and Trial Balance files, standardizes disparate chart-of-accounts using configurable mapping rules, applies SCD-2 logic for historical tracking, and populates a financial star schema consisting of fact_financial and dim_account tables. Technologies used include PySpark for large-scale transformations, Apache Airflow for orchestration, Hive/Parquet for intermediate storage, and PostgreSQL for the analytical warehouse. The solution achieves reliable ingestion of up to 10 million records per cycle, offers strong auditability, supports daily/weekly batch processing, and enables BI reporting through tools like Power BI and Tableau.

2. Architecture Overview

The architecture follows a modular ETL design: Airflow triggers ingestion from multi-company financial files into Hive/Parquet landing zones, PySpark performs harmonization and SCD-2 transformations, and final results are loaded into a PostgreSQL warehouse structured for analytical queries. Airflow handles orchestration and monitoring; Hive acts as an auditable interim storage zone; PySpark applies complex mapping and SCD-2 rules; PostgreSQL hosts the star schema for BI consumption. This architecture ensures data integrity, scalability, historical preservation, and efficient financial reporting.

3. Technology Stack

  • Orchestration: Apache Airflow
  • Processing: PySpark (mapping, SCD-2, transformations)
  • Storage (Intermediate): Hive + Parquet
  • Data Warehouse: PostgreSQL (star schema)
  • Validation: Python, Pandas, custom schema checks
  • Security: SSL for JDBC, role-based access
  • Monitoring: Airflow UI, Prometheus, Slack alerts
  • CI/CD: Jenkins + Git

4. Data Model

dim_account (SCD-2): Holds harmonized chart-of-accounts with historical versions. Columns: account_id, natural_account, company_id, account_name, account_type, effective_date, expiry_date, is_current, version. Supports versioned history for audit and compliance.

fact_financial: Contains financial facts per period (balances, debits, credits). Columns: fact_id, account_id, period_date, debit_amount, credit_amount, balance, transaction_type, source_company. Enables period-end reporting and transactional detail analysis.

Data Volumes: Up to 10M records per cycle; 5GB processed in under 30 minutes.

5. ETL Processing

Extract: Airflow ingests CSV/Excel/API-provided GL and TB files from SFTP/S3/local.

Transform: PySpark applies chart-of-accounts harmonization via JSON/YAML mapping rules. Implements SCD-2: hashing attributes, generating new versions, closing old ones. Data quality checks for duplicates, schema mismatches, and balancing (debits = credits).

Load: Dimensional tables loaded into PostgreSQL via JDBC. fact_financial written as incremental snapshots. Intermediate Storage: Parquet layers stored in Hive for auditing and reprocessing. This pipeline ensures idempotency, traceability, and reliable cross-company financial standardization.

6. Project Timeline (12 Weeks)

Start: Aug 1, 2025 | Dev Complete: Sep 26, 2025 | Go-Live: Oct 31, 2025

  • Weeks 1–2 — Planning & Design: Requirements gathering, data modeling, architecture diagrams.
  • Weeks 3–8 — Development: Build Airflow ingestion DAGs, PySpark harmonization jobs, SCD-2 engine, PostgreSQL loaders.
  • Weeks 9–10 — Testing: Unit testing (PyTest), end-to-end DAG runs, performance validation with 15GB datasets (Oct 10, 2025 sign-off).
  • Weeks 11–12 — Deployment & Handover: Production deployment, documentation, DBA integration, client training.

7. Testing & Deployment

Testing included schema validations, reconciliation checks (debits = credits), duplicate detection, performance testing on 15GB datasets (processed in 25 minutes), and UAT validation. Airflow DAGs were tested in isolated dev environments before production rollout. Deployment used CI/CD via Jenkins, SSL-secured JDBC connections, and Airflow environment promotion (dev → staging → prod). Rollback supported through Hive staging backups and PostgreSQL snapshots.

8. Monitoring & Maintenance

Monitoring includes Airflow DAG tracking (success/failure rates, retries), Prometheus metrics for Spark executor usage, and Slack notifications for pipeline failures. Data quality reports are generated per cycle. Security controls: role-based access, SSL encryption, anonymized dev data. Scalability is achieved through Spark dynamic allocation and Parquet partitioning by period/company. The system maintains 99.9% reliability and provides transparent audit trails.

9. Roles & Responsibilities

Total Project Team: 5 members | Methodology: Agile Sprints

  • 🚀 PySpark Developers (3): Build transformations, harmonization logic, SCD-2, loaders
  • 🏗️ DBA: Manage PostgreSQL schemas, indexing, query optimization
  • 📊 Project Manager: Delivery, client communication, sprint planning
  • ⚙️ Airflow Engineer: DAG design, scheduling, monitoring

Tools used: Git, Jenkins, Airflow, Spark, PostgreSQL, Hive