Telecom CDR Warehouse

1. Executive Summary

The Telecom CDR Warehouse is a big data pipeline built on the Hadoop ecosystem to ingest, process, and store millions of daily call detail records (CDRs). It uses Spark for heavy ETL (duration, routing, billing metrics), Hive for partitioned warehousing, and Oozie for automated daily workflows. The system scales to 10+ million CDRs/day, supports efficient querying on date and call type, and was delivered in about 7 months, on time. This warehouse enables telecom teams to improve billing accuracy, network optimization, and operational analytics on top of a robust, fault-tolerant data platform.

2. Architecture Overview

The solution uses a distributed Hadoop architecture optimized for high-volume CDR processing:

Data Sources: Telecom switches exporting CDRs (CSV/JSON) into landing zones.

Ingestion Layer: Files land into HDFS under /raw/cdrs/ via Flume or direct uploads.

Processing Layer: Apache Spark jobs (on YARN) perform ETL to derive metrics like duration and billing.

Storage Layer: Hive tables over HDFS, partitioned by date and call_type (voice/SMS/data) using ORC format.

Orchestration Layer: Apache Oozie workflows and coordinators schedule and automate daily ETL runs.

3. Technology Stack

  • Storage: Hadoop HDFS (Distributed file system)
  • Processing / ETL: Apache Spark (Scala/Python, Spark SQL) on YARN
  • Warehousing / Query: Apache Hive (SQL-like layer over HDFS)
  • Workflow Orchestration: Apache Oozie
  • Resource Management: YARN and ZooKeeper
  • Formats & SerDes: CSV/JSON, Hive SerDe, and ORC for optimized storage

4. Data & Warehousing Model

Raw Layer (HDFS): Directory: /raw/cdrs/. Contains raw files from telecom switches.

Processed Layer (HDFS): Directory: /processed/metrics/. Spark ETL outputs with derived metrics.

Hive Tables: cdrs_raw (External table) and cdrs_processed. Partitioning by date + call_type and using ORC + bucketing on subscriber_id ensures efficient reporting and ad-hoc queries.

5. ETL Processing

Extract: CDR files delivered to HDFS via Flume. HDFS replication factor (e.g., 3) ensures fault tolerance.

Transform: Spark jobs compute duration (end_time - start_time), billing_amount based on routing (intl vs local), and aggregate totals per subscriber.

Load: Transformed data is written to Hive tables using dynamic partitioning. ORC settings are tuned for high-performance retrieval.

6. Project Timeline (7 Months)

Project Start: May 1, 2025 | Duration: ~7 months (Delivered on time)

  • May 1 – May 15, 2025: Kickoff and cluster provisioning.
  • May 16 – Jun 1, 2025: Architecture finalization and HDFS layout design.
  • Jun 2 – Jun 30, 2025: Ingestion implementation via Flume/direct uploads.
  • Jul 1 – Aug 15, 2025: Spark ETL development for metrics (duration, billing).
  • Aug 16 – Sep 30, 2025: Hive warehousing setup and ORC tuning.
  • Oct 1 – Oct 31, 2025: Oozie workflow and coordinator configuration.
  • Nov 1 – Nov 30, 2025: Integration testing and production rollout.

7. Testing & Deployment

Testing: Unit Tests for transformations; End-to-end integration from landing to Hive; Performance Tests validating 10M+ records in <1 hour; Metrics accuracy verified at ~99%.

Deployment: Spark jobs deployed via packaged jars/py files; Oozie workflows scheduled for daily triggers on the Hadoop cluster.

8. Monitoring & Maintenance

Monitoring: Cluster health via Ambari/Ganglia; YARN/Spark UI for resource utilization; Oozie logs for retry management.

Maintenance: Daily log cleanup, quarterly cluster tuning (memory/executors), and HDFS capacity monitoring. Estimated Cost: ~$1,000/month for a 10-node cluster.

9. Roles & Responsibilities

Methodology: Waterfall with iterative testing for large-scale stability.

  • 🚀 Big Data Architect (1): Architecture design, data model, and performance strategy.
  • ⚙️ Spark Developers (2): ETL implementation and Hive integration.
  • 🛠️ Ops Engineer (1): Cluster provisioning, security (Kerberos), and Oozie operations.