The Telecom CDR Warehouse is a big data pipeline built on the Hadoop ecosystem to ingest, process, and store millions of daily call detail records (CDRs). It uses Spark for heavy ETL (duration, routing, billing metrics), Hive for partitioned warehousing, and Oozie for automated daily workflows. The system scales to 10+ million CDRs/day, supports efficient querying on date and call type, and was delivered in about 7 months, on time. This warehouse enables telecom teams to improve billing accuracy, network optimization, and operational analytics on top of a robust, fault-tolerant data platform.
The solution uses a distributed Hadoop architecture optimized for high-volume CDR processing:
Data Sources: Telecom switches exporting CDRs (CSV/JSON) into landing zones.
Ingestion Layer: Files land into HDFS under /raw/cdrs/ via Flume or direct uploads.
Processing Layer: Apache Spark jobs (on YARN) perform ETL to derive metrics like duration and billing.
Storage Layer: Hive tables over HDFS, partitioned by date and call_type (voice/SMS/data) using ORC format.
Orchestration Layer: Apache Oozie workflows and coordinators schedule and automate daily ETL runs.
Raw Layer (HDFS): Directory: /raw/cdrs/. Contains raw files from telecom switches.
Processed Layer (HDFS): Directory: /processed/metrics/. Spark ETL outputs with derived metrics.
Hive Tables: cdrs_raw (External table) and cdrs_processed. Partitioning by date + call_type and using ORC + bucketing on subscriber_id ensures efficient reporting and ad-hoc queries.
Extract: CDR files delivered to HDFS via Flume. HDFS replication factor (e.g., 3) ensures fault tolerance.
Transform: Spark jobs compute duration (end_time - start_time), billing_amount based on routing (intl vs local), and aggregate totals per subscriber.
Load: Transformed data is written to Hive tables using dynamic partitioning. ORC settings are tuned for high-performance retrieval.
Project Start: May 1, 2025 | Duration: ~7 months (Delivered on time)
Testing: Unit Tests for transformations; End-to-end integration from landing to Hive; Performance Tests validating 10M+ records in <1 hour; Metrics accuracy verified at ~99%.
Deployment: Spark jobs deployed via packaged jars/py files; Oozie workflows scheduled for daily triggers on the Hadoop cluster.
Monitoring: Cluster health via Ambari/Ganglia; YARN/Spark UI for resource utilization; Oozie logs for retry management.
Maintenance: Daily log cleanup, quarterly cluster tuning (memory/executors), and HDFS capacity monitoring. Estimated Cost: ~$1,000/month for a 10-node cluster.
Methodology: Waterfall with iterative testing for large-scale stability.