The Social Media Sentiment ETL Engine is a real-time data pipeline that monitors sentiment for multiple brands across platforms like Twitter (X) and Reddit. It ingests live streams via Kafka, applies NLP-based sentiment analysis in Python, aggregates trends with Spark Structured Streaming, and stores both raw and aggregated results in MongoDB for dashboard visualization. The system handles up to 10,000 events per minute, delivers near real-time (<5s) end-to-end latency, and supports historical trend analysis for brand monitoring. The project was delivered over 8 months, ahead of an original 10-month estimate, with strong error handling, monitoring, and production-ready deployment.
The architecture is a streaming, microservices-style pipeline with separate layers for ingestion, processing, storage, and visualization:
Data Sources: Twitter and Reddit APIs (streaming endpoints, English-language posts mentioning configured brands).
Ingestion Layer: Kafka producers consume API streams and publish JSON events into topics like twitter-stream and reddit-stream.
Processing Layer: Python consumers apply NLP sentiment scoring (e.g., VADER); Spark Structured Streaming performs windowed aggregation (e.g., hourly/daily averages per brand).
Storage Layer: MongoDB collections store raw sentiment scores and aggregated metrics.
Presentation Layer: A web dashboard (Streamlit / Flask / Dash) queries MongoDB and visualizes brand sentiment trends over time.
Raw Sentiment Layer (MongoDB): Stores individual analyzed posts with fields like: { text, brand, timestamp, sentiment_score, platform, metadata }.
Aggregated Layer (MongoDB): Stores time-windowed metrics: { brand, timestamp, avg_sentiment, aggregate_count }.
Logical "Warehouse": MongoDB acts as a semi-structured warehouse with Fact-like sentiment scores and Dimensions (brand, platform, time windows). Indexes on brand and timestamp ensure fast dashboard queries.
Extract: Real-time pull from Twitter and Reddit APIs; Kafka producers publish JSON events (text + brand + metadata).
Transform: Python consumers apply VADER sentiment scoring (-1 to +1); Spark Structured Streaming performs windowed aggregations with watermarking for late data.
Load: Spark writes raw and aggregated outputs into MongoDB using the MongoDB Spark Connector. TTL indexes handle raw data retention (e.g., 30 days) while aggregates are kept for historical analysis.
Project Start: January 1, 2025 | Duration: 8 Months (Delivered ahead of schedule)
Testing: Unit Tests with Pytest; Performance Tests (JMeter) validating <5s latency at 10,000 events/min; Sentiment accuracy verified at ~85% via manual review.
Deployment: Local development via Docker Compose; Production rollout on Kubernetes with rolling updates and GitHub Actions for CI/CD pipelines.
Monitoring: Prometheus/Grafana or ELK stack for alerts on high error rates or Kafka lag.
Maintenance: Weekly MongoDB backups, configuration-driven brand management via YAML, and auto-scaling for Kafka/Spark clusters. Estimated Cost: ~$500/month on cloud infrastructure.
Methodology: Agile with 2-week sprints and regular risk management demos.