Social Media Sentiment ETL Engine

1. Executive Summary

The Social Media Sentiment ETL Engine is a real-time data pipeline that monitors sentiment for multiple brands across platforms like Twitter (X) and Reddit. It ingests live streams via Kafka, applies NLP-based sentiment analysis in Python, aggregates trends with Spark Structured Streaming, and stores both raw and aggregated results in MongoDB for dashboard visualization. The system handles up to 10,000 events per minute, delivers near real-time (<5s) end-to-end latency, and supports historical trend analysis for brand monitoring. The project was delivered over 8 months, ahead of an original 10-month estimate, with strong error handling, monitoring, and production-ready deployment.

2. Architecture Overview

The architecture is a streaming, microservices-style pipeline with separate layers for ingestion, processing, storage, and visualization:

Data Sources: Twitter and Reddit APIs (streaming endpoints, English-language posts mentioning configured brands).

Ingestion Layer: Kafka producers consume API streams and publish JSON events into topics like twitter-stream and reddit-stream.

Processing Layer: Python consumers apply NLP sentiment scoring (e.g., VADER); Spark Structured Streaming performs windowed aggregation (e.g., hourly/daily averages per brand).

Storage Layer: MongoDB collections store raw sentiment scores and aggregated metrics.

Presentation Layer: A web dashboard (Streamlit / Flask / Dash) queries MongoDB and visualizes brand sentiment trends over time.

3. Technology Stack

  • Language: Python 3.x
  • Streaming / Messaging: Apache Kafka
  • Processing & Aggregation: Apache Spark Structured Streaming
  • NLP / Sentiment: NLTK / TextBlob / VADER (vaderSentiment)
  • Database / Warehouse: MongoDB (sentiment_raw, sentiment_aggregates)
  • APIs / SDKs: Tweepy (Twitter), PRAW (Reddit), kafka-python, pymongo
  • Infrastructure & Ops: Docker, Kubernetes, Prometheus + Grafana / ELK
  • CI/CD: GitHub Actions

4. Data & Warehousing Model

Raw Sentiment Layer (MongoDB): Stores individual analyzed posts with fields like: { text, brand, timestamp, sentiment_score, platform, metadata }.

Aggregated Layer (MongoDB): Stores time-windowed metrics: { brand, timestamp, avg_sentiment, aggregate_count }.

Logical "Warehouse": MongoDB acts as a semi-structured warehouse with Fact-like sentiment scores and Dimensions (brand, platform, time windows). Indexes on brand and timestamp ensure fast dashboard queries.

5. ETL Processing

Extract: Real-time pull from Twitter and Reddit APIs; Kafka producers publish JSON events (text + brand + metadata).

Transform: Python consumers apply VADER sentiment scoring (-1 to +1); Spark Structured Streaming performs windowed aggregations with watermarking for late data.

Load: Spark writes raw and aggregated outputs into MongoDB using the MongoDB Spark Connector. TTL indexes handle raw data retention (e.g., 30 days) while aggregates are kept for historical analysis.

6. Project Timeline (8 Months)

Project Start: January 1, 2025 | Duration: 8 Months (Delivered ahead of schedule)

  • Jan 1 – Jan 15, 2025 — Kickoff: Requirements gathering and metrics definition.
  • Jan 16 – Feb 1, 2025 — Design: Architecture, Kafka topics, and MongoDB schemas.
  • Feb 2 – Feb 28, 2025 — Ingestion: Kafka setup and Twitter/Reddit producers.
  • Mar 1 – Apr 15, 2025 — Processing: NLP services and Spark Streaming jobs.
  • Apr 16 – May 31, 2025 — Storage & Dashboard: MongoDB integration and Streamlit/Flask app.
  • Jun 1 – Jul 15, 2025 — Testing: Performance and end-to-end tests.
  • Jul 16 – Aug 1, 2025 — Deployment: Production rollout via Kubernetes.

7. Testing & Deployment

Testing: Unit Tests with Pytest; Performance Tests (JMeter) validating <5s latency at 10,000 events/min; Sentiment accuracy verified at ~85% via manual review.

Deployment: Local development via Docker Compose; Production rollout on Kubernetes with rolling updates and GitHub Actions for CI/CD pipelines.

8. Monitoring & Maintenance

Monitoring: Prometheus/Grafana or ELK stack for alerts on high error rates or Kafka lag.

Maintenance: Weekly MongoDB backups, configuration-driven brand management via YAML, and auto-scaling for Kafka/Spark clusters. Estimated Cost: ~$500/month on cloud infrastructure.

9. Roles & Responsibilities

Methodology: Agile with 2-week sprints and regular risk management demos.

  • 🚀 Lead Developer (1): Architecture, Spark jobs, and Kafka integration.
  • ⚙️ Data Engineers (2): Ingestion pipelines, NLP consumers, and MongoDB integration.
  • 🧠 Data Scientist (1): NLP model tuning and sentiment accuracy validation.