The Smart Document Classifier for Enterprise project delivers an ML-powered solution to automate categorization of business documents (invoices, POs, contracts, HR), reducing manual effort by 80% and achieving 95%+ accuracy. It ingests via ETL pipelines into a warehouse, classifies using TF-IDF + LR/SVM or BERT embeddings, exposes via FastAPI REST endpoints, and deploys with Docker for scalability. The system handles unstructured text, integrates secure warehousing (PostgreSQL/Snowflake), and supports compliance, completed over ~3 months from November 2025 to February 2026 as a showcase for client adoption in document management.
The architecture follows a four-component flow: data ingestion/ETL extracts from sources (e.g., PDFs via PyPDF2), transforms (cleaning, metadata extraction), loads into warehouse schemas; ML classifies preprocessed text with traditional (TF-IDF + classifiers) or advanced (BERT) models; API layer (FastAPI) provides endpoints for upload/inference; deployment monitors via Docker with logging. This design ensures efficiency for high-volume streams, security for sensitive data, and integration with enterprise systems for workflows.
The system uses Python 3.10+ for development, Scikit-Learn for TF-IDF/LR/SVM models, Hugging Face Transformers for BERT embeddings, FastAPI for REST APIs, and Docker for containerization. Additional libraries include Pandas/PyPDF2 for ETL, Psycopg2/SQLAlchemy for warehousing (PostgreSQL/Snowflake), Pytest/Locust for testing; tools like Airflow for ETL orchestration and ELK/Prometheus for monitoring.
The classification model uses TF-IDF vectorization + Logistic Regression/SVM for baseline (linear kernel, GridSearchCV tuning) or BERT embeddings (mean pooling, bert-base-uncased) with classifiers for advanced handling, trained on labeled corpora (e.g., RVL-CDIP) with splits (80/20), tokenization/stop-words, achieving 95% accuracy/F1. Features include text cleaning (lowercase/strip), metadata (date/sender), and category outputs (invoice/contract/etc.) with confidence; evaluation via metrics/confusion matrix.
Data processing extracts text from PDFs/emails using PyPDF2, transforms with cleaning/metadata addition/classification calls, and loads DataFrames into warehouse tables (doc_id, text, category, metadata_json, timestamp) via Psycopg2/to_sql. ETL orchestrates via Python scripts/Airflow, handles blobs in S3, enables SQL views/queries for analytics; ensures privacy with encryption and scalability for high volumes.
Testing includes unit (Pytest for ETL/models/API), integration for flow (ingestion to classification), accuracy (95% on benchmarks), and load (Locust for concurrency). Deployment builds/runs Docker images (python:3.10-slim base, uvicorn server), orchestrates with Kubernetes for scaling, uses phased rollout with JWT auth/encryption, and supports rollback via container versions if issues arise.
Post-deployment, monitor accuracy/errors via ELK logs/Prometheus metrics, ETL runs, and API uptime, aiming for >99% availability and low latency. Maintenance includes quarterly retraining on new data, monthly security/compliance audits, and cost controls (auto-scaling), with alerts for classification failures to trigger reviews.