An end-to-end Machine Learning web application for intelligent spam detection
๐ Live Demo โข ๐ Documentation โข ๐ Report Bug โข โจ Request Feature
- Overview
- Features
- Live Demo
- Machine Learning Models
- Technology Stack
- Installation
- Usage
- Project Structure
- Model Performance
- Deployment
- Future Enhancements
- Contributing
- Author
- License
- Acknowledgements
The Spam Detection System is a comprehensive machine learning application that classifies text messages and emails as either Spam or Ham (legitimate). Built with a focus on demonstrating the complete ML lifecycle, this project encompasses:
- ๐ Data Exploration & Analysis
- ๐งน Text Preprocessing & NLP
- ๐ง Feature Engineering
- ๐ค Model Training & Evaluation
- ๐ Production Deployment
This application showcases real-world implementation of multiple ML algorithms with an interactive web interface, making it ideal for understanding practical machine learning workflows.
- ๐ฏ Real-time Spam Classification - Instant predictions on user-input text
- ๐ Multiple ML Models - Compare predictions from 4 different algorithms
- ๐ Performance Metrics - Comprehensive evaluation with accuracy, precision, recall, and F1-score
- ๐ Visual Analytics - Interactive charts for model comparison and feature distributions
- โก Optimized Performance - Cached model loading for lightning-fast responses
- ๐ง Advanced NLP preprocessing pipeline
- ๐พ Serialized model persistence for efficient deployment
- ๐จ Clean, intuitive Streamlit UI
- โ๏ธ Cloud-ready with version-controlled dependencies
- ๐ Production-grade error handling and validation
Experience the application in action:
๐ Streamlit Cloud Deployment
Try these sample inputs:
Spam Example:
URGENT! You have won $1,000,000! Click here to claim your prize NOW!
Ham Example:
Hey, are we still meeting for coffee tomorrow at 3pm?
The system employs four distinct machine learning models, each optimized for text classification:
| Model | Algorithm | File | Key Strength |
|---|---|---|---|
| Logistic Regression | SAGA Solver | logistic_regression_saga_model.pkl |
Fast training, interpretable coefficients |
| Linear SVC | Calibrated Classifier | linearsvc_calibrated_model.pkl |
Excellent for high-dimensional text data |
| Random Forest | Ensemble Method | random_forest_model.pkl |
Robust to overfitting, feature importance |
| Neural Network | MLP Classifier | neural_network_mlp_model.pkl |
Captures complex non-linear patterns |
preprocessed_data.pkl- Cleaned and processed training datasetmodel_results.pkl- Performance metrics for all modelsfeature_distributions.png- Visualization of feature importancemodel_comparison.png- Comparative analysis charts
| Category | Technologies |
|---|---|
| Language | Python 3.10.8 |
| ML Framework | scikit-learn 1.2.2 |
| NLP | NLTK, SpaCy |
| Data Processing | NumPy, Pandas, SciPy |
| Visualization | Matplotlib, Seaborn, Plotly |
| Web Framework | Streamlit |
| Deployment | Streamlit Cloud |
- Python 3.10.8
- Anaconda/Miniconda (recommended) or pip
- Git
# Clone the repository
git clone https://github.com/sayendranadh/spam_project.git
cd spam_project
# Create a new conda environment
conda create -n spam_env python=3.10.8 -y
# Activate the environment
conda activate spam_env
# Install dependencies
pip install -r requirements.txt# Clone the repository
git clone https://github.com/sayendranadh/spam_project.git
cd spam_project
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt# Ensure your environment is activated
conda activate spam_env # or: source venv/bin/activate
# Launch the Streamlit application
streamlit run spam_detection_ui.pyThe application will open automatically in your default browser at http://localhost:8501
- Enter Text: Type or paste the message you want to classify
- Select Model: Choose from the available ML models (or compare all)
- Get Prediction: Click "Classify" to see results
- View Metrics: Explore performance statistics and visualizations
To retrain models with your own data:
# Step 1: Exploratory Data Analysis
python step1_data_exploration.py
# Step 2: Feature Engineering
python step2_feature_engineering.py
# Step 3: Data Preprocessing
python step3_preprocessing.py
# Step 4: Model Training
python step4_model_training.py
# Run the UI with new models
streamlit run spam_detection_ui.pyspam_project/
โ
โโโ ๐ spam_detection_ui.py # Main Streamlit application (ENTRY POINT)
โ
โโโ ๐ฌ ML Pipeline Scripts
โ โโโ step1_data_exploration.py # EDA and data insights
โ โโโ step2_feature_engineering.py # Text preprocessing & feature extraction
โ โโโ step3_preprocessing.py # Data cleaning pipeline
โ โโโ step4_model_training.py # Model training & evaluation
โ
โโโ ๐ง Utilities
โ โโโ fix_preprocessed_data.py # Data consistency fixes
โ
โโโ ๐พ Model Artifacts (*.pkl)
โ โโโ logistic_regression_saga_model.pkl
โ โโโ linearsvc_calibrated_model.pkl
โ โโโ random_forest_model.pkl
โ โโโ neural_network_mlp_model.pkl
โ โโโ preprocessed_data.pkl
โ โโโ model_results.pkl
โ
โโโ ๐ Visualizations
โ โโโ model_comparison.png # Model performance charts
โ โโโ feature_distributions.png # Feature importance plots
โ
โโโ โ๏ธ Configuration Files
โ โโโ requirements.txt # Python dependencies
โ โโโ runtime.txt # Python version for deployment
โ
โโโ ๐ README.md # Project documentation
All models are evaluated using standard classification metrics:
- Accuracy - Overall prediction correctness
- Precision - Spam prediction reliability
- Recall - Spam detection coverage
- F1-Score - Harmonic mean of precision and recall
View the model_comparison.png file for detailed performance visualizations showing:
- Model accuracy comparison
- Precision-Recall trade-offs
- Confusion matrices
- ROC curves
requirements.txt- All Python dependenciesruntime.txt- Python version specification
runtime.txt
python-3.10.8
-
Prepare Repository
git add . git commit -m "Deploy to Streamlit Cloud" git push origin main
-
Configure Streamlit Cloud
- Visit share.streamlit.io
- Click "New app"
- Select your repository:
spam_project - Branch:
main - Main file:
spam_detection_ui.py
-
Deploy
- Click "Deploy!"
- Wait for build completion
- Your app will be live at:
https://[app-name].streamlit.app/
Python: 3.10.8
scikit-learn: 1.2.2Using different versions may cause compatibility issues. The requirements.txt file pins exact versions to ensure consistency.
-
Deep Learning Models
- LSTM networks for sequential text analysis
- Transformer-based models (BERT, DistilBERT)
-
Model Interpretability
- SHAP values for feature importance
- LIME for local interpretability
-
API Development
- RESTful API with FastAPI
- Swagger/OpenAPI documentation
-
Data Management
- Database integration (PostgreSQL/MongoDB)
- User feedback collection system
-
Advanced Features
- Real-time email ingestion
- Batch processing capabilities
- Multi-language support
- Custom model training interface
Have ideas for improvements? Check out the Contributing section below.
Contributions are what make the open-source community amazing! Any contributions you make are greatly appreciated.
-
Fork the Project
git clone https://github.com/sayendranadh/spam_project.git
-
Create a Feature Branch
git checkout -b feature/AmazingFeature
-
Commit Changes
git commit -m 'Add some AmazingFeature' -
Push to Branch
git push origin feature/AmazingFeature
-
Open a Pull Request
- Write clear, descriptive commit messages
- Follow PEP 8 style guidelines for Python code
- Add tests for new features
- Update documentation as needed
- Ensure all tests pass before submitting PR
Sayendranadh
- ๐ Final Year B.Tech Student
- ๐ผ Aspiring Data Scientist / Machine Learning Engineer
- ๐ GitHub: @sayendranadh
- ๐ LinkedIn: Connect with me
- ๐ง Email: sayendranadh2005@gmail.com
๐ค Contributors
Special thanks to:
- I.Vishnu Varma - Project Contributor
- P.Sai Charan - Project Contributor
If you find this project helpful:
- โญ Star this repository
- ๐ฆ Share it with others
- ๐ค Connect on LinkedIn
This project is licensed under the MIT License - see the LICENSE file for details.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software.
Special thanks to:
- scikit-learn - Comprehensive ML library and documentation
- Streamlit - Amazing framework for ML web apps
- Streamlit Cloud - Free hosting for data apps
- NLTK - Natural Language Toolkit
- SpaCy - Industrial-strength NLP
- Open-source NLP community for datasets and research
- Kaggle Spam Classification Datasets
- scikit-learn Text Classification Tutorial
- Streamlit Documentation