This is a machine learning project that classifies emails and SMS messages as spam or not spam using Natural Language Processing (NLP) where I have used Navie Based ( Multinomial Naive Bayes ) which assumes that all the word in a sentence does not have correlation, this may not be true in practical situations but when I tried this model it gave some really good scores of 97.10 % accuracy and a precison of 100 % after using several EDA ( Exploratory Data Analysis ) methods and feature engineering methods I broke the dataset into 80 % training and 20 % tesitng and then I was able to get usefull features to feed the model.
Further I used docker to containarize my model so that anyone can use docker image and build my project locally and don't have to install all the dependencies. I also hosted this model on Huggingface which provided me smooth integration of model backend with streamlit frontend.
Shrish Mishra
This project implements a spam detection system using multiple machine learning algorithms. The system processes text messages, transforms them using TF-IDF vectorization, and classifies them as spam or legitimate (ham) messages.
- Source: SMS Spam Collection Dataset
- Total Messages: 5,572 messages
- After Preprocessing: 5,169 messages (after removing 403 duplicates)
- Distribution:
- Ham (Not Spam): 4,516 messages (87.37%)
- Spam: 653 messages (12.63%)
The project analyzes the following text features:
- Number of characters
- Number of words
- Number of sentences
- Transformed text (after preprocessing)
The transform_text function in Spam_detection.ipynb and app.py performs the following steps:
- Lowercase Conversion: Converts all text to lowercase
- Tokenization: Breaks text into individual words using
nltk.word_tokenize() - Alphanumeric Filtering: Removes special characters and keeps only alphanumeric tokens
- Stop Words Removal: Removes common English stop words and punctuation
- Stemming: Reduces words to their root form using Porter Stemmer
TF-IDF Vectorizer (Term Frequency-Inverse Document Frequency)
- Maximum features: 3,000
- Converts text into numerical vectors
- Saved as:
vectorizer.pkl
The project evaluates 10 different classification algorithms:
| Algorithm | Accuracy | Precision |
|---|---|---|
| Multinomial Naive Bayes (NB) | 97.10% | 100.00% |
| K-Nearest Neighbors (KN) | 90.52% | 100.00% |
| Random Forest (RF) | 97.58% | 98.29% |
| Support Vector Classifier (SVC) | 97.58% | 97.48% |
| Extra Trees Classifier (ETC) | 97.49% | 97.46% |
| Logistic Regression (LR) | 95.84% | 97.03% |
| Gradient Boosting (GBDT) | 94.68% | 91.92% |
| Bagging Classifier (BgC) | 95.84% | 86.82% |
| AdaBoost | 92.46% | 84.88% |
| Decision Tree (DT) | 92.75% | 81.19% |
Multinomial Naive Bayes was selected as the final model due to:
- High accuracy: 97.10%
- Perfect precision: 100.00%
- No false positives (0 legitimate messages classified as spam)
- Confusion Matrix:
[[896 0] [ 30 108]] - Model saved as:
model.pklπΎ
spam_detection_model/
βββ app.py # Streamlit web application
βββ Spam_detection.ipynb # Jupyter notebook with full analysis
βββ spam.csv # Dataset
βββ model.pkl # Trained Multinomial Naive Bayes model
βββ vectorizer.pkl # TF-IDF vectorizer
- Python 3.9.6 π
- Libraries:
- pandas
- numpy
- nltk
- scikit-learn
- matplotlib
- seaborn
- wordcloud
- streamlit
- pickle
- Clone the repository
- Install required packages:
pip install pandas numpy nltk scikit-learn matplotlib seaborn wordcloud streamlit- Download NLTK data: β¬οΈ
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')streamlit run app.pyThe application provides a simple interface where you can:
- Enter an email or SMS message
- Click "Predict"
- See if the message is classified as "Spam" or "Not Spam"
Open Spam_detection.ipynb in Jupyter Notebook to:
- Explore the complete data analysis
- View visualizations (word clouds, histograms, correlation heatmaps)
- Train and evaluate different models
- Modify and experiment with the code
- Data Loading: Read spam.csv with ISO-8859-1 encoding
- Data Cleaning:
- Remove unnecessary columns
- Rename columns to 'target' and 'text'
- Encode labels (ham=0, spam=1)
- Remove 403 duplicate messages
- Feature Engineering: Extract character count, word count, and sentence count
- Text Preprocessing: Apply the
transform_textfunction - Vectorization: Convert text to TF-IDF features (3000 features)
- Train-Test Split: 80% training, 20% testing (random_state=2)
- Model Training: Train 10 different classifiers
- Evaluation: Compare accuracy and precision scores
- Model Selection: Choose Multinomial Naive Bayes as the final model
- Spam messages are significantly longer than ham messages π
- Average characters:
- Ham: 70.46
- Spam: 137.89
- Average words:
- Ham: 17.12
- Spam: 27.67
- Most frequent words in spam: "call", "free", "txt", "claim", "prize"
Run locally with:
streamlit run app.pyYou can run the app inside Docker β this is useful for consistent environments and for deploying to servers or CI.
Build the Docker image locally (from the repository root):
docker build -t spam-detection-app:latest .Run the container and map Streamlit's port 8501 to your host:
docker run --rm -p 8501:8501 -v "$PWD":/app spam-detection-app:latestOr use docker-compose (recommended for development):
docker compose up --buildExample: run the container and pass a private HF token (if needed):
docker run --rm -p 8501:8501 -e HF_TOKEN="$HF_TOKEN" -v "$PWD":/app spam-detection-app:latest- Add more sophisticated preprocessing techniques
- Implement deep learning models (LSTM, BERT)
- Add multilingual support
- Enhance the web interface with more features
- Add confidence scores and probability display
- Implement feedback mechanism for model improvement
This project is open source and available for educational purposes. π
For questions or suggestions, please contact Shrish Mishra. π¬
- Add more sophisticated preprocessing techniques
- Implement deep learning models (LSTM, BERT)
- Add multilingual support
- Enhance the web interface with more features
- Deploy to cloud platform (Heroku, AWS, etc.)
For questions or suggestions, please contact at shrish409@gmail.com