Skip to content

A complete SMS spam-classification pipeline using classic machine-learning algorithms: it ingests the publicly available SMS spam dataset, applies text preprocessing (cleaning, tokenization, vectorization), then trains and evaluates three classifiers : Naïve Bayes model, a Decision Tree, and a Random Forest

Notifications You must be signed in to change notification settings

sameer-at-git/SMS-Spam-Classification-using-Naive-Bayes-Decision-Tree-and-Random-Forest

Repository files navigation

Dataset Python 3.6 library

Project Overview

• Created a machine learning model that detects/classifies a SMS into SPAM or HAM (normal) based on the textual data using Natural Language Processing.<br/>Engineered features like word_count, contains_currency_symbol, and contains_number from the text SMS.

How will this project help?

• This project helps in filtering/cleaning the SMS from the phone.

Resources Used

• Packages: pandas, numpy, sklearn, matplotlib, seaborn, nltk.<br/> • Dataset by UCI Machine Learing on Kaggle: https://www.kaggle.com/uciml/sms-spam-collection-dataset

Exploratory Data Analysis (EDA)

Exploring NaN values in dataset <br/>Plotted countplot for SMS labels Spam vs. Ham

Feature Engineering

• Handling imbalanced dataset using Oversampling <br/> SpamVsHam <br/>Creating new features from existing features e.g. word_count, contains_currency_symbol, contains_numbers, etc.<br/> word_count <br/> currency_numbers

Data Cleaning

• Removing special character and numbers using regular expression <br/> • Converting the entire sms into lower case <br/> • Tokenizing the sms by words <br/> • Removing the stop words <br/> • Lemmatizing the words <br/> • Joining the lemmatized words <br/> • Building a corpus of messages

Model Building and Evaluation

Metric: F1-Score <br/> • Multinomial Naive Bayes: 0.943 <br/> • Decision Tree: 0.98 <br/>Random Forest: 0.994 <br/> • Voting (Decision Tree + Multinomial Naive Bayes): 0.98 <br/> matrix <br/> Note: Evaluation scores are obtained using cross validation.

Model Prediction

Prediction

Do ⭐ the repository, if it helped you in anyway.

About

A complete SMS spam-classification pipeline using classic machine-learning algorithms: it ingests the publicly available SMS spam dataset, applies text preprocessing (cleaning, tokenization, vectorization), then trains and evaluates three classifiers : Naïve Bayes model, a Decision Tree, and a Random Forest

Topics

Resources

Stars

Watchers

Forks