• Created a machine learning model that detects/classifies a SMS into SPAM or HAM (normal) based on the textual data using Natural Language Processing.<br/>
• Engineered features like word_count, contains_currency_symbol, and contains_number from the text SMS.
• This project helps in filtering/cleaning the SMS from the phone.
• Packages: pandas, numpy, sklearn, matplotlib, seaborn, nltk.<br/>
• Dataset by UCI Machine Learing on Kaggle: https://www.kaggle.com/uciml/sms-spam-collection-dataset
• Exploring NaN values in dataset <br/>
• Plotted countplot for SMS labels Spam vs. Ham
• Handling imbalanced dataset using Oversampling <br/>
<br/>
• Creating new features from existing features e.g. word_count, contains_currency_symbol, contains_numbers, etc.<br/>
<br/>

• Removing special character and numbers using regular expression <br/>
• Converting the entire sms into lower case <br/>
• Tokenizing the sms by words <br/>
• Removing the stop words <br/>
• Lemmatizing the words <br/>
• Joining the lemmatized words <br/>
• Building a corpus of messages
Metric: F1-Score <br/>
• Multinomial Naive Bayes: 0.943 <br/>
• Decision Tree: 0.98 <br/>
• Random Forest: 0.994 <br/>
• Voting (Decision Tree + Multinomial Naive Bayes): 0.98 <br/>
<br/>
Note: Evaluation scores are obtained using cross validation.
Do ⭐ the repository, if it helped you in anyway.