An end-to-end NLP project that performs sentiment analysis on user reviews of the ChatGPT application. This repository contains the code for data analysis, machine learning model training, and a deployed interactive web application.
Sentiment analysis is a natural language processing (NLP) technique used to determine the sentiment expressed in a given text. This project aims to analyze user reviews of a ChatGPT application and classify them as Positive, Neutral, or Negative. The goal is to gain insights into customer satisfaction, identify common concerns, and ultimately enhance the application's user experience.
- Language: Python 3.9+
- Data Manipulation: Pandas, NumPy
- Data Visualization: Matplotlib, Seaborn, WordCloud
- NLP: NLTK
- Machine Learning: Scikit-learn
- Web App Framework: Streamlit
The project follows a standard machine learning pipeline:
- Data Loading: The
chatgpt_style_reviews_dataset.xlsxis loaded. - Exploratory Data Analysis (EDA): Visualizations are generated to understand distributions, trends, and key phrases.
- Data Preprocessing: Text data is cleaned, tokenized, lemmatized, and stopwords are removed.
- Feature Engineering: Cleaned text is converted into numerical features using TF-IDF vectorization.
- Model Training: Several models (Logistic Regression, Naive Bayes, etc.) are trained and the best one is selected after hyperparameter tuning.
- Deployment: The trained model and vectorizer are saved and served via an interactive Streamlit web application.
Follow these steps to run the project locally:
-
Clone the Repository:
git clone https://github.com/itz-Mayank/AI_Echo.git cd AI_Echo -
Create a Virtual Environment (Recommended):
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install Dependencies:
pip install -r requirements.txt -
Download Dataset:
- Place your dataset (e.g.,
chatgpt_style_reviews_dataset.xlsx) in the root directory of the project.
- Place your dataset (e.g.,
-
Run the Training Pipeline (Optional):
- To retrain the model and generate the EDA plots, run the main Python script.
python AI_echo.ipynb -
Run the Streamlit App:
- Ensure the
sentiment_model.pklandtfidf_vectorizer.pklfiles are in the Models folder.
streamlit run app.pyYour browser will open with the running application.
- Ensure the
├── app.py
├── AI_echo.ipynb
├── requirements.txt
├── Models/
├── sentiment_model.pkl
├── tfidf_vectorizer.pkl
├── eda_visualizations/
└── chatgpt_style_reviews_dataset.xlsx
The final model is a Logistic Regression classifier trained on the provided 50-row dataset. The model was evaluated on the same data it was trained on to measure its memorization capability, as per the project requirements.
The model achieved an outstanding 98% accuracy, demonstrating its ability to perfectly learn the patterns within this specific dataset.
| Precision | Recall | F1-Score | Support | |
|---|---|---|---|---|
| Negative | 0.95 | 1.00 | 0.98 | 20 |
| Neutral | 1.00 | 0.92 | 0.96 | 13 |
| Positive | 1.00 | 1.00 | 1.00 | 17 |
| Accuracy | 0.98 | 50 | ||
| Macro Avg | 0.98 | 0.97 | 0.98 | 50 |
| Weighted Avg | 0.98 | 0.98 | 0.98 | 50 |
Note: This high performance reflects the model's ability to memorize a small, specific dataset. For a model that can generalize to new, unseen reviews, a much larger and more diverse dataset would be required.
Created and developed by Mayank Meghwal.