Skip to content

pashudzu/ToxicCommentClassificationAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ§ͺ Toxic Comment Classifier

banner

Python scikit-learn MIT License

πŸ§ͺ Toxic Comment Classifier is a simple yet effective machine learning project that detects toxic comments in both English and Russian.
It uses classical NLP techniques (TF-IDF + Logistic Regression) for real-time text classification.

Whether you're building a moderation system or just exploring NLP, this project is a great starting point.

πŸ“¦ Description

The script:

  • Downloads and extracts English and Russian toxic comment datasets.
  • Merges them into training and testing sets.
  • Uses TfidfVectorizer to convert text into numerical features.
  • Trains a logistic regression model.
  • Saves the model and vectorizer to model.pkl.
  • Allows the user to input a comment and checks if it is toxic.

πŸ—ƒ Datasets Used

Jigsaw Toxic Comment Classification Challenge:
https://www.kaggle.com/datasets/julian3833/jigsaw-toxic-comment-classification-challenge
Russian Language Toxic Comments:
https://www.kaggle.com/datasets/blackmoon/russian-language-toxic-comments

πŸ“ Project Structure

.
└── Toxic Comment Classifier AI
β”œβ”€β”€ dataset
β”œβ”€β”€ .gitattributes
β”œβ”€β”€ .gitignore
β”œβ”€β”€ LICENSE
β”œβ”€β”€ main.py
β”œβ”€β”€ model.pkl
β”œβ”€β”€ README.md
└── requirements.txt

πŸ›  Installation

Clone the repository:

git clone https://github.com/pashudzu/ToxicCommentClassificationAI.git  
cd ToxicCommentClassificationAI
python main.py

Install dependencies:
pip install -r requirements.txt

πŸ” Example Usage

example

Comment Classification
"You're stupid and nobody likes you!" ❌ Toxic
"Have a great day!" βœ… Kindness

πŸ“ˆ Model Performance

The model prints the accuracy score after training.

🧠 Technologies Used

  • Python 3
  • scikit-learn
  • NLTK
  • pickle
  • TF-IDF vectorization
  • Logistic Regression

πŸ“Œ Notes

  • βœ… Supports both English and Russian comments.
  • πŸ§ͺ Uses only the toxic label (binary classification)
  • πŸ’Ύ The model is saved to avoid retraining on each run.
  • πŸš€ Avoids retraining if a saved model exists

πŸ“œ License

This project is licensed under the MIT License. Use it freely.

Made with ❀️ by pashudzu