π§ͺ Toxic Comment Classifier is a simple yet effective machine learning project that detects toxic comments in both English and Russian.
It uses classical NLP techniques (TF-IDF + Logistic Regression) for real-time text classification.
Whether you're building a moderation system or just exploring NLP, this project is a great starting point.
The script:
- Downloads and extracts English and Russian toxic comment datasets.
- Merges them into training and testing sets.
- Uses TfidfVectorizer to convert text into numerical features.
- Trains a logistic regression model.
- Saves the model and vectorizer to model.pkl.
- Allows the user to input a comment and checks if it is toxic.
Jigsaw Toxic Comment Classification Challenge:
https://www.kaggle.com/datasets/julian3833/jigsaw-toxic-comment-classification-challenge
Russian Language Toxic Comments:
https://www.kaggle.com/datasets/blackmoon/russian-language-toxic-comments
.
βββ Toxic Comment Classifier AI
βββ dataset
βββ .gitattributes
βββ .gitignore
βββ LICENSE
βββ main.py
βββ model.pkl
βββ README.md
βββ requirements.txt
Clone the repository:
git clone https://github.com/pashudzu/ToxicCommentClassificationAI.git
cd ToxicCommentClassificationAI
python main.pyInstall dependencies:
pip install -r requirements.txt
| Comment | Classification |
|---|---|
| "You're stupid and nobody likes you!" | β Toxic |
| "Have a great day!" | β Kindness |
The model prints the accuracy score after training.
- Python 3
- scikit-learn
- NLTK
- pickle
- TF-IDF vectorization
- Logistic Regression
- β Supports both English and Russian comments.
- π§ͺ Uses only the
toxiclabel (binary classification) - πΎ The model is saved to avoid retraining on each run.
- π Avoids retraining if a saved model exists
This project is licensed under the MIT License. Use it freely.

