This repository contains comprehensive implementations for the Bengali Multi-task Hate Speech Identification shared task at BLP Workshop @IJCNLP-AACL 2025. The project addresses the complex problem of detecting and understanding hate speech in Bengali across three related subtasks: hate type classification, target identification, and multi-task analysis. The implementation explores various machine learning approaches from traditional deep learning to state-of-the-art transformer models with advanced training techniques.
- Objective: Model experimentation, architecture exploration, and hyperparameter tuning
- Data: Training and validation datasets provided by organizers
- Focus: Testing various approaches and techniques to identify best-performing models
- Metrics: Validation F1 scores on development set
- Objective: Final model evaluation on unseen test data
- Data: Hidden test set released during evaluation period
- Focus: Deploying best models from developmental phase with refined configurations
- Metrics: Test F1 scores on official evaluation set
Multi-class classification of Bengali text into: Abusive, Sexism, Religious Hate, Political Hate, Profane, or None.
- BiLSTM - F1 Score: 56.25%
- LSTM with Attention - F1 Score: 55.18%
- XLM-RoBERTa-large - F1 Score: 72.81%
- MuRIL-large-cased - F1 Score: 71.02%
- BanglaBERT (csebuetnlp) - F1 Score: 70.74%
- BanglaBERT-large (csebuetnlp) - F1 Score: 70.51%
- XLM-RoBERTa-base - F1 Score: 70.50%
- DistilBERT-multilingual - F1 Score: 68.03%
- MuRIL-large-cased with K-Fold - F1 Score: 73.61%
- XLM-RoBERTa-large with K-Fold - F1 Score: 73.45%
- BanglaBERT with K-Fold - F1 Score: 73.29%
- BanglaBERT with Normalizer - F1 Score: 74.32%
- MuRIL-large-cased with Normalizer - F1 Score: 73.73%
- XLM-RoBERTa-large with Normalizer - F1 Score: 73.29%
- BanglaBERT with K-Fold + FGM - F1 Score: 73.87%
- MuRIL-large-cased with K-Fold + FGM - F1 Score: 73.68%
- BanglaBERT + K-Fold + FGM + Normalizer - F1 Score: 74.88% ⭐ (Best Development Score)
- MuRIL-large-cased + K-Fold + FGM + Normalizer - F1 Score: 73.81%
- BanglaBERT + K-Fold + FGM + Normalizer - Test F1: 72.33% ⭐ (Best Test Score)
- BanglaBERT + K-Fold + FGM - Test F1: 72.17%
- MuRIL-large-cased + K-Fold + Normalizer - Test F1: 72.30%
- BanglaBERT + K-Fold - Test F1: 72.05%
- MuRIL-large-cased + K-Fold + FGM - Test F1: 71.90%
- MuRIL-large-cased + K-Fold - Test F1: 71.88%
- XLM-RoBERTa-large + K-Fold - Test F1: 71.72%
- XLM-RoBERTa-large + K-Fold + Normalizer - Test F1: 71.57%
- MuRIL-large-cased + K-Fold + FGM + Normalizer - Test F1: 71.31%
- BanglaBERT + K-Fold + Normalizer - Test F1: 71.14%
- BanglaBERT (Base) - Test F1: 70.31%
Classification of hate speech targets into: Individuals, Organizations, Communities, or Society.
- Traditional deep learning approaches implemented (scores pending)
- BanglaBERT - F1 Score: 72.09%
- MuRIL-large-cased - F1 Score: 71.93%
- XLM-RoBERTa-large - F1 Score: 71.38%
- MuRIL-large-cased with K-Fold - F1 Score: 74.96% ⭐ (Best Development Score)
- BanglaBERT with K-Fold - F1 Score: 73.69%
- XLM-RoBERTa-large with K-Fold - F1 Score: 71.53%
- BanglaBERT with Normalizer - F1 Score: 74.72%
- MuRIL-large-cased with Normalizer - F1 Score: 74.48%
- XLM-RoBERTa-large with Normalizer - F1 Score: 72.39%
- XLM-RoBERTa-large with K-Fold + FGM - F1 Score: 74.20%
- BanglaBERT with K-Fold + FGM - F1 Score: 74.12%
- MuRIL-large-cased with K-Fold + FGM - F1 Score: 73.89%
- BanglaBERT + K-Fold + FGM + Normalizer - F1 Score: 74.64%
- MuRIL-large-cased + K-Fold + FGM + Normalizer - F1 Score: 74.56%
- XLM-RoBERTa-large + K-Fold + FGM + Normalizer - F1 Score: 74.32%
- XLM-RoBERTa-large - Test F1: 71.23%
- MuRIL-large-cased - Test F1: 70.93%
- BanglaBERT - Test F1: 70.25%
- MuRIL-large-cased + K-Fold - Test F1: 73.44%
- BanglaBERT + K-Fold - Test F1: 71.85%
- XLM-RoBERTa-large + K-Fold - Test F1: 68.07%
- MuRIL-large-cased + K-Fold + Normalizer - Test F1: 73.44%
- BanglaBERT + K-Fold + Normalizer - Test F1: 72.89%
- XLM-RoBERTa-large + K-Fold + Normalizer - Test F1: 71.66%
- XLM-RoBERTa-large + K-Fold + FGM - Test F1: 73.28%
- MuRIL-large-cased + K-Fold + FGM - Test F1: 72.92%
- BanglaBERT + K-Fold + FGM - Test F1: 72.25%
- BanglaBERT + K-Fold + FGM + Normalizer - Test F1: 73.12% ⭐
- MuRIL-large-cased + K-Fold + FGM + Normalizer - Test F1: 72.95% ⭐
- XLM-RoBERTa-large + K-Fold + FGM + Normalizer - Test F1: 72.17%
Multi-task classification combining hate type (Abusive, Sexism, Religious Hate, Political Hate, Profane, None), severity (Little to None, Mild, Severe), and target group (Individuals, Organizations, Communities, Society).
- Basic transformer implementations (scores pending)
- Standard K-Fold implementations (scores pending)
All using BanglaBERT (cse-buet-nlp) with different adversarial techniques:
- BanglaBERT + FreeLB - F1 Score: 74.52% ⭐ (Best Development Score)
- BanglaBERT + Simple FreeLB - F1 Score: 73.91%
- BanglaBERT + GAT - F1 Score: 73.79%
- BanglaBERT + FGM - F1 Score: 73.75%
- Text normalization implementations (scores pending)
- Comprehensive technique combinations (scores pending)
- BanglaBERT + K-Fold + Normalizer - Test F1: 73.00%
- BanglaBERT + FreeLB + K-Fold - Test F1: 72.00%
- FGM (Fast Gradient Method): Simple and efficient adversarial perturbations
- AWP (Adversarial Weight Perturbation): Weight-space adversarial training
- FreeLB: Free large-batch adversarial training for improved generalization
- Simple FreeLB: Streamlined version of FreeLB
- GAT (Geometry-Aware Training): Advanced geometry-aware adversarial training
normalize(
text,
unicode_norm="NFKC", # Canonical decomposition + compatibility
punct_replacement=None, # Preserve original punctuation
url_replacement=None, # Preserve URLs
emoji_replacement=None, # Preserve emojis
apply_unicode_norm_last=True # Apply normalization as final step
)- Attention-Based Pooling Head: Dynamic token weighting for better representation
- Multi-Head Classification: Custom classification layers for Bengali text
- Enhanced Dropout Strategies: Improved regularization techniques
- K-Fold Implementation: 5-fold cross-validation for robust evaluation
- Stratified Sampling: Maintaining class distribution across folds
- Ensemble Averaging: Combining predictions from multiple folds
| Subtask | Model | F1 Score | Technique |
|---|---|---|---|
| 1A | BanglaBERT | 74.88% | K-Fold + FGM + Normalizer |
| 1B | MuRIL-large-cased | 74.96% | K-Fold Cross Validation |
| 1C | BanglaBERT | 74.52% | FreeLB Adversarial Training |
| Subtask | Model | Dev F1 | Test F1 | Performance Drop |
|---|---|---|---|---|
| 1A | BanglaBERT + K-Fold + FGM + Normalizer | 74.88% | 72.33% | -2.55% |
| 1B | MuRIL-large-cased + K-Fold | 74.96% | 73.44% | -1.52% |
| 1C | BanglaBERT + K-Fold + Normalizer | 74.52% | 73.00% | -1.52% |
| Approach | BanglaBERT | MuRIL-large | XLM-RoBERTa-large |
|---|---|---|---|
| Base LLM | 70.31% | - | - |
| + K-Fold | 72.05% | 71.88% | 71.72% |
| + K-Fold + Normalizer | 71.14% | 72.30% | 71.57% |
| + K-Fold + FGM | 72.17% | 71.90% | - |
| + K-Fold + FGM + Normalizer | 72.33% ⭐ | 71.31% | - |
| Approach | BanglaBERT | MuRIL-large | XLM-RoBERTa-large |
|---|---|---|---|
| Base LLM | 70.25% | 70.93% | 71.23% |
| + K-Fold | 71.85% | 73.44% ⭐ | 68.07% |
| + K-Fold + Normalizer | 72.89% | 73.44% ⭐ | 71.66% |
| + K-Fold + FGM | 72.25% | 72.92% | 73.28% |
| + K-Fold + FGM + Normalizer | 73.12% | 72.95% | 72.17% |
| Approach | BanglaBERT | Development | Test |
|---|---|---|---|
| K-Fold + Normalizer | ✅ | - | 73% ⭐ |
| K-Fold + FreeLB | ✅ | 74.52% | 72% |
| Simple FreeLB | ✅ | 73.91% | - |
| GAT | ✅ | 73.79% | - |
| FGM | ✅ | 73.75% | - |
- Generalization Gap: 1-3% performance drop from development to test across all subtasks
- Most Stable: K-Fold + Normalizer combinations showed best consistency (especially in subtask1C)
- Overfitting Risk: Single models without cross-validation showed higher variance
- Best Generalization:
- Subtask 1A: Adversarial training methods (FGM + Normalizer)
- Subtask 1B: Combined approaches (K-Fold + FGM + Normalizer)
- Subtask 1C: Normalization techniques (smallest performance drop: -1.52%)
- K-Fold Cross Validation: Consistent 2-3% improvement across all models
- Text Normalization: Additional 0.5-1% boost for Bengali text processing
- Adversarial Training: 0.5-1.5% improvement with better robustness
- Combined Techniques: Best overall performance with stacked improvements
- Transformer Superiority: 15-20% improvement over traditional deep learning
- BanglaBERT (csebuetnlp): Specialized Bengali language model
- MuRIL-large-cased: Multilingual model with strong Bengali support
- XLM-RoBERTa (base & large): Cross-lingual transformer variants
- DistilBERT-multilingual: Lightweight multilingual model
- Enhanced Tokenization: Bengali-specific preprocessing pipelines
- Dynamic Padding: Efficient batch processing strategies
- Label Smoothing: Improved training stability
- Learning Rate Scheduling: Optimized training convergence
Shared_Task1_HateSpeech/
├── subtask1A/ # Hate speech type classification
│ ├── Developmental Phase/
│ │ ├── DL Models/ # BiLSTM, LSTM-Attention
│ │ ├── LLMs/ # Base transformer models
│ │ ├── LLMS with K Fold CV/ # K-Fold implementations
│ │ ├── K Folds with normalizer/
│ │ ├── LLMs_KFolds_adversarial attacks/
│ │ ├── LLMS_KFolds_attacks_normalizer/
│ │ └── Various classification heads/
│ └── Evaluation Phase/ # Final test submissions
├── subtask1B/ # Hate speech target classification
│ ├── Developmental Phase/
│ │ ├── DL Models/
│ │ ├── LLMs/
│ │ ├── LLMS with K Fold CV/
│ │ ├── K Folds with normalizer/
│ │ ├── LLMs_KFolds_adversarial attacks/
│ │ └── LLMS_KFolds_attacks_normalizer/
│ └── Evaluation Phase/
│ ├── LLMs/
│ ├── LLMS with K Fold CV/
│ ├── K Folds with normalizer/
│ ├── LLMs_KFolds_adversarial attacks/
│ └── LLMS_KFolds_attacks_normalizer/
└── subtask1C/ # Multi-task hate speech analysis
├── Developmental Phase/
│ ├── LLMs/
│ ├── LLMS with K Fold CV/
│ ├── LLMs with adversarial attacks and K Fold CV/
│ ├── LLMs with K Fold CV and normalizer/
│ └── K Fold CV with attacks and normalizer/
└── Evaluation Phase/
├── LLMs/
├── LLMS with K Fold CV/
├── LLMs with adversarial attacks and K Fold CV/
├── LLMs with K Fold CV and normalizer/
└── K Fold CV with attacks and normalizer/
- Model directories:
v{f1_score}_{model_name}- Example:
v0.7488_banglabert-fgm= 74.88% F1 score using BanglaBERT with FGM
- Example:
- Each directory contains:
- Jupyter notebook (.ipynb) with complete implementation
- Dataset file (subtask_1X.tsv)
- Model checkpoints and outputs
- Baseline Models: 55-68% F1 (Deep Learning approaches)
- Base Transformers: 68-73% F1 (Standard LLM implementations)
- K-Fold Enhancement: 70-74% F1 (Cross-validation improvements)
- Normalization Boost: 73-75% F1 (Text preprocessing optimization)
- Adversarial Training: 73-75% F1 (Robustness improvements)
- Combined Excellence: 74-75% F1 (Best technique combinations)
- Average Performance Drop: 1-3% on unseen test data
- Most Stable Approaches: K-Fold + Normalizer combinations
- Highest Risk: Single model implementations without regularization
- Best Generalization: Models with adversarial training components
- Deep Learning: PyTorch, TensorFlow
- Transformers: Hugging Face Transformers library
- Text Processing: Custom Bengali normalizers, NLTK
- Evaluation: Scikit-learn, Custom metrics implementations
- Adversarial: Custom FGM, AWP, FreeLB implementations
- Cross-Validation: Stratified K-Fold with scikit-learn
- GPU Acceleration: CUDA-enabled training
- Mixed Precision: For memory efficiency
- Gradient Accumulation: Effective batch size optimization
- Early Stopping: Preventing overfitting
- Bengali-Specific Normalization: NFKC Unicode with preservation strategies
- Advanced Adversarial Training: Multiple adversarial techniques comparison
- Custom Attention Heads: Learnable pooling mechanisms
- Robust Cross-Validation: Stratified K-Fold with ensemble strategies
- Multi-Phase Evaluation: Systematic development vs evaluation analysis
- Language-Specific Approaches: Bengali text requires specialized preprocessing
- Adversarial Robustness: Significant impact on generalization
- Cross-Validation Importance: Critical for reliable performance estimation
- Model Ensemble Benefits: Combining techniques yields optimal results
- Navigate to desired subtask directory
- Choose appropriate approach folder
- Open corresponding Jupyter notebook
- Ensure required dependencies are installed
- Execute cells sequentially for complete pipeline
- Each notebook contains complete training pipeline
- Data preprocessing and normalization included
- Model evaluation and metrics calculation automated
- Results saved with performance indicators
- Multi-Modal Approaches: Incorporating contextual information
- Advanced Ensembling: Sophisticated model combination strategies
- Real-Time Processing: Optimized inference pipelines
- Transfer Learning: Cross-task knowledge transfer
- Data Augmentation: Synthetic data generation for Bengali
- Explainability: Understanding model decision processes
- Fairness Analysis: Bias detection and mitigation
- Cross-Lingual Transfer: Knowledge sharing across languages
- Domain Adaptation: Generalization to different text domains
- Competition: Bengali Multi-task Hate Speech Identification Shared Task
- Workshop: BLP Workshop @ IJCNLP-AACL 2025
- Website: https://multihate.github.io/
- Evaluation Metrics:
- Subtask 1A & 1B: Micro-F1
- Subtask 1C: Weighted Micro-F1
id text label
Labels: Abusive, Sexism, Religious Hate, Political Hate, Profane, None
id text label
Labels: Individuals, Organizations, Communities, Society
id text hate_type hate_severity to_whom
- hate_type: Abusive, Sexism, Religious Hate, Political Hate, Profane, None
- hate_severity: Little to None, Mild, Severe
- to_whom: Individuals, Organizations, Communities, Society
This work represents comprehensive exploration of Bengali hate speech detection for the BLP Workshop @ IJCNLP-AACL 2025 shared task, contributing to the advancement of multilingual NLP and social media content moderation.
- Md Arid Hasan, PhD Student, The University of Toronto
- Firoj Alam, Senior Scientist, Qatar Computing Research Institute
- Md Fahad Hossain, Lecturer, Daffodil International University
- Usman Naseem, Assistant Professor, Macquarie University
- Syed Ishtiaque Ahmed, Associate Professor, The University of Toronto
Note: This repository demonstrates state-of-the-art approaches for Bengali hate speech detection across multiple classification tasks, with particular emphasis on robust evaluation methodology and practical implementation strategies for the official shared task.