Shared Task 1: Hate Speech Detection in Bengali

Project Overview

This repository contains comprehensive implementations for the Bengali Multi-task Hate Speech Identification shared task at BLP Workshop @IJCNLP-AACL 2025. The project addresses the complex problem of detecting and understanding hate speech in Bengali across three related subtasks: hate type classification, target identification, and multi-task analysis. The implementation explores various machine learning approaches from traditional deep learning to state-of-the-art transformer models with advanced training techniques.

Competition Phases

🔬 Developmental Phase

Objective: Model experimentation, architecture exploration, and hyperparameter tuning
Data: Training and validation datasets provided by organizers
Focus: Testing various approaches and techniques to identify best-performing models
Metrics: Validation F1 scores on development set

🏆 Evaluation Phase

Objective: Final model evaluation on unseen test data
Data: Hidden test set released during evaluation period
Focus: Deploying best models from developmental phase with refined configurations
Metrics: Test F1 scores on official evaluation set

Repository Structure

Subtask 1A - Hate Speech Type Classification

Multi-class classification of Bengali text into: Abusive, Sexism, Religious Hate, Political Hate, Profane, or None.

📊 Developmental Phase Results

Deep Learning Models

BiLSTM - F1 Score: 56.25%
LSTM with Attention - F1 Score: 55.18%

Large Language Models (LLMs)

XLM-RoBERTa-large - F1 Score: 72.81%
MuRIL-large-cased - F1 Score: 71.02%
BanglaBERT (csebuetnlp) - F1 Score: 70.74%
BanglaBERT-large (csebuetnlp) - F1 Score: 70.51%
XLM-RoBERTa-base - F1 Score: 70.50%
DistilBERT-multilingual - F1 Score: 68.03%

LLMs with K-Fold Cross Validation

MuRIL-large-cased with K-Fold - F1 Score: 73.61%
XLM-RoBERTa-large with K-Fold - F1 Score: 73.45%
BanglaBERT with K-Fold - F1 Score: 73.29%

K-Fold with Text Normalizer

BanglaBERT with Normalizer - F1 Score: 74.32%
MuRIL-large-cased with Normalizer - F1 Score: 73.73%
XLM-RoBERTa-large with Normalizer - F1 Score: 73.29%

LLMs with Adversarial Training (K-Fold + FGM)

BanglaBERT with K-Fold + FGM - F1 Score: 73.87%
MuRIL-large-cased with K-Fold + FGM - F1 Score: 73.68%

Advanced Combined Approaches (K-Fold + FGM + Normalizer)

BanglaBERT + K-Fold + FGM + Normalizer - F1 Score: 74.88% ⭐ (Best Development Score)
MuRIL-large-cased + K-Fold + FGM + Normalizer - F1 Score: 73.81%

🎯 Evaluation Phase Results

BanglaBERT + K-Fold + FGM + Normalizer - Test F1: 72.33% ⭐ (Best Test Score)
BanglaBERT + K-Fold + FGM - Test F1: 72.17%
MuRIL-large-cased + K-Fold + Normalizer - Test F1: 72.30%
BanglaBERT + K-Fold - Test F1: 72.05%
MuRIL-large-cased + K-Fold + FGM - Test F1: 71.90%
MuRIL-large-cased + K-Fold - Test F1: 71.88%
XLM-RoBERTa-large + K-Fold - Test F1: 71.72%
XLM-RoBERTa-large + K-Fold + Normalizer - Test F1: 71.57%
MuRIL-large-cased + K-Fold + FGM + Normalizer - Test F1: 71.31%
BanglaBERT + K-Fold + Normalizer - Test F1: 71.14%
BanglaBERT (Base) - Test F1: 70.31%

Subtask 1B - Hate Speech Target Classification

Classification of hate speech targets into: Individuals, Organizations, Communities, or Society.

📊 Developmental Phase Results

Deep Learning Models

Traditional deep learning approaches implemented (scores pending)

Large Language Models (LLMs)

BanglaBERT - F1 Score: 72.09%
MuRIL-large-cased - F1 Score: 71.93%
XLM-RoBERTa-large - F1 Score: 71.38%

LLMs with K-Fold Cross Validation

MuRIL-large-cased with K-Fold - F1 Score: 74.96% ⭐ (Best Development Score)
BanglaBERT with K-Fold - F1 Score: 73.69%
XLM-RoBERTa-large with K-Fold - F1 Score: 71.53%

K-Fold with Text Normalizer

BanglaBERT with Normalizer - F1 Score: 74.72%
MuRIL-large-cased with Normalizer - F1 Score: 74.48%
XLM-RoBERTa-large with Normalizer - F1 Score: 72.39%

LLMs with K-Fold and Adversarial Attacks (FGM)

XLM-RoBERTa-large with K-Fold + FGM - F1 Score: 74.20%
BanglaBERT with K-Fold + FGM - F1 Score: 74.12%
MuRIL-large-cased with K-Fold + FGM - F1 Score: 73.89%

Advanced Combined Approaches (K-Fold + Adversarial + Normalizer)

BanglaBERT + K-Fold + FGM + Normalizer - F1 Score: 74.64%
MuRIL-large-cased + K-Fold + FGM + Normalizer - F1 Score: 74.56%
XLM-RoBERTa-large + K-Fold + FGM + Normalizer - F1 Score: 74.32%

🎯 Evaluation Phase Results

Base LLMs (without K-Fold)

XLM-RoBERTa-large - Test F1: 71.23%
MuRIL-large-cased - Test F1: 70.93%
BanglaBERT - Test F1: 70.25%

LLMs with K-Fold Cross Validation

MuRIL-large-cased + K-Fold - Test F1: 73.44%
BanglaBERT + K-Fold - Test F1: 71.85%
XLM-RoBERTa-large + K-Fold - Test F1: 68.07%

K-Fold with Text Normalizer

MuRIL-large-cased + K-Fold + Normalizer - Test F1: 73.44%
BanglaBERT + K-Fold + Normalizer - Test F1: 72.89%
XLM-RoBERTa-large + K-Fold + Normalizer - Test F1: 71.66%

LLMs with K-Fold and Adversarial Attacks (FGM)

XLM-RoBERTa-large + K-Fold + FGM - Test F1: 73.28%
MuRIL-large-cased + K-Fold + FGM - Test F1: 72.92%
BanglaBERT + K-Fold + FGM - Test F1: 72.25%

Advanced Combined Approaches (K-Fold + FGM + Normalizer)

BanglaBERT + K-Fold + FGM + Normalizer - Test F1: 73.12% ⭐
MuRIL-large-cased + K-Fold + FGM + Normalizer - Test F1: 72.95% ⭐
XLM-RoBERTa-large + K-Fold + FGM + Normalizer - Test F1: 72.17%

Subtask 1C - Multi-task Hate Speech Analysis

Multi-task classification combining hate type (Abusive, Sexism, Religious Hate, Political Hate, Profane, None), severity (Little to None, Mild, Severe), and target group (Individuals, Organizations, Communities, Society).

📊 Developmental Phase Results

Base LLMs

Basic transformer implementations (scores pending)

LLMs with K-Fold Cross Validation

Standard K-Fold implementations (scores pending)

LLMs with Adversarial Training and K-Fold

All using BanglaBERT (cse-buet-nlp) with different adversarial techniques:

BanglaBERT + FreeLB - F1 Score: 74.52% ⭐ (Best Development Score)
BanglaBERT + Simple FreeLB - F1 Score: 73.91%
BanglaBERT + GAT - F1 Score: 73.79%
BanglaBERT + FGM - F1 Score: 73.75%

LLMs with K-Fold and Normalizer

Text normalization implementations (scores pending)

Advanced Combined Approaches (K-Fold + Adversarial + Normalizer)

Comprehensive technique combinations (scores pending)

🎯 Evaluation Phase Results

LLMs with K-Fold and Normalizer

BanglaBERT + K-Fold + Normalizer - Test F1: 73.00%

LLMs with Adversarial Training and K-Fold

BanglaBERT + FreeLB + K-Fold - Test F1: 72.00%

Technical Implementation Details

Advanced Training Techniques

Adversarial Training Methods

FGM (Fast Gradient Method): Simple and efficient adversarial perturbations
AWP (Adversarial Weight Perturbation): Weight-space adversarial training
FreeLB: Free large-batch adversarial training for improved generalization
Simple FreeLB: Streamlined version of FreeLB
GAT (Geometry-Aware Training): Advanced geometry-aware adversarial training

Text Normalization Pipeline

normalize(
    text,
    unicode_norm="NFKC",          # Canonical decomposition + compatibility
    punct_replacement=None,        # Preserve original punctuation
    url_replacement=None,          # Preserve URLs
    emoji_replacement=None,        # Preserve emojis
    apply_unicode_norm_last=True   # Apply normalization as final step
)

Custom Model Architectures

Attention-Based Pooling Head: Dynamic token weighting for better representation
Multi-Head Classification: Custom classification layers for Bengali text
Enhanced Dropout Strategies: Improved regularization techniques

Cross-Validation Strategy

K-Fold Implementation: 5-fold cross-validation for robust evaluation
Stratified Sampling: Maintaining class distribution across folds
Ensemble Averaging: Combining predictions from multiple folds

Performance Analysis

📈 Best Performing Models by Phase

Developmental Phase Champions:

Subtask	Model	F1 Score	Technique
1A	BanglaBERT	74.88%	K-Fold + FGM + Normalizer
1B	MuRIL-large-cased	74.96%	K-Fold Cross Validation
1C	BanglaBERT	74.52%	FreeLB Adversarial Training

Evaluation Phase Performance:

Subtask	Model	Dev F1	Test F1	Performance Drop
1A	BanglaBERT + K-Fold + FGM + Normalizer	74.88%	72.33%	-2.55%
1B	MuRIL-large-cased + K-Fold	74.96%	73.44%	-1.52%
1C	BanglaBERT + K-Fold + Normalizer	74.52%	73.00%	-1.52%

Best Test Phase Models (Subtask 1A):

Approach	BanglaBERT	MuRIL-large	XLM-RoBERTa-large
Base LLM	70.31%	-	-
+ K-Fold	72.05%	71.88%	71.72%
+ K-Fold + Normalizer	71.14%	72.30%	71.57%
+ K-Fold + FGM	72.17%	71.90%	-
+ K-Fold + FGM + Normalizer	72.33% ⭐	71.31%	-

Best Test Phase Models (Subtask 1B):

Approach	BanglaBERT	MuRIL-large	XLM-RoBERTa-large
Base LLM	70.25%	70.93%	71.23%
+ K-Fold	71.85%	73.44% ⭐	68.07%
+ K-Fold + Normalizer	72.89%	73.44% ⭐	71.66%
+ K-Fold + FGM	72.25%	72.92%	73.28%
+ K-Fold + FGM + Normalizer	73.12%	72.95%	72.17%

Best Test Phase Models (Subtask 1C):

Approach	BanglaBERT	Development	Test
K-Fold + Normalizer	✅	-	73% ⭐
K-Fold + FreeLB	✅	74.52%	72%
Simple FreeLB	✅	73.91%	-
GAT	✅	73.79%	-
FGM	✅	73.75%	-

Key Performance Insights

Development vs Evaluation Observations:

Generalization Gap: 1-3% performance drop from development to test across all subtasks
Most Stable: K-Fold + Normalizer combinations showed best consistency (especially in subtask1C)
Overfitting Risk: Single models without cross-validation showed higher variance
Best Generalization:
- Subtask 1A: Adversarial training methods (FGM + Normalizer)
- Subtask 1B: Combined approaches (K-Fold + FGM + Normalizer)
- Subtask 1C: Normalization techniques (smallest performance drop: -1.52%)

Technical Effectiveness:

K-Fold Cross Validation: Consistent 2-3% improvement across all models
Text Normalization: Additional 0.5-1% boost for Bengali text processing
Adversarial Training: 0.5-1.5% improvement with better robustness
Combined Techniques: Best overall performance with stacked improvements
Transformer Superiority: 15-20% improvement over traditional deep learning

Model Architecture Details

Transformer Models Utilized

BanglaBERT (csebuetnlp): Specialized Bengali language model
MuRIL-large-cased: Multilingual model with strong Bengali support
XLM-RoBERTa (base & large): Cross-lingual transformer variants
DistilBERT-multilingual: Lightweight multilingual model

Custom Implementations

Enhanced Tokenization: Bengali-specific preprocessing pipelines
Dynamic Padding: Efficient batch processing strategies
Label Smoothing: Improved training stability
Learning Rate Scheduling: Optimized training convergence

File Organization

Directory Structure:

Shared_Task1_HateSpeech/
├── subtask1A/                    # Hate speech type classification
│   ├── Developmental Phase/
│   │   ├── DL Models/           # BiLSTM, LSTM-Attention
│   │   ├── LLMs/                # Base transformer models
│   │   ├── LLMS with K Fold CV/ # K-Fold implementations
│   │   ├── K Folds with normalizer/
│   │   ├── LLMs_KFolds_adversarial attacks/
│   │   ├── LLMS_KFolds_attacks_normalizer/
│   │   └── Various classification heads/
│   └── Evaluation Phase/        # Final test submissions
├── subtask1B/                   # Hate speech target classification
│   ├── Developmental Phase/
│   │   ├── DL Models/
│   │   ├── LLMs/
│   │   ├── LLMS with K Fold CV/
│   │   ├── K Folds with normalizer/
│   │   ├── LLMs_KFolds_adversarial attacks/
│   │   └── LLMS_KFolds_attacks_normalizer/
│   └── Evaluation Phase/
│       ├── LLMs/
│       ├── LLMS with K Fold CV/
│       ├── K Folds with normalizer/
│       ├── LLMs_KFolds_adversarial attacks/
│       └── LLMS_KFolds_attacks_normalizer/
└── subtask1C/                   # Multi-task hate speech analysis
    ├── Developmental Phase/
    │   ├── LLMs/
    │   ├── LLMS with K Fold CV/
    │   ├── LLMs with adversarial attacks and K Fold CV/
    │   ├── LLMs with K Fold CV and normalizer/
    │   └── K Fold CV with attacks and normalizer/
    └── Evaluation Phase/
        ├── LLMs/
        ├── LLMS with K Fold CV/
        ├── LLMs with adversarial attacks and K Fold CV/
        ├── LLMs with K Fold CV and normalizer/
        └── K Fold CV with attacks and normalizer/

Naming Convention:

Model directories: v{f1_score}_{model_name}
- Example: v0.7488_banglabert-fgm = 74.88% F1 score using BanglaBERT with FGM
Each directory contains:
- Jupyter notebook (.ipynb) with complete implementation
- Dataset file (subtask_1X.tsv)
- Model checkpoints and outputs

Performance Evolution

Developmental Phase Progression:

Baseline Models: 55-68% F1 (Deep Learning approaches)
Base Transformers: 68-73% F1 (Standard LLM implementations)
K-Fold Enhancement: 70-74% F1 (Cross-validation improvements)
Normalization Boost: 73-75% F1 (Text preprocessing optimization)
Adversarial Training: 73-75% F1 (Robustness improvements)
Combined Excellence: 74-75% F1 (Best technique combinations)

Development → Evaluation Trends:

Average Performance Drop: 1-3% on unseen test data
Most Stable Approaches: K-Fold + Normalizer combinations
Highest Risk: Single model implementations without regularization
Best Generalization: Models with adversarial training components

Technologies and Frameworks

Core Technologies:

Deep Learning: PyTorch, TensorFlow
Transformers: Hugging Face Transformers library
Text Processing: Custom Bengali normalizers, NLTK
Evaluation: Scikit-learn, Custom metrics implementations
Adversarial: Custom FGM, AWP, FreeLB implementations
Cross-Validation: Stratified K-Fold with scikit-learn

Hardware and Training:

GPU Acceleration: CUDA-enabled training
Mixed Precision: For memory efficiency
Gradient Accumulation: Effective batch size optimization
Early Stopping: Preventing overfitting

Key Contributions

Novel Techniques Implemented:

Bengali-Specific Normalization: NFKC Unicode with preservation strategies
Advanced Adversarial Training: Multiple adversarial techniques comparison
Custom Attention Heads: Learnable pooling mechanisms
Robust Cross-Validation: Stratified K-Fold with ensemble strategies
Multi-Phase Evaluation: Systematic development vs evaluation analysis

Research Insights:

Language-Specific Approaches: Bengali text requires specialized preprocessing
Adversarial Robustness: Significant impact on generalization
Cross-Validation Importance: Critical for reliable performance estimation
Model Ensemble Benefits: Combining techniques yields optimal results

Usage Instructions

Running Experiments:

Navigate to desired subtask directory
Choose appropriate approach folder
Open corresponding Jupyter notebook
Ensure required dependencies are installed
Execute cells sequentially for complete pipeline

Model Training:

Each notebook contains complete training pipeline
Data preprocessing and normalization included
Model evaluation and metrics calculation automated
Results saved with performance indicators

Future Work

Potential Improvements:

Multi-Modal Approaches: Incorporating contextual information
Advanced Ensembling: Sophisticated model combination strategies
Real-Time Processing: Optimized inference pipelines
Transfer Learning: Cross-task knowledge transfer
Data Augmentation: Synthetic data generation for Bengali

Research Directions:

Explainability: Understanding model decision processes
Fairness Analysis: Bias detection and mitigation
Cross-Lingual Transfer: Knowledge sharing across languages
Domain Adaptation: Generalization to different text domains

Official Task Information

Task Details

Competition: Bengali Multi-task Hate Speech Identification Shared Task
Workshop: BLP Workshop @ IJCNLP-AACL 2025
Website: https://multihate.github.io/
Evaluation Metrics:
- Subtask 1A & 1B: Micro-F1
- Subtask 1C: Weighted Micro-F1

Data Format

Subtask 1A

id    text    label

Labels: Abusive, Sexism, Religious Hate, Political Hate, Profane, None

Subtask 1B

id    text    label

Labels: Individuals, Organizations, Communities, Society

Subtask 1C

id    text    hate_type    hate_severity    to_whom

hate_type: Abusive, Sexism, Religious Hate, Political Hate, Profane, None
hate_severity: Little to None, Mild, Severe
to_whom: Individuals, Organizations, Communities, Society

Citation and Acknowledgments

This work represents comprehensive exploration of Bengali hate speech detection for the BLP Workshop @ IJCNLP-AACL 2025 shared task, contributing to the advancement of multilingual NLP and social media content moderation.

Organizers

Md Arid Hasan, PhD Student, The University of Toronto
Firoj Alam, Senior Scientist, Qatar Computing Research Institute
Md Fahad Hossain, Lecturer, Daffodil International University
Usman Naseem, Assistant Professor, Macquarie University
Syed Ishtiaque Ahmed, Associate Professor, The University of Toronto

Note: This repository demonstrates state-of-the-art approaches for Bengali hate speech detection across multiple classification tasks, with particular emphasis on robust evaluation methodology and practical implementation strategies for the official shared task.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
subtask1A		subtask1A
subtask1B		subtask1B
subtask1C		subtask1C
README.md		README.md

SyedT1/Shared_Task1_HateSpeech

Folders and files

Latest commit

History

Repository files navigation

Shared Task 1: Hate Speech Detection in Bengali

Project Overview

Competition Phases

🔬 Developmental Phase

🏆 Evaluation Phase

Repository Structure

Subtask 1A - Hate Speech Type Classification

📊 Developmental Phase Results

Deep Learning Models

Large Language Models (LLMs)

LLMs with K-Fold Cross Validation

K-Fold with Text Normalizer

LLMs with Adversarial Training (K-Fold + FGM)

Advanced Combined Approaches (K-Fold + FGM + Normalizer)

🎯 Evaluation Phase Results

Subtask 1B - Hate Speech Target Classification

📊 Developmental Phase Results

Deep Learning Models

Large Language Models (LLMs)

LLMs with K-Fold Cross Validation

K-Fold with Text Normalizer

LLMs with K-Fold and Adversarial Attacks (FGM)

Advanced Combined Approaches (K-Fold + Adversarial + Normalizer)

🎯 Evaluation Phase Results

Base LLMs (without K-Fold)

LLMs with K-Fold Cross Validation

K-Fold with Text Normalizer

LLMs with K-Fold and Adversarial Attacks (FGM)

Advanced Combined Approaches (K-Fold + FGM + Normalizer)

Subtask 1C - Multi-task Hate Speech Analysis

📊 Developmental Phase Results

Base LLMs

LLMs with K-Fold Cross Validation

LLMs with Adversarial Training and K-Fold

LLMs with K-Fold and Normalizer

Advanced Combined Approaches (K-Fold + Adversarial + Normalizer)

🎯 Evaluation Phase Results

LLMs with K-Fold and Normalizer

LLMs with Adversarial Training and K-Fold

Technical Implementation Details

Advanced Training Techniques

Adversarial Training Methods

Text Normalization Pipeline

Custom Model Architectures

Cross-Validation Strategy

Performance Analysis

📈 Best Performing Models by Phase

Developmental Phase Champions:

Evaluation Phase Performance:

Best Test Phase Models (Subtask 1A):

Best Test Phase Models (Subtask 1B):

Best Test Phase Models (Subtask 1C):

Key Performance Insights

Development vs Evaluation Observations:

Technical Effectiveness:

Model Architecture Details

Transformer Models Utilized

Custom Implementations

File Organization

Directory Structure:

Naming Convention:

Performance Evolution

Developmental Phase Progression:

Development → Evaluation Trends:

Technologies and Frameworks

Core Technologies:

Hardware and Training:

Key Contributions

Novel Techniques Implemented:

Research Insights:

Usage Instructions

Running Experiments:

Model Training:

Future Work

Potential Improvements:

Packages