Skip to content

SyedT1/Shared_Task1_HateSpeech

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Shared Task 1: Hate Speech Detection in Bengali

Project Overview

This repository contains comprehensive implementations for the Bengali Multi-task Hate Speech Identification shared task at BLP Workshop @IJCNLP-AACL 2025. The project addresses the complex problem of detecting and understanding hate speech in Bengali across three related subtasks: hate type classification, target identification, and multi-task analysis. The implementation explores various machine learning approaches from traditional deep learning to state-of-the-art transformer models with advanced training techniques.

Competition Phases

🔬 Developmental Phase

  • Objective: Model experimentation, architecture exploration, and hyperparameter tuning
  • Data: Training and validation datasets provided by organizers
  • Focus: Testing various approaches and techniques to identify best-performing models
  • Metrics: Validation F1 scores on development set

🏆 Evaluation Phase

  • Objective: Final model evaluation on unseen test data
  • Data: Hidden test set released during evaluation period
  • Focus: Deploying best models from developmental phase with refined configurations
  • Metrics: Test F1 scores on official evaluation set

Repository Structure

Subtask 1A - Hate Speech Type Classification

Multi-class classification of Bengali text into: Abusive, Sexism, Religious Hate, Political Hate, Profane, or None.

📊 Developmental Phase Results

Deep Learning Models
  • BiLSTM - F1 Score: 56.25%
  • LSTM with Attention - F1 Score: 55.18%
Large Language Models (LLMs)
  • XLM-RoBERTa-large - F1 Score: 72.81%
  • MuRIL-large-cased - F1 Score: 71.02%
  • BanglaBERT (csebuetnlp) - F1 Score: 70.74%
  • BanglaBERT-large (csebuetnlp) - F1 Score: 70.51%
  • XLM-RoBERTa-base - F1 Score: 70.50%
  • DistilBERT-multilingual - F1 Score: 68.03%
LLMs with K-Fold Cross Validation
  • MuRIL-large-cased with K-Fold - F1 Score: 73.61%
  • XLM-RoBERTa-large with K-Fold - F1 Score: 73.45%
  • BanglaBERT with K-Fold - F1 Score: 73.29%
K-Fold with Text Normalizer
  • BanglaBERT with Normalizer - F1 Score: 74.32%
  • MuRIL-large-cased with Normalizer - F1 Score: 73.73%
  • XLM-RoBERTa-large with Normalizer - F1 Score: 73.29%
LLMs with Adversarial Training (K-Fold + FGM)
  • BanglaBERT with K-Fold + FGM - F1 Score: 73.87%
  • MuRIL-large-cased with K-Fold + FGM - F1 Score: 73.68%
Advanced Combined Approaches (K-Fold + FGM + Normalizer)
  • BanglaBERT + K-Fold + FGM + Normalizer - F1 Score: 74.88% ⭐ (Best Development Score)
  • MuRIL-large-cased + K-Fold + FGM + Normalizer - F1 Score: 73.81%

🎯 Evaluation Phase Results

  • BanglaBERT + K-Fold + FGM + Normalizer - Test F1: 72.33% ⭐ (Best Test Score)
  • BanglaBERT + K-Fold + FGM - Test F1: 72.17%
  • MuRIL-large-cased + K-Fold + Normalizer - Test F1: 72.30%
  • BanglaBERT + K-Fold - Test F1: 72.05%
  • MuRIL-large-cased + K-Fold + FGM - Test F1: 71.90%
  • MuRIL-large-cased + K-Fold - Test F1: 71.88%
  • XLM-RoBERTa-large + K-Fold - Test F1: 71.72%
  • XLM-RoBERTa-large + K-Fold + Normalizer - Test F1: 71.57%
  • MuRIL-large-cased + K-Fold + FGM + Normalizer - Test F1: 71.31%
  • BanglaBERT + K-Fold + Normalizer - Test F1: 71.14%
  • BanglaBERT (Base) - Test F1: 70.31%

Subtask 1B - Hate Speech Target Classification

Classification of hate speech targets into: Individuals, Organizations, Communities, or Society.

📊 Developmental Phase Results

Deep Learning Models
  • Traditional deep learning approaches implemented (scores pending)
Large Language Models (LLMs)
  • BanglaBERT - F1 Score: 72.09%
  • MuRIL-large-cased - F1 Score: 71.93%
  • XLM-RoBERTa-large - F1 Score: 71.38%
LLMs with K-Fold Cross Validation
  • MuRIL-large-cased with K-Fold - F1 Score: 74.96% ⭐ (Best Development Score)
  • BanglaBERT with K-Fold - F1 Score: 73.69%
  • XLM-RoBERTa-large with K-Fold - F1 Score: 71.53%
K-Fold with Text Normalizer
  • BanglaBERT with Normalizer - F1 Score: 74.72%
  • MuRIL-large-cased with Normalizer - F1 Score: 74.48%
  • XLM-RoBERTa-large with Normalizer - F1 Score: 72.39%
LLMs with K-Fold and Adversarial Attacks (FGM)
  • XLM-RoBERTa-large with K-Fold + FGM - F1 Score: 74.20%
  • BanglaBERT with K-Fold + FGM - F1 Score: 74.12%
  • MuRIL-large-cased with K-Fold + FGM - F1 Score: 73.89%
Advanced Combined Approaches (K-Fold + Adversarial + Normalizer)
  • BanglaBERT + K-Fold + FGM + Normalizer - F1 Score: 74.64%
  • MuRIL-large-cased + K-Fold + FGM + Normalizer - F1 Score: 74.56%
  • XLM-RoBERTa-large + K-Fold + FGM + Normalizer - F1 Score: 74.32%

🎯 Evaluation Phase Results

Base LLMs (without K-Fold)
  • XLM-RoBERTa-large - Test F1: 71.23%
  • MuRIL-large-cased - Test F1: 70.93%
  • BanglaBERT - Test F1: 70.25%
LLMs with K-Fold Cross Validation
  • MuRIL-large-cased + K-Fold - Test F1: 73.44%
  • BanglaBERT + K-Fold - Test F1: 71.85%
  • XLM-RoBERTa-large + K-Fold - Test F1: 68.07%
K-Fold with Text Normalizer
  • MuRIL-large-cased + K-Fold + Normalizer - Test F1: 73.44%
  • BanglaBERT + K-Fold + Normalizer - Test F1: 72.89%
  • XLM-RoBERTa-large + K-Fold + Normalizer - Test F1: 71.66%
LLMs with K-Fold and Adversarial Attacks (FGM)
  • XLM-RoBERTa-large + K-Fold + FGM - Test F1: 73.28%
  • MuRIL-large-cased + K-Fold + FGM - Test F1: 72.92%
  • BanglaBERT + K-Fold + FGM - Test F1: 72.25%
Advanced Combined Approaches (K-Fold + FGM + Normalizer)
  • BanglaBERT + K-Fold + FGM + Normalizer - Test F1: 73.12% ⭐
  • MuRIL-large-cased + K-Fold + FGM + Normalizer - Test F1: 72.95% ⭐
  • XLM-RoBERTa-large + K-Fold + FGM + Normalizer - Test F1: 72.17%

Subtask 1C - Multi-task Hate Speech Analysis

Multi-task classification combining hate type (Abusive, Sexism, Religious Hate, Political Hate, Profane, None), severity (Little to None, Mild, Severe), and target group (Individuals, Organizations, Communities, Society).

📊 Developmental Phase Results

Base LLMs
  • Basic transformer implementations (scores pending)
LLMs with K-Fold Cross Validation
  • Standard K-Fold implementations (scores pending)
LLMs with Adversarial Training and K-Fold

All using BanglaBERT (cse-buet-nlp) with different adversarial techniques:

  • BanglaBERT + FreeLB - F1 Score: 74.52% ⭐ (Best Development Score)
  • BanglaBERT + Simple FreeLB - F1 Score: 73.91%
  • BanglaBERT + GAT - F1 Score: 73.79%
  • BanglaBERT + FGM - F1 Score: 73.75%
LLMs with K-Fold and Normalizer
  • Text normalization implementations (scores pending)
Advanced Combined Approaches (K-Fold + Adversarial + Normalizer)
  • Comprehensive technique combinations (scores pending)

🎯 Evaluation Phase Results

LLMs with K-Fold and Normalizer
  • BanglaBERT + K-Fold + Normalizer - Test F1: 73.00%
LLMs with Adversarial Training and K-Fold
  • BanglaBERT + FreeLB + K-Fold - Test F1: 72.00%

Technical Implementation Details

Advanced Training Techniques

Adversarial Training Methods

  • FGM (Fast Gradient Method): Simple and efficient adversarial perturbations
  • AWP (Adversarial Weight Perturbation): Weight-space adversarial training
  • FreeLB: Free large-batch adversarial training for improved generalization
  • Simple FreeLB: Streamlined version of FreeLB
  • GAT (Geometry-Aware Training): Advanced geometry-aware adversarial training

Text Normalization Pipeline

normalize(
    text,
    unicode_norm="NFKC",          # Canonical decomposition + compatibility
    punct_replacement=None,        # Preserve original punctuation
    url_replacement=None,          # Preserve URLs
    emoji_replacement=None,        # Preserve emojis
    apply_unicode_norm_last=True   # Apply normalization as final step
)

Custom Model Architectures

  • Attention-Based Pooling Head: Dynamic token weighting for better representation
  • Multi-Head Classification: Custom classification layers for Bengali text
  • Enhanced Dropout Strategies: Improved regularization techniques

Cross-Validation Strategy

  • K-Fold Implementation: 5-fold cross-validation for robust evaluation
  • Stratified Sampling: Maintaining class distribution across folds
  • Ensemble Averaging: Combining predictions from multiple folds

Performance Analysis

📈 Best Performing Models by Phase

Developmental Phase Champions:

Subtask Model F1 Score Technique
1A BanglaBERT 74.88% K-Fold + FGM + Normalizer
1B MuRIL-large-cased 74.96% K-Fold Cross Validation
1C BanglaBERT 74.52% FreeLB Adversarial Training

Evaluation Phase Performance:

Subtask Model Dev F1 Test F1 Performance Drop
1A BanglaBERT + K-Fold + FGM + Normalizer 74.88% 72.33% -2.55%
1B MuRIL-large-cased + K-Fold 74.96% 73.44% -1.52%
1C BanglaBERT + K-Fold + Normalizer 74.52% 73.00% -1.52%

Best Test Phase Models (Subtask 1A):

Approach BanglaBERT MuRIL-large XLM-RoBERTa-large
Base LLM 70.31% - -
+ K-Fold 72.05% 71.88% 71.72%
+ K-Fold + Normalizer 71.14% 72.30% 71.57%
+ K-Fold + FGM 72.17% 71.90% -
+ K-Fold + FGM + Normalizer 72.33% ⭐ 71.31% -

Best Test Phase Models (Subtask 1B):

Approach BanglaBERT MuRIL-large XLM-RoBERTa-large
Base LLM 70.25% 70.93% 71.23%
+ K-Fold 71.85% 73.44% ⭐ 68.07%
+ K-Fold + Normalizer 72.89% 73.44% ⭐ 71.66%
+ K-Fold + FGM 72.25% 72.92% 73.28%
+ K-Fold + FGM + Normalizer 73.12% 72.95% 72.17%

Best Test Phase Models (Subtask 1C):

Approach BanglaBERT Development Test
K-Fold + Normalizer - 73% ⭐
K-Fold + FreeLB 74.52% 72%
Simple FreeLB 73.91% -
GAT 73.79% -
FGM 73.75% -

Key Performance Insights

Development vs Evaluation Observations:

  • Generalization Gap: 1-3% performance drop from development to test across all subtasks
  • Most Stable: K-Fold + Normalizer combinations showed best consistency (especially in subtask1C)
  • Overfitting Risk: Single models without cross-validation showed higher variance
  • Best Generalization:
    • Subtask 1A: Adversarial training methods (FGM + Normalizer)
    • Subtask 1B: Combined approaches (K-Fold + FGM + Normalizer)
    • Subtask 1C: Normalization techniques (smallest performance drop: -1.52%)

Technical Effectiveness:

  • K-Fold Cross Validation: Consistent 2-3% improvement across all models
  • Text Normalization: Additional 0.5-1% boost for Bengali text processing
  • Adversarial Training: 0.5-1.5% improvement with better robustness
  • Combined Techniques: Best overall performance with stacked improvements
  • Transformer Superiority: 15-20% improvement over traditional deep learning

Model Architecture Details

Transformer Models Utilized

  • BanglaBERT (csebuetnlp): Specialized Bengali language model
  • MuRIL-large-cased: Multilingual model with strong Bengali support
  • XLM-RoBERTa (base & large): Cross-lingual transformer variants
  • DistilBERT-multilingual: Lightweight multilingual model

Custom Implementations

  • Enhanced Tokenization: Bengali-specific preprocessing pipelines
  • Dynamic Padding: Efficient batch processing strategies
  • Label Smoothing: Improved training stability
  • Learning Rate Scheduling: Optimized training convergence

File Organization

Directory Structure:

Shared_Task1_HateSpeech/
├── subtask1A/                    # Hate speech type classification
│   ├── Developmental Phase/
│   │   ├── DL Models/           # BiLSTM, LSTM-Attention
│   │   ├── LLMs/                # Base transformer models
│   │   ├── LLMS with K Fold CV/ # K-Fold implementations
│   │   ├── K Folds with normalizer/
│   │   ├── LLMs_KFolds_adversarial attacks/
│   │   ├── LLMS_KFolds_attacks_normalizer/
│   │   └── Various classification heads/
│   └── Evaluation Phase/        # Final test submissions
├── subtask1B/                   # Hate speech target classification
│   ├── Developmental Phase/
│   │   ├── DL Models/
│   │   ├── LLMs/
│   │   ├── LLMS with K Fold CV/
│   │   ├── K Folds with normalizer/
│   │   ├── LLMs_KFolds_adversarial attacks/
│   │   └── LLMS_KFolds_attacks_normalizer/
│   └── Evaluation Phase/
│       ├── LLMs/
│       ├── LLMS with K Fold CV/
│       ├── K Folds with normalizer/
│       ├── LLMs_KFolds_adversarial attacks/
│       └── LLMS_KFolds_attacks_normalizer/
└── subtask1C/                   # Multi-task hate speech analysis
    ├── Developmental Phase/
    │   ├── LLMs/
    │   ├── LLMS with K Fold CV/
    │   ├── LLMs with adversarial attacks and K Fold CV/
    │   ├── LLMs with K Fold CV and normalizer/
    │   └── K Fold CV with attacks and normalizer/
    └── Evaluation Phase/
        ├── LLMs/
        ├── LLMS with K Fold CV/
        ├── LLMs with adversarial attacks and K Fold CV/
        ├── LLMs with K Fold CV and normalizer/
        └── K Fold CV with attacks and normalizer/

Naming Convention:

  • Model directories: v{f1_score}_{model_name}
    • Example: v0.7488_banglabert-fgm = 74.88% F1 score using BanglaBERT with FGM
  • Each directory contains:
    • Jupyter notebook (.ipynb) with complete implementation
    • Dataset file (subtask_1X.tsv)
    • Model checkpoints and outputs

Performance Evolution

Developmental Phase Progression:

  1. Baseline Models: 55-68% F1 (Deep Learning approaches)
  2. Base Transformers: 68-73% F1 (Standard LLM implementations)
  3. K-Fold Enhancement: 70-74% F1 (Cross-validation improvements)
  4. Normalization Boost: 73-75% F1 (Text preprocessing optimization)
  5. Adversarial Training: 73-75% F1 (Robustness improvements)
  6. Combined Excellence: 74-75% F1 (Best technique combinations)

Development → Evaluation Trends:

  • Average Performance Drop: 1-3% on unseen test data
  • Most Stable Approaches: K-Fold + Normalizer combinations
  • Highest Risk: Single model implementations without regularization
  • Best Generalization: Models with adversarial training components

Technologies and Frameworks

Core Technologies:

  • Deep Learning: PyTorch, TensorFlow
  • Transformers: Hugging Face Transformers library
  • Text Processing: Custom Bengali normalizers, NLTK
  • Evaluation: Scikit-learn, Custom metrics implementations
  • Adversarial: Custom FGM, AWP, FreeLB implementations
  • Cross-Validation: Stratified K-Fold with scikit-learn

Hardware and Training:

  • GPU Acceleration: CUDA-enabled training
  • Mixed Precision: For memory efficiency
  • Gradient Accumulation: Effective batch size optimization
  • Early Stopping: Preventing overfitting

Key Contributions

Novel Techniques Implemented:

  1. Bengali-Specific Normalization: NFKC Unicode with preservation strategies
  2. Advanced Adversarial Training: Multiple adversarial techniques comparison
  3. Custom Attention Heads: Learnable pooling mechanisms
  4. Robust Cross-Validation: Stratified K-Fold with ensemble strategies
  5. Multi-Phase Evaluation: Systematic development vs evaluation analysis

Research Insights:

  • Language-Specific Approaches: Bengali text requires specialized preprocessing
  • Adversarial Robustness: Significant impact on generalization
  • Cross-Validation Importance: Critical for reliable performance estimation
  • Model Ensemble Benefits: Combining techniques yields optimal results

Usage Instructions

Running Experiments:

  1. Navigate to desired subtask directory
  2. Choose appropriate approach folder
  3. Open corresponding Jupyter notebook
  4. Ensure required dependencies are installed
  5. Execute cells sequentially for complete pipeline

Model Training:

  • Each notebook contains complete training pipeline
  • Data preprocessing and normalization included
  • Model evaluation and metrics calculation automated
  • Results saved with performance indicators

Future Work

Potential Improvements:

  • Multi-Modal Approaches: Incorporating contextual information
  • Advanced Ensembling: Sophisticated model combination strategies
  • Real-Time Processing: Optimized inference pipelines
  • Transfer Learning: Cross-task knowledge transfer
  • Data Augmentation: Synthetic data generation for Bengali

Research Directions:

  • Explainability: Understanding model decision processes
  • Fairness Analysis: Bias detection and mitigation
  • Cross-Lingual Transfer: Knowledge sharing across languages
  • Domain Adaptation: Generalization to different text domains

Official Task Information

Task Details

  • Competition: Bengali Multi-task Hate Speech Identification Shared Task
  • Workshop: BLP Workshop @ IJCNLP-AACL 2025
  • Website: https://multihate.github.io/
  • Evaluation Metrics:
    • Subtask 1A & 1B: Micro-F1
    • Subtask 1C: Weighted Micro-F1

Data Format

Subtask 1A

id    text    label

Labels: Abusive, Sexism, Religious Hate, Political Hate, Profane, None

Subtask 1B

id    text    label

Labels: Individuals, Organizations, Communities, Society

Subtask 1C

id    text    hate_type    hate_severity    to_whom
  • hate_type: Abusive, Sexism, Religious Hate, Political Hate, Profane, None
  • hate_severity: Little to None, Mild, Severe
  • to_whom: Individuals, Organizations, Communities, Society

Citation and Acknowledgments

This work represents comprehensive exploration of Bengali hate speech detection for the BLP Workshop @ IJCNLP-AACL 2025 shared task, contributing to the advancement of multilingual NLP and social media content moderation.

Organizers

  • Md Arid Hasan, PhD Student, The University of Toronto
  • Firoj Alam, Senior Scientist, Qatar Computing Research Institute
  • Md Fahad Hossain, Lecturer, Daffodil International University
  • Usman Naseem, Assistant Professor, Macquarie University
  • Syed Ishtiaque Ahmed, Associate Professor, The University of Toronto

Note: This repository demonstrates state-of-the-art approaches for Bengali hate speech detection across multiple classification tasks, with particular emphasis on robust evaluation methodology and practical implementation strategies for the official shared task.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •