This project demonstrates how to fine-tune a BERT-based model to classify text prompts as either benign or jailbreak. It uses the jackhhao/jailbreak-classification dataset and leverages Hugging Face's transformers and datasets libraries.
The goal is to develop a robust model that can differentiate between safe and potentially harmful (jailbreak) prompts, which is especially useful for content moderation and safety in LLM deployments.
- Name:
jackhhao/jailbreak-classification - Features:
prompt: The input text prompt.type: Label indicating whether the prompt isbenignorjailbreak.
- Loaded the dataset using
load_dataset. - Filtered prompts based on length (< 5000 characters).
- Tokenized prompts using
bert-base-uncased. - Encoded labels:
benignβ 0,jailbreakβ 1. - Removed unnecessary columns.
- Prepared PyTorch DataLoaders with dynamic padding.
- Base model:
bert-base-uncased - Task: Sequence Classification
- Fine-tuned using Hugging Face's training utilities (can be extended further).
pip install datasets transformers torchUse Jupyter or any notebook environment to run Jailbreak_classification.ipynb.
Integrate Hugging Face's Trainer API or a custom training loop to fine-tune the model further on your machine.
- Accuracy : 98.4%
- f1 score : 98.54%
βββ Jailbreak_classification.ipynb # Main notebook
βββ README.md # Project documentation
Pathan Sharukh Khan
This project is open-source and available under the MIT License.