Jailbreak Prompt Classification

This project demonstrates how to fine-tune a BERT-based model to classify text prompts as either benign or jailbreak. It uses the jackhhao/jailbreak-classification dataset and leverages Hugging Face's transformers and datasets libraries.

🔍 Project Overview

The goal is to develop a robust model that can differentiate between safe and potentially harmful (jailbreak) prompts, which is especially useful for content moderation and safety in LLM deployments.

📦 Dataset

Name: jackhhao/jailbreak-classification
Features:
- prompt: The input text prompt.
- type: Label indicating whether the prompt is benign or jailbreak.

🧰 Tools & Libraries

🧪 Preprocessing Steps

Loaded the dataset using load_dataset.
Filtered prompts based on length (< 5000 characters).
Tokenized prompts using bert-base-uncased.
Encoded labels: benign → 0, jailbreak → 1.
Removed unnecessary columns.
Prepared PyTorch DataLoaders with dynamic padding.

🧠 Model

Base model: bert-base-uncased
Task: Sequence Classification
Fine-tuned using Hugging Face's training utilities (can be extended further).

🚀 Running the Project

1. Install Dependencies

pip install datasets transformers torch

2. Run the Notebook

Use Jupyter or any notebook environment to run Jailbreak_classification.ipynb.

3. (Optional) Train the Model

Integrate Hugging Face's Trainer API or a custom training loop to fine-tune the model further on your machine.

📈 Results

Accuracy : 98.4%
f1 score : 98.54%

📂 File Structure

├── Jailbreak_classification.ipynb   # Main notebook
├── README.md                        # Project documentation

✍️ Author

Pathan Sharukh Khan

📜 License

This project is open-source and available under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Jailbreak Prompt Classification

🔍 Project Overview

📦 Dataset

🧰 Tools & Libraries

🧪 Preprocessing Steps

🧠 Model

🚀 Running the Project

1. Install Dependencies

2. Run the Notebook

3. (Optional) Train the Model

📈 Results

📂 File Structure

✍️ Author

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Jailbreak_classification.ipynb		Jailbreak_classification.ipynb
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Jailbreak Prompt Classification

🔍 Project Overview

📦 Dataset

🧰 Tools & Libraries

🧪 Preprocessing Steps

🧠 Model

🚀 Running the Project

1. Install Dependencies

2. Run the Notebook

3. (Optional) Train the Model

📈 Results

📂 File Structure

✍️ Author

📜 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages