Skip to content

sharukh010/jailbreak-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Jailbreak Prompt Classification

This project demonstrates how to fine-tune a BERT-based model to classify text prompts as either benign or jailbreak. It uses the jackhhao/jailbreak-classification dataset and leverages Hugging Face's transformers and datasets libraries.

πŸ” Project Overview

The goal is to develop a robust model that can differentiate between safe and potentially harmful (jailbreak) prompts, which is especially useful for content moderation and safety in LLM deployments.

πŸ“¦ Dataset

  • Name: jackhhao/jailbreak-classification
  • Features:
    • prompt: The input text prompt.
    • type: Label indicating whether the prompt is benign or jailbreak.

🧰 Tools & Libraries

πŸ§ͺ Preprocessing Steps

  1. Loaded the dataset using load_dataset.
  2. Filtered prompts based on length (< 5000 characters).
  3. Tokenized prompts using bert-base-uncased.
  4. Encoded labels: benign β†’ 0, jailbreak β†’ 1.
  5. Removed unnecessary columns.
  6. Prepared PyTorch DataLoaders with dynamic padding.

🧠 Model

  • Base model: bert-base-uncased
  • Task: Sequence Classification
  • Fine-tuned using Hugging Face's training utilities (can be extended further).

πŸš€ Running the Project

1. Install Dependencies

pip install datasets transformers torch

2. Run the Notebook

Use Jupyter or any notebook environment to run Jailbreak_classification.ipynb.

3. (Optional) Train the Model

Integrate Hugging Face's Trainer API or a custom training loop to fine-tune the model further on your machine.

πŸ“ˆ Results

  • Accuracy : 98.4%
  • f1 score : 98.54%

πŸ“‚ File Structure

β”œβ”€β”€ Jailbreak_classification.ipynb   # Main notebook
β”œβ”€β”€ README.md                        # Project documentation

✍️ Author

Pathan Sharukh Khan

πŸ“œ License

This project is open-source and available under the MIT License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors