Skip to content

LLM-QC/AdversariaLLM

Repository files navigation

AdversariaLLM

arXiv

A comprehensive toolkit for evaluating and comparing continuous and discrete adversarial attacks on LLMs. This repository provides a unified framework for running various attack methods, generating adversarial prompts, and evaluating model safety and robustness.

πŸ”§ Installation

  1. Clone the repository:
git clone https://github.com/LLM-QC/AdversariaLLM
cd AdversariaLLM

This repository supports two setup paths:

Option A: Pixi (recommended)

Pixi installs the environment and the local adversariallm package (editable) from pyproject.toml.

pixi install --locked

Run commands either with pixi run ...:

pixi run python run_attacks.py --help
pixi run pytest -q tests/test_attacks/test_direct.py

or activate the environment first:

pixi shell
python run_attacks.py --help

Option B: Classic pip / virtualenv / conda workflow

Use this if you prefer a traditional Python environment.

  1. Install dependencies:
pip install -r requirements.txt
  1. Install the package in development mode:
pip install -e .

πŸš€ Quick Start

Repository Root Path (root_dir)

By default, root_dir is inferred from the working directory where you run the Hydra script. If needed, you can override it explicitly:

python run_attacks.py root_dir=/absolute/path/to/repo ...

If you prefer a fixed setup, you can also hard-code root_dir in conf/paths.yaml.

Running Basic Attacks

To evaluate a model with a single attack method:

python run_attacks.py -m \
    model=microsoft/Phi-3-mini-4k-instruct \
    dataset=adv_behaviors \
    datasets.adv_behaviors.idx="range(0,300)" \
    attack=gcg \
    hydra.launcher.timeout_min=240

Running Multiple Attacks (Sweep)

To compare multiple attack methods:

python run_attacks.py -m \
    model=microsoft/Phi-3-mini-4k-instruct \
    dataset=adv_behaviors \
    datasets.adv_behaviors.idx="range(0,300)" \
    attack=gcg,pair,autodan \
    hydra.launcher.timeout_min=240

This will launch 900 jobs (3 attacks Γ— 300 prompts) and run GCG, PAIR, and AutoDAN against Phi-3 on all 300 prompts.

🎯 Supported Attack Methods

The framework supports various adversarial attack algorithms:

  • GCG - Greedy Coordinate Gradient attack (with various objectives, including REINFORCE)
  • PAIR - Prompt Automatic Iterative Refinement
  • AutoDAN - Automatic prompt generation
  • PGD - Projected Gradient Descent (continuous in embedding and indicator-space, with & without discretization)
  • Random Search - Baseline random optimization
  • Human Jailbreaks - Curated human-written prompts
  • Direct - Direct prompt testing without optimization
  • BEAST - Gradient-free discrete optimization
  • Best-of-N - Jailbreaking with simple string perturbations
  • Inpainting - Diffusion-based inpainting attacks (Implemented as transfer attacks)

πŸ“Š Evaluation and Judging

For a complete list of supported judges, see: JudgeZoo

Default Judge

By default, all completions are evaluated using StrongREJECT. You can change this by modifying the classifiers attribute in your config:

classifiers: ["strong_reject", "harmbench", "custom_judge"]

Running Judges Separately

python run_judges.py \
    judge=strong_reject

will judge all files with strong_reject which haven not been judged yet.

πŸ”§ Advanced Usage

Custom Attack Parameters

You can override specific attack parameters:

python run_attacks.py -m \
    attack=gcg \
    attacks.gcg.num_steps=500 \
    attacks.gcg.search_width=512

Distributional Evaluation

Distributional evaluation allows you to assess the behavior of attacks across multiple sampled responses rather than a single deterministic output. This is particularly useful for measuring the robustness of safety mechanisms and understanding the distribution of model behaviors under adversarial conditions. Inspired by arxiv:2410.03523 and arxiv:2507.04446.

Specify Generation Parameters

generation_config:
  temperature: 0.7
  top_p: 1.0
  top_k: 0
  max_new_tokens: 256
  num_return_sequences: 50

Example: Basic Distributional Evaluation

To evaluate a model with multiple sampled responses:

python run_attacks.py -m \
    model=microsoft/Phi-3-mini-4k-instruct \
    dataset=adv_behaviors \
    datasets.adv_behaviors.idx="range(0,50)" \
    attack=gcg \
    attacks.gcg.generation_config.temperature=0.7 \
    attacks.gcg.generation_config.num_return_sequences=50 \
    attacks.gcg.generation_config.max_new_tokens=256

This will generate 50 diverse responses per prompt at temperature 0.7, allowing you to compute metrics like:

  • Expected harmfulness: E[h(Y)]
  • Success rate across samples
  • Distribution of refusal vs. compliance behaviors

Example: Comparing Baseline vs. Distributional Attacks

Compare deterministic baseline (temperature=0.0) with distributional sampling:

# Baseline: deterministic evaluation
python run_attacks.py -m \
    model=meta-llama/Meta-Llama-3.1-8B-Instruct \
    dataset=adv_behaviors \
    attack=pair \
    attacks.pair.generation_config.temperature=0.0 \
    attacks.pair.generation_config.num_return_sequences=1

# Distributional: sample-based evaluation
python run_attacks.py -m \
    model=meta-llama/Meta-Llama-3.1-8B-Instruct \
    dataset=adv_behaviors \
    attack=pair \
    attacks.pair.generation_config.temperature=0.7 \
    attacks.pair.generation_config.num_return_sequences=50

πŸ“ˆ Results and Analysis

Results are saved in the configured output directory with the following structure:

outputs/
β”œβ”€β”€ YYYY-MM-DD/HH-MM-SS/{i}/run.json
...
└── YYYY-MM-DD/HH-MM-SS/{i}/run.json

Visualization & Evaluation (WIP)

Generate plots and analysis with visualize_results.ipynb in evaluations/

Used in

[1] Beyer, Tim, et al. "Fast Proxies for LLM Robustness Evaluation." arXiv preprint arXiv:2502.10487 (2025).
[2] Xhonneux, Sophie, et al. "A generative approach to LLM harmfulness detection with special red flag tokens." arXiv preprint arXiv:2502.16366 (2025).
[3] Beyer, Tim, et al. "LLM-safety Evaluations Lack Robustness." arXiv preprint arXiv:2503.02574 (2025).
[4] Beyer, Tim, et al. "Sampling-aware adversarial attacks against large language models." arXiv preprint arXiv:2507.04446 (2025).
[5] LΓΌdke, David, et al. "Diffusion LLMs are Natural Adversaries for any LLM." arXiv preprint arXiv:2511.00203 (2025).

🀝 Contributing

Contributions welcome!

πŸ“ Project Structure

llm-quick-check/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ attacks/           # Attack implementations
β”‚   β”‚   β”œβ”€β”€ gcg.py        # GCG attack
β”‚   β”‚   β”œβ”€β”€ pair.py       # PAIR attack
β”‚   β”‚   β”œβ”€β”€ autodan.py    # AutoDAN attack
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ dataset/          # Dataset handling (modular)
β”‚   β”‚   β”œβ”€β”€ prompt_dataset.py      # Base dataset class
β”‚   β”‚   β”œβ”€β”€ adv_behaviors.py       # AdvBench behaviors
β”‚   β”‚   β”œβ”€β”€ jbb_behaviors.py       # JailbreakBench
β”‚   β”‚   β”œβ”€β”€ strong_reject.py       # StrongREJECT
β”‚   β”‚   β”œβ”€β”€ or_bench.py            # ORBench
β”‚   β”‚   β”œβ”€β”€ refusal_direction.py   # RefusalDirection
β”‚   β”‚   β”œβ”€β”€ xs_test.py             # XSTest
β”‚   β”‚   β”œβ”€β”€ alpaca.py              # Alpaca
β”‚   β”‚   β”œβ”€β”€ mmlu.py                # MMLU
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ io_utils/         # I/O utilities
β”‚   β”œβ”€β”€ lm_utils/         # Language model utilities
β”‚   └── types.py          # Type definitions
β”œβ”€β”€ conf/                 # Configuration files
β”‚   β”œβ”€β”€ config.yaml       # Main config
β”‚   β”œβ”€β”€ attacks/          # Attack-specific configs
β”‚   β”œβ”€β”€ datasets/         # Dataset configs
β”‚   └── models/           # Model configs
β”œβ”€β”€ run_attacks.py        # Main attack runner
β”œβ”€β”€ run_judges.py         # Judge evaluation
β”œβ”€β”€ run_sampling.py       # Sampling utilities
└── requirements.txt      # Dependencies

πŸ™ Acknowledgments

Please be sure to cite the underlying work if you build on it.

Datasets

Attacks

Other

Citation

If you use this repo in your work or found it useful, please consider citing

@article{beyer2025adversariallm,
  title={AdversariaLLM: A Unified and Modular Toolbox for LLM Robustness Research},
  author={Beyer, Tim and Dornbusch, Jonas and Steimle, Jakob and Ladenburger, Moritz and Schwinn, Leo and G{\"u}nnemann, Stephan},
  journal={arXiv preprint arXiv:2511.04316},
  year={2025}
}

About

Toolbox to run adversarial attacks against LLM.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors