AdversariaLLM

A comprehensive toolkit for evaluating and comparing continuous and discrete adversarial attacks on LLMs. This repository provides a unified framework for running various attack methods, generating adversarial prompts, and evaluating model safety and robustness.

🔧 Installation

Clone the repository:

git clone https://github.com/LLM-QC/AdversariaLLM
cd AdversariaLLM

This repository supports two setup paths:

Option A: Pixi (recommended)

Pixi installs the environment and the local adversariallm package (editable) from pyproject.toml.

pixi install --locked

Run commands either with pixi run ...:

pixi run python run_attacks.py --help
pixi run pytest -q tests/test_attacks/test_direct.py

or activate the environment first:

pixi shell
python run_attacks.py --help

Option B: Classic pip / virtualenv / conda workflow

Use this if you prefer a traditional Python environment.

Install dependencies:

pip install -r requirements.txt

Install the package in development mode:

pip install -e .

🚀 Quick Start

Repository Root Path (`root_dir`)

By default, root_dir is inferred from the working directory where you run the Hydra script. If needed, you can override it explicitly:

python run_attacks.py root_dir=/absolute/path/to/repo ...

If you prefer a fixed setup, you can also hard-code root_dir in conf/paths.yaml.

Running Basic Attacks

To evaluate a model with a single attack method:

python run_attacks.py -m \
    model=microsoft/Phi-3-mini-4k-instruct \
    dataset=adv_behaviors \
    datasets.adv_behaviors.idx="range(0,300)" \
    attack=gcg \
    hydra.launcher.timeout_min=240

Running Multiple Attacks (Sweep)

To compare multiple attack methods:

python run_attacks.py -m \
    model=microsoft/Phi-3-mini-4k-instruct \
    dataset=adv_behaviors \
    datasets.adv_behaviors.idx="range(0,300)" \
    attack=gcg,pair,autodan \
    hydra.launcher.timeout_min=240

This will launch 900 jobs (3 attacks × 300 prompts) and run GCG, PAIR, and AutoDAN against Phi-3 on all 300 prompts.

🎯 Supported Attack Methods

The framework supports various adversarial attack algorithms:

GCG - Greedy Coordinate Gradient attack (with various objectives, including REINFORCE)
PAIR - Prompt Automatic Iterative Refinement
AutoDAN - Automatic prompt generation
PGD - Projected Gradient Descent (continuous in embedding and indicator-space, with & without discretization)
Random Search - Baseline random optimization
Human Jailbreaks - Curated human-written prompts
Direct - Direct prompt testing without optimization
BEAST - Gradient-free discrete optimization
Best-of-N - Jailbreaking with simple string perturbations
Inpainting - Diffusion-based inpainting attacks (Implemented as transfer attacks)

📊 Evaluation and Judging

For a complete list of supported judges, see: JudgeZoo

Default Judge

By default, all completions are evaluated using StrongREJECT. You can change this by modifying the classifiers attribute in your config:

classifiers: ["strong_reject", "harmbench", "custom_judge"]

Running Judges Separately

python run_judges.py \
    judge=strong_reject

will judge all files with strong_reject which haven not been judged yet.

🔧 Advanced Usage

Custom Attack Parameters

You can override specific attack parameters:

python run_attacks.py -m \
    attack=gcg \
    attacks.gcg.num_steps=500 \
    attacks.gcg.search_width=512

Distributional Evaluation

Distributional evaluation allows you to assess the behavior of attacks across multiple sampled responses rather than a single deterministic output. This is particularly useful for measuring the robustness of safety mechanisms and understanding the distribution of model behaviors under adversarial conditions. Inspired by arxiv:2410.03523 and arxiv:2507.04446.

Specify Generation Parameters

generation_config:
  temperature: 0.7
  top_p: 1.0
  top_k: 0
  max_new_tokens: 256
  num_return_sequences: 50

Example: Basic Distributional Evaluation

To evaluate a model with multiple sampled responses:

python run_attacks.py -m \
    model=microsoft/Phi-3-mini-4k-instruct \
    dataset=adv_behaviors \
    datasets.adv_behaviors.idx="range(0,50)" \
    attack=gcg \
    attacks.gcg.generation_config.temperature=0.7 \
    attacks.gcg.generation_config.num_return_sequences=50 \
    attacks.gcg.generation_config.max_new_tokens=256

This will generate 50 diverse responses per prompt at temperature 0.7, allowing you to compute metrics like:

Expected harmfulness: E[h(Y)]
Success rate across samples
Distribution of refusal vs. compliance behaviors

Example: Comparing Baseline vs. Distributional Attacks

Compare deterministic baseline (temperature=0.0) with distributional sampling:

# Baseline: deterministic evaluation
python run_attacks.py -m \
    model=meta-llama/Meta-Llama-3.1-8B-Instruct \
    dataset=adv_behaviors \
    attack=pair \
    attacks.pair.generation_config.temperature=0.0 \
    attacks.pair.generation_config.num_return_sequences=1

# Distributional: sample-based evaluation
python run_attacks.py -m \
    model=meta-llama/Meta-Llama-3.1-8B-Instruct \
    dataset=adv_behaviors \
    attack=pair \
    attacks.pair.generation_config.temperature=0.7 \
    attacks.pair.generation_config.num_return_sequences=50

📈 Results and Analysis

Results are saved in the configured output directory with the following structure:

outputs/
├── YYYY-MM-DD/HH-MM-SS/{i}/run.json
...
└── YYYY-MM-DD/HH-MM-SS/{i}/run.json

Visualization & Evaluation (WIP)

Generate plots and analysis with visualize_results.ipynb in evaluations/

Used in

[1] Beyer, Tim, et al. "Fast Proxies for LLM Robustness Evaluation." arXiv preprint arXiv:2502.10487 (2025).
[2] Xhonneux, Sophie, et al. "A generative approach to LLM harmfulness detection with special red flag tokens." arXiv preprint arXiv:2502.16366 (2025).
[3] Beyer, Tim, et al. "LLM-safety Evaluations Lack Robustness." arXiv preprint arXiv:2503.02574 (2025).
[4] Beyer, Tim, et al. "Sampling-aware adversarial attacks against large language models." arXiv preprint arXiv:2507.04446 (2025).
[5] Lüdke, David, et al. "Diffusion LLMs are Natural Adversaries for any LLM." arXiv preprint arXiv:2511.00203 (2025).

🤝 Contributing

Contributions welcome!

📁 Project Structure

llm-quick-check/
├── src/
│   ├── attacks/           # Attack implementations
│   │   ├── gcg.py        # GCG attack
│   │   ├── pair.py       # PAIR attack
│   │   ├── autodan.py    # AutoDAN attack
│   │   └── ...
│   ├── dataset/          # Dataset handling (modular)
│   │   ├── prompt_dataset.py      # Base dataset class
│   │   ├── adv_behaviors.py       # AdvBench behaviors
│   │   ├── jbb_behaviors.py       # JailbreakBench
│   │   ├── strong_reject.py       # StrongREJECT
│   │   ├── or_bench.py            # ORBench
│   │   ├── refusal_direction.py   # RefusalDirection
│   │   ├── xs_test.py             # XSTest
│   │   ├── alpaca.py              # Alpaca
│   │   ├── mmlu.py                # MMLU
│   │   └── ...
│   ├── io_utils/         # I/O utilities
│   ├── lm_utils/         # Language model utilities
│   └── types.py          # Type definitions
├── conf/                 # Configuration files
│   ├── config.yaml       # Main config
│   ├── attacks/          # Attack-specific configs
│   ├── datasets/         # Dataset configs
│   └── models/           # Model configs
├── run_attacks.py        # Main attack runner
├── run_judges.py         # Judge evaluation
├── run_sampling.py       # Sampling utilities
└── requirements.txt      # Dependencies

🙏 Acknowledgments

Please be sure to cite the underlying work if you build on it.

Datasets

Attacks

Other

JudgeZoo for judge implementations

Citation

If you use this repo in your work or found it useful, please consider citing

@article{beyer2025adversariallm,
  title={AdversariaLLM: A Unified and Modular Toolbox for LLM Robustness Research},
  author={Beyer, Tim and Dornbusch, Jonas and Steimle, Jakob and Ladenburger, Moritz and Schwinn, Leo and G{\"u}nnemann, Stephan},
  journal={arXiv preprint arXiv:2511.04316},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 253 Commits
adversariallm		adversariallm
chat_templates		chat_templates
conf		conf
data		data
evaluate		evaluate
notebooks		notebooks
tests		tests
.gitignore		.gitignore
README.md		README.md
pixi.lock		pixi.lock
purge_orphans.py		purge_orphans.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_attacks.py		run_attacks.py
run_ensemble.py		run_ensemble.py
run_judges.py		run_judges.py
run_sampling.py		run_sampling.py
slurm_status.py		slurm_status.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AdversariaLLM

🔧 Installation

Option A: Pixi (recommended)

Option B: Classic pip / virtualenv / conda workflow

🚀 Quick Start

Repository Root Path (`root_dir`)

Running Basic Attacks

Running Multiple Attacks (Sweep)

🎯 Supported Attack Methods

📊 Evaluation and Judging

Default Judge

Running Judges Separately

🔧 Advanced Usage

Custom Attack Parameters

Distributional Evaluation

Specify Generation Parameters

Example: Basic Distributional Evaluation

Example: Comparing Baseline vs. Distributional Attacks

📈 Results and Analysis

Visualization & Evaluation (WIP)

Used in

🤝 Contributing

📁 Project Structure

🙏 Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AdversariaLLM

🔧 Installation

Option A: Pixi (recommended)

Option B: Classic pip / virtualenv / conda workflow

🚀 Quick Start

Repository Root Path (root_dir)

Running Basic Attacks

Running Multiple Attacks (Sweep)

🎯 Supported Attack Methods

📊 Evaluation and Judging

Default Judge

Running Judges Separately

🔧 Advanced Usage

Custom Attack Parameters

Distributional Evaluation

Specify Generation Parameters

Example: Basic Distributional Evaluation

Example: Comparing Baseline vs. Distributional Attacks

📈 Results and Analysis

Visualization & Evaluation (WIP)

Used in

🤝 Contributing

📁 Project Structure

🙏 Acknowledgments

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Repository Root Path (`root_dir`)

Packages