Skip to content

BarakMozesPro/secureflow-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SecureFlow Benchmark — LLM Safety Evaluation & Red-Teaming Framework

Python License Build

SecureFlow Benchmark is a modular LLM safety evaluation framework built for red-teaming language models. It uses plugin-based attack probes, prompt transformation buffs, and automated safety detectors to systematically find vulnerabilities in LLM deployments.

Architecture

graph LR
    A[Attack Probes] --> B[Prompt Buffs]
    B --> C[LLM Generator]
    C --> D[Response]
    D --> E[Safety Detectors]
    E --> F[Evaluator]
    F --> G[Safety Report]
Loading

Features

  • Plugin-based architecture with 4 extension points: probes, generators, detectors, and buffs
  • Attack probes for jailbreaking (DAN, TAP), encoding attacks, and prompt injection
  • LLM backends for OpenAI, HuggingFace, and REST APIs
  • Safety detectors for content analysis and package hallucination detection
  • Configurable evaluation harness with structured attempt data model
  • Report generation and analysis visualization

Quick Start

pip install secureflow-benchmark

Usage

import secureflow_benchmark
from secureflow_benchmark.probes import dan
from secureflow_benchmark.generators import openai as openai_gen
from secureflow_benchmark.evaluators import base as base_eval

# Set up a generator
generator = openai_gen.OpenAIGenerator(name="gpt-4o")

# Run DAN jailbreak probes
probe = dan.Dan_11_0()
attempts = probe.probe(generator)

# Evaluate results
evaluator = base_eval.Evaluator()
results = evaluator.evaluate(attempts)
print(f"Pass rate: {results.pass_rate:.2%}")

Project Structure

secureflow_benchmark/
├── __init__.py           # Package init and version
├── _config.py            # Configuration management
├── _plugins.py           # Plugin discovery system
├── configurable.py       # Base configurable class
├── attempt.py            # Attempt data model
├── cli.py                # Command-line interface
├── command.py            # Run orchestration commands
├── report.py             # Report generation
├── payloads.py           # Payload management
├── probes/               # Attack probes
│   ├── base.py           # Base probe class
│   ├── dan.py            # DAN jailbreak probes
│   ├── tap.py            # TAP jailbreak probes
│   ├── encoding.py       # Encoding attack probes
│   ├── latentinjection.py # Latent injection probes
│   ├── promptinject.py   # Prompt injection probes
│   ├── continuation.py   # Continuation probes
│   ├── grandma.py        # Grandma exploit probes
│   └── lmrc.py           # LMRC benchmark probes
├── generators/           # LLM backends
│   ├── base.py           # Base generator class
│   ├── openai.py         # OpenAI API backend
│   ├── huggingface.py    # HuggingFace backend
│   └── rest.py           # Generic REST API backend
├── detectors/            # Safety detectors
│   ├── base.py           # Base detector class
│   ├── unsafe_content.py # Unsafe content detection
│   ├── packagehallucination.py # Package hallucination
│   ├── mitigation.py     # Mitigation detection
│   └── always.py         # Always-pass/fail detectors
├── buffs/                # Prompt transformations
│   ├── base.py           # Base buff class
│   ├── encoding.py       # Encoding transformations
│   ├── lowercase.py      # Lowercase transformation
│   ├── paraphrase.py     # Paraphrase transformation
│   └── low_resource_languages.py # Language transforms
├── harnesses/            # Evaluation harnesses
├── evaluators/           # Result evaluators
└── analyze/              # Report analysis tools

What I Learned

Building this framework deepened my understanding of LLM red-teaming methodology — specifically the taxonomy of jailbreak attacks (DAN prompts, encoding-based bypasses, latent injection), how automated safety benchmarking pipelines work, and the design patterns needed to make such a system extensible across different model backends and attack strategies.

Credit

Built upon garak by NVIDIA (Apache 2.0 License). SecureFlow Benchmark is a focused subset and fork of garak, adapted for benchmarking and educational use.

License

Apache 2.0 — See LICENSE for details.

About

LLM safety evaluation framework with attack probes and safety detectors

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages