Skip to content

Agent breaker#1628

Open
eliyacohen-hub wants to merge 9 commits intoNVIDIA:mainfrom
eliyacohen-hub:agent_breaker
Open

Agent breaker#1628
eliyacohen-hub wants to merge 9 commits intoNVIDIA:mainfrom
eliyacohen-hub:agent_breaker

Conversation

@eliyacohen-hub
Copy link
Copy Markdown

@eliyacohen-hub eliyacohen-hub commented Feb 24, 2026

Agent Breaker: Multi-turn red-team probe for agentic LLM applications

Adds a new probe (agent_breaker.AgentBreaker) that performs automated security testing of agentic LLM applications — systems that use tools (e.g. code execution, database queries, file access, API calls).

A red team model analyzes each tool for vulnerabilities, generates targeted exploits, attacks the agent in multi-turn conversations (learning from failures), and verifies attack success.

Key features:

  • Auto-discovery — if no tools are defined in config, the probe queries the target agent to discover its tools automatically
  • Parallel tool attacks — configurable max_parallel_tools (default: sequential)
  • Adaptive attacks — each attempt analyzes previous prompts/responses to improve exploits
  • Early stopping — stops attacking a tool immediately upon success

OWASP LLM Top 10: LLM01 (Prompt Injection), LLM07 (Insecure Plugin Design), LLM08 (Excessive Agency)

Verification

  • Create a scan config YAML pointing to your target agent REST endpoint
  • python -m garak --config scan_config.yaml
  • python -m pytest tests/probes/test_agent_breaker.py tests/detectors/test_detectors_agent_breaker.py -v
  • Verify auto-discovery works when agent.yaml has no tools defined
  • Verify parallel and sequential tool attacks both work correctly
  • Verify results display: agent_breaker.AgentBreakerResult: FAIL ok on X/Y

Environment notes

  • Requires a red team model via NVIDIA Inference API, or the user can change it to another llm endpoint
  • Requires a target agent exposed as a REST endpoint (or any garak generator)
  • No specific hardware requirements (all inference is remote API calls)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 24, 2026

DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅

Copy link
Copy Markdown
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial thoughts, usage based testing and evaluation to understand the flow here is in progress. The team is still loading context on how this works and evaluating where and how this can integrate with expected use cases and user experience.

Further guidance will be added as that testing evolves.

Comment thread garak/generators/inference_api.py Outdated
Comment thread garak/data/agent_breaker/prompts.py Outdated
Comment thread garak/data/agent_breaker/__init__.py Outdated
Comment thread garak/data/agent_breaker/agent.yaml
Comment thread garak/probes/agent_breaker.py Outdated
Comment thread garak/probes/agent_breaker.py
Comment thread garak/probes/agent_breaker.py
Comment thread garak/detectors/agent_breaker.py
Comment thread garak/probes/agent_breaker.py Outdated
Comment thread garak/probes/agent_breaker.py Outdated
Comment thread garak/data/tags.misp.tsv
quality:Security:Integrity Integrity Detection of tainted training data etc.
quality:Security:Availability Availability Availability (Model DoS)
quality:Security:Adversarial Adversarial Robustness Adversarial Robustness
quality:Security:AgentSecurity Agent Security Security of agentic tool-using LLM applications against exploitation
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leondz something we should consider WRT taxonomy we'd discussed previously.

Comment thread tests/detectors/test_detectors.py
Comment thread garak/probes/agent_breaker.py Outdated
Comment thread garak/probes/agent_breaker.py Outdated
Comment thread garak/probes/agent_breaker.py
Comment thread garak/probes/agent_breaker.py Outdated
Comment thread garak/probes/agent_breaker.py Outdated
Comment on lines +561 to +568
verify_prompt = ab_prompts.VERIFY.format(
target_tool=target_tool,
vulnerability_info=vulnerability_info,
attack_prompt=attack_prompt,
agent_response=agent_response,
)

verification = self._get_red_team_response(verify_prompt)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using LLMaaJ feels... questionable here. Ultimately, this should probably be Detector logic, but I'd wager that there are a lot of cases where this will say "yeah, exploitation worked" since it's generating the attack prompt against the tool -- there's a predisposition to believing that it will work, and I wonder how much the agent response has to budge to change the response here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about the confirmation bias risk. We've already refactored the final scoring into the Detector (AgentBreakerResult), which uses its own independent LLM instance to judge each output -- so the red team model no longer determines the reported results.
The probe's internal _verify_attack_success is now only used for loop control (decide whether to keep retrying or move to the next tool). It doesn't affect the final score. That said, replacing it with simpler heuristics (e.g. keyword-based refusal detection) would reduce bias, save tokens, and cut latency -- happy to tackle that as a follow-up.

Comment thread garak/detectors/agent_breaker.py Outdated
Comment thread garak/probes/agent_breaker.py Outdated
Comment thread garak/probes/agent_breaker.py Outdated
@eliyacohen-hub
Copy link
Copy Markdown
Author

I have read the DCO Document and I hereby sign the DCO

@eliyacohen-hub
Copy link
Copy Markdown
Author

recheck

@barlanyado
Copy link
Copy Markdown

I have read the DCO Document and I hereby sign the DCO

@leondz leondz self-requested a review March 11, 2026 19:15
Copy link
Copy Markdown
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second pass based on iterative testing and collaborative conversation.

This probe can be adjusted to more closely align with the IterativeProbe interface which will also increase the ability to audit and debug a run result. An additional benefit will also be gained as the core pipeline alignment will expose to progress indicators during execution.

An initial start at refactor can be found here.

Comment thread garak/detectors/agent_breaker.py Outdated
Comment thread garak/detectors/agent_breaker.py Outdated
Comment thread garak/detectors/agent_breaker.py Outdated
Comment thread garak/probes/agent_breaker.py Outdated
Comment thread garak/probes/agent_breaker.py Outdated
Comment thread garak/probes/agent_breaker.py Outdated
eliyacohen-hub and others added 8 commits April 9, 2026 18:21
Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>
Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>
Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>
Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>
Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>
Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>
Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>
…num_generation bigger than 1

Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>
Copy link
Copy Markdown
Collaborator

@erickgalinkin erickgalinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awaiting further feedback

AGENT RESPONSE:
{agent_response}

Return a JSON object (no markdown, just raw JSON):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something we should think about (outside the scope of this PR) is having a helper Generator class that supports models that accept structured output, or otherwise consider gracefully handling trying to pass a schema to models that do.

f"{len(self.agent_config['tools'])} tools"
)

def _discover_agent_config(self, generator) -> None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably out of scope for this PR but including for posterity:

Some agents support A2A protocol, which gives us a lot of useful information for free: see docs

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, target recon is something we likely need to elevate into a core run stage.

* always supply a mock default NIM enviornment variable
* update agent_breaker to instantiate using config_root
* consolidate some tests with parameters

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants