Agent breaker by eliyacohen-hub · Pull Request #1628 · NVIDIA/garak

eliyacohen-hub · 2026-02-24T12:09:47Z

Agent Breaker: Multi-turn red-team probe for agentic LLM applications

Adds a new probe (agent_breaker.AgentBreaker) that performs automated security testing of agentic LLM applications — systems that use tools (e.g. code execution, database queries, file access, API calls).

A red team model analyzes each tool for vulnerabilities, generates targeted exploits, attacks the agent in multi-turn conversations (learning from failures), and verifies attack success.

Key features:

Auto-discovery — if no tools are defined in config, the probe queries the target agent to discover its tools automatically
Parallel tool attacks — configurable max_parallel_tools (default: sequential)
Adaptive attacks — each attempt analyzes previous prompts/responses to improve exploits
Early stopping — stops attacking a tool immediately upon success

OWASP LLM Top 10: LLM01 (Prompt Injection), LLM07 (Insecure Plugin Design), LLM08 (Excessive Agency)

Verification

Create a scan config YAML pointing to your target agent REST endpoint
python -m garak --config scan_config.yaml
python -m pytest tests/probes/test_agent_breaker.py tests/detectors/test_detectors_agent_breaker.py -v
Verify auto-discovery works when agent.yaml has no tools defined
Verify parallel and sequential tool attacks both work correctly
Verify results display: agent_breaker.AgentBreakerResult: FAIL ok on X/Y

Environment notes

Requires a red team model via NVIDIA Inference API, or the user can change it to another llm endpoint
Requires a target agent exposed as a REST endpoint (or any garak generator)
No specific hardware requirements (all inference is remote API calls)

github-actions · 2026-02-24T12:10:00Z

DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅

jmartin-tech

Some initial thoughts, usage based testing and evaluation to understand the flow here is in progress. The team is still loading context on how this works and evaluating where and how this can integrate with expected use cases and user experience.

Further guidance will be added as that testing evolves.

erickgalinkin · 2026-03-05T14:29:22Z

 quality:Security:Integrity	Integrity	Detection of tainted training data etc.
 quality:Security:Availability	Availability	Availability (Model DoS)
 quality:Security:Adversarial	Adversarial Robustness	Adversarial Robustness
+quality:Security:AgentSecurity	Agent Security	Security of agentic tool-using LLM applications against exploitation


@leondz something we should consider WRT taxonomy we'd discussed previously.

erickgalinkin · 2026-03-05T15:53:49Z

+        verify_prompt = ab_prompts.VERIFY.format(
+            target_tool=target_tool,
+            vulnerability_info=vulnerability_info,
+            attack_prompt=attack_prompt,
+            agent_response=agent_response,
+        )
+
+        verification = self._get_red_team_response(verify_prompt)


Using LLMaaJ feels... questionable here. Ultimately, this should probably be Detector logic, but I'd wager that there are a lot of cases where this will say "yeah, exploitation worked" since it's generating the attack prompt against the tool -- there's a predisposition to believing that it will work, and I wonder how much the agent response has to budge to change the response here.

Good point about the confirmation bias risk. We've already refactored the final scoring into the Detector (AgentBreakerResult), which uses its own independent LLM instance to judge each output -- so the red team model no longer determines the reported results.
The probe's internal _verify_attack_success is now only used for loop control (decide whether to keep retrying or move to the next tool). It doesn't affect the final score. That said, replacing it with simpler heuristics (e.g. keyword-based refusal detection) would reduce bias, save tokens, and cut latency -- happy to tackle that as a follow-up.

eliyacohen-hub · 2026-03-09T20:19:08Z

I have read the DCO Document and I hereby sign the DCO

eliyacohen-hub · 2026-03-09T20:20:27Z

recheck

barlanyado · 2026-03-11T18:59:06Z

I have read the DCO Document and I hereby sign the DCO

jmartin-tech

Second pass based on iterative testing and collaborative conversation.

This probe can be adjusted to more closely align with the IterativeProbe interface which will also increase the ability to audit and debug a run result. An additional benefit will also be gained as the core pipeline alignment will expose to progress indicators during execution.

An initial start at refactor can be found here.

Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com> Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>

Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>

…num_generation bigger than 1 Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>

erickgalinkin

Awaiting further feedback

erickgalinkin · 2026-04-14T16:15:23Z

+  AGENT RESPONSE:
+  {agent_response}
+
+  Return a JSON object (no markdown, just raw JSON):


Something we should think about (outside the scope of this PR) is having a helper Generator class that supports models that accept structured output, or otherwise consider gracefully handling trying to pass a schema to models that do.

erickgalinkin · 2026-04-14T16:17:49Z

+            f"{len(self.agent_config['tools'])} tools"
+        )
+
+    def _discover_agent_config(self, generator) -> None:


Probably out of scope for this PR but including for posterity:

Some agents support A2A protocol, which gives us a lot of useful information for free: see docs

Agreed, target recon is something we likely need to elevate into a core run stage.

* always supply a mock default NIM enviornment variable * update agent_breaker to instantiate using config_root * consolidate some tests with parameters Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>

jmartin-tech reviewed Mar 2, 2026

View reviewed changes

erickgalinkin requested changes Mar 5, 2026

View reviewed changes

eliyacohen-hub requested review from erickgalinkin and jmartin-tech March 9, 2026 20:21

eliyacohen-hub force-pushed the agent_breaker branch from 47f06aa to 7242b9c Compare March 9, 2026 21:15

github-actions bot added a commit that referenced this pull request Mar 9, 2026

@eliyacohen-hub has signed the CLA in #1628

f1627c4

leondz self-requested a review March 11, 2026 19:15

jmartin-tech requested changes Mar 11, 2026

View reviewed changes

github-actions bot added a commit that referenced this pull request Mar 18, 2026

@eliyacohen-hub has signed the CLA in #1628

2bb0373

jmartin-tech mentioned this pull request Mar 23, 2026

MCP Server Security Scanning: OWASP MCP Top 10 coverage #1639

Open

eliyacohen-hub force-pushed the agent_breaker branch from e12d874 to cd7e609 Compare April 9, 2026 15:19

eliyacohen-hub and others added 8 commits April 9, 2026 18:21

feat: add agent_breaker probe for agentic LLM security testing

2bd2f73

Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>

exploit system prompt improvment

7dd86bd

Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>

use gpt-oss as the attacker and detector models

1edb94e

Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>

lint agent_breaker

57bda0b

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com> Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>

refactor to follow IterativeProbe patterns

0b839b6

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com> Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>

no hardcoded message with the class name

6b19f70

Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>

fix attacker flow to be iterative per tool

99d0746

Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>

fix comments: same detector for the probe and the detector + support …

9b2cd36

…num_generation bigger than 1 Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>

eliyacohen-hub force-pushed the agent_breaker branch from cd7e609 to 9b2cd36 Compare April 9, 2026 15:22

erickgalinkin reviewed Apr 14, 2026

View reviewed changes

refactor tests to support nim detectors loaded by probes

c5f42b1

* always supply a mock default NIM enviornment variable * update agent_breaker to instantiate using config_root * consolidate some tests with parameters Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>

Conversation

eliyacohen-hub commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Agent Breaker: Multi-turn red-team probe for agentic LLM applications

Verification

Environment notes

Uh oh!

github-actions bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmartin-tech left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

erickgalinkin Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

erickgalinkin Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

eliyacohen-hub Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eliyacohen-hub commented Mar 9, 2026

Uh oh!

eliyacohen-hub commented Mar 9, 2026

Uh oh!

barlanyado commented Mar 11, 2026

Uh oh!

jmartin-tech left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

erickgalinkin left a comment

Choose a reason for hiding this comment

Uh oh!

erickgalinkin Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

erickgalinkin Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

jmartin-tech Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eliyacohen-hub commented Feb 24, 2026 •

edited

Loading

github-actions bot commented Feb 24, 2026 •

edited

Loading