Conversation
|
DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅ |
jmartin-tech
left a comment
There was a problem hiding this comment.
Some initial thoughts, usage based testing and evaluation to understand the flow here is in progress. The team is still loading context on how this works and evaluating where and how this can integrate with expected use cases and user experience.
Further guidance will be added as that testing evolves.
| quality:Security:Integrity Integrity Detection of tainted training data etc. | ||
| quality:Security:Availability Availability Availability (Model DoS) | ||
| quality:Security:Adversarial Adversarial Robustness Adversarial Robustness | ||
| quality:Security:AgentSecurity Agent Security Security of agentic tool-using LLM applications against exploitation |
There was a problem hiding this comment.
@leondz something we should consider WRT taxonomy we'd discussed previously.
| verify_prompt = ab_prompts.VERIFY.format( | ||
| target_tool=target_tool, | ||
| vulnerability_info=vulnerability_info, | ||
| attack_prompt=attack_prompt, | ||
| agent_response=agent_response, | ||
| ) | ||
|
|
||
| verification = self._get_red_team_response(verify_prompt) |
There was a problem hiding this comment.
Using LLMaaJ feels... questionable here. Ultimately, this should probably be Detector logic, but I'd wager that there are a lot of cases where this will say "yeah, exploitation worked" since it's generating the attack prompt against the tool -- there's a predisposition to believing that it will work, and I wonder how much the agent response has to budge to change the response here.
There was a problem hiding this comment.
Good point about the confirmation bias risk. We've already refactored the final scoring into the Detector (AgentBreakerResult), which uses its own independent LLM instance to judge each output -- so the red team model no longer determines the reported results.
The probe's internal _verify_attack_success is now only used for loop control (decide whether to keep retrying or move to the next tool). It doesn't affect the final score. That said, replacing it with simpler heuristics (e.g. keyword-based refusal detection) would reduce bias, save tokens, and cut latency -- happy to tackle that as a follow-up.
|
I have read the DCO Document and I hereby sign the DCO |
|
recheck |
47f06aa to
7242b9c
Compare
|
I have read the DCO Document and I hereby sign the DCO |
jmartin-tech
left a comment
There was a problem hiding this comment.
Second pass based on iterative testing and collaborative conversation.
This probe can be adjusted to more closely align with the IterativeProbe interface which will also increase the ability to audit and debug a run result. An additional benefit will also be gained as the core pipeline alignment will expose to progress indicators during execution.
An initial start at refactor can be found here.
e12d874 to
cd7e609
Compare
Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>
Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>
Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>
Signed-off-by: Jeffrey Martin <jemartin@nvidia.com> Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>
Signed-off-by: Jeffrey Martin <jemartin@nvidia.com> Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>
Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>
Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>
…num_generation bigger than 1 Signed-off-by: eliyacohen-hub <eliya339041957@gmail.com>
cd7e609 to
9b2cd36
Compare
erickgalinkin
left a comment
There was a problem hiding this comment.
Awaiting further feedback
| AGENT RESPONSE: | ||
| {agent_response} | ||
|
|
||
| Return a JSON object (no markdown, just raw JSON): |
There was a problem hiding this comment.
Something we should think about (outside the scope of this PR) is having a helper Generator class that supports models that accept structured output, or otherwise consider gracefully handling trying to pass a schema to models that do.
| f"{len(self.agent_config['tools'])} tools" | ||
| ) | ||
|
|
||
| def _discover_agent_config(self, generator) -> None: |
There was a problem hiding this comment.
Probably out of scope for this PR but including for posterity:
Some agents support A2A protocol, which gives us a lot of useful information for free: see docs
There was a problem hiding this comment.
Agreed, target recon is something we likely need to elevate into a core run stage.
* always supply a mock default NIM enviornment variable * update agent_breaker to instantiate using config_root * consolidate some tests with parameters Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
Agent Breaker: Multi-turn red-team probe for agentic LLM applications
Adds a new probe (
agent_breaker.AgentBreaker) that performs automated security testing of agentic LLM applications — systems that use tools (e.g. code execution, database queries, file access, API calls).A red team model analyzes each tool for vulnerabilities, generates targeted exploits, attacks the agent in multi-turn conversations (learning from failures), and verifies attack success.
Key features:
max_parallel_tools(default: sequential)OWASP LLM Top 10: LLM01 (Prompt Injection), LLM07 (Insecure Plugin Design), LLM08 (Excessive Agency)
Verification
python -m garak --config scan_config.yamlpython -m pytest tests/probes/test_agent_breaker.py tests/detectors/test_detectors_agent_breaker.py -vagent_breaker.AgentBreakerResult: FAIL ok on X/YEnvironment notes