Skip to content

Proposal: Agentic AI security rules (beyond MCP) #28

@nik-kale

Description

@nik-kale

Description

This is a proposal for a new rule covering agentic AI security patterns that fall outside the scope of the existing codeguard-0-mcp-security.md rule. The MCP rule covers protocol-level security (transport, sandboxing, workload identity). This proposal addresses the broader system-level patterns that apply regardless of protocol.

These patterns come up when AI agents delegate to other agents, invoke tools across trust boundaries, or operate autonomously. They are structural (either enforced or not), not value-dependent, so they should translate well into rules that models can follow without generating arbitrary numbers.

This also aligns with the tools Omar referenced in the CodeGuard donation announcement - MCP Scanner, A2A Scanner, and Agent Skills Scanner - which address the tooling side of the same problem space. This rule would be the policy complement.

Agentic AI Security

Secure AI agent systems at the architectural level. These guidelines apply to any system where agents delegate tasks, invoke tools, communicate with other agents, or operate with varying degrees of autonomy.

For MCP-specific transport, sandboxing, and workload identity controls, see codeguard-0-mcp-security.md. This rule covers the broader patterns that apply regardless of protocol.

Prompt Injection Prevention

  • Separate system instructions from user-supplied and data-sourced content. Never concatenate untrusted data directly into system prompts.
  • Enforce instruction hierarchy: system instructions take priority over user messages, which take priority over tool outputs and retrieved content.
  • Treat all external data injected into agent context (tool responses, API results, database records, file contents) as untrusted input. Apply the same rigor as SQL injection prevention: data is data, never instructions.
  • Validate and sanitize tool outputs before including them in subsequent prompts. Strip or escape content that could be interpreted as instructions.
  • For retrieval-augmented generation (RAG), sanitize retrieved documents before injection into context. Flag content that contains instruction-like patterns.

Agent Delegation and Trust Chains

  • When Agent A delegates a task to Agent B, scope the delegation explicitly. Pass only the minimum context and permissions needed for the subtask.
  • Verify the identity and capabilities of downstream agents before delegation. Do not delegate to agents that have not been registered and authorized.
  • Maintain a delegation chain record: which agent delegated to which, what permissions were granted, and what actions were taken. This chain must be auditable.
  • Prevent unbounded delegation chains. Enforce a maximum delegation depth as a configuration parameter, not a hardcoded value in application logic.
  • Revoke delegated permissions when the subtask completes or times out. Do not allow delegated permissions to persist beyond the task scope.

Context Window Security

  • Do not accumulate sensitive data (credentials, PII, internal system details) across conversation turns or sessions. Scrub context between unrelated tasks.
  • Enforce context boundaries between users and tenants. One user's conversation context must never leak into another user's session.
  • Audit what data enters the context window. Log when sensitive data classifications (credentials, PII, financial data) appear in agent context for detection purposes.
  • Implement context management strategies (summarization, truncation, windowing) that prioritize retaining security-relevant instructions over older conversational content. Make context bounds configurable per deployment.
  • For multi-turn interactions, re-validate security constraints at each turn rather than relying on constraints set in earlier turns that may have been displaced from context.

Agent Output Validation

  • Validate all agent-generated outputs before execution. This includes generated code, shell commands, API calls, file operations, and configuration changes.
  • Apply static analysis or pattern matching to generated code before it runs. Check for known-dangerous patterns (hardcoded credentials, shell injection, unsafe deserialization).
  • For code generation, run generated code in a sandboxed environment before promoting to production. Never execute agent-generated code directly in a privileged context.
  • Validate that agent-generated API calls conform to expected schemas, target authorized endpoints, and use appropriate authentication.
  • Log all agent outputs and execution results for post-hoc audit.

Autonomous Action Boundaries

  • Define explicit boundaries for what actions an agent can take without human approval. Document these boundaries as part of the agent's configuration, not just in prompts.
  • Classify actions by impact level. Read-only operations, reversible writes, and irreversible operations (delete, deploy, send external communications) should have different approval thresholds.
  • Require human confirmation for irreversible or high-impact actions. Implement a two-stage pattern: the agent proposes the action, a human approves, then the agent executes.
  • Monitor autonomous agent activity. Require periodic human check-ins for long-running or write-heavy agent sessions. Make activity thresholds configurable rather than hardcoded.
  • Implement kill switches: the ability to immediately halt all agent actions across a system.

Multi-Agent Communication

  • Authenticate all inter-agent messages. Agents must verify the identity of the sender before acting on received messages.
  • Sign messages between agents to prevent tampering. Validate signatures on receipt.
  • Apply the principle of least privilege to inter-agent communication channels. Agents should only be able to send messages to agents they need to communicate with, not broadcast to all.
  • Validate the schema and content of inter-agent messages. Reject messages that do not conform to the expected format or contain unexpected instruction-like content.

Implementation Checklist

  • Prompt injection mitigations in place (instruction hierarchy, input/output separation, data sanitization)
  • Delegation scoping and depth limits configured
  • Context window scrubbing between sessions and tenants
  • Agent output validation pipeline (static analysis, sandboxing) before execution
  • Autonomous action boundaries defined and enforced (impact classification, human approval gates)
  • Inter-agent authentication and message signing active
  • Kill switch capability tested and operational
  • Delegation chain logging and audit trail in place

Test Plan

  • Attempt indirect prompt injection via tool outputs and verify it does not override system instructions
  • Delegate a task with scoped permissions and verify the downstream agent cannot exceed those permissions
  • Verify context isolation between tenants by checking for data leakage across sessions
  • Generate intentionally malicious code via agent and verify the output validation pipeline catches it before execution
  • Trigger an irreversible action and verify human approval is required
  • Send an unsigned or malformed inter-agent message and verify it is rejected

What language(s) are the rule for?

python, javascript, typescript, go, rust, java

Other

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions