Skip to content

hardenedlinux/agentic-ai-pentest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

AI Agents for WebGoat Penetration Testing Evaluation

Using AI agents for penetration testing (pentesting) has become a significant trend in cybersecurity. Traditional manual pentesting is time-consuming, expensive, and limited by human availability, while rule-based automated scanners often produce high false-positive rates and miss complex, chained vulnerabilities.

Modern autonomous AI agents powered by large language models (LLMs) can:

  • Perform reconnaissance
  • Reason about application behavior
  • Plan attack chains
  • Execute tools (nmap, sqlmap, Metasploit, browser automation, etc.)
  • Validate exploits with real proof-of-concepts (PoCs)
  • Generate structured reports

This approach enables faster, more scalable, and repeatable security assessments — especially valuable for continuous testing, bug bounty simulation, and red teaming.

Some notable open-source AI agents for penetration testing include:

  • Strix
    Autonomous AI "hackers" that behave like real attackers. They run code dynamically, explore applications (via HTTP proxy, browser, terminal), find vulnerabilities, and confirm them with working PoCs. Supports teams of collaborating agents for scaled assessments.

  • PentAGI
    Fully autonomous multi-agent system for complex pentesting tasks. Integrates 20+ professional tools (nmap, Metasploit, sqlmap, etc.) in a sandboxed Docker environment. Uses AI to plan steps, execute attacks via terminal/browser/editor, and monitor progress in real time.

  • PentestGPT
    Popular LLM-powered framework that guides and automates parts of the pentesting process interactively, especially useful for web and CTF-style challenges.

  • PentestAgent
    Black-box security testing framework focused on bug bounty, red team, and structured decision-making during pentests.

  • Shannon
    Autonomous AI pentester specialized in finding and exploiting web vulnerabilities (e.g., XSS, SQLi, SSRF) with high success rates on benchmarks.

This project experiments with these open-source AI agents to evaluate their real-world effectiveness. The main goal is to test how many vulnerabilities they can autonomously discover and exploit in WebGoat — the well-known deliberately insecure web application maintained by OWASP.

WebGoat serves as a standardized benchmark with dozens of lessons covering SQL injection, XSS, CSRF, access control issues, insecure deserialization, and more. By running these agents against WebGoat (in isolated, safe environments), we aim to measure:

  • Detection coverage across vulnerability types
  • Quality of exploitation (successful PoCs vs. false positives)
  • Autonomy level and reasoning quality
  • Time/effort required vs. traditional methods
  • Limitations and failure modes

Results and detailed findings will be documented in future sections or reports.

Test results

Strix with LLM Vulnerabilities Found
Claude Opus 4.6 > 15
GLM-4.7 8
GLM-5 6

Heads-up: Token Consumption in Agent Loops

Many of these autonomous AI agents (especially ones using recursive reasoning, long ReAct-style loops, or multi-step planning like Strix and PentAGI) can become surprisingly token-expensive during execution. Claude-opus-4.6 is expensive, in our case, it consumed over $200 without finish all tasks. The current version of full tasks for strix agent consumes about 50M tokens.

The main cost drivers are:

  • Repeated LLM calls in deep reasoning/planning loops
  • Long context windows when maintaining full history + tool outputs
  • Verbose intermediate steps (tool descriptions, observations, thoughts)
  • Re-prompting on failed actions or when refining attack chains

This represents one of the biggest near-term optimization opportunities in agent-based pentesting:

  • Better early stopping heuristics
  • Shorter memory summarization
  • Tool-use compression / fewer verbose observations
  • Hierarchical planning instead of flat ReAct loops
  • Caching of reconnaissance results
  • Cheaper models for low-confidence / exploration steps

About

Agentic AI perform penetration testing autonomously

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors