AI Agents for WebGoat Penetration Testing Evaluation

Using AI agents for penetration testing (pentesting) has become a significant trend in cybersecurity. Traditional manual pentesting is time-consuming, expensive, and limited by human availability, while rule-based automated scanners often produce high false-positive rates and miss complex, chained vulnerabilities.

Modern autonomous AI agents powered by large language models (LLMs) can:

Perform reconnaissance
Reason about application behavior
Plan attack chains
Execute tools (nmap, sqlmap, Metasploit, browser automation, etc.)
Validate exploits with real proof-of-concepts (PoCs)
Generate structured reports

This approach enables faster, more scalable, and repeatable security assessments — especially valuable for continuous testing, bug bounty simulation, and red teaming.

Some notable open-source AI agents for penetration testing include:

Strix
Autonomous AI "hackers" that behave like real attackers. They run code dynamically, explore applications (via HTTP proxy, browser, terminal), find vulnerabilities, and confirm them with working PoCs. Supports teams of collaborating agents for scaled assessments.
PentAGI
Fully autonomous multi-agent system for complex pentesting tasks. Integrates 20+ professional tools (nmap, Metasploit, sqlmap, etc.) in a sandboxed Docker environment. Uses AI to plan steps, execute attacks via terminal/browser/editor, and monitor progress in real time.
PentestGPT
Popular LLM-powered framework that guides and automates parts of the pentesting process interactively, especially useful for web and CTF-style challenges.
PentestAgent
Black-box security testing framework focused on bug bounty, red team, and structured decision-making during pentests.
Shannon
Autonomous AI pentester specialized in finding and exploiting web vulnerabilities (e.g., XSS, SQLi, SSRF) with high success rates on benchmarks.

This project experiments with these open-source AI agents to evaluate their real-world effectiveness. The main goal is to test how many vulnerabilities they can autonomously discover and exploit in WebGoat — the well-known deliberately insecure web application maintained by OWASP.

WebGoat serves as a standardized benchmark with dozens of lessons covering SQL injection, XSS, CSRF, access control issues, insecure deserialization, and more. By running these agents against WebGoat (in isolated, safe environments), we aim to measure:

Detection coverage across vulnerability types
Quality of exploitation (successful PoCs vs. false positives)
Autonomy level and reasoning quality
Time/effort required vs. traditional methods
Limitations and failure modes

Results and detailed findings will be documented in future sections or reports.

Test results

Strix with LLM	Vulnerabilities Found
Claude Opus 4.6	> 15
GLM-4.7	8
GLM-5	6

Heads-up: Token Consumption in Agent Loops

Many of these autonomous AI agents (especially ones using recursive reasoning, long ReAct-style loops, or multi-step planning like Strix and PentAGI) can become surprisingly token-expensive during execution. Claude-opus-4.6 is expensive, in our case, it consumed over $200 without finish all tasks. The current version of full tasks for strix agent consumes about 50M tokens.

The main cost drivers are:

Repeated LLM calls in deep reasoning/planning loops
Long context windows when maintaining full history + tool outputs
Verbose intermediate steps (tool descriptions, observations, thoughts)
Re-prompting on failed actions or when refining attack chains

This represents one of the biggest near-term optimization opportunities in agent-based pentesting:

Better early stopping heuristics
Shorter memory summarization
Tool-use compression / fewer verbose observations
Hierarchical planning instead of flat ReAct loops
Caching of reconnaissance results
Cheaper models for low-confidence / exploration steps

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
strix_runs		strix_runs
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Agents for WebGoat Penetration Testing Evaluation

Test results

Heads-up: Token Consumption in Agent Loops

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

AI Agents for WebGoat Penetration Testing Evaluation

Test results

Heads-up: Token Consumption in Agent Loops

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages