-
Notifications
You must be signed in to change notification settings - Fork 1
SkillFortifyBench
SkillFortifyBench is a curated dataset of 540 agent skills designed to evaluate the accuracy of agent skill security tools. It contains both malicious and benign samples across all three major agent skill formats, covering 13 attack types derived from real-world incidents. SkillFortifyBench is the first public benchmark for agent skill security analysis with ground-truth labels, standardized evaluation metrics, and reproducible methodology.
| Property | Value |
|---|---|
| Total skills | 540 |
| Malicious skills | 270 |
| Benign skills | 270 |
| Attack types covered | 13 |
| Formats | Claude Code, MCP, OpenClaw |
| Ground truth | All skills labeled with verified classifications |
| Source | Derived from ClawHavoc, MalTool, and documented incidents |
Before SkillFortifyBench, there was no standardized way to evaluate agent skill security tools. Each tool claimed effectiveness based on its own test cases, making meaningful comparison impossible. The absence of a shared benchmark meant:
- No way to measure false positive rates across tools
- No way to compare recall on the same malicious samples
- No ground truth for what "malicious" means in the context of agent skills
- No reproducible evaluation methodology
SkillFortifyBench addresses all of these gaps.
Malicious samples were derived from three verified sources:
-
ClawHavoc campaign samples: Skills based on the attack patterns documented in the ClawHavoc investigation (1,200+ malicious skills that infiltrated the OpenClaw marketplace in January-February 2026)
-
MalTool dataset patterns: Attack techniques from the MalTool research paper (6,487 malicious agent tools catalogued in February 2026), adapted to the three supported skill formats
-
CVE-2026-25253 variants: Remote code execution patterns derived from the first agent-software CVE, including variations that exploit installation, loading, and execution phases
| # | Attack Type | Description |
|---|---|---|
| 1 | Data exfiltration | Skill reads sensitive files or environment variables and sends them to an external endpoint |
| 2 | Credential theft | Skill specifically targets API keys, tokens, and authentication credentials |
| 3 | Remote code execution | Skill executes arbitrary commands on the host system |
| 4 | Privilege escalation | Skill attempts to gain capabilities beyond its declaration |
| 5 | Prompt injection | Skill injects instructions into the agent's prompt or context |
| 6 | Configuration tampering | Skill modifies agent or system configuration files |
| 7 | Persistence installation | Skill installs backdoors or persistent access mechanisms |
| 8 | Supply chain redirect | Skill redirects dependency resolution to malicious sources |
| 9 | Information disclosure | Skill leaks system information (OS, paths, installed software) |
| 10 | Denial of service | Skill consumes excessive resources or causes agent crashes |
| 11 | Capability masking | Skill declares minimal capabilities but requires more at runtime |
| 12 | Dependency exploitation | Skill exploits transitive dependencies for malicious purposes |
| 13 | Environment manipulation | Skill modifies environment variables or system state |
Benign samples represent legitimate agent skills across common use cases:
- File management and organization
- Web search and information retrieval
- Code analysis and development assistance
- Data formatting and transformation
- Communication and notification
- Deployment and DevOps automation
- Documentation and reporting
Each benign sample has been verified to require only the capabilities it declares -- no hidden behaviors, no excessive permissions, no undeclared resource access.
Skills are distributed across all three supported formats to ensure format-agnostic evaluation:
| Format | Malicious | Benign | Total |
|---|---|---|---|
| Claude Code | 90 | 90 | 180 |
| MCP | 90 | 90 | 180 |
| OpenClaw | 90 | 90 | 180 |
| Metric | Value |
|---|---|
| Precision | 100% (zero false positives) |
| Recall | 94.12% |
| F1 Score | 96.95% |
| Average analysis time | 2.55 ms per skill |
-
100% precision: SkillFortify did not flag any benign skill as malicious across all 540 test cases. When it reports a finding, it is backed by formal analysis -- not a heuristic guess that might be wrong.
-
94.12% recall: SkillFortify detected 254 of 270 malicious skills. The 16 undetected samples fall into two categories:
- Install-time attacks (typosquatting, dependency confusion) that require registry-level analysis outside the scope of local static analysis
- Heavily obfuscated payloads that resist static capability inference
-
96.95% F1: The harmonic mean of precision and recall, representing the overall balance between catching threats and avoiding false alarms.
The 6% recall gap is documented transparently in the paper as a known limitation. The undetected attacks require information that is not available during local analysis:
- Typosquatting requires comparing skill names against a registry of legitimate packages -- the attack is in the name, not the code
- Dependency confusion requires knowledge of internal vs external package namespaces -- the attack is in the resolution order, not the skill content
These attack types are addressed in v0.3.0 with registry scanning capabilities.
The benchmark was originally constructed for three core skill formats (Claude Code, MCP, OpenClaw). As of v0.3.0, SkillFortify supports 22 agent frameworks. The formal analysis engine and detection methodology validated on SkillFortifyBench applies uniformly across all supported frameworks -- the capability lattice and threat model are format-agnostic by design. Skills from any supported framework are analyzed against the same formal model and theorems.
pip install skillfortify[dev]# Clone the repository
git clone https://github.com/varun369/skillfortify.git
cd skillfortify
# Run benchmark tests
pytest tests/benchmark/ -v# Claude Code skills only
pytest tests/benchmark/ -v -k "claude"
# MCP configurations only
pytest tests/benchmark/ -v -k "mcp"
# OpenClaw manifests only
pytest tests/benchmark/ -v -k "claw"# Data exfiltration samples
pytest tests/benchmark/ -v -k "exfiltration"
# Privilege escalation samples
pytest tests/benchmark/ -v -k "escalation"The benchmark is open for community contributions. To add new test skills:
- Create the skill file in the appropriate format (Claude, MCP, or OpenClaw)
- Document the attack type and expected behavior
- Classify the attack into one of the 13 types (or propose a new type)
- Verify that the sample is detectable by SkillFortify (or document it as a gap)
- Submit a pull request to the benchmark directory
- Create a legitimate skill that performs a useful function
- Verify that it only uses declared capabilities
- Confirm SkillFortify reports it as SAFE (zero false positives is a hard requirement)
- Submit a pull request
- Every sample must have a ground-truth label (malicious or benign)
- Malicious samples must specify which attack type(s) they represent
- Benign samples must be verified to have no undeclared capabilities
- All samples must be in valid format for their declared type
- Include a brief description of what the skill does (or claims to do)
SkillFortifyBench is the first benchmark specifically designed for agent skill security. The closest related work:
| Benchmark | Focus | Size | Agent Skills? | Ground Truth? |
|---|---|---|---|---|
| SkillFortifyBench | Agent skill security | 540 | Yes (3 formats) | Yes |
| MalTool dataset | Malicious tool cataloguing | 6,487 | Partial | Partial |
| ToolShield eval set | Tool safety | ~100 | Partial | Yes |
| OWASP Benchmark | Web application security | 2,740 | No | Yes |
SkillFortifyBench differs from MalTool in that it includes verified benign samples for false positive measurement, covers all three major skill formats, and provides standardized evaluation metrics. MalTool is an excellent attack catalogue but was not designed as an evaluation benchmark.
- Research Paper -- Full benchmark methodology, statistical analysis, and comparison tables
- Formal Foundations -- The theorems that guarantee SkillFortify's analysis properties
- Why SkillFortify -- The threat landscape that motivated the benchmark
SkillFortify -- formal foundations for trustworthy AI agents.