F41LUR3-F1R57 — Adversarial Evaluation for Embodied AI

Failure is not an edge case. It is the primary object of study.

🌐 failurefirst.org · 📄 arXiv preprint · 🗄️ Dataset & tools (private)

Headline Numbers


120 models evaluated	Across OpenRouter, Ollama, and native CLIs (Claude, Codex, Gemini)
18,176 adversarial prompts	5 attack families, 79+ techniques, versioned JSONL with JSON Schema
151 benchmark runs	2,936 scored results in a unified SQLite corpus
2.3× classifier overcount	Keyword heuristics inflate ASR by 2.3× vs LLM-graded ground truth

Four Headline Findings

1. Supply Chain Injection: 90–100% ASR

50 injection scenarios against 6 small open-weight models (1.5–3.8B params). Every model treated injected tool definitions and skill files as legitimate instructions. No statistically significant differences between any model pair (chi-square with Bonferroni correction, Cohen's κ = 0.782).

2. Faithfulness Gap: 24–42% Against Frontier Models

Format-lock attacks — requesting harmful content structured as JSON, YAML, or code — achieved 30% (Claude Sonnet 4.5), 42% (Codex GPT-5.2), and 24% (Gemini 3 Flash) LLM-graded ASR. Models embed harmful content within structured fields while maintaining the appearance of a well-formatted, helpful response.

3. Multi-Turn Escalation: 80–90% on Reasoning Models

Crescendo attacks achieved 80–90% ASR against DeepSeek-R1 but only ~10% against small non-reasoning models. The extended context tracking that makes reasoning models capable also makes them vulnerable to gradual escalation.

4. The Classifier Overcount Problem

Cohen's κ = 0.245 between keyword and LLM classification. Heuristic REFUSAL labels are 95% reliable; heuristic COMPLIANCE labels have an 88% false positive rate. Aggregate effect: heuristic ASR 36.2% → corrected 15.9%.

What This Is

A research framework for studying how embodied and agentic AI systems fail:

Red-teaming datasets — adversarial scenarios targeting cognitive vulnerabilities in tool-using, multi-agent, and stateful systems
Failure taxonomies — structured classifications of recursive, contextual, and interactional failure modes
Evaluation infrastructure — benchmark runners (HTTP API, native CLI, local Ollama), scoring pipelines, statistical significance testing
Classification pipeline — consensus grading (heuristic + LLM) with documented error characteristics

This is not an attack toolkit and does not claim real-world safety guarantees.

Quick Start

git clone https://github.com/adrianwedd/failure-first.git
cd failure-first
pip install -r requirements-dev.txt

make validate   # Schema validation — 0 errors required
make lint       # Safety linter — catches operational phrasing
make bench      # Dry-run benchmark — no API calls

The Site

failurefirst.org hosts 18+ blog posts, 23 daily paper analyses, and 19 policy reports — each with audio overviews, video summaries, and infographics generated via NotebookLM.

Core Philosophy

Most AI evaluation asks: "Does the system succeed at the task?"

We ask: "How does it fail? What breaks first? Can it recover?"

Recursive failures — one failure cascading into others
Contextual failures — instruction hierarchy confusion
Interactional failures — multi-agent amplification
Temporal failures — stateful degradation across episodes
Recovery failures — inability to recognise and correct mistakes

Safety & Ethics

All scenarios describe failure patterns, not operational exploits. Research aims to improve defenses, not enable attacks. Full traces and adversarial payloads are available under NDA for AI safety researchers at accredited institutions, government safety bodies, and frontier lab security teams.

Contact: Open a GitHub issue with institutional affiliation.

Citation

@software{failure_first_2026,
  title   = {F41LUR3-F1R57: Adversarial Evaluation Framework for Embodied AI},
  author  = {Adrian Wedd},
  year    = {2026},
  url     = {https://failurefirst.org},
  note    = {120 models, 18{,}176 prompts, 5 attack families}
}

License

MIT

Study failures to build better defenses.

Name		Name	Last commit message	Last commit date
Latest commit History 190 Commits
docs		docs
scripts		scripts
site		site
.DS_Store		.DS_Store
CONTRIBUTING.md		CONTRIBUTING.md
DESIGN_CHARTER.md		DESIGN_CHARTER.md
MANIFEST.json		MANIFEST.json
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

F41LUR3-F1R57 — Adversarial Evaluation for Embodied AI

Headline Numbers

Four Headline Findings

1. Supply Chain Injection: 90–100% ASR

2. Faithfulness Gap: 24–42% Against Frontier Models

3. Multi-Turn Escalation: 80–90% on Reasoning Models

4. The Classifier Overcount Problem

What This Is

Quick Start

The Site

Core Philosophy

Safety & Ethics

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

F41LUR3-F1R57 — Adversarial Evaluation for Embodied AI

Headline Numbers

Four Headline Findings

1. Supply Chain Injection: 90–100% ASR

2. Faithfulness Gap: 24–42% Against Frontier Models

3. Multi-Turn Escalation: 80–90% on Reasoning Models

4. The Classifier Overcount Problem

What This Is

Quick Start

The Site

Core Philosophy

Safety & Ethics

Citation

License

About

Resources

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages