Failure is not an edge case. It is the primary object of study.
🌐 failurefirst.org · 📄 arXiv preprint · 🗄️ Dataset & tools (private)
| 120 models evaluated | Across OpenRouter, Ollama, and native CLIs (Claude, Codex, Gemini) |
| 18,176 adversarial prompts | 5 attack families, 79+ techniques, versioned JSONL with JSON Schema |
| 151 benchmark runs | 2,936 scored results in a unified SQLite corpus |
| 2.3× classifier overcount | Keyword heuristics inflate ASR by 2.3× vs LLM-graded ground truth |
50 injection scenarios against 6 small open-weight models (1.5–3.8B params). Every model treated injected tool definitions and skill files as legitimate instructions. No statistically significant differences between any model pair (chi-square with Bonferroni correction, Cohen's κ = 0.782).
Format-lock attacks — requesting harmful content structured as JSON, YAML, or code — achieved 30% (Claude Sonnet 4.5), 42% (Codex GPT-5.2), and 24% (Gemini 3 Flash) LLM-graded ASR. Models embed harmful content within structured fields while maintaining the appearance of a well-formatted, helpful response.
Crescendo attacks achieved 80–90% ASR against DeepSeek-R1 but only ~10% against small non-reasoning models. The extended context tracking that makes reasoning models capable also makes them vulnerable to gradual escalation.
Cohen's κ = 0.245 between keyword and LLM classification. Heuristic REFUSAL labels are 95% reliable; heuristic COMPLIANCE labels have an 88% false positive rate. Aggregate effect: heuristic ASR 36.2% → corrected 15.9%.
A research framework for studying how embodied and agentic AI systems fail:
- Red-teaming datasets — adversarial scenarios targeting cognitive vulnerabilities in tool-using, multi-agent, and stateful systems
- Failure taxonomies — structured classifications of recursive, contextual, and interactional failure modes
- Evaluation infrastructure — benchmark runners (HTTP API, native CLI, local Ollama), scoring pipelines, statistical significance testing
- Classification pipeline — consensus grading (heuristic + LLM) with documented error characteristics
This is not an attack toolkit and does not claim real-world safety guarantees.
git clone https://github.com/adrianwedd/failure-first.git
cd failure-first
pip install -r requirements-dev.txt
make validate # Schema validation — 0 errors required
make lint # Safety linter — catches operational phrasing
make bench # Dry-run benchmark — no API callsfailurefirst.org hosts 18+ blog posts, 23 daily paper analyses, and 19 policy reports — each with audio overviews, video summaries, and infographics generated via NotebookLM.
Recent posts:
- 120 Models, 18,176 Prompts: What We Found
- The Classifier Overcount Problem
- Reasoning Models Are Uniquely Vulnerable to Multi-Turn Attacks
- When LLM Vulnerabilities Meet Robots
Most AI evaluation asks: "Does the system succeed at the task?"
We ask: "How does it fail? What breaks first? Can it recover?"
- Recursive failures — one failure cascading into others
- Contextual failures — instruction hierarchy confusion
- Interactional failures — multi-agent amplification
- Temporal failures — stateful degradation across episodes
- Recovery failures — inability to recognise and correct mistakes
All scenarios describe failure patterns, not operational exploits. Research aims to improve defenses, not enable attacks. Full traces and adversarial payloads are available under NDA for AI safety researchers at accredited institutions, government safety bodies, and frontier lab security teams.
Contact: Open a GitHub issue with institutional affiliation.
@software{failure_first_2026,
title = {F41LUR3-F1R57: Adversarial Evaluation Framework for Embodied AI},
author = {Adrian Wedd},
year = {2026},
url = {https://failurefirst.org},
note = {120 models, 18{,}176 prompts, 5 attack families}
}MIT
Study failures to build better defenses.