-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Proposal
Write a long-form technical blog post (~2,500 words) designed to bring AgentVault to the attention of developers and AI safety/governance people. Publish on the website (add a /blog route to the Astro site), then submit to HN and send directly to relevant people.
Target audience
- HN front page readers (developers building multi-agent systems)
- AI safety/governance researchers (ARIA, NIST AI framework, EU AI Act compliance)
- Agent framework maintainers (CrewAI, AutoGen, LangGraph)
- A2A working group contacts
Working title
"Your AI Agent Just Leaked Your Salary Expectations"
HN submission title variant: "We red-teamed AI agent negotiations — the model leaked salary expectations every time"
Structure
1. The Hook — A Negotiation Gone Wrong (~400 words)
Open with the salary negotiation demo scenario (#6). Two agents, each holding private context. Alice's agent knows her floor is £88K. Bob's agent knows his budget stretches to £98K. They're supposed to check compatibility — not exchange numbers.
Run it with free-text output (schema v1). Quote the red team results: the model disclosed Alice's exact range. This isn't hypothetical — it happened in testing.
Key line: "The model did exactly what it was trained to do. That's the problem."
2. Why "Be Careful" Doesn't Work (~400 words)
Information-theoretic argument. A free-text channel has unbounded capacity. Prompt instructions are advisory. You can't audit what didn't leak — you can only audit what was possible to leak.
Even with JSON mode, a "reasoning": string field can carry arbitrary information. The constraint must be on schema design, not output format.
3. The Fix: Bound the Channel, Not the Model (~500 words)
Walk through what AgentVault does, using the salary scenario. Show the actual schema and contract from the demo scenario config (not a standalone script):
- Contract: both agents agree on purpose, output schema, prompt template, guardian policy — before any context is shared
- Relay: assembles prompt, calls model, validates output against schema, rejects non-conforming
- Receipt: cryptographic proof of what schema governed the exchange
The schema has ~12 bits of channel capacity. The model can say "STRONG_ALIGNMENT" or "NO_OVERLAP" — it cannot say "Alice's floor is £88K."
4. But Can You Trust the Relay? (~400 words)
Honest limitation: in the software lane, the relay sees plaintext. SELF_ASSERTED means the relay says it followed the rules.
Then: the TEE lane. Same protocol, AMD SEV-SNP confidential VM. Relay operator can't see inputs. Receipt is hardware-attested. Validated on GCP N2D.
Assurance tier diagram here — not as marketing, as an honest statement of what each level proves.
5. The Red Team Results (~300 words)
Actual numbers from docs/red-team-report-2026-02-25.md:
- Schema v1 (free-text): models leaked exact investment ranges
- Schema v2 (all-enum): zero leaks across 7 scenarios, two models, multiple runs
Empirical proof that bounding the channel works — not because the model was told to be careful, but because the schema couldn't carry the information.
6. Try It (~200 words)
Point at the demo: clone, add API key, docker compose up --build, open localhost:3200, pick Salary Negotiation, toggle canary checking on.
Invite reader to run the adversarial extraction scenario (#13) and try to make the protocol leak.
Closing line: "If your agents are talking to other agents, the question isn't whether they'll disclose something they shouldn't. It's whether the channel they're talking over makes that physically possible."
What this needs
- Add
/blogroute to the Astro site - Draft the post text
- Pull schema/contract excerpts from the demo scenario config (no standalone code to build)
- Format red team numbers for citation
- Ensure the Docker demo path is solid (depends on agentvault#322, #328)
- Identify 10-15 people to send it to directly (AI safety researchers, framework maintainers, A2A contacts)
Why this piece, not the website concept pages
The website explains the architecture — it's reference material for people who already care. This piece tells a story with a villain (the helpful model that leaks your secrets) and a plot twist (the fix isn't telling the model to stop — it's making leakage physically impossible). It's designed to make people care in the first place.
Context
- A2A bootstrap support just shipped (agentvault#301-303) — positions AgentVault as complementary A2A infrastructure
- Demo runtime hardened in agentvault#322, #328
- Red team results already documented
- All 15 demo scenarios already built and working