Skip to content

Blog post: "Your AI Agent Just Leaked Your Salary Expectations" #67

@tobytkershaw

Description

@tobytkershaw

Proposal

Write a long-form technical blog post (~2,500 words) designed to bring AgentVault to the attention of developers and AI safety/governance people. Publish on the website (add a /blog route to the Astro site), then submit to HN and send directly to relevant people.

Target audience

  • HN front page readers (developers building multi-agent systems)
  • AI safety/governance researchers (ARIA, NIST AI framework, EU AI Act compliance)
  • Agent framework maintainers (CrewAI, AutoGen, LangGraph)
  • A2A working group contacts

Working title

"Your AI Agent Just Leaked Your Salary Expectations"

HN submission title variant: "We red-teamed AI agent negotiations — the model leaked salary expectations every time"

Structure

1. The Hook — A Negotiation Gone Wrong (~400 words)

Open with the salary negotiation demo scenario (#6). Two agents, each holding private context. Alice's agent knows her floor is £88K. Bob's agent knows his budget stretches to £98K. They're supposed to check compatibility — not exchange numbers.

Run it with free-text output (schema v1). Quote the red team results: the model disclosed Alice's exact range. This isn't hypothetical — it happened in testing.

Key line: "The model did exactly what it was trained to do. That's the problem."

2. Why "Be Careful" Doesn't Work (~400 words)

Information-theoretic argument. A free-text channel has unbounded capacity. Prompt instructions are advisory. You can't audit what didn't leak — you can only audit what was possible to leak.

Even with JSON mode, a "reasoning": string field can carry arbitrary information. The constraint must be on schema design, not output format.

3. The Fix: Bound the Channel, Not the Model (~500 words)

Walk through what AgentVault does, using the salary scenario. Show the actual schema and contract from the demo scenario config (not a standalone script):

  • Contract: both agents agree on purpose, output schema, prompt template, guardian policy — before any context is shared
  • Relay: assembles prompt, calls model, validates output against schema, rejects non-conforming
  • Receipt: cryptographic proof of what schema governed the exchange

The schema has ~12 bits of channel capacity. The model can say "STRONG_ALIGNMENT" or "NO_OVERLAP" — it cannot say "Alice's floor is £88K."

4. But Can You Trust the Relay? (~400 words)

Honest limitation: in the software lane, the relay sees plaintext. SELF_ASSERTED means the relay says it followed the rules.

Then: the TEE lane. Same protocol, AMD SEV-SNP confidential VM. Relay operator can't see inputs. Receipt is hardware-attested. Validated on GCP N2D.

Assurance tier diagram here — not as marketing, as an honest statement of what each level proves.

5. The Red Team Results (~300 words)

Actual numbers from docs/red-team-report-2026-02-25.md:

  • Schema v1 (free-text): models leaked exact investment ranges
  • Schema v2 (all-enum): zero leaks across 7 scenarios, two models, multiple runs

Empirical proof that bounding the channel works — not because the model was told to be careful, but because the schema couldn't carry the information.

6. Try It (~200 words)

Point at the demo: clone, add API key, docker compose up --build, open localhost:3200, pick Salary Negotiation, toggle canary checking on.

Invite reader to run the adversarial extraction scenario (#13) and try to make the protocol leak.

Closing line: "If your agents are talking to other agents, the question isn't whether they'll disclose something they shouldn't. It's whether the channel they're talking over makes that physically possible."

What this needs

  • Add /blog route to the Astro site
  • Draft the post text
  • Pull schema/contract excerpts from the demo scenario config (no standalone code to build)
  • Format red team numbers for citation
  • Ensure the Docker demo path is solid (depends on agentvault#322, #328)
  • Identify 10-15 people to send it to directly (AI safety researchers, framework maintainers, A2A contacts)

Why this piece, not the website concept pages

The website explains the architecture — it's reference material for people who already care. This piece tells a story with a villain (the helpful model that leaks your secrets) and a plot twist (the fix isn't telling the model to stop — it's making leakage physically impossible). It's designed to make people care in the first place.

Context

  • A2A bootstrap support just shipped (agentvault#301-303) — positions AgentVault as complementary A2A infrastructure
  • Demo runtime hardened in agentvault#322, #328
  • Red team results already documented
  • All 15 demo scenarios already built and working

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions