Add app-development benchmark config and CLI integration by mongodben · Pull Request #9 · mongodb/ai-benchmarks

mongodben · 2026-03-18T20:30:03Z

Changes

Add benchmark config for app_development and register it in the benchmark CLI
Load 104 eval cases from datasets/app-development.yml with dataset splits: all, mongodb_optimal, db_agnostic
Add system prompt variants in prompts.ts: none, generic_coding_assistant, mongodb_recommended, system_architect, stack_agnostic
Each prompt variant registers as a separate task in the CLI (e.g. simple_prompt_completion, prompt_system_architect)
Wire subject model through Braintrust proxy, judge model uses gpt-5.4

Subject model uses .chat() through Braintrust proxy since .responses() has translation issues for non-OpenAI providers (Claude, Gemini)
generic_coding_assistant prompt uses "production-ready" language to encourage models to include a real database
system_architect variant focuses on design reasoning over code, giving classifiers more signal to analyze

Generated with Claude Code

vercel · 2026-03-18T20:30:08Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
ai-benchmarks	Error		Mar 18, 2026 8:30pm

mongodben added 2 commits March 18, 2026 14:09

create harness

529520d

benchmark in harness

3b58790

mongodben temporarily deployed to test-ci March 18, 2026 20:30 — with GitHub Actions Inactive

mongodben merged commit b3f241b into main Mar 18, 2026
3 of 4 checks passed