Skip to content

Add app-development benchmark: task function, classifiers, and sample metrics#8

Merged
mongodben merged 1 commit intomainfrom
EAI-1614
Mar 18, 2026
Merged

Add app-development benchmark: task function, classifiers, and sample metrics#8
mongodben merged 1 commit intomainfrom
EAI-1614

Conversation

@mongodben
Copy link
Copy Markdown
Collaborator

@mongodben mongodben commented Mar 18, 2026

Jira: https://jira.mongodb.org/browse/EAI-1614

Changes

  • Implement makeGenerateAppResponseTask — the main task function that orchestrates the app-development eval pipeline:
    1. generate response
    2. classify stack
    3. self-reflect
    4. analyze database choice
  • Add sampleSize param to support running multiple samples per eval case for model non-determinism
  • Add pass@k, pass%k, and pass^k sample metrics to mongodb-rag-core/eval for computing success probabilities across samples
  • Update AppDevelopmentTaskOutput to use { samples: AppDevelopmentSample[] } structure
  • Update PrimaryDatabaseIsMongoDb and MentionsMongoDbInGeneration scorers to return all three pass*k metrics per dimension

Notes

  • pass@k, pass%k, pass^k live in mongodb-rag-core/eval/sampleMetrics so they're across benchmark packages
  • When sampleSize=1 (default), all three metrics collapse to the same value — no behavioral change for existing usage
  • Task function runs steps 2 (classify) and 3 (self-reflect) in parallel, then step 4 (analyze) sequentially since it depends on the classified database

Generated with Claude Code

@vercel
Copy link
Copy Markdown

vercel bot commented Mar 18, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
ai-benchmarks Error Error Mar 18, 2026 3:12pm

Request Review

@mongodben mongodben merged commit 36e544e into main Mar 18, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant