(EAI-1237): NL2AtlasSearch prompt optimization + claude code optimization by mongodben · Pull Request #3 · mongodb/ai-benchmarks

mongodben · 2025-10-07T20:59:34Z

Jira: https://jira.mongodb.org/browse/EAI-1237

Changes

NL2AtlasSearch prompt optimization
- Added 'maximalist' prompt with thorough guidance
- Added 'optimized' prompt with guidance for maximum benchmark performance
Claude Code repo set up (you can reasonably ignore this in the review)
Bug fixes to benchmark CLI

Notes

GPT-5 results for different prompting strategies (Braintrust link):

[
   // maximal prompt
  {
    "name": "nl_to_atlas_search?experimentType=agentic_prompt_maximal_recommendation&model=gpt-5&datasets=simple_english_wikipedia",
    "eXNeON": 0.8574313418984933,
    "NDCG@10": 0.7086369322198238,
    "NonEmptyArrayOutput": 0.891156462585034,
    "SearchOperatorUsed": 0.9319727891156463,
    "SuccessfulExecution": 0.8979591836734694,
    "num_examples": "153",
    "error_rate": 0,
    "num_errors": "5",
    "duration": 250.30080919830422,
    "llm_duration": 25.0676878828614,
    "prompt_tokens": 56558,
    "completion_tokens": 16582,
    "total_tokens": 73140,
    "last_updated": 1760040236675,
    "metadata": {
      "task": "agentic_prompt_maximal_recommendation",
      "model": "gpt-5",
      "dataset": "simple_english_wikipedia"
    },
    "id": "39cc52a9-a5c6-492e-add0-e7aa2a08bcbb",
    "Model": "gpt-5",
    "Task": "agentic_prompt_maximal_recommendation"
  },
    // simple prompt
    {
    "name": "nl_to_atlas_search?experimentType=agentic&model=gpt-5&datasets=simple_english_wikipedia-5c9789a3",
    "eXNeON": 0.8434604329994794,
    "NDCG@10": 0.7275832285965571,
    "NonEmptyArrayOutput": 0.8503401360544217,
    "SearchOperatorUsed": 0.891156462585034,
    "SuccessfulExecution": 0.9047619047619048,
    "null": null,
    "num_examples": "153",
    "error_rate": 0,
    "num_errors": "5",
    "duration": 187.33616447919295,
    "llm_duration": 21.070697993086004,
    "prompt_tokens": 39163,
    "completion_tokens": 11531,
    "total_tokens": 50694,
    "last_updated": 1760039408254,
    "metadata": {
      "task": "agentic",
      "model": "gpt-5",
      "dataset": "simple_english_wikipedia"
    },
    "description": null,
    "id": "fd5350f4-cd3f-4ffc-9958-1bedd804cc77",
    "dataset": null,
    "tags": null,
    "Model": "gpt-5",
    "Task": "agentic"
  },
  // optimal prompt
  {
    "name": "nl_to_atlas_search?experimentType=agentic_prompt_optimized_recommendation&model=gpt-5&datasets=simple_english_wikipedia-a79dff15",
    "eXNeON": 0.8805962495947711,
    "NDCG@10": 0.7475505612929915,
    "NonEmptyArrayOutput": 0.9139072847682119,
    "SearchOperatorUsed": 0.9403973509933775,
    "SuccessfulExecution": 0.9205298013245033,
    "null": null,
    "num_examples": "153",
    "error_rate": 0,
    "num_errors": "1",
    "duration": 169.67061183170267,
    "llm_duration": 18.546299081233357,
    "prompt_tokens": 42477,
    "completion_tokens": 13022,
    "total_tokens": 55499,
    "last_updated": 1760037818891,
    "metadata": {
      "task": "agentic_prompt_optimized_recommendation",
      "model": "gpt-5",
      "dataset": "simple_english_wikipedia"
    },
    "description": null,
    "id": "f26e22c5-bee0-40bf-927f-5b800f7cf665",
    "dataset": null,
    "tags": null,
    "Model": "gpt-5",
    "Task": "agentic_prompt_optimized_recommendation"
  }
]

hschawe

i've got a couple questions before approving

hschawe · 2025-10-09T21:16:18Z

packages/benchmarks/src/textToDriver/generateDriverCode/languagePrompts/atlasSearch.ts

-You may use the available tools to help you explore the database, generate the query, think about the problem, and submit the final solution.
+const tools = `<tools>

-<tool name="${thinkToolName}">


have you noticed any performance loss in tool calling for non-gpt models after removing the tool descriptions here?

i haven't no, but also havent specifically looked into that

since the "tool instructions in tool description" guidance came from openai, i'm concerned that this change could cause performance losses in non-gpt models that's unrelated to the "mongodb knowledge" of those models. something to keep in mind when running these benchmarks in the future

yea valid point. worth measuring (in the future)

like we could be stacking the deck in openai's favor w/ this approach

hschawe · 2025-10-09T21:17:27Z

packages/benchmarks/src/textToDriver/generateDriverCode/mongoDbMcpAgent.ts

@@ -131,7 +131,6 @@ export async function makeMongoDbMcpAgent({
  mcpToolSet[thinkToolName] = thinkTool;


do you think we should exclude this when using reasoning models?

no, i think it sohuld be kept 1. for consistency and 2. b/c the reasoning model might still derive value from explicitly writing thoughts out (this was mentioned as useful in a blog post i read)

hschawe · 2025-10-09T21:24:48Z

packages/benchmarks/src/textToDriver/nltoAtlasSearchBenchmarkConfig.ts

+          systemPrompt: atlasSearchAgentPromptWithOptimizedRecommendation,
+          maxSteps: ATLAS_SEARCH_AGENT_MAX_STEPS,
+          mongoClient,
+          mongoDbMcpClient: mcpClient,


nitpick-y but consider renaming mcpClient as mongoDbMcpClient for simplicity

Ben Perlmutter added 30 commits August 12, 2025 15:38

stub out various commponents

dee417d

stub out implementation

f37e244

Executor refactor

33a830e

execute aggregation

a77bb1d

all works except rewrite

691a1c9

pipeline working e2e

3b8c417

Remove rewrite step

6cb984b

working pipeline

d39f469

add subsequent PR eval

b53713c

fix build errs

c32a21c

fix broken test

0118346

Agent improvements on sample data

6e4ef17

checkpoint w/ ai sdk update

1ef64c8

checkpoint w/ ai sdk update

32a775d

validate model name

248f98f

handle google gemini tool calling

7643787

improve scorers

749a5e5

prompt tweaks

840ea1e

tool tweaks and tracing

58b4c00

Fix tests

c5c546a

add antipattern notes

bc014e1

start refining pipeline

2099075

sloppy checkpoint

3bfeb19

Merge remote-tracking branch 'upstream/main' into EAI-1231

ba82640

update lock

682400a

config

ae4cb1a

Merge remote-tracking branch 'upstream/main' into EAI-1231

814075f

checkpoint

78c6a9c

Merge remote-tracking branch 'upstream/main' into EAI-1231

e548af9

gen full benchmark

555e5d4

Ben Perlmutter added 7 commits August 26, 2025 11:54

more robust err handling

f065f5c

first draft of prompt

645d85f

Merge remote-tracking branch 'upstream/main' into EAI-1231

2d8186c

Merge branch 'EAI-1231' into EA-1237

8fe70bc

Merge remote-tracking branch 'upstream/main' into EAI-1237

755eb2c

prompt tweaks

eaa4122

fix atlas search prompt

e410446

mongodben temporarily deployed to test-ci October 7, 2025 20:59 — with GitHub Actions Inactive

Merge remote-tracking branch 'upstream/main' into EAI-1237

679a917

mongodben temporarily deployed to test-ci October 7, 2025 21:03 — with GitHub Actions Inactive

mongodben changed the title ~~Eai 1237~~ (EAI-1237): NL2AtlasSearch prompt optimization Oct 7, 2025

Ben Perlmutter added 4 commits October 8, 2025 10:59

claude code set up

c17d861

Add claude 4.5 + remove dupes

c914b4e

prompt refinement + bug fixes

6b08d62

prompt refinements

f567815

mongodben temporarily deployed to test-ci October 8, 2025 21:26 — with GitHub Actions Inactive

Ben Perlmutter added 2 commits October 9, 2025 11:17

add optimized prompt example

0236518

refine prompts

544add5

mongodben temporarily deployed to test-ci October 9, 2025 18:45 — with GitHub Actions Inactive

mongodben changed the title ~~(EAI-1237): NL2AtlasSearch prompt optimization~~ (EAI-1237): NL2AtlasSearch prompt optimization + claude code optimization Oct 9, 2025

mongodben marked this pull request as ready for review October 9, 2025 18:47

fix runner at root

475a6c8

mongodben temporarily deployed to test-ci October 9, 2025 19:19 — with GitHub Actions Inactive

hschawe self-requested a review October 9, 2025 21:29

hschawe reviewed Oct 9, 2025

View reviewed changes

variable rename

80c0d95

mongodben temporarily deployed to test-ci October 10, 2025 14:31 — with GitHub Actions Inactive

mongodben requested a review from hschawe October 10, 2025 14:31

hschawe approved these changes Oct 10, 2025

View reviewed changes

mongodben merged commit a6bf10a into main Oct 10, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(EAI-1237): NL2AtlasSearch prompt optimization + claude code optimization#3

(EAI-1237): NL2AtlasSearch prompt optimization + claude code optimization#3
mongodben merged 51 commits intomainfrom
EAI-1237

mongodben commented Oct 7, 2025 •

edited

Loading

Uh oh!

hschawe left a comment

Uh oh!

hschawe Oct 9, 2025

Uh oh!

mongodben Oct 10, 2025

Uh oh!

hschawe Oct 10, 2025

Uh oh!

mongodben Oct 10, 2025

Uh oh!

mongodben Oct 10, 2025

Uh oh!

hschawe Oct 9, 2025

Uh oh!

mongodben Oct 10, 2025

Uh oh!

hschawe Oct 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -131,7 +131,6 @@ export async function makeMongoDbMcpAgent({
		mcpToolSet[thinkToolName] = thinkTool;

Conversation

mongodben commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Notes

Uh oh!

hschawe left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mongodben commented Oct 7, 2025 •

edited

Loading