Skip to content

fix: update benchmark questions for CLI and databases#13

Merged
Meldiron merged 9 commits intomainfrom
fix-benchmark-questions
Mar 5, 2026
Merged

fix: update benchmark questions for CLI and databases#13
Meldiron merged 9 commits intomainfrom
fix-benchmark-questions

Conversation

@ChiragAgg5k
Copy link
Member

@ChiragAgg5k ChiragAgg5k commented Mar 4, 2026

Summary

  • Update cli-4 correct answer to appwrite.config.json (replaces appwrite.json as the main config file)
  • Replace all appwrite.json references with appwrite.config.json across existing questions
  • Add new questions:
    • cli-21: appwrite.json legacy fallback behavior
    • cli-22/23: appwrite types command and --strict flag
    • cli-24: --with-variables flag for pushing environment variables
    • db-21: Integer attribute min/max limits (64-bit signed)

Test plan

  • Verify benchmark builds successfully
  • Run benchmark questions to confirm new questions render correctly

Summary by CodeRabbit

  • New Features

    • Added eight new CLI benchmark questions covering configuration precedence, types, strict mode, missing env vars, defaults, generation, and run vs push.
    • Added three new database benchmark questions.
  • Updates

    • Standardized configuration filename references across CLI questions and updated prompts/answers for consistency.
    • Updated model integration to a chat-based SDK, improving generation reliability and response handling.

- Update cli-4 to use appwrite.config.json as the main config file
- Replace all appwrite.json references with appwrite.config.json
- Add cli-21: appwrite.json legacy fallback behavior
- Add cli-22/23: appwrite types command and --strict flag
- Add cli-24: --with-variables flag for pushing env variables
- Add db-21: integer attribute min/max limits (64-bit signed)
@appwrite
Copy link

appwrite bot commented Mar 4, 2026

Appwrite Arena

Project ID: appwrite-arena

Sites (1)
Site Status Logs Preview QR
 Arena
arena
Ready Ready View Logs Preview URL QR Code

Tip

GraphQL API works alongside REST and WebSocket protocols

@coderabbitai
Copy link

coderabbitai bot commented Mar 4, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Replaced references to appwrite.json with appwrite.config.json across CLI question entries and large dataset JSON files; updated related prompts, workflows, and rubric text. Added eight new CLI MCQs (cli-21cli-28) and three new database MCQs (db-21db-23). Integrated the OpenRouter SDK (@openrouter/sdk) and replaced direct HTTP fetch calls with SDK calls in benchmark/src/judge.ts and benchmark/src/runner.ts, including adjustments to request/response shapes. Renamed interface fields from tool_calls/tool_call_id to camelCase (toolCalls/toolCallId) and updated ApiResponse accordingly. No exported function signatures were removed or added.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately summarizes the main change: updating benchmark questions related to CLI and databases configuration file references and adding new questions.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix-benchmark-questions

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmark/src/questions/cli.ts`:
- Around line 304-311: The question `cli-24` incorrectly marks
`--with-variables` as the missing push flag; update the question's choices array
and the `correctAnswer` value in benchmark/src/questions/cli.ts so the correct
push flag is selected (replace the current correctAnswer "B" with "A" to point
to "--env" or otherwise substitute the actual push-specific flag), and ensure
the question text refers to `appwrite push functions` (not `run`) to match
Appwrite CLI behavior by editing the `choices` array and the `correctAnswer`
field for that question.
- Around line 125-127: The CLI command strings in the question/ rubric use the
incorrect singular form "appwrite init function" and "appwrite run function";
update those literal strings to the official plural form "appwrite init
functions" and "appwrite run functions" so the answer text and the rubric (the
two strings shown in the diff) match official documentation and will grade
correctly; locate and replace the occurrences of "appwrite init function" and
"appwrite run function" in the variables that hold the answer text and rubric to
use "functions" instead.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: eaea98c4-bb13-4e7d-90ea-b1f39e711844

📥 Commits

Reviewing files that changed from the base of the PR and between 3e47c9d and 3bf6fec.

📒 Files selected for processing (2)
  • benchmark/src/questions/cli.ts
  • benchmark/src/questions/databases.ts

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
benchmark/src/questions/cli.ts (3)

360-360: ⚠️ Potential issue | 🟡 Minor

Use plural command names in cli-28 prompt for accuracy.

Line 360 currently uses singular command forms; align with the documented plural forms to avoid teaching incorrect syntax.

What are the official Appwrite CLI command forms: `appwrite run function` vs `appwrite run functions`, and `appwrite push function` vs `appwrite push functions`?
Proposed fix
-			"What's the key difference between `appwrite run function` and `appwrite push function`?",
+			"What's the key difference between `appwrite run functions` and `appwrite push functions`?",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/src/questions/cli.ts` at line 360, The prompt string currently uses
singular CLI command names ("appwrite run function" and "appwrite push
function"); update the prompt in the questions CLI (the string "What's the key
difference between `appwrite run function` and `appwrite push function`?") to
use the documented plural forms ("appwrite run functions" and "appwrite push
functions") so it matches official Appwrite CLI syntax and avoids teaching
incorrect commands.

304-311: ⚠️ Potential issue | 🟠 Major

cli-24 ties --with-variables to the wrong command context.

Line 304 asks about a missing push flag, but --with-variables is used with local run (appwrite run functions), not push.

In Appwrite CLI docs, is `--with-variables` a flag for `appwrite run functions` or `appwrite push functions`?
Proposed fix
-			"A function works locally but fails after pushing because environment variables are missing. What flag was likely missing from the push command?",
+			"A function works in cloud but fails during local execution because environment variables are missing. What flag was likely missing from the run command?",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/src/questions/cli.ts` around lines 304 - 311, The question maps the
run-only flag "--with-variables" to a push context; update the
choices/correctAnswer so the push flag is correct: change the correctAnswer from
"B" to the option representing "--env" and ensure the question text still asks
about a missing push flag; modify the `correctAnswer` value in the choices block
(and adjust any test metadata that references `cli-24` if present) so `--env` is
the correct answer for the push command scenario.

125-127: ⚠️ Potential issue | 🟠 Major

Use documented plural CLI commands in answer/rubric (functions, not function).

Line 125 and Line 127 still use singular forms, which can misgrade correct answers that follow official CLI syntax.

Appwrite docs: confirm exact commands for function scaffolding and local run (is it `appwrite init functions` and `appwrite run functions`?).
Proposed fix
-			"Run appwrite init function to scaffold, choose runtime and template, configure appwrite.config.json with function settings (name, runtime, execute permissions, variables, schedule, etc.), develop locally with appwrite run function, then deploy with appwrite push functions.",
+			"Run appwrite init functions to scaffold, choose runtime and template, configure appwrite.config.json with function settings (name, runtime, execute permissions, variables, schedule, etc.), develop locally with appwrite run functions, then deploy with appwrite push functions.",
...
-			"Must mention: 1) appwrite init function to scaffold, 2) Runtime selection, 3) appwrite.config.json configuration options, 4) Local development with appwrite run function, 5) Deployment with appwrite push functions",
+			"Must mention: 1) appwrite init functions to scaffold, 2) Runtime selection, 3) appwrite.config.json configuration options, 4) Local development with appwrite run functions, 5) Deployment with appwrite push functions",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/src/questions/cli.ts` around lines 125 - 127, Update the literal
CLI command text to use the documented plural form: replace "appwrite init
function" with "appwrite init functions" and "appwrite run function" with
"appwrite run functions" in the question and rubric strings (the entries that
currently contain those phrases and the `rubric` string shown). Ensure both the
user-facing instruction string and the rubric mention the plural commands so
grading matches official Appwrite CLI usage.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@benchmark/src/questions/cli.ts`:
- Line 360: The prompt string currently uses singular CLI command names
("appwrite run function" and "appwrite push function"); update the prompt in the
questions CLI (the string "What's the key difference between `appwrite run
function` and `appwrite push function`?") to use the documented plural forms
("appwrite run functions" and "appwrite push functions") so it matches official
Appwrite CLI syntax and avoids teaching incorrect commands.
- Around line 304-311: The question maps the run-only flag "--with-variables" to
a push context; update the choices/correctAnswer so the push flag is correct:
change the correctAnswer from "B" to the option representing "--env" and ensure
the question text still asks about a missing push flag; modify the
`correctAnswer` value in the choices block (and adjust any test metadata that
references `cli-24` if present) so `--env` is the correct answer for the push
command scenario.
- Around line 125-127: Update the literal CLI command text to use the documented
plural form: replace "appwrite init function" with "appwrite init functions" and
"appwrite run function" with "appwrite run functions" in the question and rubric
strings (the entries that currently contain those phrases and the `rubric`
string shown). Ensure both the user-facing instruction string and the rubric
mention the plural commands so grading matches official Appwrite CLI usage.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 32f50e5e-e89e-4bad-9817-605cd11fd477

📥 Commits

Reviewing files that changed from the base of the PR and between 3bf6fec and bf82989.

📒 Files selected for processing (2)
  • benchmark/src/questions/cli.ts
  • benchmark/src/questions/databases.ts

"appwrite.json",
],
correctAnswer: "D",
correctAnswer: "C",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not re-order, god knows what it will do to existing results.
We can rename correct answer, but lets not re-order

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
benchmark/src/questions/cli.ts (3)

125-127: ⚠️ Potential issue | 🟠 Major

Use official plural command names in the answer and rubric.

Line 125 and Line 127 still use singular forms (appwrite init function, appwrite run function), which can misgrade correct responses that follow documented CLI commands.

Suggested fix
-			"Run appwrite init function to scaffold, choose runtime and template, configure appwrite.config.json with function settings (name, runtime, execute permissions, variables, schedule, etc.), develop locally with appwrite run function, then deploy with appwrite push functions.",
+			"Run appwrite init functions to scaffold, choose runtime and template, configure appwrite.config.json with function settings (name, runtime, execute permissions, variables, schedule, etc.), develop locally with appwrite run functions, then deploy with appwrite push functions.",
...
-			"Must mention: 1) appwrite init function to scaffold, 2) Runtime selection, 3) appwrite.config.json configuration options, 4) Local development with appwrite run function, 5) Deployment with appwrite push functions",
+			"Must mention: 1) appwrite init functions to scaffold, 2) Runtime selection, 3) appwrite.config.json configuration options, 4) Local development with appwrite run functions, 5) Deployment with appwrite push functions",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/src/questions/cli.ts` around lines 125 - 127, Update the CLI
command wording from singular to the official plural forms: change occurrences
of "appwrite init function" to "appwrite init functions" and "appwrite run
function" to "appwrite run functions" (these appear in the prompt text and in
the rubric strings near the "Run appwrite..." and "Local development..." lines)
so the prompt and rubric use the documented command names consistently and won't
misgrade correct responses.

304-311: ⚠️ Potential issue | 🟠 Major

cli-24 maps --with-variables to the wrong command context.

The prompt frames this as a missing flag on push, but --with-variables is a run functions flag. This makes the question factually incorrect.

Suggested fix (keeps choice order and current correctAnswer)
-			"A function works locally but fails after pushing because environment variables are missing. What flag was likely missing from the push command?",
+			"A function works after deployment but fails during local execution because environment variables are missing. What flag was likely missing from the run command?",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/src/questions/cli.ts` around lines 304 - 311, The question text
(cli-24) incorrectly asks about a missing push flag while the correct answer
(--with-variables) is a run-functions flag; update the prompt wording to ask
about the run (or "run functions") command instead of push so the existing
choices and correctAnswer ("B") remain valid, e.g., change the question string
that mentions "push" to reference "run" or "run functions" while leaving the
choices array and correctAnswer intact.

360-360: ⚠️ Potential issue | 🟡 Minor

Use plural command forms in the cli-28 question text.

The question at line 360 uses singular forms (appwrite run function / appwrite push function), but the Appwrite CLI documentation specifies plural forms (appwrite run functions / appwrite push functions). This mismatch has caused confusion in the community.

Suggested fix
-			"What's the key difference between `appwrite run function` and `appwrite push function`?",
+			"What's the key difference between `appwrite run functions` and `appwrite push functions`?",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/src/questions/cli.ts` at line 360, The question string "What's the
key difference between `appwrite run function` and `appwrite push function`?"
uses singular CLI commands; update that string to use the correct plural CLI
forms from the docs by replacing `appwrite run function` with `appwrite run
functions` and `appwrite push function` with `appwrite push functions` (i.e.,
change the quoted question text in benchmark/src/questions/cli.ts to "What's the
key difference between `appwrite run functions` and `appwrite push
functions`?").
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@benchmark/src/questions/cli.ts`:
- Around line 125-127: Update the CLI command wording from singular to the
official plural forms: change occurrences of "appwrite init function" to
"appwrite init functions" and "appwrite run function" to "appwrite run
functions" (these appear in the prompt text and in the rubric strings near the
"Run appwrite..." and "Local development..." lines) so the prompt and rubric use
the documented command names consistently and won't misgrade correct responses.
- Around line 304-311: The question text (cli-24) incorrectly asks about a
missing push flag while the correct answer (--with-variables) is a run-functions
flag; update the prompt wording to ask about the run (or "run functions")
command instead of push so the existing choices and correctAnswer ("B") remain
valid, e.g., change the question string that mentions "push" to reference "run"
or "run functions" while leaving the choices array and correctAnswer intact.
- Line 360: The question string "What's the key difference between `appwrite run
function` and `appwrite push function`?" uses singular CLI commands; update that
string to use the correct plural CLI forms from the docs by replacing `appwrite
run function` with `appwrite run functions` and `appwrite push function` with
`appwrite push functions` (i.e., change the quoted question text in
benchmark/src/questions/cli.ts to "What's the key difference between `appwrite
run functions` and `appwrite push functions`?").

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 64e2669d-f698-4b6b-9da8-5036cf5e111a

📥 Commits

Reviewing files that changed from the base of the PR and between bf82989 and 6015c6e.

📒 Files selected for processing (2)
  • benchmark/src/questions/cli.ts
  • benchmark/src/questions/databases.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • benchmark/src/questions/databases.ts

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
benchmark/src/runner.ts (1)

359-361: ⚠️ Potential issue | 🟠 Major

Do not mutate shared systemPrompt inside the processing loop.

Line 360 appends MCQ instructions directly to systemPrompt, so each MCQ permanently modifies later prompts (including non-MCQ), which can skew results.

🐛 Proposed fix
-        if (question.type === "mcq") {
-          systemPrompt += "\nTo answer, you must call the tool corresponding to the correct answer (e.g. answer_a, answer_b, answer_c, or answer_d). Do not respond with text.";
-        }
-
-        processQuestion(question, model, systemPrompt, tools, skillsMap, debug).then((result) => {
+        const questionSystemPrompt =
+          question.type === "mcq"
+            ? `${systemPrompt}\nTo answer, you must call the tool corresponding to the correct answer (e.g. answer_a, answer_b, answer_c, or answer_d). Do not respond with text.`
+            : systemPrompt;
+
+        processQuestion(question, model, questionSystemPrompt, tools, skillsMap, debug).then((result) => {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/src/runner.ts` around lines 359 - 361, The code mutates the shared
systemPrompt inside the processing loop when question.type === "mcq", causing
subsequent prompts to be incorrectly modified; instead, create a per-iteration
prompt variable (e.g., let prompt = systemPrompt or const promptBase =
systemPrompt) and append the MCQ-specific instruction to that local prompt when
question.type === "mcq", then use that local prompt for downstream calls; do not
modify the original systemPrompt variable so non-MCQ questions remain unaffected
(look for usages of systemPrompt in the runner processing loop and where
question.type is checked).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@benchmark/package.json`:
- Around line 19-20: Update the dependency declaration in package.json to pin
`@openrouter/sdk` to an exact release instead of a caret range: replace the
current version spec "^0.9.11" for the dependency "@openrouter/sdk" with
"0.9.11" to ensure reproducible benchmark runs.

In `@benchmark/src/runner.ts`:
- Around line 62-72: Replace the ad-hoc typing and casts around
chatGenerationParams and the response: change the chatGenerationParams variable
from Record<string, unknown> to the SDK type models.ChatGenerationParams, assign
tools conditionally as before, and call openrouter.chat.send with the correctly
typed object (no wrapping cast). Also replace the double-cast return
(ApiResponse) with the SDK response type models.ChatResponse and return that
directly; update any variable names/signatures that referenced ApiResponse to
use models.ChatResponse so TypeScript enforces payload/response shapes for
openrouter.chat.send.

---

Outside diff comments:
In `@benchmark/src/runner.ts`:
- Around line 359-361: The code mutates the shared systemPrompt inside the
processing loop when question.type === "mcq", causing subsequent prompts to be
incorrectly modified; instead, create a per-iteration prompt variable (e.g., let
prompt = systemPrompt or const promptBase = systemPrompt) and append the
MCQ-specific instruction to that local prompt when question.type === "mcq", then
use that local prompt for downstream calls; do not modify the original
systemPrompt variable so non-MCQ questions remain unaffected (look for usages of
systemPrompt in the runner processing loop and where question.type is checked).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0cc80fed-0c20-4dd1-88a3-2a3a04bc3776

📥 Commits

Reviewing files that changed from the base of the PR and between 6015c6e and f91b595.

⛔ Files ignored due to path filters (1)
  • benchmark/package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (3)
  • benchmark/package.json
  • benchmark/src/judge.ts
  • benchmark/src/runner.ts

Comment on lines +19 to +20
"dependencies": {
"@openrouter/sdk": "^0.9.11"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

OpenRouter TypeScript SDK official documentation version pinning recommendations

💡 Result:

OpenRouter’s official TypeScript SDK is explicitly labeled beta, and the maintainers warn that there may be breaking changes between versions without a major-version bump—so they recommend pinning to a specific @openrouter/sdk version (i.e., don’t rely on floating ranges like ^ if you want reproducible installs). [1][2]

Practical implication: install/update intentionally (e.g., npm i @openrouter/sdk@<exact-version>) and keep that exact version in your package.json so repeated installs don’t unexpectedly pick up breaking changes. [1]


🌐 Web query:

OpenRouter SDK beta status GitHub releases

💡 Result:

  • Beta status: OpenRouter’s official SDKs are explicitly labeled “OpenRouter SDK (Beta)” (TypeScript + Python), and both repos state they’re in beta with possible breaking changes between versions even without a major version bump (they recommend pinning versions). [1][2]
  • GitHub releases (latest as of 2026-03-04):
    • TypeScript SDK (OpenRouterTeam/typescript-sdk): latest release v0.9.11 published Feb 23, 2026. [3]
    • Python SDK (OpenRouterTeam/python-sdk): latest release v0.7.11 published Feb 23, 2026. [4]

Sources
[1] (github.com)
[2] (github.com)
[3] (github.com)
[4] (github.com)

Citations:


Pin @openrouter/sdk to an exact version for reproducible benchmark runs.

The OpenRouter SDK is explicitly labeled as beta with potential breaking changes between versions without major-version bumps. The maintainers recommend pinning to an exact version. Using ^0.9.11 allows minor and patch updates that could alter benchmark scores; change to 0.9.11 for consistent results across runs.

🔧 Proposed change
   "dependencies": {
-    "@openrouter/sdk": "^0.9.11"
+    "@openrouter/sdk": "0.9.11"
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"dependencies": {
"@openrouter/sdk": "^0.9.11"
"dependencies": {
"@openrouter/sdk": "0.9.11"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/package.json` around lines 19 - 20, Update the dependency
declaration in package.json to pin `@openrouter/sdk` to an exact release instead
of a caret range: replace the current version spec "^0.9.11" for the dependency
"@openrouter/sdk" with "0.9.11" to ensure reproducible benchmark runs.

Comment on lines +62 to +72
const chatGenerationParams: Record<string, unknown> = {
model: model.openRouterId,
temperature: TEMPERATURE,
messages,
};
if (tools && tools.length > 0) {
body.tools = tools;
}

const response = await fetch(OPENROUTER_API_URL, {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${apiKey}`,
},
body: JSON.stringify(body),
});

if (!response.ok) {
const text = await response.text();
throw new Error(`OpenRouter API error (${response.status}): ${text}`);
chatGenerationParams.tools = tools;
}

return (await response.json()) as ApiResponse;
const response = await openrouter.chat.send({ chatGenerationParams } as Parameters<typeof openrouter.chat.send>[0]);
return response as unknown as ApiResponse;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find and check package.json to see the OpenRouter SDK version
fd "package.json" -t f | head -5

Repository: appwrite/arena

Length of output: 93


🏁 Script executed:

# Check the file in question
cat -n benchmark/src/runner.ts | sed -n '50,85p'

Repository: appwrite/arena

Length of output: 1173


🏁 Script executed:

# Check the OpenRouter SDK version in benchmark/package.json
cat benchmark/package.json | grep -A 2 -B 2 "@openrouter"

Repository: appwrite/arena

Length of output: 121


🏁 Script executed:

# Also check the imports at the top of runner.ts to understand how openrouter is used
head -30 benchmark/src/runner.ts

Repository: appwrite/arena

Length of output: 722


🌐 Web query:

@openrouter/sdk v0.9.11 OpenRouter.chat.send TypeScript signature types parameters response

💡 Result:

@openrouter/sdk@0.9.11openRouter.chat.send TypeScript types

Method overloads

// non-streaming (explicitly stream?: false)
send(
  request: operations.SendChatCompletionRequestRequest & {
    chatGenerationParams: { stream?: false | undefined };
  },
  options?: RequestOptions
): Promise<models.ChatResponse>;

// streaming (stream: true)
send(
  request: operations.SendChatCompletionRequestRequest & {
    chatGenerationParams: { stream: true };
  },
  options?: RequestOptions
): Promise<EventStream<models.ChatStreamingResponseChunk>>;

// general type (union)
send(
  request: operations.SendChatCompletionRequestRequest,
  options?: RequestOptions
): Promise<operations.SendChatCompletionRequestResponse>;

[1]


Parameters

request: operations.SendChatCompletionRequestRequest

type SendChatCompletionRequestRequest = {
  httpReferer?: string;
  xTitle?: string;
  chatGenerationParams: models.ChatGenerationParams;
};

[2]

chatGenerationParams: models.ChatGenerationParams (core fields)

messages is required; most other fields are optional (routing, sampling, tools, etc.). Key ones you’ll commonly use:

  • messages: Array<Message>
  • model?: string / models?: Array<string>
  • stream?: boolean (drives whether you get an EventStream)
  • temperature?: number | null
  • topP?: number | null
  • maxCompletionTokens?: number | null (and maxTokens?: number | null deprecated)
  • tools?: Array<ToolDefinitionJson>, toolChoice?: ToolChoiceOption
    [3]

options?: RequestOptions

Second optional argument passed through the SDK request pipeline. [1]


Response types

Non-streaming response (models.ChatResponse)

type ChatResponse = {
  id: string;
  choices: Array<ChatResponseChoice>;
  created: number;
  model: string;
  object: "chat.completion";
  systemFingerprint?: string | null;
  usage?: ChatGenerationTokenUsage;
};

[4]

Streaming response (EventStream<models.ChatStreamingResponseChunk>)

Each streamed chunk is:

type ChatStreamingResponseChunk = {
  id: string;
  choices: Array<ChatStreamingChoice>;
  created: number;
  model: string;
  object: "chat.completion.chunk";
  systemFingerprint?: string | null;
  error?: { message: string; code: number };
  usage?: ChatGenerationTokenUsage;
};

[5]

Union response type

type SendChatCompletionRequestResponse =
  | models.ChatResponse
  | EventStream<models.ChatStreamingResponseChunk>;

[2]


🏁 Script executed:

# Check for ApiResponse type definition
rg "type ApiResponse|interface ApiResponse" benchmark/src/ -A 5

Repository: appwrite/arena

Length of output: 317


Use proper SDK types instead of Record<string, unknown> to maintain type safety.

Lines 62-72 bypass SDK typing through Record<string, unknown> and casts, which prevents TypeScript from catching payload or response drift. Replace Record<string, unknown> with models.ChatGenerationParams from the SDK and use the proper response type (models.ChatResponse) instead of the double-cast workaround.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/src/runner.ts` around lines 62 - 72, Replace the ad-hoc typing and
casts around chatGenerationParams and the response: change the
chatGenerationParams variable from Record<string, unknown> to the SDK type
models.ChatGenerationParams, assign tools conditionally as before, and call
openrouter.chat.send with the correctly typed object (no wrapping cast). Also
replace the double-cast return (ApiResponse) with the SDK response type
models.ChatResponse and return that directly; update any variable
names/signatures that referenced ApiResponse to use models.ChatResponse so
TypeScript enforces payload/response shapes for openrouter.chat.send.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
benchmark/src/runner.ts (1)

358-364: ⚠️ Potential issue | 🟠 Major

Avoid mutating shared systemPrompt inside the question loop.

Line 360-Line 362 appends MCQ instructions to the outer systemPrompt, so later questions inherit repeated/incorrect instructions.

💡 Proposed fix
-        if (question.type === "mcq") {
-          systemPrompt += "\nTo answer, you must call the tool corresponding to the correct answer (e.g. answer_a, answer_b, answer_c, or answer_d). Do not respond with text.";
-        }
-
-        processQuestion(question, model, systemPrompt, tools, skillsMap, debug).then((result) => {
+        const promptForQuestion =
+          question.type === "mcq"
+            ? `${systemPrompt}\nTo answer, you must call the tool corresponding to the correct answer (e.g. answer_a, answer_b, answer_c, or answer_d). Do not respond with text.`
+            : systemPrompt;
+
+        processQuestion(question, model, promptForQuestion, tools, skillsMap, debug).then((result) => {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/src/runner.ts` around lines 358 - 364, The loop mutates the shared
systemPrompt when a question.type === "mcq", causing later questions to
accumulate MCQ instructions; instead create a per-question prompt copy (e.g.,
let prompt = systemPrompt or newPrompt) and append the MCQ instruction to that
copy, then pass the copy to processQuestion(model, prompt, tools, skillsMap,
debug) (referencing question.type, systemPrompt, and processQuestion) so the
outer systemPrompt remains unchanged.
🧹 Nitpick comments (1)
src/data/results-without-skills.json (1)

5653-5827: Consider de-duplicating repeated question payloads across model blocks.

The same question definitions are copied in multiple sections. A single canonical source (then reference/expand at generation time) would reduce drift and patch size for future benchmark edits.

Also applies to: 8652-8827, 11651-11826

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/data/results-without-skills.json` around lines 5653 - 5827, The file
contains duplicated question objects (e.g., questionId "db-21", "db-22",
"cli-21", "cli-22", etc.) repeated across multiple blocks; extract all unique
question definitions into a single canonical questions array (keyed by
questionId) and replace repeated full payloads with references (e.g., by
questionId) in the model/result blocks, updating any code that consumes this
JSON to resolve references at generation time; ensure unique keys (questionId)
remain unchanged and update serialization/deserialization routines to expand
referenced entries when producing the final output.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/data/results-with-skills.json`:
- Around line 5649-5651: The judgeReasoning field in the JSON (the
"judgeReasoning" value associated with this result) incorrectly flags
"appwrite.config.json" as wrong; update that string to reflect the new canonical
filename (or regenerate the benchmark so judgeReasoning and runDate are
consistent) so the reasoning no longer penalizes answers that use
appwrite.config.json; locate the relevant "judgeReasoning" entry in the result
record and either edit the text to acknowledge the new filename or re-run the
benchmark generation to produce an updated judgeReasoning.

In `@src/data/results-without-skills.json`:
- Around line 2699-2700: The ground-truth strings under the "correctAnswer" and
"rubric" fields use singular CLI commands; update them to the official plural
forms: replace "appwrite init function" → "appwrite init functions", "appwrite
run function" → "appwrite run functions", and "appwrite push function" →
"appwrite push functions" wherever they appear (e.g., in the JSON entries
containing the keys correctAnswer and rubric referenced in the diff) so the
expected answers match documented CLI syntax.
- Around line 5753-5758: Update the JSON entries where the "question" field
reads "A function works locally but fails after pushing because environment
variables are missing. What flag was likely missing from the push command?" to
instead reference the run command: change "push command" to "run command" (so
the question becomes "...missing from the run command?") in all three
occurrences (the identical entries at the three affected locations), ensuring
the context matches the Appwrite CLI flag (--with-variables for appwrite run
functions).

---

Outside diff comments:
In `@benchmark/src/runner.ts`:
- Around line 358-364: The loop mutates the shared systemPrompt when a
question.type === "mcq", causing later questions to accumulate MCQ instructions;
instead create a per-question prompt copy (e.g., let prompt = systemPrompt or
newPrompt) and append the MCQ instruction to that copy, then pass the copy to
processQuestion(model, prompt, tools, skillsMap, debug) (referencing
question.type, systemPrompt, and processQuestion) so the outer systemPrompt
remains unchanged.

---

Nitpick comments:
In `@src/data/results-without-skills.json`:
- Around line 5653-5827: The file contains duplicated question objects (e.g.,
questionId "db-21", "db-22", "cli-21", "cli-22", etc.) repeated across multiple
blocks; extract all unique question definitions into a single canonical
questions array (keyed by questionId) and replace repeated full payloads with
references (e.g., by questionId) in the model/result blocks, updating any code
that consumes this JSON to resolve references at generation time; ensure unique
keys (questionId) remain unchanged and update serialization/deserialization
routines to expand referenced entries when producing the final output.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9b682286-7ac7-48b0-aa3d-742630e08744

📥 Commits

Reviewing files that changed from the base of the PR and between f91b595 and 32945d9.

📒 Files selected for processing (4)
  • benchmark/src/runner.ts
  • benchmark/src/types.ts
  • src/data/results-with-skills.json
  • src/data/results-without-skills.json

Comment on lines 5649 to 5651
"correct": true,
"score": 0.7,
"judgeReasoning": "The answer covers most of the required rubric points: (1) appwrite init functions for scaffolding, (2) runtime selection is mentioned in the config options, (3) appwrite.json configuration options are detailed (though incorrectly named as appwrite.config.json - the actual file is appwrite.json), (4) local development with 'appwrite run functions' is mentioned in step 7 (though the reference uses 'appwrite run function' singular), and (5) deployment with 'appwrite push functions' is covered. The answer is comprehensive and detailed. The main issues are: the incorrect claim that the file is named 'appwrite.config.json' rather than 'appwrite.json', and the local development command is mentioned only briefly at the end rather than as a primary workflow step. The execute permissions configuration is also missing from the appwrite.json example. Despite these issues, all 5 rubric points are addressed, earning a reasonably high score, but the incorrect filename claim is a notable error."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Stale judgeReasoning contradicts the PR's changes.

The judgeReasoning text states:

"(though incorrectly named as appwrite.config.json - the actual file is appwrite.json)"

This is now incorrect since appwrite.config.json is the new canonical filename. If these benchmark results are being regenerated (as suggested by the updated runDate), this reasoning should also reflect the updated understanding. Otherwise, it penalizes models for giving what is now the correct answer.

Consider regenerating the benchmark results with the updated questions to get consistent judge reasoning, or manually correct this text if regeneration is not feasible.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/data/results-with-skills.json` around lines 5649 - 5651, The
judgeReasoning field in the JSON (the "judgeReasoning" value associated with
this result) incorrectly flags "appwrite.config.json" as wrong; update that
string to reflect the new canonical filename (or regenerate the benchmark so
judgeReasoning and runDate are consistent) so the reasoning no longer penalizes
answers that use appwrite.config.json; locate the relevant "judgeReasoning"
entry in the result record and either edit the text to acknowledge the new
filename or re-run the benchmark generation to produce an updated
judgeReasoning.

Comment on lines +2699 to +2700
"correctAnswer": "Run appwrite init function to scaffold, choose runtime and template, configure appwrite.config.json with function settings (name, runtime, execute permissions, variables, schedule, etc.), develop locally with appwrite run function, then deploy with appwrite push functions.",
"rubric": "Must mention: 1) appwrite init function to scaffold, 2) Runtime selection, 3) appwrite.config.json configuration options, 4) Local development with appwrite run function, 5) Deployment with appwrite push functions",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify current command forms in official docs
echo "=== Checking appwrite init functions ==="
curl -fsSL https://appwrite.io/docs/tooling/command-line/functions 2>/dev/null | grep -o 'appwrite init function[s]*' | head -n 5 || echo "Pattern not found in init functions page"

echo ""
echo "=== Checking appwrite run functions ==="
curl -fsSL https://appwrite.io/docs/products/functions/develop-locally 2>/dev/null | grep -o 'appwrite run function[s]*' | head -n 5 || echo "Pattern not found in run functions page"

echo ""
echo "=== Checking appwrite push functions ==="
curl -fsSL https://appwrite.io/docs/products/functions/develop-locally 2>/dev/null | grep -o 'appwrite push function[s]*' | head -n 5 || echo "Pattern not found in push functions page"

Repository: appwrite/arena

Length of output: 440


Update CLI commands to use official plural forms in ground-truth text.

Official Appwrite documentation uses plural command syntax: appwrite init functions, appwrite run functions, and appwrite push functions. The JSON file currently uses singular forms (init function, run function, push function), which will mis-score otherwise-correct responses that follow the documented CLI syntax.

Proposed fix
- "correctAnswer": "Run appwrite init function to scaffold, ... develop locally with appwrite run function, then deploy with appwrite push functions."
+ "correctAnswer": "Run appwrite init functions to scaffold, ... develop locally with appwrite run functions, then deploy with appwrite push functions."

- "rubric": "Must mention: 1) appwrite init function to scaffold, ... 4) Local development with appwrite run function, 5) Deployment with appwrite push functions"
+ "rubric": "Must mention: 1) appwrite init functions to scaffold, ... 4) Local development with appwrite run functions, 5) Deployment with appwrite push functions"

- "question": "What's the key difference between `appwrite run function` and `appwrite push function`?"
+ "question": "What's the key difference between `appwrite run functions` and `appwrite push functions`?"

Affects lines: 2699-2700, 5646-5647, 5817, 8645-8646, 8816, 11572-11573, 11815

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"correctAnswer": "Run appwrite init function to scaffold, choose runtime and template, configure appwrite.config.json with function settings (name, runtime, execute permissions, variables, schedule, etc.), develop locally with appwrite run function, then deploy with appwrite push functions.",
"rubric": "Must mention: 1) appwrite init function to scaffold, 2) Runtime selection, 3) appwrite.config.json configuration options, 4) Local development with appwrite run function, 5) Deployment with appwrite push functions",
"correctAnswer": "Run appwrite init functions to scaffold, choose runtime and template, configure appwrite.config.json with function settings (name, runtime, execute permissions, variables, schedule, etc.), develop locally with appwrite run functions, then deploy with appwrite push functions.",
"rubric": "Must mention: 1) appwrite init functions to scaffold, 2) Runtime selection, 3) appwrite.config.json configuration options, 4) Local development with appwrite run functions, 5) Deployment with appwrite push functions",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/data/results-without-skills.json` around lines 2699 - 2700, The
ground-truth strings under the "correctAnswer" and "rubric" fields use singular
CLI commands; update them to the official plural forms: replace "appwrite init
function" → "appwrite init functions", "appwrite run function" → "appwrite run
functions", and "appwrite push function" → "appwrite push functions" wherever
they appear (e.g., in the JSON entries containing the keys correctAnswer and
rubric referenced in the diff) so the expected answers match documented CLI
syntax.

Comment on lines +5753 to +5758
"question": "A function works locally but fails after pushing because environment variables are missing. What flag was likely missing from the push command?",
"choices": [
"--env",
"--with-variables",
"--include-env",
"--push-variables"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Try to fetch documentation about --with-variables from Appwrite docs
# Check for local development documentation
echo "=== Checking Appwrite develop-locally documentation ==="
curl -fsSL "https://appwrite.io/docs/products/functions/develop-locally" 2>/dev/null | grep -i "with-variables" -A 3 -B 3 || echo "No --with-variables found or page unreachable"

echo ""
echo "=== Checking Appwrite CLI functions documentation ==="
curl -fsSL "https://appwrite.io/docs/tooling/command-line/functions" 2>/dev/null | grep -i "with-variables" -A 3 -B 3 || echo "No --with-variables found or page unreachable"

echo ""
echo "=== Checking for push command context ==="
curl -fsSL "https://appwrite.io/docs/tooling/command-line/functions" 2>/dev/null | grep -i "push" -A 3 -B 3 | head -n 30 || echo "Could not fetch push documentation"

Repository: appwrite/arena

Length of output: 50373


🏁 Script executed:

#!/bin/bash
# Also check the actual JSON file to verify the other locations mentioned
echo "=== Checking lines 8752-8758 in results-without-skills.json ==="
sed -n '8752,8758p' src/data/results-without-skills.json

echo ""
echo "=== Checking lines 11751-11757 in results-without-skills.json ==="
sed -n '11751,11757p' src/data/results-without-skills.json

Repository: appwrite/arena

Length of output: 824


Change push command context to run command across three locations.

The Appwrite documentation confirms --with-variables is a parameter for appwrite run functions (local development), not for push commands. The current question incorrectly frames this flag as missing from a push command, which encodes incorrect troubleshooting logic.

Affected locations

Lines 5753-5758, 8752-8758, 11751-11757 all contain identical questions with the same error.

- "question": "A function works locally but fails after pushing because environment variables are missing. What flag was likely missing from the push command?"
+ "question": "A function fails during local execution because environment variables are missing. What flag was likely missing from the run command?"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"question": "A function works locally but fails after pushing because environment variables are missing. What flag was likely missing from the push command?",
"choices": [
"--env",
"--with-variables",
"--include-env",
"--push-variables"
"question": "A function fails during local execution because environment variables are missing. What flag was likely missing from the run command?",
"choices": [
"--env",
"--with-variables",
"--include-env",
"--push-variables"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/data/results-without-skills.json` around lines 5753 - 5758, Update the
JSON entries where the "question" field reads "A function works locally but
fails after pushing because environment variables are missing. What flag was
likely missing from the push command?" to instead reference the run command:
change "push command" to "run command" (so the question becomes "...missing from
the run command?") in all three occurrences (the identical entries at the three
affected locations), ensuring the context matches the Appwrite CLI flag
(--with-variables for appwrite run functions).

@Meldiron Meldiron merged commit ce51ff6 into main Mar 5, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants