Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 85 additions & 0 deletions benchmark/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 3 additions & 0 deletions benchmark/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -15,5 +15,8 @@
"devDependencies": {
"@types/bun": "latest",
"typescript": "^5.7.2"
},
"dependencies": {
"@openrouter/sdk": "^0.9.11"
Comment on lines +19 to +20
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

OpenRouter TypeScript SDK official documentation version pinning recommendations

💡 Result:

OpenRouter’s official TypeScript SDK is explicitly labeled beta, and the maintainers warn that there may be breaking changes between versions without a major-version bump—so they recommend pinning to a specific @openrouter/sdk version (i.e., don’t rely on floating ranges like ^ if you want reproducible installs). [1][2]

Practical implication: install/update intentionally (e.g., npm i @openrouter/sdk@<exact-version>) and keep that exact version in your package.json so repeated installs don’t unexpectedly pick up breaking changes. [1]


🌐 Web query:

OpenRouter SDK beta status GitHub releases

💡 Result:

  • Beta status: OpenRouter’s official SDKs are explicitly labeled “OpenRouter SDK (Beta)” (TypeScript + Python), and both repos state they’re in beta with possible breaking changes between versions even without a major version bump (they recommend pinning versions). [1][2]
  • GitHub releases (latest as of 2026-03-04):
    • TypeScript SDK (OpenRouterTeam/typescript-sdk): latest release v0.9.11 published Feb 23, 2026. [3]
    • Python SDK (OpenRouterTeam/python-sdk): latest release v0.7.11 published Feb 23, 2026. [4]

Sources
[1] (github.com)
[2] (github.com)
[3] (github.com)
[4] (github.com)

Citations:


Pin @openrouter/sdk to an exact version for reproducible benchmark runs.

The OpenRouter SDK is explicitly labeled as beta with potential breaking changes between versions without major-version bumps. The maintainers recommend pinning to an exact version. Using ^0.9.11 allows minor and patch updates that could alter benchmark scores; change to 0.9.11 for consistent results across runs.

🔧 Proposed change
   "dependencies": {
-    "@openrouter/sdk": "^0.9.11"
+    "@openrouter/sdk": "0.9.11"
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"dependencies": {
"@openrouter/sdk": "^0.9.11"
"dependencies": {
"@openrouter/sdk": "0.9.11"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/package.json` around lines 19 - 20, Update the dependency
declaration in package.json to pin `@openrouter/sdk` to an exact release instead
of a caret range: replace the current version spec "^0.9.11" for the dependency
"@openrouter/sdk" with "0.9.11" to ensure reproducible benchmark runs.

}
}
1 change: 1 addition & 0 deletions benchmark/src/config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ export const MODELS: ModelConfig[] = [
name: "Gemini 3.1 Pro (Preview)",
provider: "Google",
openRouterId: "google/gemini-3.1-pro-preview",
openRouterProviderOrder: ["google-ai-studio"],
},
{
id: "gpt-5-3-codex",
Expand Down
30 changes: 30 additions & 0 deletions benchmark/src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,29 @@ interface ModelProgress {
results: QuestionResult[];
}

function sanitizeResults(models: Record<string, ModelProgress>): boolean {
let fixed = false;
for (const [modelId, model] of Object.entries(models)) {
const seen = new Map<string, number>();
const duplicates: string[] = [];
for (let i = 0; i < model.results.length; i++) {
const qid = model.results[i].questionId;
if (seen.has(qid)) {
duplicates.push(qid);
}
seen.set(qid, i);
}
if (duplicates.length > 0) {
// Keep last occurrence of each questionId
const keepIndices = new Set(seen.values());
model.results = model.results.filter((_, i) => keepIndices.has(i));
console.log(` Sanitize: removed ${duplicates.length} duplicate(s) from ${modelId}: ${[...new Set(duplicates)].join(", ")}`);
fixed = true;
}
}
return fixed;
}

function getResultsPath(mode: string): string {
return resolve(import.meta.dir, `../../src/data/results-${mode}.json`);
}
Expand All @@ -172,6 +195,7 @@ function loadExistingResults(mode: string): Record<string, ModelProgress> {
correct: d.correct,
score: d.score,
judgeReasoning: d.judgeReasoning,
...(d.modComment ? { modComment: d.modComment } : {}),
})),
};
}
Expand Down Expand Up @@ -214,6 +238,7 @@ function saveResults(
correct: r.correct,
score: r.score,
judgeReasoning: r.judgeReasoning,
...(r.modComment ? { modComment: r.modComment } : {}),
};
});

Expand Down Expand Up @@ -273,6 +298,11 @@ async function main() {
);
}

if (sanitizeResults(models)) {
saveResults(models, mode);
console.log(`Sanitized results saved.`);
}

let skillsMap: Map<string, SkillInfo> | undefined;
let tools: Tool[] | undefined;
if (mode === "with-skills") {
Expand Down
27 changes: 8 additions & 19 deletions benchmark/src/judge.ts
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
import { JUDGE_MODEL, OPENROUTER_API_URL, TEMPERATURE } from "./config";
import { OpenRouter } from "@openrouter/sdk";
import { JUDGE_MODEL, TEMPERATURE } from "./config";
import type { Question } from "./types";

const apiKey = process.env.OPENROUTER_API_KEY;

const openrouter = new OpenRouter({ apiKey });

interface JudgeResult {
score: number;
reasoning: string;
Expand All @@ -28,31 +31,17 @@ Score the model's answer from 0.0 to 1.0 where:
Respond in this exact JSON format:
{"score": <number>, "reasoning": "<brief explanation>"}`;

const response = await fetch(OPENROUTER_API_URL, {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${apiKey}`,
},
body: JSON.stringify({
const data = await openrouter.chat.send({
chatGenerationParams: {
model: JUDGE_MODEL,
temperature: TEMPERATURE,
messages: [
{ role: "system", content: systemPrompt },
{ role: "user", content: modelAnswer },
],
}),
},
});

if (!response.ok) {
const text = await response.text();
throw new Error(`Judge API error (${response.status}): ${text}`);
}

const data = (await response.json()) as {
choices: Array<{ message: { content: string } }>;
};
const content = data.choices[0]?.message?.content ?? "";
const content = (data as { choices: Array<{ message: { content: string } }> }).choices[0]?.message?.content ?? "";

try {
const jsonMatch = content.match(/\{[\s\S]*\}/);
Expand Down
126 changes: 119 additions & 7 deletions benchmark/src/questions/cli.ts
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ export const cliQuestions: Question[] = [
"appwrite.config.js",
".appwriterc",
"appwrite.yaml",
"appwrite.json",
"appwrite.config.json",
],
correctAnswer: "D",
},
Expand Down Expand Up @@ -120,17 +120,17 @@ export const cliQuestions: Question[] = [
category: "cli",
type: "free-form",
question:
"Describe the complete workflow for creating, configuring, and deploying an Appwrite Function using the CLI. Include key configuration options in appwrite.json.",
"Describe the complete workflow for creating, configuring, and deploying an Appwrite Function using the CLI. Include key configuration options in appwrite.config.json.",
correctAnswer:
"Run appwrite init function to scaffold, choose runtime and template, configure appwrite.json with function settings (name, runtime, execute permissions, variables, schedule, etc.), develop locally with appwrite run function, then deploy with appwrite push functions.",
"Run appwrite init function to scaffold, choose runtime and template, configure appwrite.config.json with function settings (name, runtime, execute permissions, variables, schedule, etc.), develop locally with appwrite run function, then deploy with appwrite push functions.",
rubric:
"Must mention: 1) appwrite init function to scaffold, 2) Runtime selection, 3) appwrite.json configuration options, 4) Local development with appwrite run function, 5) Deployment with appwrite push functions",
"Must mention: 1) appwrite init function to scaffold, 2) Runtime selection, 3) appwrite.config.json configuration options, 4) Local development with appwrite run function, 5) Deployment with appwrite push functions",
},
{
id: "cli-11",
category: "cli",
type: "mcq",
question: "What command initializes the CLI with your Appwrite project and creates appwrite.json?",
question: "What command initializes the CLI with your Appwrite project and creates appwrite.config.json?",
choices: [
"appwrite init project",
"appwrite setup",
Expand Down Expand Up @@ -169,7 +169,7 @@ export const cliQuestions: Question[] = [
id: "cli-14",
category: "cli",
type: "mcq",
question: "What does the appwrite.json file represent?",
question: "What does the appwrite.config.json file represent?",
choices: [
"Only function configurations",
"User credentials only",
Expand Down Expand Up @@ -225,7 +225,7 @@ export const cliQuestions: Question[] = [
choices: [
"Pushes code to a Git repository",
"Uploads environment variables only",
"Deploys tracked resources (e.g. functions, collections) from appwrite.json to your Appwrite project",
"Deploys tracked resources (e.g. functions, collections) from appwrite.config.json to your Appwrite project",
"Syncs local config with the server and overwrites server state",
],
correctAnswer: "C",
Expand Down Expand Up @@ -254,4 +254,116 @@ export const cliQuestions: Question[] = [
rubric:
"Must mention: 1) data as JSON string with double quotes, 2) permissions as array (space-separated in CLI), 3) Example or correct syntax for databases create-document",
},
{
id: "cli-21",
category: "cli",
type: "mcq",
question:
"What happens if an Appwrite project contains an appwrite.json file but no appwrite.config.json?",
choices: [
"The CLI throws an error and requires appwrite.config.json",
"The CLI falls back to appwrite.json for legacy backwards compatibility",
"The CLI ignores it and uses default settings",
"The CLI automatically migrates appwrite.json to appwrite.config.json",
],
correctAnswer: "B",
},
{
id: "cli-22",
category: "cli",
type: "mcq",
question:
"What does the appwrite types command do?",
choices: [
"Generates typed models for your Appwrite project's collections and attributes",
"Lists all data types supported by Appwrite databases",
"Converts documents between different data formats",
"Validates the types defined in appwrite.config.json",
],
correctAnswer: "A",
},
{
id: "cli-23",
category: "cli",
type: "mcq",
question:
"What does the --strict flag do in the appwrite types command?",
choices: [
"Enforces type-safe null checks in generated code",
"Throws errors for missing or invalid collection attributes",
"Automatically converts field names to follow language conventions",
"Disables generation of optional fields",
],
correctAnswer: "C",
},
{
id: "cli-24",
category: "cli",
type: "mcq",
question:
"A function works locally but fails after pushing because environment variables are missing. What flag was likely missing from the push command?",
choices: [
"--env",
"--with-variables",
"--include-env",
"--push-variables",
],
correctAnswer: "B",
},
{
id: "cli-25",
category: "cli",
type: "mcq",
question:
"When defining an attribute in appwrite.config.json, what happens if 'type' is 'string' or 'varchar' but 'size' is not defined?",
choices: [
"The attribute is created with a default size of 255",
"The attribute is created as an unlimited text field",
"The CLI automatically calculates the size based on sample data",
"The CLI throws a validation error because 'size' is required for string/varchar types",
],
correctAnswer: "D",
},
{
id: "cli-26",
category: "cli",
type: "mcq",
question:
"In appwrite.config.json, when defining an attribute with 'required' set to true, what must the 'default' property be set to?",
choices: [
"The default can be any value matching the type",
"The default must be set to an empty string or 0",
"The default must be null",
"The default property is optional and can be omitted",
],
correctAnswer: "C",
},
{
id: "cli-27",
category: "cli",
type: "mcq",
question:
"What does the 'appwrite generate' command do?",
choices: [
"Generates a new Appwrite project from a template",
"Creates boilerplate code for functions and collections",
"Generates a type-safe SDK from your Appwrite project configuration",
"Generates API documentation for your project",
],
correctAnswer: "C",
},
{
id: "cli-28",
category: "cli",
type: "mcq",
question:
"What's the key difference between `appwrite run function` and `appwrite push function`?",
choices: [
"`run` executes locally with Docker emulation, `push` deploys to Appwrite cloud",
"`run` deploys to staging, `push` deploys to production",
"`run` is for testing, `push` is for CI/CD pipelines only",
"`run` requires internet connection, `push` works offline",
],
correctAnswer: "A",
},
];
Loading