Skip to content

perstack-ai/demo-catalog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Perstack: Demo Catalog

Demo repository for Perstack. Runs the same task across different providers and compares results.

bash-gaming

Game generation demo using the bash-gaming expert defined in perstack.toml — a multi-agent team that autonomously designs, implements, tests, and packages CLI games.

  • Definition: perstack.toml
  • Providers and models:
    • Anthropic: Opus 4.6, Sonnet 4.6
    • OpenAI: GPT-5.4, GPT-5 mini
    • Google: Gemini 3.1 Pro, Gemini 3 Flash
    • Fireworks:
      • Kimi K2.5
      • MiniMax M2.5

Expert Topology

bash-gaming (coordinator)
├── @bash-gaming/plan
├── @bash-gaming/build
└── @bash-gaming/verify

Each expert has a defaultModelTier (high or middle). Perstack automatically selects the appropriate model within the provider based on this tier. The --model flag sets the base model; delegates may use a different model from the same provider depending on their tier.

Expert Model Tier Role
bash-gaming (coordinator) high Coordinates the entire task and delegates to the appropriate experts.
@bash-gaming/plan middle Expands requirements, defines dual-mode API contract, npm package structure.
@bash-gaming/build high Implements Ink+React TUI, --ai JSON mode, game logic, npm packaging.
@bash-gaming/verify middle Validates npx runnability, AI mode deterministic output, TUI playthrough tests.

Task (Query)

Create a Wizardry-like dungeon crawler in a fixed 10-floor labyrinth with complex layouts, traps, fixed room encounters, and random battles. Include special-effect gear drops, leveling, and a skill tree for one playable character. Balance difficulty around build optimization. Death in the dungeon causes loss of one random equipped item.

Execution Command

Replace --provider, --model, and result directory for each run.

docker run --pull always --rm -it \
  --env-file .env \
  -v ./<result-dir>:/workspace \
  -v ./bash-gaming/perstack.toml:/definitions/perstack.toml:ro \
  perstack/perstack start bash-gaming \
    --config /definitions/perstack.toml \
    --provider <provider> \
    --model <model> \
    "Create a Wizardry-like dungeon crawler in a fixed 10-floor labyrinth with complex layouts, traps, fixed room encounters, and random battles. Include special-effect gear drops, leveling, and a skill tree for one playable character. Balance difficulty around build optimization. Death in the dungeon causes loss of one random equipped item."

Evaluation Criteria

For evaluation, I focused on just three things:

  1. Does the expert adhere to my instructions?
  2. Is the outcome verified and actually working? (npx . in the result directory)
  3. Is the API cost affordable?

Why these three? Because even if the harness architecture is solid, an agent needs to be evaluated on instruction adherence, minimum quality assurance, and cost efficiency.

Results

Provider Models Used Adherence Works Directory Steps Duration Input Tokens Cached Output Tokens Cost
Anthropic Opus 4.6, Sonnet 4.6 anthropic/ 173 51m 18s 13.8M 13.3M (96.07%) 213.9K $15.24
Fireworks Kimi K2.5 fireworks-kimi-k2p5/ 324 1h 46m 20.6M 19.0M (92.13%) 189.1K ~$3.43
Google Gemini 3.1 Pro, Gemini 3 Flash google/ 163 16m 31s 2.9M 2.1M (72.69%) 46.2K ~$1.76
OpenAI GPT-5.4, GPT-5 mini openai/ 118 12m 24s 2.4M 2.2M (89.56%) 42.6K ~$1.80
Fireworks MiniMax M2.5 fireworks-minimax-m2p5/ 59 5m 49s 1.0M 844.4K (83.31%) 39.7K ~$0.13

Thoughts

  • 3 out of 5 providers followed the full plan → build → verify pipeline and produced verified working output, with no provider-specific tuning. The topology was defined once and ran as-is.
  • Claude produced the richest output with flawless instruction adherence. It also achieved the highest cache hit rate (96%) among all providers, but pricing still pushed the total to 8× the nearest competitor.
  • Kimi K2.5 produced excellent output at $3.43 and was the most faithful to delegation. I'd rank it well above GPT and Gemini in both instruction adherence and quality.
  • Gemini followed the full pipeline and produced a verified working game. But it's buggier than GPT's output and almost unplayable.
  • GPT was the fastest and cheapest, but skipped the verify step entirely. It called build three times instead of following the pipeline.
  • MiniMax M2.5 ignored instructions entirely and made a browser-based HTML game. Instruction adherence is a challenge, but the newest version, M2.7, was recently announced with adherence improvements, so I'm looking forward to it.

The full execution logs for every run are in the repo, so you can see exactly what each model did and reproduce it yourself.


Anthropic — Opus 4.6 + Sonnet 4.6

cd bash-gaming/opus-4-6-anthropic && npm install && npx .

Anthropic title

Anthropic gameplay

Run with perstack@0.0.136. Completed in a single query with no follow-up requests. The coordinator routed plan and verify to Sonnet 4.6 (middle tier), and build to Opus 4.6 (high tier). High instruction adherence — produced a fully functional Ink TUI dungeon crawler with all requested features.

Cost breakdown by run
Run Model Input Cached Uncached Output
bash-gaming (coordinator) Opus 4.6 24.7K 24.7K 2.1K
@bash-gaming/plan Sonnet 4.6 7.1M 6.9M 200K 142.6K
bash-gaming (coordinator) Opus 4.6 8.4K 8.4K 11.8K
@bash-gaming/build Opus 4.6 4.6M 4.5M 100K 36.1K
bash-gaming (coordinator) Opus 4.6 90.0K 65.8K 24.2K 1.0K
@bash-gaming/verify Sonnet 4.6 1.6M 1.5M 100K 16.8K
bash-gaming (coordinator) Opus 4.6 370.8K 351.8K 19.0K 3.5K

By model:

Model Cost
Opus 4.6 $9.22
Sonnet 4.6 $6.02
Total $15.24

View execution history:

cd bash-gaming/opus-4-6-anthropic && npx perstack log --job omd3cvzndvtvbpma0tut38ac

OpenAI — GPT-5.4 + GPT-5 mini

cd bash-gaming/gpt-5-4-openai && npm install && npx .

OpenAI gameplay

Run with perstack@0.0.136. Completed in a single query with no follow-up requests. The coordinator routed plan to GPT-5 mini (middle tier). However, it never delegated to @bash-gaming/verify — instead it called @bash-gaming/build three times, bypassing the defined plan→build→verify pipeline. The result is functional, but the expert topology was not fully followed.

Cost breakdown by run
Run Model Input Cached Uncached Output
bash-gaming (coordinator) GPT-5.4 5.8K 1.2K 4.6K 541
@bash-gaming/plan GPT-5 mini 3.9K 1.0K 2.9K 4.6K
bash-gaming (coordinator) GPT-5.4 5.3K 5.3K 209
@bash-gaming/build GPT-5.4 1.6M 1.5M 100K 22.5K
bash-gaming (coordinator) GPT-5.4 85.9K 46.6K 39.3K 1.2K
@bash-gaming/build GPT-5.4 503.1K 427.5K 75.6K 9.3K
bash-gaming (coordinator) GPT-5.4 59.0K 54.3K 4.7K 375
@bash-gaming/build GPT-5.4 99.8K 82.4K 17.4K 3.0K
bash-gaming (coordinator) GPT-5.4 99.3K 94.8K 4.5K 892

By model:

Model Pricing (input / cached / output per 1M) Cost
GPT-5.4 $2.50 / $0.25 / $15.00 ~$1.75
GPT-5 mini $0.25 / $0.025 / $2.00 < $0.01
Total ~$1.80

View execution history:

cd bash-gaming/gpt-5-4-openai && npx perstack log --job cbtj0h7h3hm6o92m12awertf

Google — Gemini 3.1 Pro + Gemini 3 Flash

cd bash-gaming/google/wizardry-crawler && npm install && npx .

Google gameplay

Run with perstack@0.0.136. Completed in a single query with no follow-up requests. The coordinator correctly followed the plan→build→verify pipeline, routing plan and verify to Gemini 3 Flash (middle tier). Functional but buggy — the game runs and is playable, but exhibits noticeable gameplay issues.

Cost breakdown by run
Run Model Input Cached Uncached Output
bash-gaming (coordinator) Gemini 3.1 Pro 1.1K 1.1K 354
@bash-gaming/plan Gemini 3 Flash 394.9K 229.7K 165.2K 12.5K
bash-gaming (coordinator) Gemini 3.1 Pro 12.2K 12.2K 531
@bash-gaming/build Gemini 3.1 Pro 492.2K 379.8K 112.4K 12.0K
bash-gaming (coordinator) Gemini 3.1 Pro 3.9K 3.9K 48
@bash-gaming/verify Gemini 3 Flash 289.0K 60.0K 229.0K 4.4K
bash-gaming (coordinator) Gemini 3.1 Pro 18.4K 18.4K 347
@bash-gaming/build Gemini 3.1 Pro 1.6M 1.4M 200K 15.1K
bash-gaming (coordinator) Gemini 3.1 Pro 46.1K 46.1K 952

By model:

Model Pricing (input / cached / output per 1M) Cost
Gemini 3.1 Pro Preview $2.00 / $0.20 / $12.00 ~$1.50
Gemini 3 Flash Preview $0.50 / $0.05 / $3.00 ~$0.26
Total ~$1.76

View execution history:

cd bash-gaming/google/wizardry-crawler && npx perstack log --job j3oa726f86kyqyqc75nekbbu

Fireworks — Kimi K2.5

cd bash-gaming/kimi-k2p5-fireworks && npm install && npx .

Kimi K2.5 title

Kimi K2.5 gameplay

Run with perstack@0.0.136. Kimi K2.5 performed micro-agent orchestration as expected, leveraging delegation across the expert topology to design, implement, test, and iteratively improve the deliverables. Two follow-up requests were made to address environment-specific issues (the game was functional within the harness).

Cost breakdown by run
Run Input Cached Uncached Output
bash-gaming (coordinator) 952 952 504
@bash-gaming/plan 955.8K 794.1K 161.7K 69.5K
bash-gaming (coordinator) 2.5K 512 2.0K 432
@bash-gaming/build 12.2M 11.7M 500K 83.1K
bash-gaming (coordinator) 3.9K 2.0K 1.9K 307
@bash-gaming/verify 1.8M 1.6M 200K 8.6K
bash-gaming (coordinator) 56.9K 26.1K 30.8K 1.3K
(follow-up 1)
bash-gaming (coordinator) 57.9K 26.6K 31.3K 566
@bash-gaming/build 882.5K 710.7K 171.8K 6.9K
bash-gaming (coordinator) 1.7M 1.5M 200K 3.6K
(follow-up 2)
bash-gaming (coordinator) 306.1K 205.3K 100.8K 700
@bash-gaming/build 2.2M 2.1M 100K 13.0K
bash-gaming (coordinator) 416.1K 394.2K 21.9K 695

All runs used Kimi K2.5 ($0.60 / $0.10 / $3.00 per 1M). Total: ~$3.43

View execution history:

cd bash-gaming/fireworks-kimi-k2p5 && npx perstack log --job iaitgzq7vdn92fwmu16pzm18

Fireworks — MiniMax M2.5

MiniMax M2.5 screenshot

Run with perstack@0.0.136. MiniMax M2.5 ignored the expert instructions — it produced a single-file browser-based HTML game (labyrinth.html) instead of an npx-installable CLI game with Ink TUI and AI mode. No npm package structure, no TypeScript, no tests. The coordinator delegated to plan and verify but skipped @bash-gaming/build entirely — the HTML file was written directly during the plan phase.

Cost breakdown by run
Run Input Cached Uncached Output
bash-gaming (coordinator) 14.6K 10.1K 4.5K 1.5K
@bash-gaming/plan 172.9K 111.2K 61.7K 29.3K
bash-gaming (coordinator) 3.4K 2.3K 1.1K 346
@bash-gaming/verify 755.9K 663.1K 92.8K 5.5K
bash-gaming (coordinator) 66.7K 57.6K 9.1K 2.9K

All runs used MiniMax M2.5 ($0.30 / $0.03 / $1.20 per 1M). Total: ~$0.13

View execution history:

cd bash-gaming/fireworks-minimax-m2p5 && npx perstack log --job pxpbam5i9zliguib8ivk2itw

About

Demo repository for Perstack. Runs the same task across different providers and compares results.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors