Demo repository for Perstack. Runs the same task across different providers and compares results.
Game generation demo using the bash-gaming expert defined in perstack.toml — a multi-agent team that autonomously designs, implements, tests, and packages CLI games.
- Definition:
perstack.toml - Providers and models:
- Anthropic: Opus 4.6, Sonnet 4.6
- OpenAI: GPT-5.4, GPT-5 mini
- Google: Gemini 3.1 Pro, Gemini 3 Flash
- Fireworks:
- Kimi K2.5
- MiniMax M2.5
bash-gaming (coordinator)
├── @bash-gaming/plan
├── @bash-gaming/build
└── @bash-gaming/verify
Each expert has a defaultModelTier (high or middle). Perstack automatically selects the appropriate model within the provider based on this tier. The --model flag sets the base model; delegates may use a different model from the same provider depending on their tier.
| Expert | Model Tier | Role |
|---|---|---|
| bash-gaming (coordinator) | high | Coordinates the entire task and delegates to the appropriate experts. |
| @bash-gaming/plan | middle | Expands requirements, defines dual-mode API contract, npm package structure. |
| @bash-gaming/build | high | Implements Ink+React TUI, --ai JSON mode, game logic, npm packaging. |
| @bash-gaming/verify | middle | Validates npx runnability, AI mode deterministic output, TUI playthrough tests. |
Create a Wizardry-like dungeon crawler in a fixed 10-floor labyrinth with complex layouts, traps, fixed room encounters, and random battles. Include special-effect gear drops, leveling, and a skill tree for one playable character. Balance difficulty around build optimization. Death in the dungeon causes loss of one random equipped item.
Replace --provider, --model, and result directory for each run.
docker run --pull always --rm -it \
--env-file .env \
-v ./<result-dir>:/workspace \
-v ./bash-gaming/perstack.toml:/definitions/perstack.toml:ro \
perstack/perstack start bash-gaming \
--config /definitions/perstack.toml \
--provider <provider> \
--model <model> \
"Create a Wizardry-like dungeon crawler in a fixed 10-floor labyrinth with complex layouts, traps, fixed room encounters, and random battles. Include special-effect gear drops, leveling, and a skill tree for one playable character. Balance difficulty around build optimization. Death in the dungeon causes loss of one random equipped item."For evaluation, I focused on just three things:
- Does the expert adhere to my instructions?
- Is the outcome verified and actually working? (
npx .in the result directory) - Is the API cost affordable?
Why these three? Because even if the harness architecture is solid, an agent needs to be evaluated on instruction adherence, minimum quality assurance, and cost efficiency.
| Provider | Models Used | Adherence | Works | Directory | Steps | Duration | Input Tokens | Cached | Output Tokens | Cost |
|---|---|---|---|---|---|---|---|---|---|---|
| Anthropic | Opus 4.6, Sonnet 4.6 | ✅ | ✅ | anthropic/ |
173 | 51m 18s | 13.8M | 13.3M (96.07%) | 213.9K | $15.24 |
| Fireworks | Kimi K2.5 | ✅ | ✅ | fireworks-kimi-k2p5/ |
324 | 1h 46m | 20.6M | 19.0M (92.13%) | 189.1K | ~$3.43 |
| Gemini 3.1 Pro, Gemini 3 Flash | ✅ | ✅ | google/ |
163 | 16m 31s | 2.9M | 2.1M (72.69%) | 46.2K | ~$1.76 | |
| OpenAI | GPT-5.4, GPT-5 mini | ❌ | ✅ | openai/ |
118 | 12m 24s | 2.4M | 2.2M (89.56%) | 42.6K | ~$1.80 |
| Fireworks | MiniMax M2.5 | ❌ | ❌ | fireworks-minimax-m2p5/ |
59 | 5m 49s | 1.0M | 844.4K (83.31%) | 39.7K | ~$0.13 |
- 3 out of 5 providers followed the full plan → build → verify pipeline and produced verified working output, with no provider-specific tuning. The topology was defined once and ran as-is.
- Claude produced the richest output with flawless instruction adherence. It also achieved the highest cache hit rate (96%) among all providers, but pricing still pushed the total to 8× the nearest competitor.
- Kimi K2.5 produced excellent output at $3.43 and was the most faithful to delegation. I'd rank it well above GPT and Gemini in both instruction adherence and quality.
- Gemini followed the full pipeline and produced a verified working game. But it's buggier than GPT's output and almost unplayable.
- GPT was the fastest and cheapest, but skipped the verify step entirely. It called build three times instead of following the pipeline.
- MiniMax M2.5 ignored instructions entirely and made a browser-based HTML game. Instruction adherence is a challenge, but the newest version, M2.7, was recently announced with adherence improvements, so I'm looking forward to it.
The full execution logs for every run are in the repo, so you can see exactly what each model did and reproduce it yourself.
cd bash-gaming/opus-4-6-anthropic && npm install && npx .Run with perstack@0.0.136. Completed in a single query with no follow-up requests. The coordinator routed plan and verify to Sonnet 4.6 (middle tier), and build to Opus 4.6 (high tier). High instruction adherence — produced a fully functional Ink TUI dungeon crawler with all requested features.
Cost breakdown by run
| Run | Model | Input | Cached | Uncached | Output |
|---|---|---|---|---|---|
| bash-gaming (coordinator) | Opus 4.6 | 24.7K | — | 24.7K | 2.1K |
| @bash-gaming/plan | Sonnet 4.6 | 7.1M | 6.9M | 200K | 142.6K |
| bash-gaming (coordinator) | Opus 4.6 | 8.4K | — | 8.4K | 11.8K |
| @bash-gaming/build | Opus 4.6 | 4.6M | 4.5M | 100K | 36.1K |
| bash-gaming (coordinator) | Opus 4.6 | 90.0K | 65.8K | 24.2K | 1.0K |
| @bash-gaming/verify | Sonnet 4.6 | 1.6M | 1.5M | 100K | 16.8K |
| bash-gaming (coordinator) | Opus 4.6 | 370.8K | 351.8K | 19.0K | 3.5K |
By model:
| Model | Cost |
|---|---|
| Opus 4.6 | $9.22 |
| Sonnet 4.6 | $6.02 |
| Total | $15.24 |
View execution history:
cd bash-gaming/opus-4-6-anthropic && npx perstack log --job omd3cvzndvtvbpma0tut38accd bash-gaming/gpt-5-4-openai && npm install && npx .Run with perstack@0.0.136. Completed in a single query with no follow-up requests. The coordinator routed plan to GPT-5 mini (middle tier). However, it never delegated to @bash-gaming/verify — instead it called @bash-gaming/build three times, bypassing the defined plan→build→verify pipeline. The result is functional, but the expert topology was not fully followed.
Cost breakdown by run
| Run | Model | Input | Cached | Uncached | Output |
|---|---|---|---|---|---|
| bash-gaming (coordinator) | GPT-5.4 | 5.8K | 1.2K | 4.6K | 541 |
| @bash-gaming/plan | GPT-5 mini | 3.9K | 1.0K | 2.9K | 4.6K |
| bash-gaming (coordinator) | GPT-5.4 | 5.3K | — | 5.3K | 209 |
| @bash-gaming/build | GPT-5.4 | 1.6M | 1.5M | 100K | 22.5K |
| bash-gaming (coordinator) | GPT-5.4 | 85.9K | 46.6K | 39.3K | 1.2K |
| @bash-gaming/build | GPT-5.4 | 503.1K | 427.5K | 75.6K | 9.3K |
| bash-gaming (coordinator) | GPT-5.4 | 59.0K | 54.3K | 4.7K | 375 |
| @bash-gaming/build | GPT-5.4 | 99.8K | 82.4K | 17.4K | 3.0K |
| bash-gaming (coordinator) | GPT-5.4 | 99.3K | 94.8K | 4.5K | 892 |
By model:
| Model | Pricing (input / cached / output per 1M) | Cost |
|---|---|---|
| GPT-5.4 | $2.50 / $0.25 / $15.00 | ~$1.75 |
| GPT-5 mini | $0.25 / $0.025 / $2.00 | < $0.01 |
| Total | ~$1.80 |
View execution history:
cd bash-gaming/gpt-5-4-openai && npx perstack log --job cbtj0h7h3hm6o92m12awertfcd bash-gaming/google/wizardry-crawler && npm install && npx .Run with perstack@0.0.136. Completed in a single query with no follow-up requests. The coordinator correctly followed the plan→build→verify pipeline, routing plan and verify to Gemini 3 Flash (middle tier). Functional but buggy — the game runs and is playable, but exhibits noticeable gameplay issues.
Cost breakdown by run
| Run | Model | Input | Cached | Uncached | Output |
|---|---|---|---|---|---|
| bash-gaming (coordinator) | Gemini 3.1 Pro | 1.1K | — | 1.1K | 354 |
| @bash-gaming/plan | Gemini 3 Flash | 394.9K | 229.7K | 165.2K | 12.5K |
| bash-gaming (coordinator) | Gemini 3.1 Pro | 12.2K | — | 12.2K | 531 |
| @bash-gaming/build | Gemini 3.1 Pro | 492.2K | 379.8K | 112.4K | 12.0K |
| bash-gaming (coordinator) | Gemini 3.1 Pro | 3.9K | — | 3.9K | 48 |
| @bash-gaming/verify | Gemini 3 Flash | 289.0K | 60.0K | 229.0K | 4.4K |
| bash-gaming (coordinator) | Gemini 3.1 Pro | 18.4K | — | 18.4K | 347 |
| @bash-gaming/build | Gemini 3.1 Pro | 1.6M | 1.4M | 200K | 15.1K |
| bash-gaming (coordinator) | Gemini 3.1 Pro | 46.1K | — | 46.1K | 952 |
By model:
| Model | Pricing (input / cached / output per 1M) | Cost |
|---|---|---|
| Gemini 3.1 Pro Preview | $2.00 / $0.20 / $12.00 | ~$1.50 |
| Gemini 3 Flash Preview | $0.50 / $0.05 / $3.00 | ~$0.26 |
| Total | ~$1.76 |
View execution history:
cd bash-gaming/google/wizardry-crawler && npx perstack log --job j3oa726f86kyqyqc75nekbbucd bash-gaming/kimi-k2p5-fireworks && npm install && npx .Run with perstack@0.0.136. Kimi K2.5 performed micro-agent orchestration as expected, leveraging delegation across the expert topology to design, implement, test, and iteratively improve the deliverables. Two follow-up requests were made to address environment-specific issues (the game was functional within the harness).
Cost breakdown by run
| Run | Input | Cached | Uncached | Output |
|---|---|---|---|---|
| bash-gaming (coordinator) | 952 | — | 952 | 504 |
| @bash-gaming/plan | 955.8K | 794.1K | 161.7K | 69.5K |
| bash-gaming (coordinator) | 2.5K | 512 | 2.0K | 432 |
| @bash-gaming/build | 12.2M | 11.7M | 500K | 83.1K |
| bash-gaming (coordinator) | 3.9K | 2.0K | 1.9K | 307 |
| @bash-gaming/verify | 1.8M | 1.6M | 200K | 8.6K |
| bash-gaming (coordinator) | 56.9K | 26.1K | 30.8K | 1.3K |
| (follow-up 1) | ||||
| bash-gaming (coordinator) | 57.9K | 26.6K | 31.3K | 566 |
| @bash-gaming/build | 882.5K | 710.7K | 171.8K | 6.9K |
| bash-gaming (coordinator) | 1.7M | 1.5M | 200K | 3.6K |
| (follow-up 2) | ||||
| bash-gaming (coordinator) | 306.1K | 205.3K | 100.8K | 700 |
| @bash-gaming/build | 2.2M | 2.1M | 100K | 13.0K |
| bash-gaming (coordinator) | 416.1K | 394.2K | 21.9K | 695 |
All runs used Kimi K2.5 ($0.60 / $0.10 / $3.00 per 1M). Total: ~$3.43
View execution history:
cd bash-gaming/fireworks-kimi-k2p5 && npx perstack log --job iaitgzq7vdn92fwmu16pzm18Run with perstack@0.0.136. MiniMax M2.5 ignored the expert instructions — it produced a single-file browser-based HTML game (labyrinth.html) instead of an npx-installable CLI game with Ink TUI and AI mode. No npm package structure, no TypeScript, no tests. The coordinator delegated to plan and verify but skipped @bash-gaming/build entirely — the HTML file was written directly during the plan phase.
Cost breakdown by run
| Run | Input | Cached | Uncached | Output |
|---|---|---|---|---|
| bash-gaming (coordinator) | 14.6K | 10.1K | 4.5K | 1.5K |
| @bash-gaming/plan | 172.9K | 111.2K | 61.7K | 29.3K |
| bash-gaming (coordinator) | 3.4K | 2.3K | 1.1K | 346 |
| @bash-gaming/verify | 755.9K | 663.1K | 92.8K | 5.5K |
| bash-gaming (coordinator) | 66.7K | 57.6K | 9.1K | 2.9K |
All runs used MiniMax M2.5 ($0.30 / $0.03 / $1.20 per 1M). Total: ~$0.13
View execution history:
cd bash-gaming/fireworks-minimax-m2p5 && npx perstack log --job pxpbam5i9zliguib8ivk2itw





