Perstack: Demo Catalog

Demo repository for Perstack. Runs the same task across different providers and compares results.

bash-gaming

Game generation demo using the bash-gaming expert defined in perstack.toml — a multi-agent team that autonomously designs, implements, tests, and packages CLI games.

Definition: perstack.toml
Providers and models:
- Anthropic: Opus 4.6, Sonnet 4.6
- OpenAI: GPT-5.4, GPT-5 mini
- Google: Gemini 3.1 Pro, Gemini 3 Flash
- Fireworks:
  - Kimi K2.5
  - MiniMax M2.5

Expert Topology

bash-gaming (coordinator)
├── @bash-gaming/plan
├── @bash-gaming/build
└── @bash-gaming/verify

Each expert has a defaultModelTier (high or middle). Perstack automatically selects the appropriate model within the provider based on this tier. The --model flag sets the base model; delegates may use a different model from the same provider depending on their tier.

Expert	Model Tier	Role
bash-gaming (coordinator)	high	Coordinates the entire task and delegates to the appropriate experts.
@bash-gaming/plan	middle	Expands requirements, defines dual-mode API contract, npm package structure.
@bash-gaming/build	high	Implements Ink+React TUI, --ai JSON mode, game logic, npm packaging.
@bash-gaming/verify	middle	Validates npx runnability, AI mode deterministic output, TUI playthrough tests.

Task (Query)

Create a Wizardry-like dungeon crawler in a fixed 10-floor labyrinth with complex layouts, traps, fixed room encounters, and random battles. Include special-effect gear drops, leveling, and a skill tree for one playable character. Balance difficulty around build optimization. Death in the dungeon causes loss of one random equipped item.

Execution Command

Replace --provider, --model, and result directory for each run.

docker run --pull always --rm -it \
  --env-file .env \
  -v ./<result-dir>:/workspace \
  -v ./bash-gaming/perstack.toml:/definitions/perstack.toml:ro \
  perstack/perstack start bash-gaming \
    --config /definitions/perstack.toml \
    --provider <provider> \
    --model <model> \
    "Create a Wizardry-like dungeon crawler in a fixed 10-floor labyrinth with complex layouts, traps, fixed room encounters, and random battles. Include special-effect gear drops, leveling, and a skill tree for one playable character. Balance difficulty around build optimization. Death in the dungeon causes loss of one random equipped item."

Evaluation Criteria

For evaluation, I focused on just three things:

Does the expert adhere to my instructions?
Is the outcome verified and actually working? (npx . in the result directory)
Is the API cost affordable?

Why these three? Because even if the harness architecture is solid, an agent needs to be evaluated on instruction adherence, minimum quality assurance, and cost efficiency.

Results

Provider	Models Used	Adherence	Works	Directory	Steps	Duration	Input Tokens	Cached	Output Tokens	Cost
Anthropic	Opus 4.6, Sonnet 4.6	✅	✅	`anthropic/`	173	51m 18s	13.8M	13.3M (96.07%)	213.9K	$15.24
Fireworks	Kimi K2.5	✅	✅	`fireworks-kimi-k2p5/`	324	1h 46m	20.6M	19.0M (92.13%)	189.1K	~$3.43
Google	Gemini 3.1 Pro, Gemini 3 Flash	✅	✅	`google/`	163	16m 31s	2.9M	2.1M (72.69%)	46.2K	~$1.76
OpenAI	GPT-5.4, GPT-5 mini	❌	✅	`openai/`	118	12m 24s	2.4M	2.2M (89.56%)	42.6K	~$1.80
Fireworks	MiniMax M2.5	❌	❌	`fireworks-minimax-m2p5/`	59	5m 49s	1.0M	844.4K (83.31%)	39.7K	~$0.13

Thoughts

3 out of 5 providers followed the full plan → build → verify pipeline and produced verified working output, with no provider-specific tuning. The topology was defined once and ran as-is.
Claude produced the richest output with flawless instruction adherence. It also achieved the highest cache hit rate (96%) among all providers, but pricing still pushed the total to 8× the nearest competitor.
Kimi K2.5 produced excellent output at $3.43 and was the most faithful to delegation. I'd rank it well above GPT and Gemini in both instruction adherence and quality.
Gemini followed the full pipeline and produced a verified working game. But it's buggier than GPT's output and almost unplayable.
GPT was the fastest and cheapest, but skipped the verify step entirely. It called build three times instead of following the pipeline.
MiniMax M2.5 ignored instructions entirely and made a browser-based HTML game. Instruction adherence is a challenge, but the newest version, M2.7, was recently announced with adherence improvements, so I'm looking forward to it.

The full execution logs for every run are in the repo, so you can see exactly what each model did and reproduce it yourself.

Anthropic — Opus 4.6 + Sonnet 4.6

cd bash-gaming/opus-4-6-anthropic && npm install && npx .

Run with perstack@0.0.136. Completed in a single query with no follow-up requests. The coordinator routed plan and verify to Sonnet 4.6 (middle tier), and build to Opus 4.6 (high tier). High instruction adherence — produced a fully functional Ink TUI dungeon crawler with all requested features.

Cost breakdown by run

Run	Model	Input	Cached	Uncached	Output
bash-gaming (coordinator)	Opus 4.6	24.7K	—	24.7K	2.1K
@bash-gaming/plan	Sonnet 4.6	7.1M	6.9M	200K	142.6K
bash-gaming (coordinator)	Opus 4.6	8.4K	—	8.4K	11.8K
@bash-gaming/build	Opus 4.6	4.6M	4.5M	100K	36.1K
bash-gaming (coordinator)	Opus 4.6	90.0K	65.8K	24.2K	1.0K
@bash-gaming/verify	Sonnet 4.6	1.6M	1.5M	100K	16.8K
bash-gaming (coordinator)	Opus 4.6	370.8K	351.8K	19.0K	3.5K

By model:

Model	Cost
Opus 4.6	$9.22
Sonnet 4.6	$6.02
Total	$15.24

View execution history:

cd bash-gaming/opus-4-6-anthropic && npx perstack log --job omd3cvzndvtvbpma0tut38ac

OpenAI — GPT-5.4 + GPT-5 mini

cd bash-gaming/gpt-5-4-openai && npm install && npx .

Run with perstack@0.0.136. Completed in a single query with no follow-up requests. The coordinator routed plan to GPT-5 mini (middle tier). However, it never delegated to @bash-gaming/verify — instead it called @bash-gaming/build three times, bypassing the defined plan→build→verify pipeline. The result is functional, but the expert topology was not fully followed.

Cost breakdown by run

Run	Model	Input	Cached	Uncached	Output
bash-gaming (coordinator)	GPT-5.4	5.8K	1.2K	4.6K	541
@bash-gaming/plan	GPT-5 mini	3.9K	1.0K	2.9K	4.6K
bash-gaming (coordinator)	GPT-5.4	5.3K	—	5.3K	209
@bash-gaming/build	GPT-5.4	1.6M	1.5M	100K	22.5K
bash-gaming (coordinator)	GPT-5.4	85.9K	46.6K	39.3K	1.2K
@bash-gaming/build	GPT-5.4	503.1K	427.5K	75.6K	9.3K
bash-gaming (coordinator)	GPT-5.4	59.0K	54.3K	4.7K	375
@bash-gaming/build	GPT-5.4	99.8K	82.4K	17.4K	3.0K
bash-gaming (coordinator)	GPT-5.4	99.3K	94.8K	4.5K	892

By model:

Model	Pricing (input / cached / output per 1M)	Cost
GPT-5.4	$2.50 / $0.25 / $15.00	~$1.75
GPT-5 mini	$0.25 / $0.025 / $2.00	< $0.01
Total		~$1.80

View execution history:

cd bash-gaming/gpt-5-4-openai && npx perstack log --job cbtj0h7h3hm6o92m12awertf

Google — Gemini 3.1 Pro + Gemini 3 Flash

cd bash-gaming/google/wizardry-crawler && npm install && npx .

Run with perstack@0.0.136. Completed in a single query with no follow-up requests. The coordinator correctly followed the plan→build→verify pipeline, routing plan and verify to Gemini 3 Flash (middle tier). Functional but buggy — the game runs and is playable, but exhibits noticeable gameplay issues.

Cost breakdown by run

Run	Model	Input	Cached	Uncached	Output
bash-gaming (coordinator)	Gemini 3.1 Pro	1.1K	—	1.1K	354
@bash-gaming/plan	Gemini 3 Flash	394.9K	229.7K	165.2K	12.5K
bash-gaming (coordinator)	Gemini 3.1 Pro	12.2K	—	12.2K	531
@bash-gaming/build	Gemini 3.1 Pro	492.2K	379.8K	112.4K	12.0K
bash-gaming (coordinator)	Gemini 3.1 Pro	3.9K	—	3.9K	48
@bash-gaming/verify	Gemini 3 Flash	289.0K	60.0K	229.0K	4.4K
bash-gaming (coordinator)	Gemini 3.1 Pro	18.4K	—	18.4K	347
@bash-gaming/build	Gemini 3.1 Pro	1.6M	1.4M	200K	15.1K
bash-gaming (coordinator)	Gemini 3.1 Pro	46.1K	—	46.1K	952

By model:

Model	Pricing (input / cached / output per 1M)	Cost
Gemini 3.1 Pro Preview	$2.00 / $0.20 / $12.00	~$1.50
Gemini 3 Flash Preview	$0.50 / $0.05 / $3.00	~$0.26
Total		~$1.76

View execution history:

cd bash-gaming/google/wizardry-crawler && npx perstack log --job j3oa726f86kyqyqc75nekbbu

Fireworks — Kimi K2.5

cd bash-gaming/kimi-k2p5-fireworks && npm install && npx .

Run with perstack@0.0.136. Kimi K2.5 performed micro-agent orchestration as expected, leveraging delegation across the expert topology to design, implement, test, and iteratively improve the deliverables. Two follow-up requests were made to address environment-specific issues (the game was functional within the harness).

Cost breakdown by run

Run	Input	Cached	Uncached	Output
bash-gaming (coordinator)	952	—	952	504
@bash-gaming/plan	955.8K	794.1K	161.7K	69.5K
bash-gaming (coordinator)	2.5K	512	2.0K	432
@bash-gaming/build	12.2M	11.7M	500K	83.1K
bash-gaming (coordinator)	3.9K	2.0K	1.9K	307
@bash-gaming/verify	1.8M	1.6M	200K	8.6K
bash-gaming (coordinator)	56.9K	26.1K	30.8K	1.3K
(follow-up 1)
bash-gaming (coordinator)	57.9K	26.6K	31.3K	566
@bash-gaming/build	882.5K	710.7K	171.8K	6.9K
bash-gaming (coordinator)	1.7M	1.5M	200K	3.6K
(follow-up 2)
bash-gaming (coordinator)	306.1K	205.3K	100.8K	700
@bash-gaming/build	2.2M	2.1M	100K	13.0K
bash-gaming (coordinator)	416.1K	394.2K	21.9K	695

All runs used Kimi K2.5 ($0.60 / $0.10 / $3.00 per 1M). Total: ~$3.43

View execution history:

cd bash-gaming/fireworks-kimi-k2p5 && npx perstack log --job iaitgzq7vdn92fwmu16pzm18

Fireworks — MiniMax M2.5

Run with perstack@0.0.136. MiniMax M2.5 ignored the expert instructions — it produced a single-file browser-based HTML game (labyrinth.html) instead of an npx-installable CLI game with Ink TUI and AI mode. No npm package structure, no TypeScript, no tests. The coordinator delegated to plan and verify but skipped @bash-gaming/build entirely — the HTML file was written directly during the plan phase.

Cost breakdown by run

Run	Input	Cached	Uncached	Output
bash-gaming (coordinator)	14.6K	10.1K	4.5K	1.5K
@bash-gaming/plan	172.9K	111.2K	61.7K	29.3K
bash-gaming (coordinator)	3.4K	2.3K	1.1K	346
@bash-gaming/verify	755.9K	663.1K	92.8K	5.5K
bash-gaming (coordinator)	66.7K	57.6K	9.1K	2.9K

All runs used MiniMax M2.5 ($0.30 / $0.03 / $1.20 per 1M). Total: ~$0.13

View execution history:

cd bash-gaming/fireworks-minimax-m2p5 && npx perstack log --job pxpbam5i9zliguib8ivk2itw

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
bash-gaming		bash-gaming
screenshots		screenshots
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Perstack: Demo Catalog

bash-gaming

Expert Topology

Task (Query)

Execution Command

Evaluation Criteria

Results

Thoughts

Anthropic — Opus 4.6 + Sonnet 4.6

OpenAI — GPT-5.4 + GPT-5 mini

Google — Gemini 3.1 Pro + Gemini 3 Flash

Fireworks — Kimi K2.5

Fireworks — MiniMax M2.5

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Perstack: Demo Catalog

bash-gaming

Expert Topology

Task (Query)

Execution Command

Evaluation Criteria

Results

Thoughts

Anthropic — Opus 4.6 + Sonnet 4.6

OpenAI — GPT-5.4 + GPT-5 mini

Google — Gemini 3.1 Pro + Gemini 3 Flash

Fireworks — Kimi K2.5

Fireworks — MiniMax M2.5

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages