A web platform for playing ARC-AGI-3 games — interactive reasoning puzzles on a 64×64 pixel grid with 16 colours. No instructions are given; players (human or AI) must discover the rules, controls, and goals by experimenting.
Live at arc3.sonpham.net
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # fill in API keys you want to use
python server/app.py # local dev serverOpen http://localhost:5000 in your browser to play.
python agent.py --game ls20 # play one game
python agent.py --model gemini-2.5-flash --game ft09 # override model
python agent.py --max-steps 400 # custom step limit
python agent.py --list-models # check available models
python agent.py --show-config # print resolved configpython batch_runner.py --games ls20 --concurrency 1 --max-steps 10 # smoke test
python batch_runner.py --games all --concurrency 4 # all games
python batch_runner.py --games fd01,ft09 --repeat 3 --concurrency 4 # specific games
python batch_runner.py --resume <batch_id> # resume interruptedAll game-playing logic runs in the browser: game environment execution (Pyodide or server-proxied), LLM calls (BYOK / Puter.js / Copilot), REPL / code execution, agent memory, and scaffolding logic. The server is stateless except for auth and session persistence.
Entry points:
server/app.py— Flask application (63 routes)agent.py— Autonomous CLI agentbatch_runner.py— Batch game runner
Service layer (server/services/):
auth_service.py— Magic link, Google OAuth, Copilot auth, API keyssession_service.py— Session resume, branch, import, OBS eventsgame_service.py— Game start, step, reset, undosocial_service.py— Comments, leaderboard, contributorsllm_admin_service.py— Model listing, BYOK key management
Database layer:
db.py— Connection pooling, schema init, migrationsdb_sessions.py— Session CRUDdb_auth.py— Users, tokens, magic linksdb_llm.py— LLM call loggingdb_tools.py— Tool execution loggingdb_exports.py— File export/import
LLM providers:
llm_providers.py— Router: model ID → provider callllm_providers_openai.py— OpenAI + LM Studio (OpenAI-compatible)llm_providers_anthropic.py— Anthropic Claudellm_providers_google.py— Google Geminillm_providers_copilot.py— GitHub Copilot (device flow)
Agent modules:
agent.py— Game-playing orchestratoragent_llm.py— LLM decision logicagent_response_parsing.py— Parse LLM responses → actionsagent_history.py— Action history management
Infrastructure:
models.py— Model registry (42 models across 10 providers)constants.py— Shared constantsexceptions.py— Structured error classes
Global-scope scripts loaded via <script> tags. Load order matters.
Core: state.js, engine.js, reasoning.js
UI: ui.js, ui-grid.js, ui-models.js, ui-tabs.js, ui-tokens.js
LLM: llm.js, llm-config.js, llm-controls.js, llm-executor.js, llm-reasoning.js, llm-timeline.js
Scaffolding: scaffolding.js, scaffolding-linear.js, scaffolding-rlm.js, scaffolding-three-system.js, scaffolding-agent-spawn.js, scaffolding-world-model.js
Session: session.js, session-storage.js, session-persistence.js, session-replay.js, session-views.js, session-views-grid.js, session-views-history.js
Observatory: observatory.js, obs-page.js, obs-scrubber.js, obs-session-loader.js, obs-swimlane.js
Human play: human.js, human-game.js, human-input.js, human-render.js, human-session.js, human-social.js
Other: arena.js, leaderboard.js, memory-inspector.js, share-page.js, draw-page.js, dev.js
templates/index.html— Main SPA (game player + agent UI)templates/obs.html— Observatory (session replay viewer)templates/share.html— Public share/replay pagetemplates/arena.html— Game arenatemplates/draw.html— Grid drawing tooltemplates/ab01.html— Antibody standalone page
The CLI agent is configured via config.yaml with four blocks:
| Setting | Default | Effect |
|---|---|---|
full_grid |
true |
Full RLE-compressed 64×64 grid |
change_map |
true |
Cells changed since last action |
color_histogram |
false |
Count of each colour |
region_map |
false |
Connected-component regions (BFS flood-fill) |
history_length |
0 |
Recent moves in prompt (0 = all) |
reasoning_trace |
false |
Include LLM reasoning in history |
max_context_tokens |
100000 |
Token budget for context |
memory_injection |
true |
Inject hard-memory facts |
memory_injection_max_chars |
1500 |
Max chars of memory to inject |
| Setting | Default | Effect |
|---|---|---|
executor_model |
gemini-2.5-flash |
Main action model |
condenser_model |
null |
History condensation model (null = reuse executor) |
reflector_model |
null |
Post-game reflection model (null = reuse executor) |
temperature |
0.3 |
Sampling temperature |
max_tokens |
2048 |
Max output tokens |
planning_horizon |
5 |
Max actions per LLM call |
reflection_max_tokens |
1024 |
Max tokens for reflection passes |
planner_model |
null |
Scaffolding planner model |
monitor_model |
null |
Scaffolding monitor model |
world_model_model |
null |
Scaffolding world model |
| Setting | Default | Effect |
|---|---|---|
hard_memory_file |
memory/MEMORY.md |
Cross-session persistent facts |
session_log_file |
memory/sessions.json |
Structured session log |
allow_inline_memory_writes |
true |
Agent can save facts mid-game |
reflect_after_game |
true |
Reflection pass after each game |
condense_every |
0 |
Summarise history every N steps (0 = off) |
condense_threshold |
0 |
Force condensation above N entries (0 = off) |
| Setting | Default | Effect |
|---|---|---|
mode |
single |
single or three_system |
planner_max_turns |
10 |
Max REPL iterations for planner |
world_model_max_turns |
5 |
Max REPL iterations for world model |
world_model_update_every |
5 |
Steps between world model updates |
max_plan_length |
15 |
Maximum plan steps |
min_plan_length |
3 |
Minimum plan steps |
| Provider | Example model | Env key | Cost |
|---|---|---|---|
| Anthropic | claude-sonnet-4-6 |
ANTHROPIC_API_KEY |
Paid |
| Gemini | gemini-2.5-flash |
GEMINI_API_KEY |
~Free |
| OpenAI | openai/o4-mini |
OPENAI_API_KEY |
Paid |
| Groq | groq/llama-3.3-70b-versatile |
GROQ_API_KEY |
Free |
| Mistral | mistral/mistral-small-latest |
MISTRAL_API_KEY |
Free |
| Cloudflare | cf/llama-3.3-70b-instruct |
CLOUDFLARE_API_KEY + CLOUDFLARE_ACCOUNT_ID |
Free |
| HuggingFace | hf/qwen2.5-72b-instruct |
HUGGINGFACE_API_KEY |
Free |
| Copilot | copilot/gpt-4.1 |
None (OAuth) | Free w/ subscription |
| Puter | puter/gpt-4o |
None (client-side) | Free |
| LM Studio | lmstudio/qwen3.5-35b |
None (local) | Free |
| Ollama | ollama/llama3.1 |
None (local) | Free |
Full model list: see models.py MODEL_REGISTRY or run python agent.py --list-models.
External harnesses can stream game sessions to arc3.sonpham.net in real-time. Live sessions appear in Browse Sessions → AI Sessions with a 🔴 Live badge.
import requests, websockets, json, asyncio
# 1. Register a session
resp = requests.post("https://arc3.sonpham.net/api/sessions/stream/register", json={
"game_id": "ls20",
"harness": "my-harness",
"agents": [{"id": "main", "model": "gemini-2.5-flash", "role": "executor"}]
})
data = resp.json()
ws_url = data["ws_url"] # wss://arc3.sonpham.net/ws/stream/{id}?token=...
view_url = data["view_url"] # share this link — anyone can watch live
# 2. Stream events over WebSocket
async def run():
async with websockets.connect(ws_url) as ws:
await ws.send(json.dumps({"v":1, "event":"session_start", "game_id":"ls20", ...}))
# After each LLM call:
await ws.send(json.dumps({"v":1, "event":"llm_call", "agent_id":"main", ...}))
# After each game action (grid state is required):
await ws.send(json.dumps({"v":1, "event":"act", "action":"UP", "grid":[[...]], ...}))
# When done:
await ws.send(json.dumps({"v":1, "event":"session_end", "result":"WIN", ...}))
asyncio.run(run())| Method | Path | Description |
|---|---|---|
POST |
/api/sessions/stream/register |
Register a new live session. Returns session_id, stream_token, ws_url, view_url. |
WS |
/ws/stream/{session_id}?token={stream_token} |
Push events as JSON messages. One event per message. |
GET |
/api/sessions/live |
List all currently streaming sessions (game, model, steps, viewers). |
GET |
/api/sessions/{id}/obs-events?live=true |
SSE stream for viewers — replays history then tails live events. |
POST |
/api/sessions/upload |
Upload a completed .arc3log (JSONL) file for post-hoc replay. |
All events share an envelope: {"v": 1, "t": "<ISO-8601>", "elapsed_s": <float>, "session_id": "...", "game_id": "...", "event": "<type>"}.
| Event | Required fields | Purpose |
|---|---|---|
session_start |
harness, agents[] |
Harness metadata — emitted once at start |
llm_call |
agent_id, model, response |
One per LLM API call. Optional: coordinates_mentioned[] for hover highlight |
act |
action, grid[][], step_num |
One per game action. grid = full 64×64 state after action |
memory_write |
file, content |
Full memory file content at time of write |
tool_call |
tool, code, output |
REPL / tool executions |
agent_message |
from_agent, to_agent, content |
Inter-agent communication |
session_end |
result, total_steps, total_cost |
Final summary — emitted once at end |
Full specification with examples: docs/SESSION-LOG-API.md
- Railway: auto-deploys from
masterbranch - Procfile:
gunicorn server.app:app --bind 0.0.0.0:$PORT --workers 1 --threads 8 - Railway Volume: persistent disk at
/datafor SQLite - Env vars:
SERVER_MODE=prod,DB_DATA_DIR=/data - Staging: push to
stagingbranch first, never directly tomaster
python tests/test_providers.py # all provider API paths
python tests/test_providers.py groq # single provider
python -c "from server.app import app; import db; import agent; import batch_runner; print('OK')"
python batch_runner.py --games ls20 --concurrency 1 --max-steps 5 # smoke test