You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
refactor: add failure conditions to plan, further slim plan instruction
Plan now defines 5 things: domain constraints, realistic usage, test
queries, evaluation (success + failure conditions + restart points),
and role division.
Key changes:
- New "Failure Conditions" section in plan output: hard reject rules
derived from domain constraints, with which expert caused the failure
and where to restart
- Success Criteria no longer includes "what failure looks like" (moved
to dedicated Failure Conditions section)
- Failure conditions are not the inverse of success criteria — they are
domain-specific rules that require deep understanding of constraints
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: definitions/create-expert/perstack.toml
+28-54Lines changed: 28 additions & 54 deletions
Original file line number
Diff line number
Diff line change
@@ -15,7 +15,7 @@
15
15
16
16
[experts."create-expert"]
17
17
defaultModelTier = "high"
18
-
version = "1.0.9"
18
+
version = "1.0.11"
19
19
description = "Creates and modifies Perstack expert definitions in perstack.toml"
20
20
instruction = """
21
21
You are the coordinator for creating and modifying Perstack expert definitions. perstack.toml is the single source of truth — your job is to produce or modify it according to the user's request.
Analyzes the user's request, designs test scenarios with verification methods, and architects the expert system.
65
+
Analyzes the user's request and produces plan.md: domain constraints, test queries, verification methods, and role architecture.
66
66
Provide: (1) what the expert should do, (2) path to existing perstack.toml if one exists.
67
-
Writes plan.md covering test queries, verification methods, domain knowledge, and delegation architecture.
68
67
"""
69
68
instruction = """
70
-
Your job is to deeply understand what the user needs and produce a plan that downstream delegates can execute against. The plan's core value is two things: (1) concrete test queries that exercise the expert's full range, and (2) correct verification methods for each query.
69
+
Analyze the user's request and produce plan.md. The plan defines five things:
71
70
72
-
## Investigation
71
+
1. **What domain constraints exist** — rules the LLM cannot derive on its own
72
+
2. **What realistic usage looks like** — concrete scenarios and test queries
73
+
3. **What to execute** — the actual queries to run against the expert
74
+
4. **How to evaluate results** — success conditions, failure conditions, and where to restart on failure
75
+
5. **What role division follows from the above** — who does the work, who verifies it
73
76
74
-
Before writing the plan:
75
-
- If an existing perstack.toml path was provided, read it to understand the current state
76
-
- Read relevant workspace files to understand the domain
77
-
78
-
## Domain Knowledge Extraction
79
-
80
-
Extract the constraints, values, and quality bars embedded in the user's request. Every word choice is a signal — "polished" implies no placeholders, "well-tested" implies automated playthroughs, "run anywhere" implies cross-platform npx. Convert implicit values into explicit rules the expert can follow. Focus on what makes THIS expert's output different from a generic attempt.
81
-
82
-
Domain knowledge is NOT generic facts the LLM already knows, general best practices, or step-by-step procedures.
83
-
84
-
## Verification Thinking
85
-
86
-
For each test query, think carefully about how an independent person would verify the result. Not by reading the code — by running it. Ask:
87
-
88
-
- What commands would you execute to confirm it works?
89
-
- What output would you expect to see?
90
-
- What would a failure look like?
91
-
92
-
This thinking naturally leads to architectural separation between executors and verifiers. In the real world, the person who did the work is never the person who signs off on it. The same applies here: experts that produce artifacts (code, files, configs) must be verified by a separate expert that builds, runs, and executes those artifacts to confirm they actually work. Without this separation, the executor's reasoning biases the quality judgment.
77
+
Before writing the plan, read existing perstack.toml (if provided) and relevant workspace files to understand the domain.
93
78
94
79
## Output: plan.md
95
80
96
-
Write plan.md with the following sections:
97
-
98
81
### Expert Purpose
99
-
One paragraph defining the expert's wedge — what it does, for whom, and why it is valuable.
82
+
One paragraph: what it does, for whom, what makes it different from a generic attempt.
83
+
84
+
### Domain Knowledge
85
+
Constraints and rules unique to this expert, extracted from the user's request. Every word choice is a signal — "polished" means no placeholders, "well-tested" means automated playthroughs, "run anywhere" means cross-platform. Only include what the LLM wouldn't know without being told. Do not include code snippets, file paths, library recommendations, or step-by-step procedures.
100
86
101
-
### Use Case Analysis
102
-
Concrete scenarios where this expert would be used. Include the user's context, their goal, and what a successful outcome looks like.
87
+
### Use Cases
88
+
2-3 concrete scenarios: who uses this expert, what they ask for, what success looks like.
103
89
104
90
### 3 Test Queries
105
-
A numbered list of 3 realistic queries that would actually be sent to this expert. These must:
106
-
- Cover the full range of the expert's capabilities
107
-
- Include simple and complex cases
108
-
- Include at least one edge case
109
-
- Be specific enough to evaluate (not vague like "do something")
91
+
Realistic queries that would actually be sent to the expert. Cover simple, complex, and edge cases.
110
92
111
93
### Success Criteria
112
-
For each of the 3 test queries, define:
113
-
- What the correct output looks like (concrete, observable conditions)
114
-
- How to verify it actually works (specific commands to run, expected results)
115
-
- What a failure looks like (so the verifier knows when to reject)
116
-
117
-
### Domain Knowledge
118
-
The specific constraints and rules the expert's instruction must contain. Only include knowledge the LLM cannot derive on its own. Keep it focused.
119
-
120
-
### Architecture Design
121
-
122
-
#### Delegation Tree
123
-
Visual tree showing coordinator → delegate relationships. Explain the cohesion rationale for each grouping.
94
+
For each test query:
95
+
- What correct output looks like (observable conditions)
96
+
- What commands to run to verify it works
124
97
125
-
Every tree that includes experts producing work must include a separate verifier expert with exec capability. The verifier does not review code — it builds, runs, and executes the output to confirm it works. This is the same principle as real-world quality assurance: the person who did the work is not the person who signs off on it.
98
+
### Failure Conditions
99
+
Conditions derived from domain constraints that mean the work must be rejected. These are not the inverse of success criteria — they are hard reject rules that come from deeply understanding the domain. For each failure condition: what specifically is wrong, which expert's work caused it, and where to restart. Example: if the user requires "pure game logic with no I/O," then engine code containing console.log is a failure condition that requires redoing the engine expert's work.
126
100
127
-
#### Expert Definitions
128
-
For each expert: name, one-line purpose, and role (executor or verifier).
101
+
### Architecture
102
+
Delegation tree with role assignments. Every expert that produces artifacts needs a separate verifier expert that builds, runs, and executes the output to confirm it works — the person who did the work is not the person who signs off on it. For each expert: name, one-line purpose, executor or verifier.
129
103
130
104
After writing plan.md, attemptCompletion with the file path.
131
105
"""
@@ -150,7 +124,7 @@ pick = [
150
124
151
125
[experts."@create-expert/build"]
152
126
defaultModelTier = "low"
153
-
version = "1.0.9"
127
+
version = "1.0.11"
154
128
description = """
155
129
Orchestrates the write → test → verify → improve cycle for perstack.toml.
156
130
Provide: path to plan.md (containing requirements, architecture, test queries, and success criteria).
Writes or modifies a perstack.toml definition from plan.md requirements and architecture.
218
192
Provide: (1) path to plan.md, (2) optionally path to existing perstack.toml to preserve, (3) optionally feedback from a failed test to address.
@@ -322,7 +296,7 @@ pick = [
322
296
323
297
[experts."@create-expert/verify-test"]
324
298
defaultModelTier = "low"
325
-
version = "1.0.9"
299
+
version = "1.0.11"
326
300
description = """
327
301
Verifies test-expert results by inspecting produced artifacts, executing them, and reviewing the definition against plan.md.
328
302
Provide: (1) the test-expert's factual report (query, what was produced, errors), (2) the success criteria from plan.md, (3) path to plan.md (for semantic review of instructions), (4) path to perstack.toml.
0 commit comments