feat: add must/should priority to verification signals (#790)

FL4TLiN3 · claude · web-flow · commit 3b4283a0726b · 2026-03-14T01:17:39.000Z
All signals were treated equally — any failure caused CONTINUE and looping.
But "app runs" and "all tests pass" have different user impact. Add priority:

- must: failure blocks completion (user cannot use the artifact)
- should: failure reported as known limitation (artifact is usable)

Changes:
- Design Principle 5: signal priority (must/should)
- Plan: Verification Signals require must/should per signal
- Plan self-check: new item 4 — every signal has a priority
- verify-test: only must failures cause CONTINUE
- Bump to 1.0.20

Co-authored-by: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/definitions/create-expert/perstack.toml b/definitions/create-expert/perstack.toml
@@ -57,12 +57,13 @@
 #    - Without this boundary, plan bloat leaks directly into instructions.
 #
 # 5. Verification Signal Design
-#    - Success checks and reject rules are both expressed as hard signals:
-#      a command with a deterministic expected result.
+#    - Each signal is classified as must (blocks completion) or should
+#      (reported but does not block). Must signals protect core usability;
+#      should signals cover polish and secondary quality.
 #    - Reject signals are not the inverse of success signals — they detect
 #      domain-specific anti-patterns that indicate fundamental failure.
-#    - Each signal specifies: what to run, what to expect, and where to
-#      restart if it fails.
+#    - Each signal specifies: what to run, what to expect, and priority
+#      (must/should).
 #
 # 6. Instruction Content = Domain Constraints Only
 #    - An instruction should contain ONLY what the LLM cannot derive on
@@ -88,7 +89,7 @@
 
 [experts."create-expert"]
 defaultModelTier = "high"
-version = "1.0.19"
+version = "1.0.20"
 description = "Creates and modifies Perstack expert definitions in perstack.toml"
 instruction = """
 You are the coordinator for creating and modifying Perstack expert definitions. perstack.toml is the single source of truth — your job is to produce or modify it according to the user's request.
@@ -133,7 +134,7 @@ pick = ["readTextFile", "exec", "attemptCompletion"]
 
 [experts."@create-expert/plan"]
 defaultModelTier = "high"
-version = "1.0.19"
+version = "1.0.20"
 description = """
 Analyzes the user's request and produces plan.md: domain constraints, test query, verification signals, and role architecture.
 Provide: (1) what the expert should do, (2) path to existing perstack.toml if one exists.
@@ -164,10 +165,12 @@ Constraints and rules unique to this expert, extracted from the user's request.
 One comprehensive, realistic query that exercises the expert's full capability. Design the query so that its verification signals can cover all domain constraints from the Domain Knowledge section. Coverage comes from signal design depth, not from running multiple queries.
 
 ### Verification Signals
-Hard signals for the test query — verification checks whose results do not depend on LLM judgment:
+Hard signals for the test query — verification checks whose results do not depend on LLM judgment. Each signal specifies:
 - The exact command to run (deterministic, repeatable)
 - The expected result (specific output, presence/absence of content, numeric threshold)
-- Why this checks ground truth, not a proxy
+- Priority: **must** (failure blocks completion — the user cannot use the artifact) or **should** (failure is reported but does not block — the artifact is usable with known limitations)
+
+Must signals protect core usability — can the user run the artifact and get the primary value? Should signals cover polish, testing, and secondary quality.
 
 Include both positive signals (artifact works correctly) and reject signals (domain-specific anti-patterns are absent). Reject signals are not the inverse of positive signals — they detect fundamental failures derived from deeply understanding the domain.
 
@@ -193,9 +196,10 @@ Re-read plan.md and verify each rule. If any check fails, fix plan.md before att
 1. **Section names exact match**: plan.md uses exactly these section names and no others — "Expert Purpose", "Domain Knowledge", "Use Cases", "Test Query", "Verification Signals", "Architecture". Extra sections confuse downstream experts.
 2. **Single test query**: "Test Query" section contains exactly one query, not multiple.
 3. **Every signal is a command**: each entry in "Verification Signals" specifies a concrete command to execute and its expected result. Entries that describe what to observe or what correct output "looks like" without a command are not signals — rewrite them.
-4. **No soft language in signals**: signals contain no phrases like "verify that", "check that", "should be", "looks correct", "works properly". Each signal is: run X → expect Y.
-5. **Domain constraint coverage**: every constraint in "Domain Knowledge" is exercised by at least one signal. List which signal covers which constraint.
-6. **Architecture is names only**: "Architecture" section contains expert name, one-line purpose, and role (executor/verifier) per expert. No deliverables, no constraints, no implementation details.
+4. **Every signal has a priority**: each signal is marked as **must** (blocks completion) or **should** (reported, does not block). At least one must signal exists. Must signals protect core usability — can the user run the artifact and get the primary value?
+5. **No soft language in signals**: signals contain no phrases like "verify that", "check that", "should be", "looks correct", "works properly". Each signal is: run X → expect Y.
+6. **Domain constraint coverage**: every constraint in "Domain Knowledge" is exercised by at least one signal. List which signal covers which constraint.
+7. **Architecture is names only**: "Architecture" section contains expert name, one-line purpose, and role (executor/verifier) per expert. No deliverables, no constraints, no implementation details.
 
 After writing plan.md, attemptCompletion with the file path.
 """
@@ -220,7 +224,7 @@ pick = [
 
 [experts."@create-expert/build"]
 defaultModelTier = "low"
-version = "1.0.19"
+version = "1.0.20"
 description = """
 Orchestrates the write → review → test → verify cycle for perstack.toml.
 Provide: path to plan.md (containing requirements, architecture, test query, and verification signals).
@@ -281,7 +285,7 @@ pick = ["readTextFile", "exec", "todo", "attemptCompletion"]
 
 [experts."@create-expert/write-definition"]
 defaultModelTier = "low"
-version = "1.0.19"
+version = "1.0.20"
 description = """
 Writes or modifies a perstack.toml definition from plan.md requirements and architecture.
 Provide: (1) path to plan.md, (2) optionally path to existing perstack.toml to preserve, (3) optionally feedback from a failed test to address.
@@ -384,7 +388,7 @@ pick = [
 
 [experts."@create-expert/review-definition"]
 defaultModelTier = "low"
-version = "1.0.19"
+version = "1.0.20"
 description = """
 Reviews perstack.toml against plan.md for domain knowledge alignment and instruction quality.
 Provide: (1) path to plan.md, (2) path to perstack.toml.
@@ -433,7 +437,7 @@ pick = ["readTextFile", "todo", "attemptCompletion"]
 
 [experts."@create-expert/verify-test"]
 defaultModelTier = "low"
-version = "1.0.19"
+version = "1.0.20"
 description = """
 Executes hard signal checks against test-expert's results, verifies their reproducibility, and checks the definition structure.
 Provide: (1) the test-expert's factual report (query, what was produced, errors), (2) the verification signals from plan.md, (3) path to perstack.toml.
@@ -477,12 +481,12 @@ Report each as PASS/FAIL with the command output as evidence.
 
 ## Verdicts
 
-- **PASS** — all signals pass in Step 1, all signals reproduce in Step 2, all structural checks pass in Step 3.
-- **CONTINUE** — any signal failed, any signal did not reproduce, or any structural check failed. Include: which check failed, expected vs actual, specific fix needed.
+- **PASS** — all must signals pass and reproduce. Should signal results are reported but do not affect the verdict.
+- **CONTINUE** — any must signal failed, any must signal did not reproduce, or any structural check failed. Include: which check failed, expected vs actual, specific fix needed.
 
-Default to CONTINUE when any check lacks a clear PASS.
+Should signal failures are included in the report as known limitations but never cause CONTINUE.
 
-attemptCompletion with: verdict, per-signal results from Step 1, reproducibility results from Step 2, structural check results from Step 3, and (if CONTINUE) specific fix feedback.
+attemptCompletion with: verdict, per-signal results (with must/should labels) from Step 1, reproducibility results from Step 2, structural check results from Step 3, should-signal failures as known limitations, and (if CONTINUE) specific fix feedback for must failures only.
 """
 
 [experts."@create-expert/verify-test".skills."@perstack/base"]
@@ -498,7 +502,7 @@ pick = ["readTextFile", "exec", "todo", "attemptCompletion"]
 
 [experts."@create-expert/test-expert"]
 defaultModelTier = "low"
-version = "1.0.19"
+version = "1.0.20"
 description = """
 Executes a single test query against a Perstack expert definition and reports what happened.
 Provide: (1) path to perstack.toml, (2) the test query to execute, (3) the coordinator expert name to test.