feat(skills): add skill-comply — automated behavioral compliance measurement#724
feat(skills): add skill-comply — automated behavioral compliance measurement#724shimo4228 wants to merge 2 commits intoaffaan-m:mainfrom
Conversation
…urement Automated compliance measurement for skills, rules, and agent definitions. Generates behavioral specs, runs scenarios at 3 strictness levels, classifies tool calls via LLM, and produces self-contained reports. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Analysis Failed
Troubleshooting
Retry: |
📝 WalkthroughWalkthroughA new compliance measurement skill ( Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Run as run.py
participant SpecGen as spec_generator
participant ScenarioGen as scenario_generator
participant Runner as runner
participant Classifier as classifier
participant Grader as grader
participant Report as report
User->>Run: execute(skill_path, --model, --output)
Run->>SpecGen: generate_spec(skill_path)
SpecGen->>SpecGen: read skill + invoke Claude CLI
SpecGen-->>Run: ComplianceSpec YAML
Run->>ScenarioGen: generate_scenarios(spec_yaml)
ScenarioGen->>ScenarioGen: invoke Claude CLI (scenario_generator)
ScenarioGen-->>Run: [Scenario...]
loop for each Scenario
Run->>Runner: run_scenario(scenario, model)
Runner->>Runner: create sandbox, git init, run setup commands
Runner->>Runner: invoke Claude CLI (stream-json)
Runner->>Runner: parse stream -> ObservationEvents
Runner-->>Run: ScenarioRun(observations)
Run->>Classifier: classify_events(spec, observations)
Classifier->>Classifier: build prompt + invoke Claude CLI
Classifier-->>Run: {step_id: [event_idx...]}
Run->>Grader: grade(spec, observations, classifier_model)
Grader->>Grader: enforce temporal order, detect steps, compute compliance
Grader-->>Run: ComplianceResult
end
Run->>Report: generate_report(skill_path, spec, results)
Report->>Report: aggregate, build tables, timelines, recommendations
Report-->>Run: markdown_report
Run-->>User: saved report path + compliance summary
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR adds Key changes:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant run.py
participant spec_generator.py
participant scenario_generator.py
participant runner.py
participant classifier.py
participant grader.py
participant report.py
User->>run.py: skill-comply <path>
run.py->>spec_generator.py: generate_spec(skill_path)
spec_generator.py->>spec_generator.py: claude -p spec_generator.md
spec_generator.py-->>run.py: ComplianceSpec
run.py->>scenario_generator.py: generate_scenarios(skill_path, spec_yaml)
scenario_generator.py->>scenario_generator.py: claude -p scenario_generator.md
scenario_generator.py-->>run.py: list[Scenario] (3 levels)
loop For each Scenario
run.py->>runner.py: run_scenario(scenario, model)
runner.py->>runner.py: _setup_sandbox (git init + setup_commands)
runner.py->>runner.py: claude -p stream-json
runner.py->>runner.py: _parse_stream_json → ObservationEvents
runner.py-->>run.py: ScenarioRun
run.py->>grader.py: grade(spec, observations)
grader.py->>classifier.py: classify_events(spec, trace)
classifier.py->>classifier.py: claude -p classifier.md
classifier.py-->>grader.py: dict[step_id, list[int]]
grader.py->>grader.py: _check_temporal_order per step
grader.py-->>run.py: ComplianceResult
end
run.py->>report.py: generate_report(skill_path, spec, results)
report.py-->>run.py: Markdown string
run.py-->>User: results/<skill>.md
Last reviewed commit: "fix(skill-comply): a..." |
There was a problem hiding this comment.
23 issues found across 23 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="skills/skill-comply/scripts/run.py">
<violation number="1" location="skills/skill-comply/scripts/run.py:108">
P2: Writing to a user-provided `--output` path can fail because parent directories are not created before `write_text()`.</violation>
<violation number="2" location="skills/skill-comply/scripts/run.py:112">
P2: Summary calculation divides by zero when no scenarios are generated, causing a runtime crash instead of a graceful error.</violation>
</file>
<file name="AGENTS.md">
<violation number="1" location="AGENTS.md:3">
P2: The changed summary line introduces an inconsistent skill total within AGENTS.md (117 vs 115), leaving documentation internally contradictory.</violation>
</file>
<file name="README.md">
<violation number="1" location="README.md:1074">
P2: README contains conflicting hardcoded skill counts (117 vs 102), indicating documentation drift and inconsistent user-facing metrics.</violation>
</file>
<file name="skills/skill-comply/scripts/classifier.py">
<violation number="1" location="skills/skill-comply/scripts/classifier.py:24">
P1: Classification parse failures are silently treated as empty matches, causing classifier errors to be scored as real non-compliance.</violation>
<violation number="2" location="skills/skill-comply/scripts/classifier.py:32">
P2: Prompt construction can crash on non-string trace payloads because `event.input`/`event.output` are sliced without coercion.</violation>
<violation number="3" location="skills/skill-comply/scripts/classifier.py:73">
P1: `_parse_classification` can throw uncaught `AttributeError` when LLM returns valid JSON that is not an object.</violation>
</file>
<file name="skills/skill-comply/prompts/scenario_generator.md">
<violation number="1" location="skills/skill-comply/prompts/scenario_generator.md:19">
P2: Ambiguous `{id}` token in setup command example can be executed literally; `setup_commands` are run verbatim and `cwd` is already a per-scenario sandbox, so this risks shared `/tmp/skill-comply-sandbox/{id}` directories and cross-scenario leakage.</violation>
<violation number="2" location="skills/skill-comply/prompts/scenario_generator.md:52">
P1: Prompt template embeds untrusted skill/spec text without guardrails, enabling prompt-injection influence over generated `setup_commands` that are later executed.</violation>
</file>
<file name="skills/skill-comply/scripts/parser.py">
<violation number="1" location="skills/skill-comply/scripts/parser.py:49">
P2: `parse_trace` only checks `exists()`, so directory paths pass validation and fail later with `IsADirectoryError` instead of an explicit file-path validation error.</violation>
<violation number="2" location="skills/skill-comply/scripts/parser.py:87">
P2: Detector step references are not validated against declared step IDs, so invalid after_step/before_step values can silently alter grading behavior or force failures.</violation>
</file>
<file name="skills/skill-comply/prompts/classifier.md">
<violation number="1" location="skills/skill-comply/prompts/classifier.md:9">
P1: Raw untrusted tool-call content is interpolated directly into the classifier prompt without explicit inert-data handling, creating a prompt-injection risk that can skew compliance classification.</violation>
</file>
<file name="skills/skill-comply/scripts/scenario_generator.py">
<violation number="1" location="skills/skill-comply/scripts/scenario_generator.py:44">
P2: Unbounded prompt content is passed as a single CLI argument, which can exceed OS argv limits and cause subprocess invocation failure on larger skills/specs.</violation>
<violation number="2" location="skills/skill-comply/scripts/scenario_generator.py:60">
P2: LLM-generated YAML is parsed and immediately indexed without validating the top-level type or required keys. If the LLM returns malformed YAML or missing fields, this raises runtime KeyError/TypeError and crashes scenario generation. Add explicit schema/type checks (and optionally retries) before indexing.</violation>
</file>
<file name="skills/skill-comply/scripts/report.py">
<violation number="1" location="skills/skill-comply/scripts/report.py:114">
P2: Markdown table cells use unescaped dynamic text, so `|`/newlines in scenario names, step metadata, or failure reasons can corrupt report tables.</violation>
</file>
<file name="skills/skill-comply/scripts/grader.py">
<violation number="1" location="skills/skill-comply/scripts/grader.py:75">
P2: Classifier indices are only checked against the upper bound; negative indices from LLM output will silently map to the end of the list, producing incorrect evidence. Index order is also unnormalized, so non-monotonic output changes which candidate is selected first. Validate bounds (>=0) and normalize ordering before mapping to events.</violation>
</file>
<file name="skills/skill-comply/scripts/utils.py">
<violation number="1" location="skills/skill-comply/scripts/utils.py:9">
P1: YAML extraction is brittle because markdown fences are removed only when they are the first/last lines of the whole response, leaving prose/fences in common LLM outputs and breaking YAML parsing.</violation>
</file>
<file name="skills/skill-comply/tests/test_parser.py">
<violation number="1" location="skills/skill-comply/tests/test_parser.py:28">
P2: Sorting test is vacuous because it uses an already sorted fixture, so it may not catch regressions where parser stops sorting.</violation>
<violation number="2" location="skills/skill-comply/tests/test_parser.py:51">
P2: Missing-file test relies on a hardcoded absolute path, making it environment-dependent instead of using a test-controlled nonexistent path.</violation>
</file>
<file name="skills/skill-comply/scripts/runner.py">
<violation number="1" location="skills/skill-comply/scripts/runner.py:53">
P2: `subprocess.run` is not checked for success, so non‑zero Claude exits (auth/model errors, CLI failures) still get parsed and graded as if they were valid runs, risking misclassification of runtime failures as compliance failures.</violation>
<violation number="2" location="skills/skill-comply/scripts/runner.py:67">
P2: Lossy scenario ID sanitization can map distinct scenarios to the same sandbox directory, and `_setup_sandbox` then deletes that directory, causing destructive overwrites of scenario artifacts.</violation>
<violation number="3" location="skills/skill-comply/scripts/runner.py:79">
P2: Setup commands ignore failures; if `git init` or a setup command fails, the scenario still runs in a partially prepared sandbox, masking setup errors and skewing results.</violation>
<violation number="4" location="skills/skill-comply/scripts/runner.py:83">
P1: Commands from LLM-generated scenarios are executed directly without validation or sandboxing, enabling arbitrary command execution if scenario content is untrusted.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| - setup_commands should create a minimal sandbox (dirs, pyproject.toml, etc.) | ||
| - Prompts should be realistic — something a developer would actually ask | ||
|
|
||
| Skill content: |
There was a problem hiding this comment.
P1: Prompt template embeds untrusted skill/spec text without guardrails, enabling prompt-injection influence over generated setup_commands that are later executed.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At skills/skill-comply/prompts/scenario_generator.md, line 52:
<comment>Prompt template embeds untrusted skill/spec text without guardrails, enabling prompt-injection influence over generated `setup_commands` that are later executed.</comment>
<file context>
@@ -0,0 +1,62 @@
+- setup_commands should create a minimal sandbox (dirs, pyproject.toml, etc.)
+- Prompts should be realistic — something a developer would actually ask
+
+Skill content:
+
+---
</file context>
| {steps_description} | ||
|
|
||
| Tool calls (numbered): | ||
| {tool_calls} |
There was a problem hiding this comment.
P1: Raw untrusted tool-call content is interpolated directly into the classifier prompt without explicit inert-data handling, creating a prompt-injection risk that can skew compliance classification.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At skills/skill-comply/prompts/classifier.md, line 9:
<comment>Raw untrusted tool-call content is interpolated directly into the classifier prompt without explicit inert-data handling, creating a prompt-injection risk that can skew compliance classification.</comment>
<file context>
@@ -0,0 +1,24 @@
+{steps_description}
+
+Tool calls (numbered):
+{tool_calls}
+
+Respond with ONLY a JSON object mapping step_id to a list of matching tool call numbers.
</file context>
| def extract_yaml(text: str) -> str: | ||
| """Extract YAML from LLM output, stripping markdown fences if present.""" | ||
| lines = text.strip().splitlines() | ||
| if lines and lines[0].startswith("```"): |
There was a problem hiding this comment.
P1: YAML extraction is brittle because markdown fences are removed only when they are the first/last lines of the whole response, leaving prose/fences in common LLM outputs and breaking YAML parsing.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At skills/skill-comply/scripts/utils.py, line 9:
<comment>YAML extraction is brittle because markdown fences are removed only when they are the first/last lines of the whole response, leaving prose/fences in common LLM outputs and breaking YAML parsing.</comment>
<file context>
@@ -0,0 +1,13 @@
+def extract_yaml(text: str) -> str:
+ """Extract YAML from LLM output, stripping markdown fences if present."""
+ lines = text.strip().splitlines()
+ if lines and lines[0].startswith("```"):
+ lines = lines[1:]
+ if lines and lines[-1].startswith("```"):
</file context>
|
|
||
| def test_nonexistent_file_raises(self) -> None: | ||
| with pytest.raises(FileNotFoundError): | ||
| parse_trace(Path("/nonexistent/trace.jsonl")) |
There was a problem hiding this comment.
P2: Missing-file test relies on a hardcoded absolute path, making it environment-dependent instead of using a test-controlled nonexistent path.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At skills/skill-comply/tests/test_parser.py, line 51:
<comment>Missing-file test relies on a hardcoded absolute path, making it environment-dependent instead of using a test-controlled nonexistent path.</comment>
<file context>
@@ -0,0 +1,90 @@
+
+ def test_nonexistent_file_raises(self) -> None:
+ with pytest.raises(FileNotFoundError):
+ parse_trace(Path("/nonexistent/trace.jsonl"))
+
+
</file context>
| def test_events_sorted_by_timestamp(self) -> None: | ||
| events = parse_trace(FIXTURES / "compliant_trace.jsonl") | ||
| timestamps = [e.timestamp for e in events] | ||
| assert timestamps == sorted(timestamps) |
There was a problem hiding this comment.
P2: Sorting test is vacuous because it uses an already sorted fixture, so it may not catch regressions where parser stops sorting.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At skills/skill-comply/tests/test_parser.py, line 28:
<comment>Sorting test is vacuous because it uses an already sorted fixture, so it may not catch regressions where parser stops sorting.</comment>
<file context>
@@ -0,0 +1,90 @@
+ def test_events_sorted_by_timestamp(self) -> None:
+ events = parse_trace(FIXTURES / "compliant_trace.jsonl")
+ timestamps = [e.timestamp for e in events]
+ assert timestamps == sorted(timestamps)
+
+ def test_event_fields(self) -> None:
</file context>
|
|
||
| def _safe_sandbox_dir(scenario_id: str) -> Path: | ||
| """Sanitize scenario ID and ensure path stays within sandbox base.""" | ||
| safe_id = re.sub(r"[^a-zA-Z0-9\-_]", "_", scenario_id) |
There was a problem hiding this comment.
P2: Lossy scenario ID sanitization can map distinct scenarios to the same sandbox directory, and _setup_sandbox then deletes that directory, causing destructive overwrites of scenario artifacts.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At skills/skill-comply/scripts/runner.py, line 67:
<comment>Lossy scenario ID sanitization can map distinct scenarios to the same sandbox directory, and `_setup_sandbox` then deletes that directory, causing destructive overwrites of scenario artifacts.</comment>
<file context>
@@ -0,0 +1,155 @@
+
+def _safe_sandbox_dir(scenario_id: str) -> Path:
+ """Sanitize scenario ID and ensure path stays within sandbox base."""
+ safe_id = re.sub(r"[^a-zA-Z0-9\-_]", "_", scenario_id)
+ path = SANDBOX_BASE / safe_id
+ path.resolve().relative_to(SANDBOX_BASE.resolve())
</file context>
| @@ -0,0 +1,155 @@ | |||
| """Run scenarios via claude -p and parse tool calls from stream-json output.""" | |||
There was a problem hiding this comment.
P2: Setup commands ignore failures; if git init or a setup command fails, the scenario still runs in a partially prepared sandbox, masking setup errors and skewing results.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At skills/skill-comply/scripts/runner.py, line 79:
<comment>Setup commands ignore failures; if `git init` or a setup command fails, the scenario still runs in a partially prepared sandbox, masking setup errors and skewing results.</comment>
<file context>
@@ -0,0 +1,155 @@
+ shutil.rmtree(sandbox_dir)
+ sandbox_dir.mkdir(parents=True)
+
+ subprocess.run(["git", "init"], cwd=sandbox_dir, capture_output=True)
+
+ for cmd in scenario.setup_commands:
</file context>
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (10)
skills/skill-comply/prompts/classifier.md (1)
19-20: Generalize filename rules to avoid task-specific bias.Using
calculator.py-specific examples may overfit classification. Prefer generic rules (e.g., test-path patterns vs non-test paths) so behavior stays stable across domains.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/skill-comply/prompts/classifier.md` around lines 19 - 20, The current rule examples use specific filenames ("test_calculator.py", "calculator.py") which risks task-specific bias; update the classifier prompt in classifier.md to use generic filename/path patterns instead (e.g., test path patterns like "tests/", "test_*.py", "*_test.py" for test files versus non-test paths or implementation file patterns for implementation files) so the guidance applies broadly across domains; find the lines that mention "test_calculator.py" and "calculator.py" and replace them with these generic patterns and a short explanatory phrase (e.g., "files under tests/ or prefixed with test_ are tests") to keep behavior stable.skills/skill-comply/.gitignore (1)
7-7: Consider committinguv.lockfor reproducible environments.Ignoring
uv.lockcan make test and dependency resolution less deterministic across machines/CI.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/skill-comply/.gitignore` at line 7, Remove uv.lock from the .gitignore so the lockfile is committed for reproducible installs; update the .gitignore entry that currently lists "uv.lock" to delete that line, add the generated uv.lock to the repository, and commit the lockfile (regenerate it with your package manager if needed) so dependency resolution is deterministic across machines/CI.skills/skill-comply/prompts/scenario_generator.md (1)
19-19: Avoid hardcoded absolute sandbox paths in generated scenarios.Embedding
/tmp/skill-comply-sandbox/...in the prompt template can make scenario setup brittle across environments. Prefer a placeholder tied to runner-provided sandbox root.As per coding guidelines: "Do not use hardcoded values; use constants or configuration instead".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/skill-comply/prompts/scenario_generator.md` at line 19, Replace the hardcoded "/tmp/skill-comply-sandbox/{id}/..." path in the scenario template string with a runner-provided sandbox root placeholder (e.g., "{sandbox_root}" or "${SANDBOX_ROOT}") so the mkdir command becomes "mkdir -p {sandbox_root}/{id}/src {sandbox_root}/{id}/tests"; update the scenario_generator.md template where that exact string appears and ensure the consuming runner injects the sandbox root constant/config value at runtime instead of using the absolute /tmp path.skills/skill-comply/scripts/spec_generator.py (1)
50-53: Add validation for empty Claude output before parsing.Unlike
scenario_generator.py(Line 53-54), this function doesn't validate thatresult.stdoutis non-empty before extraction. If Claude returns empty output with exit code 0,extract_yamlreturns an empty string, leading to confusing parse errors during retries.Suggested fix
if result.returncode != 0: raise RuntimeError(f"claude -p failed: {result.stderr}") + if not result.stdout.strip(): + raise RuntimeError("claude -p returned empty output") + raw_yaml = extract_yaml(result.stdout)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/skill-comply/scripts/spec_generator.py` around lines 50 - 53, The Claude call in spec_generator.py checks returncode but doesn't validate result.stdout before calling extract_yaml; add a guard after the returncode check to verify result.stdout is non-empty (e.g., if not result.stdout.strip():) and raise a RuntimeError including the fact that Claude returned empty output and any available result.stderr; ensure this uses the same pattern as scenario_generator.py (checking result.stdout before calling extract_yaml) so extract_yaml is only invoked with non-empty input.skills/skill-comply/scripts/run.py (1)
96-100: Replaceprint()withlogger.info()for consistency.Lines 96 and 100 use
print()while the rest of the module useslogger.info(). This creates inconsistent output behavior (e.g., if logging is redirected).As per coding guidelines: "Avoid using
print()statements in Python code; use theloggingmodule instead."Suggested fix
for scenario in scenarios: - print(f" Running {scenario.level_name}...", end="", flush=True) + logger.info(" Running %s...", scenario.level_name) run = run_scenario(scenario, model=args.model) result = grade(spec, list(run.observations)) graded_results.append((scenario.level_name, result, list(run.observations))) - print(f" {result.compliance_rate:.0%}") + logger.info(" %s: %.0f%%", scenario.level_name, result.compliance_rate * 100)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/skill-comply/scripts/run.py` around lines 96 - 100, Replace the two print() calls around running scenarios with logger.info() so output follows the module's logging configuration: use logger.info to log the "Running {scenario.level_name}..." message (preserve the same text and no newline behavior by combining messages if needed) and log the result line with the formatted compliance_rate; update the calls near run_scenario, grade, and graded_results handling to use logger.info instead of print so logs are consistent with the module logger.skills/skill-comply/scripts/classifier.py (1)
70-78: Consider logging when classification parsing fails.Returning
{}onJSONDecodeError/ValueErrorsilently masks LLM output issues. This could lead to false compliance results (all steps appear as "not detected") without any indication of the underlying problem.Suggested improvement
+import logging + +logger = logging.getLogger(__name__) + def _parse_classification(text: str) -> dict[str, list[int]]: ... try: parsed = json.loads(cleaned) # Validate structure return { k: [int(i) for i in v] for k, v in parsed.items() if isinstance(v, list) } except (json.JSONDecodeError, ValueError): + logger.warning("Failed to parse classification output: %s", cleaned[:200]) return {}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/skill-comply/scripts/classifier.py` around lines 70 - 78, The except block that currently catches json.JSONDecodeError/ValueError after json.loads(cleaned) silently returns {} and masks LLM output issues; update that except block (where parsed and cleaned are used) to log the exception details and the raw/cleaned LLM output before returning {} so failures are visible — include the exception message (and ideally stack/trace) plus the cleaned string in the log to aid debugging.skills/skill-comply/scripts/runner.py (1)
16-16: Usetempfilemodule instead of hardcoded/tmppath.Hardcoding
/tmpis flagged by static analysis (S108) and may not work correctly on all systems or in containerized environments.Suggested fix
+import tempfile + -SANDBOX_BASE = Path("/tmp/skill-comply-sandbox") +SANDBOX_BASE = Path(tempfile.gettempdir()) / "skill-comply-sandbox"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/skill-comply/scripts/runner.py` at line 16, Replace the hardcoded SANDBOX_BASE = Path("/tmp/skill-comply-sandbox") with a tempfile-based solution: import tempfile and either set SANDBOX_BASE = Path(tempfile.gettempdir()) / "skill-comply-sandbox" for a platform-safe shared temp location, or (preferably) create a unique per-run directory with tempfile.mkdtemp() (e.g., SANDBOX_BASE = Path(tempfile.mkdtemp(prefix="skill-comply-"))) and ensure any created temp directory is cleaned up appropriately; update runner.py to import tempfile and adjust any code that assumes a persistent /tmp path.skills/skill-comply/scripts/scenario_generator.py (1)
56-68: Add error handling for YAML parsing failures.If the LLM returns malformed YAML or missing
scenarioskey,yaml.safe_loadorparsed["scenarios"]will raise exceptions with unclear error messages. Unlikespec_generator.pywhich has retry logic, this function lacks any YAML error handling.Suggested fix
raw_yaml = extract_yaml(result.stdout) - parsed = yaml.safe_load(raw_yaml) - - scenarios: list[Scenario] = [] - for s in parsed["scenarios"]: + try: + parsed = yaml.safe_load(raw_yaml) + if not parsed or "scenarios" not in parsed: + raise RuntimeError("LLM output missing 'scenarios' key") + except yaml.YAMLError as e: + raise RuntimeError(f"Failed to parse scenario YAML: {e}") from e + + scenarios: list[Scenario] = [] + for s in parsed["scenarios"]:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/skill-comply/scripts/scenario_generator.py` around lines 56 - 68, Wrap the YAML extraction and parsing in a try/except that catches yaml.YAMLError and TypeError/KeyError around yaml.safe_load and parsed["scenarios"]; validate that parsed is a dict and contains the "scenarios" key before iterating, and on failure either raise a clear exception that includes the raw_yaml (or a truncated preview) and the original error or trigger the same retry path used by spec_generator.py; update the scenario construction block (where extract_yaml, yaml.safe_load, parsed["scenarios"], and Scenario(...) are used) to log/raise a descriptive error that aids debugging when the LLM returns malformed or missing YAML.skills/skill-comply/scripts/grader.py (1)
88-92: Clarify: Only the first valid event is used as evidence per step.The loop breaks after finding the first event that passes temporal checks (line 92). This means if multiple events could satisfy a step, only the first is recorded as evidence. If this is intentional (e.g., to enforce single-detection semantics), consider adding a brief comment. If multiple events should contribute, remove the
break.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/skill-comply/scripts/grader.py` around lines 88 - 92, The loop over candidates currently stops at the first event that passes temporal checks (the break after matched.append(event)), which enforces single-evidence-per-step; decide and apply one of two fixes: if single-detection is intentional, add a short clarifying comment above the loop referencing the variables/functions (for example: "Only the first valid event is used as evidence for a step — see _check_temporal_order and matched") to make the behavior explicit; otherwise remove the break so the loop continues to evaluate all candidates and append every event for which _check_temporal_order(step, event, resolved, classified) returns None.skills/skill-comply/scripts/parser.py (1)
79-101: Consider wrapping key accesses with clearer error messages.While the
scoringsection has explicit validation (lines 92-93), other required keys likeid,name,steps, etc. will raise genericKeyErrorexceptions if missing. For a developer-facing tool, the current behavior is acceptable, but clearer error messages would improve debuggability.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/skill-comply/scripts/parser.py` around lines 79 - 101, The parser currently accesses required keys like "id", "name", "source_rule", "version", and "steps" directly which yields generic KeyError if missing; add explicit validation before building Steps and returning ComplianceSpec (e.g., check raw contains "steps", "id", "name", "source_rule", "version") and raise KeyError with clear messages like "Missing 'id' in compliance spec" or "Missing 'steps' array in compliance spec"; also validate each step has required fields before creating Step/Detector to provide targeted errors; update the block that builds steps and the return of ComplianceSpec to use these explicit checks so errors are informative.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@AGENTS.md`:
- Line 3: Update the inconsistent skill count in AGENTS.md: the header line
reads "117 skills" but the Project Structure block still shows "115"; open the
Project Structure section and change the "115" figure to "117" so both in-file
references (the top header "117 skills" and the Project Structure count) match
exactly.
In `@skills/skill-comply/scripts/parser.py`:
- Around line 74-76: parse_spec currently calls path.read_text() without
verifying the file exists (unlike parse_trace), which can produce a less clear
error; update parse_spec to check path.exists() (or path.is_file()) before
reading and raise a clear FileNotFoundError with the same style/message used in
parse_trace if the file is missing, then proceed to call yaml.safe_load as
before (refer to the parse_spec function and compare behavior to parse_trace).
In `@skills/skill-comply/scripts/report.py`:
- Line 52: The string passed to lines.append uses an unnecessary f-string prefix
despite having no placeholders; remove the leading "f" from the call to
lines.append(f"| Recommendation | All steps above threshold — no hook promotion
needed |") in report.py (locate the lines.append invocation that writes the
Recommendation row) so it becomes a plain string literal.
- Around line 39-40: The two calls to lines.append use f-strings with no
placeholders; change lines.append(f"| Metric | Value |") and
lines.append(f"|--------|-------|") to plain strings (lines.append("| Metric |
Value |") and lines.append("|--------|-------|")) so static analysis stops
flagging useless f-strings; update any similar lines.append occurrences in
report.py that use f"" without interpolation.
In `@skills/skill-comply/scripts/runner.py`:
- Around line 81-83: The loop in runner.py currently executes LLM-provided
scenario.setup_commands by calling shlex.split(cmd) and subprocess.run(...)
which is unsafe; replace this with validation against an allowlist of permitted
executables/flags (e.g., only allow "mkdir", "touch", "echo" and safe arg
patterns) before calling subprocess.run, reject or log and skip any command that
doesn't match the allowlist, and/or require explicit user confirmation for
non-allowlisted commands; additionally consider executing allowed commands in a
hardened sandbox (restricted cwd, unprivileged user, or container) using the
existing sandbox_dir and ensure all subprocess.run invocations set shell=False,
check=True, timeout, and capture_output/log stdout/stderr for auditing (refer to
scenario.setup_commands, shlex.split, subprocess.run, and sandbox_dir).
In `@skills/skill-comply/scripts/utils.py`:
- Around line 6-13: The extract_yaml function should validate and fail fast:
trim the input, raise ValueError("Empty model output; expected YAML content") if
empty, then strip leading/trailing code-fence lines only if present (remove
opening ``` and closing ``` together when both exist), build yaml_text =
"\n".join(lines).strip(), and raise ValueError("No YAML content found in model
output") if yaml_text is empty before returning it; update the extract_yaml
implementation to perform these checks and raise clear errors rather than
silently returning empty/invalid strings.
---
Nitpick comments:
In `@skills/skill-comply/.gitignore`:
- Line 7: Remove uv.lock from the .gitignore so the lockfile is committed for
reproducible installs; update the .gitignore entry that currently lists
"uv.lock" to delete that line, add the generated uv.lock to the repository, and
commit the lockfile (regenerate it with your package manager if needed) so
dependency resolution is deterministic across machines/CI.
In `@skills/skill-comply/prompts/classifier.md`:
- Around line 19-20: The current rule examples use specific filenames
("test_calculator.py", "calculator.py") which risks task-specific bias; update
the classifier prompt in classifier.md to use generic filename/path patterns
instead (e.g., test path patterns like "tests/", "test_*.py", "*_test.py" for
test files versus non-test paths or implementation file patterns for
implementation files) so the guidance applies broadly across domains; find the
lines that mention "test_calculator.py" and "calculator.py" and replace them
with these generic patterns and a short explanatory phrase (e.g., "files under
tests/ or prefixed with test_ are tests") to keep behavior stable.
In `@skills/skill-comply/prompts/scenario_generator.md`:
- Line 19: Replace the hardcoded "/tmp/skill-comply-sandbox/{id}/..." path in
the scenario template string with a runner-provided sandbox root placeholder
(e.g., "{sandbox_root}" or "${SANDBOX_ROOT}") so the mkdir command becomes
"mkdir -p {sandbox_root}/{id}/src {sandbox_root}/{id}/tests"; update the
scenario_generator.md template where that exact string appears and ensure the
consuming runner injects the sandbox root constant/config value at runtime
instead of using the absolute /tmp path.
In `@skills/skill-comply/scripts/classifier.py`:
- Around line 70-78: The except block that currently catches
json.JSONDecodeError/ValueError after json.loads(cleaned) silently returns {}
and masks LLM output issues; update that except block (where parsed and cleaned
are used) to log the exception details and the raw/cleaned LLM output before
returning {} so failures are visible — include the exception message (and
ideally stack/trace) plus the cleaned string in the log to aid debugging.
In `@skills/skill-comply/scripts/grader.py`:
- Around line 88-92: The loop over candidates currently stops at the first event
that passes temporal checks (the break after matched.append(event)), which
enforces single-evidence-per-step; decide and apply one of two fixes: if
single-detection is intentional, add a short clarifying comment above the loop
referencing the variables/functions (for example: "Only the first valid event is
used as evidence for a step — see _check_temporal_order and matched") to make
the behavior explicit; otherwise remove the break so the loop continues to
evaluate all candidates and append every event for which
_check_temporal_order(step, event, resolved, classified) returns None.
In `@skills/skill-comply/scripts/parser.py`:
- Around line 79-101: The parser currently accesses required keys like "id",
"name", "source_rule", "version", and "steps" directly which yields generic
KeyError if missing; add explicit validation before building Steps and returning
ComplianceSpec (e.g., check raw contains "steps", "id", "name", "source_rule",
"version") and raise KeyError with clear messages like "Missing 'id' in
compliance spec" or "Missing 'steps' array in compliance spec"; also validate
each step has required fields before creating Step/Detector to provide targeted
errors; update the block that builds steps and the return of ComplianceSpec to
use these explicit checks so errors are informative.
In `@skills/skill-comply/scripts/run.py`:
- Around line 96-100: Replace the two print() calls around running scenarios
with logger.info() so output follows the module's logging configuration: use
logger.info to log the "Running {scenario.level_name}..." message (preserve the
same text and no newline behavior by combining messages if needed) and log the
result line with the formatted compliance_rate; update the calls near
run_scenario, grade, and graded_results handling to use logger.info instead of
print so logs are consistent with the module logger.
In `@skills/skill-comply/scripts/runner.py`:
- Line 16: Replace the hardcoded SANDBOX_BASE =
Path("/tmp/skill-comply-sandbox") with a tempfile-based solution: import
tempfile and either set SANDBOX_BASE = Path(tempfile.gettempdir()) /
"skill-comply-sandbox" for a platform-safe shared temp location, or (preferably)
create a unique per-run directory with tempfile.mkdtemp() (e.g., SANDBOX_BASE =
Path(tempfile.mkdtemp(prefix="skill-comply-"))) and ensure any created temp
directory is cleaned up appropriately; update runner.py to import tempfile and
adjust any code that assumes a persistent /tmp path.
In `@skills/skill-comply/scripts/scenario_generator.py`:
- Around line 56-68: Wrap the YAML extraction and parsing in a try/except that
catches yaml.YAMLError and TypeError/KeyError around yaml.safe_load and
parsed["scenarios"]; validate that parsed is a dict and contains the "scenarios"
key before iterating, and on failure either raise a clear exception that
includes the raw_yaml (or a truncated preview) and the original error or trigger
the same retry path used by spec_generator.py; update the scenario construction
block (where extract_yaml, yaml.safe_load, parsed["scenarios"], and
Scenario(...) are used) to log/raise a descriptive error that aids debugging
when the LLM returns malformed or missing YAML.
In `@skills/skill-comply/scripts/spec_generator.py`:
- Around line 50-53: The Claude call in spec_generator.py checks returncode but
doesn't validate result.stdout before calling extract_yaml; add a guard after
the returncode check to verify result.stdout is non-empty (e.g., if not
result.stdout.strip():) and raise a RuntimeError including the fact that Claude
returned empty output and any available result.stderr; ensure this uses the same
pattern as scenario_generator.py (checking result.stdout before calling
extract_yaml) so extract_yaml is only invoked with non-empty input.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a55fef57-6794-46b5-9f49-f8fa75b7f64d
📒 Files selected for processing (23)
AGENTS.mdREADME.mdskills/skill-comply/.gitignoreskills/skill-comply/SKILL.mdskills/skill-comply/fixtures/compliant_trace.jsonlskills/skill-comply/fixtures/noncompliant_trace.jsonlskills/skill-comply/fixtures/tdd_spec.yamlskills/skill-comply/prompts/classifier.mdskills/skill-comply/prompts/scenario_generator.mdskills/skill-comply/prompts/spec_generator.mdskills/skill-comply/pyproject.tomlskills/skill-comply/scripts/__init__.pyskills/skill-comply/scripts/classifier.pyskills/skill-comply/scripts/grader.pyskills/skill-comply/scripts/parser.pyskills/skill-comply/scripts/report.pyskills/skill-comply/scripts/run.pyskills/skill-comply/scripts/runner.pyskills/skill-comply/scripts/scenario_generator.pyskills/skill-comply/scripts/spec_generator.pyskills/skill-comply/scripts/utils.pyskills/skill-comply/tests/test_grader.pyskills/skill-comply/tests/test_parser.py
| lines.append(f"| Metric | Value |") | ||
| lines.append(f"|--------|-------|") |
There was a problem hiding this comment.
Remove extraneous f prefix from strings without placeholders.
Static analysis correctly identifies f-strings without any placeholders. These should be plain strings.
Proposed fix
- lines.append(f"| Metric | Value |")
- lines.append(f"|--------|-------|")
+ lines.append("| Metric | Value |")
+ lines.append("|--------|-------|")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| lines.append(f"| Metric | Value |") | |
| lines.append(f"|--------|-------|") | |
| lines.append("| Metric | Value |") | |
| lines.append("|--------|-------|") |
🧰 Tools
🪛 Ruff (0.15.6)
[error] 39-39: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 40-40: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@skills/skill-comply/scripts/report.py` around lines 39 - 40, The two calls to
lines.append use f-strings with no placeholders; change lines.append(f"| Metric
| Value |") and lines.append(f"|--------|-------|") to plain strings
(lines.append("| Metric | Value |") and lines.append("|--------|-------|")) so
static analysis stops flagging useless f-strings; update any similar
lines.append occurrences in report.py that use f"" without interpolation.
| step_names = ", ".join(promote_steps) | ||
| lines.append(f"| Recommendation | **Promote {step_names} to hooks** |") | ||
| else: | ||
| lines.append(f"| Recommendation | All steps above threshold — no hook promotion needed |") |
There was a problem hiding this comment.
Remove extraneous f prefix.
This string has no placeholders.
Proposed fix
- lines.append(f"| Recommendation | All steps above threshold — no hook promotion needed |")
+ lines.append("| Recommendation | All steps above threshold — no hook promotion needed |")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| lines.append(f"| Recommendation | All steps above threshold — no hook promotion needed |") | |
| lines.append("| Recommendation | All steps above threshold — no hook promotion needed |") |
🧰 Tools
🪛 Ruff (0.15.6)
[error] 52-52: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@skills/skill-comply/scripts/report.py` at line 52, The string passed to
lines.append uses an unnecessary f-string prefix despite having no placeholders;
remove the leading "f" from the call to lines.append(f"| Recommendation | All
steps above threshold — no hook promotion needed |") in report.py (locate the
lines.append invocation that writes the Recommendation row) so it becomes a
plain string literal.
| for cmd in scenario.setup_commands: | ||
| parts = shlex.split(cmd) | ||
| subprocess.run(parts, cwd=sandbox_dir, capture_output=True) |
There was a problem hiding this comment.
Security risk: executing LLM-generated commands without validation.
setup_commands originate from scenario_generator.py, which parses them from LLM output. Executing arbitrary shell commands from LLM responses is dangerous—a malicious or hallucinated command could delete files, exfiltrate data, or compromise the system.
Consider:
- Allowlisting permitted commands (e.g.,
mkdir,touch,echo) - Running setup commands in a more restricted sandbox
- Adding user confirmation before executing
Example allowlist approach
+ALLOWED_SETUP_COMMANDS = frozenset({"mkdir", "touch", "echo", "cat"})
+
def _setup_sandbox(sandbox_dir: Path, scenario: Scenario) -> None:
"""Create sandbox directory and run setup commands."""
if sandbox_dir.exists():
shutil.rmtree(sandbox_dir)
sandbox_dir.mkdir(parents=True)
subprocess.run(["git", "init"], cwd=sandbox_dir, capture_output=True)
for cmd in scenario.setup_commands:
parts = shlex.split(cmd)
+ if parts and parts[0] not in ALLOWED_SETUP_COMMANDS:
+ raise ValueError(f"Disallowed setup command: {parts[0]!r}")
subprocess.run(parts, cwd=sandbox_dir, capture_output=True)🧰 Tools
🪛 Ruff (0.15.6)
[error] 83-83: subprocess call: check for execution of untrusted input
(S603)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@skills/skill-comply/scripts/runner.py` around lines 81 - 83, The loop in
runner.py currently executes LLM-provided scenario.setup_commands by calling
shlex.split(cmd) and subprocess.run(...) which is unsafe; replace this with
validation against an allowlist of permitted executables/flags (e.g., only allow
"mkdir", "touch", "echo" and safe arg patterns) before calling subprocess.run,
reject or log and skip any command that doesn't match the allowlist, and/or
require explicit user confirmation for non-allowlisted commands; additionally
consider executing allowed commands in a hardened sandbox (restricted cwd,
unprivileged user, or container) using the existing sandbox_dir and ensure all
subprocess.run invocations set shell=False, check=True, timeout, and
capture_output/log stdout/stderr for auditing (refer to scenario.setup_commands,
shlex.split, subprocess.run, and sandbox_dir).
| def extract_yaml(text: str) -> str: | ||
| """Extract YAML from LLM output, stripping markdown fences if present.""" | ||
| lines = text.strip().splitlines() | ||
| if lines and lines[0].startswith("```"): | ||
| lines = lines[1:] | ||
| if lines and lines[-1].startswith("```"): | ||
| lines = lines[:-1] | ||
| return "\n".join(lines) |
There was a problem hiding this comment.
Harden YAML extraction for malformed/mixed LLM output.
extract_yaml assumes fences are only at the start/end and never validates extracted content. For untrusted model output, this can silently pass invalid payloads and defer failure downstream with unclear errors.
Proposed fix
def extract_yaml(text: str) -> str:
"""Extract YAML from LLM output, stripping markdown fences if present."""
- lines = text.strip().splitlines()
- if lines and lines[0].startswith("```"):
- lines = lines[1:]
- if lines and lines[-1].startswith("```"):
- lines = lines[:-1]
- return "\n".join(lines)
+ stripped = text.strip()
+ if not stripped:
+ raise ValueError("Empty model output; expected YAML content")
+
+ lines = stripped.splitlines()
+ if lines and lines[0].startswith("```"):
+ lines = lines[1:]
+ if lines and lines[-1].startswith("```"):
+ lines = lines[:-1]
+
+ yaml_text = "\n".join(lines).strip()
+ if not yaml_text:
+ raise ValueError("No YAML content found in model output")
+ return yaml_textAs per coding guidelines: "Never trust external data (API responses, user input, file content)", "Always validate all user input before processing at system boundaries", and "Fail fast with clear error messages when validation fails".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@skills/skill-comply/scripts/utils.py` around lines 6 - 13, The extract_yaml
function should validate and fail fast: trim the input, raise ValueError("Empty
model output; expected YAML content") if empty, then strip leading/trailing
code-fence lines only if present (remove opening ``` and closing ``` together
when both exist), build yaml_text = "\n".join(lines).strip(), and raise
ValueError("No YAML content found in model output") if yaml_text is empty before
returning it; update the extract_yaml implementation to perform these checks and
raise clear errors rather than silently returning empty/invalid strings.
- AGENTS.md: fix stale skill count (115 → 117) in project structure - run.py: replace remaining print() with logger, add zero-division guard, create parent dirs for --output path - runner.py: add returncode check for claude subprocess, clarify relative_to path traversal validation - parser.py: use is_file() instead of exists(), catch KeyError for missing trace fields, add file check in parse_spec - classifier.py: log warnings on malformed classification output, guard against non-dict JSON responses - grader.py: filter negative indices from LLM classification Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Analysis Failed
Troubleshooting
Retry: |
| for cmd in scenario.setup_commands: | ||
| parts = shlex.split(cmd) | ||
| subprocess.run(parts, cwd=sandbox_dir, capture_output=True) |
There was a problem hiding this comment.
Setup command failures are silently ignored
subprocess.run return values are not checked inside the setup_commands loop. If any setup command fails (e.g., npm install, pip install some-dep, or even a mistyped shell command), the failure is swallowed entirely. The agent then runs inside a broken environment and may score 0% compliance for reasons completely unrelated to the skill being tested — producing a false negative with no diagnostic information.
For consistency with every other subprocess call in this codebase (all of which check returncode), the result should be inspected:
| for cmd in scenario.setup_commands: | |
| parts = shlex.split(cmd) | |
| subprocess.run(parts, cwd=sandbox_dir, capture_output=True) | |
| for cmd in scenario.setup_commands: | |
| parts = shlex.split(cmd) | |
| proc = subprocess.run(parts, cwd=sandbox_dir, capture_output=True) | |
| if proc.returncode != 0: | |
| raise RuntimeError( | |
| f"Setup command {cmd!r} failed (rc={proc.returncode}): " | |
| f"{proc.stderr.decode(errors='replace')[:300]}" | |
| ) |
| tool_calls = "\n".join( | ||
| f"[{i}] {event.tool}: input={event.input[:500]} output={event.output[:200]}" | ||
| for i, event in enumerate(trace) | ||
| ) |
There was a problem hiding this comment.
Literal newlines in
output break the per-tool-call prompt format
The tool_calls string is assembled with "\n".join(...), using newlines as record separators. However, event.output — which is a raw string from tool results such as Bash — frequently contains embedded newlines (e.g., pytest output, cat output, compiler errors). When those newlines appear inside a [{i}] … entry, the classifier LLM receives a malformed listing where a single tool call spans multiple lines, making it indistinguishable from multiple separate entries. This can cause the classifier to miscount indices or skip events, silently corrupting classification results.
event.input is safe because inputs come from json.dumps(tool_input), which escapes newlines as \n. But event.output is a raw str(output_content) that preserves literal newlines.
Fix by replacing newlines before building the joined string:
| tool_calls = "\n".join( | |
| f"[{i}] {event.tool}: input={event.input[:500]} output={event.output[:200]}" | |
| for i, event in enumerate(trace) | |
| ) | |
| tool_calls = "\n".join( | |
| f"[{i}] {event.tool}: input={event.input[:500]} output={event.output[:200].replace(chr(10), ' ').replace(chr(13), '')}" | |
| for i, event in enumerate(trace) | |
| ) |
| from scripts.parser import ObservationEvent | ||
| from scripts.scenario_generator import Scenario | ||
|
|
||
| SANDBOX_BASE = Path("/tmp/skill-comply-sandbox") |
There was a problem hiding this comment.
Sandbox directories are never cleaned up and IDs can collide across skills
Two related issues with the sandbox path strategy:
-
No cleanup after successful runs.
_setup_sandboxdestroys and recreates the directory at the start of a run, but there is no cleanup at the end. After running skill-comply against several skills,/tmp/skill-comply-sandbox/accumulates stale directories indefinitely. Only a subsequent run against the same scenario ID would clean it up. -
Scenario IDs are LLM-generated and can collide across different skills. The LLM tends to produce generic names (e.g.,
scenario_level_1,supportive). When two different skills happen to produce the samescenario.id, they map to the same sandbox path. Running both concurrently (or just checking the residue from a previous run) can produce cross-contaminated environments and misleading compliance results.
Both issues can be addressed by incorporating a run-specific prefix and using a try/finally for cleanup in run_scenario:
import uuid
SANDBOX_BASE = Path("/tmp/skill-comply-sandbox")
def run_scenario(...) -> ScenarioRun:
sandbox_dir = _safe_sandbox_dir(f"{scenario.id}_{uuid.uuid4().hex[:8]}")
_setup_sandbox(sandbox_dir, scenario)
try:
result = subprocess.run(...)
...
return ScenarioRun(...)
finally:
shutil.rmtree(sandbox_dir, ignore_errors=True)There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
skills/skill-comply/scripts/runner.py (1)
85-89:⚠️ Potential issue | 🔴 CriticalReject and fail fast on generated setup commands.
scenario.setup_commandsis generated external data, but Lines 87-89 split and execute it verbatim on the host. Lines 85-89 also ignore exit status, so a malicious or failed setup command can either escape the intended workspace or leave the scenario half-prepared while grading continues.🔒 Suggested hardening
+def _run_setup_command(cmd: str, *, cwd: Path) -> None: + parts = shlex.split(cmd) + if not parts: + raise ValueError("Empty setup command") + if parts[0] not in ALLOWED_SETUP_COMMANDS: + raise ValueError(f"Disallowed setup command: {parts[0]!r}") + subprocess.run( + parts, + cwd=cwd, + capture_output=True, + text=True, + timeout=30, + check=True, + ) + def _setup_sandbox(sandbox_dir: Path, scenario: Scenario) -> None: """Create sandbox directory and run setup commands.""" if sandbox_dir.exists(): shutil.rmtree(sandbox_dir) sandbox_dir.mkdir(parents=True) - subprocess.run(["git", "init"], cwd=sandbox_dir, capture_output=True) + subprocess.run( + ["git", "init"], + cwd=sandbox_dir, + capture_output=True, + text=True, + timeout=30, + check=True, + ) for cmd in scenario.setup_commands: - parts = shlex.split(cmd) - subprocess.run(parts, cwd=sandbox_dir, capture_output=True) + _run_setup_command(cmd, cwd=sandbox_dir)As per coding guidelines: "Never trust external data (API responses, user input, file content)" and "Always handle errors explicitly at every level and never silently swallow errors".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/skill-comply/scripts/runner.py` around lines 85 - 89, The loop that shlex.splits and subprocess.run(s) each entry of scenario.setup_commands (using shlex.split(cmd) and subprocess.run(parts, cwd=sandbox_dir, capture_output=True)) must be hardened: validate or whitelist each command string before executing, disallow shell metacharacters or unexpected binaries, and refuse unknown/unsafe commands; execute with subprocess.run(..., check=True, shell=False, stdin=subprocess.DEVNULL, timeout=...) so failures raise and stop processing, pass an explicit restricted env and cwd=sandbox_dir, and catch subprocess.CalledProcessError/TimeoutError to log the error and fail fast (raise or return a non-zero status) from the runner function instead of silently ignoring return codes.
🧹 Nitpick comments (1)
skills/skill-comply/scripts/classifier.py (1)
10-12: Import placement violates PEP 8 ordering.The local import at line 12 is placed after module-level code (the
loggerassignment). Per PEP 8, all imports should be grouped at the top of the file before any other statements.♻️ Suggested fix
import logging import subprocess from pathlib import Path + +from scripts.parser import ComplianceSpec, ObservationEvent logger = logging.getLogger(__name__) -from scripts.parser import ComplianceSpec, ObservationEvent - PROMPTS_DIR = Path(__file__).parent.parent / "prompts"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/skill-comply/scripts/classifier.py` around lines 10 - 12, Move the local import from scripts.parser (ComplianceSpec, ObservationEvent) above any module-level statements so imports are grouped at the top of the file; specifically, place the "from scripts.parser import ComplianceSpec, ObservationEvent" line before the logger = logging.getLogger(__name__) assignment in classifier.py to comply with PEP 8 import ordering.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@skills/skill-comply/scripts/runner.py`:
- Around line 16-17: The sandbox directory is currently deterministic and shared
(SANDBOX_BASE and ScenarioRun.sandbox_dir), which risks cross-run clobbering;
change creation to generate a unique per-run sandbox (e.g., use uuid4 or
tempfile.mkdtemp) by adding a helper like make_sandbox_dir(safe_id) that returns
Path(tempfile.mkdtemp(prefix=f"{safe_id}-", dir=SANDBOX_BASE)), update wherever
Sandbox is constructed to call that helper (replace direct
Path("/tmp/skill-comply-sandbox") + safe_id usage), and stop deleting the entire
shared base — only remove the specific per-run directory
(ScenarioRun.sandbox_dir) at teardown to avoid wiping other runs.
- Around line 103-149: The loop that parses stream-json lines must validate and
reject malformed events instead of silently skipping or assuming dicts: after
json.loads(line) ensure the result is a dict (raise or log and continue with a
clear error if not), replace msg.get("type") usage with a guarded access only
after validation, and validate that "content" for assistant messages is a list
before iterating (same validation already done for user messages) to avoid
AttributeError; update the parsing around the variables msg/msg_type/content and
the pending/ObservationEvent creation so malformed scalars/arrays/nulls are
detected and handled with a clear failure or logged rejection rather than
causing downstream crashes.
---
Duplicate comments:
In `@skills/skill-comply/scripts/runner.py`:
- Around line 85-89: The loop that shlex.splits and subprocess.run(s) each entry
of scenario.setup_commands (using shlex.split(cmd) and subprocess.run(parts,
cwd=sandbox_dir, capture_output=True)) must be hardened: validate or whitelist
each command string before executing, disallow shell metacharacters or
unexpected binaries, and refuse unknown/unsafe commands; execute with
subprocess.run(..., check=True, shell=False, stdin=subprocess.DEVNULL,
timeout=...) so failures raise and stop processing, pass an explicit restricted
env and cwd=sandbox_dir, and catch subprocess.CalledProcessError/TimeoutError to
log the error and fail fast (raise or return a non-zero status) from the runner
function instead of silently ignoring return codes.
---
Nitpick comments:
In `@skills/skill-comply/scripts/classifier.py`:
- Around line 10-12: Move the local import from scripts.parser (ComplianceSpec,
ObservationEvent) above any module-level statements so imports are grouped at
the top of the file; specifically, place the "from scripts.parser import
ComplianceSpec, ObservationEvent" line before the logger =
logging.getLogger(__name__) assignment in classifier.py to comply with PEP 8
import ordering.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8986065a-ccdf-431e-b620-da7400638c8f
📒 Files selected for processing (6)
AGENTS.mdskills/skill-comply/scripts/classifier.pyskills/skill-comply/scripts/grader.pyskills/skill-comply/scripts/parser.pyskills/skill-comply/scripts/run.pyskills/skill-comply/scripts/runner.py
🚧 Files skipped from review as they are similar to previous changes (4)
- skills/skill-comply/scripts/run.py
- skills/skill-comply/scripts/grader.py
- AGENTS.md
- skills/skill-comply/scripts/parser.py
| SANDBOX_BASE = Path("/tmp/skill-comply-sandbox") | ||
| ALLOWED_MODELS = frozenset({"haiku", "sonnet", "opus"}) |
There was a problem hiding this comment.
Use a per-run sandbox directory.
The path is deterministic (/tmp/skill-comply-sandbox/<sanitized id>), and Lines 81-83 delete it before every run. Rerunning the same scenario—or two distinct IDs that normalize to the same safe_id—can wipe another run's workspace and make ScenarioRun.sandbox_dir point at different contents than the run actually produced. A unique suffix (uuid4/mkdtemp) on the sanitized prefix avoids both collisions and cross-run clobbering.
Also applies to: 70-76, 81-83
🧰 Tools
🪛 Ruff (0.15.6)
[error] 16-16: Probable insecure usage of temporary file or directory: "/tmp/skill-comply-sandbox"
(S108)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@skills/skill-comply/scripts/runner.py` around lines 16 - 17, The sandbox
directory is currently deterministic and shared (SANDBOX_BASE and
ScenarioRun.sandbox_dir), which risks cross-run clobbering; change creation to
generate a unique per-run sandbox (e.g., use uuid4 or tempfile.mkdtemp) by
adding a helper like make_sandbox_dir(safe_id) that returns
Path(tempfile.mkdtemp(prefix=f"{safe_id}-", dir=SANDBOX_BASE)), update wherever
Sandbox is constructed to call that helper (replace direct
Path("/tmp/skill-comply-sandbox") + safe_id usage), and stop deleting the entire
shared base — only remove the specific per-run directory
(ScenarioRun.sandbox_dir) at teardown to avoid wiping other runs.
| for line in stdout.strip().splitlines(): | ||
| try: | ||
| msg = json.loads(line) | ||
| except json.JSONDecodeError: | ||
| continue | ||
|
|
||
| msg_type = msg.get("type") | ||
|
|
||
| if msg_type == "assistant": | ||
| content = msg.get("message", {}).get("content", []) | ||
| for block in content: | ||
| if block.get("type") == "tool_use": | ||
| tool_use_id = block.get("id", "") | ||
| tool_input = block.get("input", {}) | ||
| input_str = ( | ||
| json.dumps(tool_input)[:5000] | ||
| if isinstance(tool_input, dict) | ||
| else str(tool_input)[:5000] | ||
| ) | ||
| pending[tool_use_id] = { | ||
| "tool": block.get("name", "unknown"), | ||
| "input": input_str, | ||
| "order": event_counter, | ||
| } | ||
| event_counter += 1 | ||
|
|
||
| elif msg_type == "user": | ||
| content = msg.get("message", {}).get("content", []) | ||
| if isinstance(content, list): | ||
| for block in content: | ||
| tool_use_id = block.get("tool_use_id", "") | ||
| if tool_use_id in pending: | ||
| info = pending.pop(tool_use_id) | ||
| output_content = block.get("content", "") | ||
| if isinstance(output_content, list): | ||
| output_str = json.dumps(output_content)[:5000] | ||
| else: | ||
| output_str = str(output_content)[:5000] | ||
|
|
||
| events.append(ObservationEvent( | ||
| timestamp=f"T{info['order']:04d}", | ||
| event="tool_complete", | ||
| tool=info["tool"], | ||
| session=msg.get("session_id", "unknown"), | ||
| input=info["input"], | ||
| output=output_str, | ||
| )) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd skills/skill-comply && find . -name "runner.py" -type fRepository: affaan-m/everything-claude-code
Length of output: 94
🏁 Script executed:
wc -l skills/skill-comply/scripts/runner.pyRepository: affaan-m/everything-claude-code
Length of output: 116
🏁 Script executed:
sed -n '90,160p' skills/skill-comply/scripts/runner.pyRepository: affaan-m/everything-claude-code
Length of output: 2799
🏁 Script executed:
cd skills/skill-comply && rg "_parse_stream_json" -B 3 -A 3 scripts/runner.pyRepository: affaan-m/everything-claude-code
Length of output: 497
🏁 Script executed:
cd skills/skill-comply && rg "parse_stream_json\|ObservationEvent" scripts/ -lRepository: affaan-m/everything-claude-code
Length of output: 57
🏁 Script executed:
cd skills/skill-comply && head -50 scripts/runner.pyRepository: affaan-m/everything-claude-code
Length of output: 1453
🏁 Script executed:
cd skills/skill-comply && find . -name "*test*.py" -o -name "test_*" -type f | head -20Repository: affaan-m/everything-claude-code
Length of output: 120
🏁 Script executed:
cd skills/skill-comply && rg "def test.*parse" . 2>/dev/null || echo "No tests found"Repository: affaan-m/everything-claude-code
Length of output: 292
🏁 Script executed:
cd skills/skill-comply && grep -n "JSONDecodeError\|silent\|malformed" scripts/runner.pyRepository: affaan-m/everything-claude-code
Length of output: 115
🏁 Script executed:
cd skills/skill-comply && sed -n '1,30p' scripts/runner.py && echo "---" && sed -n '75,95p' scripts/runner.pyRepository: affaan-m/everything-claude-code
Length of output: 1425
🏁 Script executed:
cd skills/skill-comply && cat tests/test_parser.py | head -100Repository: affaan-m/everything-claude-code
Length of output: 3212
🏁 Script executed:
cd skills/skill-comply && rg "stream-json\|JSONDecodeError" tests/ -B 2 -A 2Repository: affaan-m/everything-claude-code
Length of output: 57
🏁 Script executed:
cd skills/skill-comply && rg "stream-json" . --type py -B 2 -A 5Repository: affaan-m/everything-claude-code
Length of output: 1722
🏁 Script executed:
cd skills/skill-comply && sed -n '103,149p' scripts/runner.py | cat -nRepository: affaan-m/everything-claude-code
Length of output: 2420
🏁 Script executed:
python3 << 'EOF'
import json
# Test what happens if JSON decodes to non-dict
test_cases = [
'["array"]',
'"string"',
'123',
'null',
'{"valid": "dict"}',
]
for case in test_cases:
try:
msg = json.loads(case)
print(f"Parsed {case:20} -> type {type(msg).__name__:8} -> msg.get('type') raises?", end=" ")
try:
result = msg.get("type")
print(f"No, got {result}")
except AttributeError as e:
print(f"Yes: {e}")
except json.JSONDecodeError:
print(f"Failed to parse {case}")
EOFRepository: affaan-m/everything-claude-code
Length of output: 617
🏁 Script executed:
cd skills/skill-comply && rg "observations\|ScenarioRun" scripts/runner.py -A 2 -B 2 | head -40Repository: affaan-m/everything-claude-code
Length of output: 57
Reject malformed stream-json events instead of silently dropping them.
Lines 103-107 silently skip JSON decode failures, and the parser assumes all events are dict-shaped. If claude outputs a JSON scalar, array, or null, line 107's msg.get("type") will crash with AttributeError: '<type>' object has no attribute 'get'. Additionally, the parser iterates content without validating it's a list for "assistant" events (line 110), while doing so for "user" events (line 129)—inconsistent error handling that masks malformed data and corrupts the trace used for compliance grading.
🧪 Suggested validation
- for line in stdout.strip().splitlines():
+ for line_no, line in enumerate(stdout.splitlines(), start=1):
+ if not line.strip():
+ continue
try:
- msg = json.loads(line)
- except json.JSONDecodeError:
- continue
+ raw_msg = json.loads(line)
+ except json.JSONDecodeError as exc:
+ raise ValueError(
+ f"Invalid stream-json event at line {line_no}"
+ ) from exc
+
+ if not isinstance(raw_msg, dict):
+ raise ValueError(
+ f"Invalid stream-json event at line {line_no}: expected object"
+ )
+ msg = raw_msg
msg_type = msg.get("type")
if msg_type == "assistant":
- content = msg.get("message", {}).get("content", [])
+ content = msg.get("message", {}).get("content", [])
+ if not isinstance(content, list):
+ raise ValueError(
+ f"Invalid stream-json event at line {line_no}: content must be list"
+ )Per coding guidelines: "Never trust external data (API responses, user input, file content)" and "Fail fast with clear error messages when validation fails".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@skills/skill-comply/scripts/runner.py` around lines 103 - 149, The loop that
parses stream-json lines must validate and reject malformed events instead of
silently skipping or assuming dicts: after json.loads(line) ensure the result is
a dict (raise or log and continue with a clear error if not), replace
msg.get("type") usage with a guarded access only after validation, and validate
that "content" for assistant messages is a list before iterating (same
validation already done for user messages) to avoid AttributeError; update the
parsing around the variables msg/msg_type/content and the
pending/ObservationEvent creation so malformed scalars/arrays/nulls are detected
and handled with a clear failure or logged rejection rather than causing
downstream crashes.
Summary
Automated compliance measurement for skills, rules, and agent definitions. Generates behavioral specs from skill files, runs scenarios at 3 prompt strictness levels, classifies tool calls via LLM, and produces self-contained Markdown reports.
claude -pin sandboxed directoriesWhat's included
skills/skill-comply/SKILL.mdscripts/prompts/tests/fixtures/Changes from #662
All 10 CodeRabbit review comments addressed:
_extract_yamlto sharedutils.pymoduleprint()withloggingmodule inrun.pyexists()→is_file()to reject directoriesany→Any(uppercase)result.stdoutin scenario generatorNameErrorinfinallyblock (spec_generator)json.JSONDecodeErrorhandling in trace parserscoringsection validation in spec parserMAX_SKILL_FILE_SIZEconstantTesting
uv run pytest -v— 25 tests passnpx markdownlint "skills/skill-comply/**/*.md"— 0 errorsnode scripts/ci/catalog.js --text— counts match (28 agents, 117 skills, 59 commands)node scripts/ci/validate-skills.js— 117 skills validatedChecklist
🤖 Generated with Claude Code
Summary by cubic
Adds
skill-comply: an automated way to measure if skills, rules, and agent definitions are actually followed. It generates behavioral specs, runs 3 pressure scenarios, classifies tool calls, checks ordering, and outputs a self-contained Markdown report.New Features
.mdfiles into observable steps.claude -pwith stream-json trace parsing.Bug Fixes
--output, added return-code checks forclaude, and stricter sandbox path validation.is_file()), better errors for missing trace fields, and spec file existence guard.Written for commit d72e3d1. Summary will update on new commits.
Summary by CodeRabbit
New Features
Documentation