feat(skills): add skill-comply — automated behavioral compliance measurement by shimo4228 · Pull Request #724 · affaan-m/everything-claude-code

shimo4228 · 2026-03-20T15:48:26Z

Replaces #662 (closed due to CI issues — now rebased on latest main with all CodeRabbit comments addressed).

Summary

Automated compliance measurement for skills, rules, and agent definitions. Generates behavioral specs from skill files, runs scenarios at 3 prompt strictness levels, classifies tool calls via LLM, and produces self-contained Markdown reports.

spec generation — extracts observable behavioral sequences from skill/rule files
scenario generation — creates 3 pressure scenarios (explicit → implicit → adversarial)
execution — runs scenarios via claude -p in sandboxed directories
LLM classification — classifies tool call traces against spec steps
grading — temporal ordering checks + compliance rate calculation
reporting — self-contained Markdown with per-step evidence and hook promotion recommendations

What's included

Path	Description
`skills/skill-comply/SKILL.md`	Skill definition
`scripts/`	9 Python modules (run, spec_generator, scenario_generator, runner, classifier, grader, parser, report, utils)
`prompts/`	3 LLM prompt templates
`tests/`	25 pytest tests (grader + parser)
`fixtures/`	Test data (YAML spec, JSONL traces)

Changes from #662

All 10 CodeRabbit review comments addressed:

Extracted duplicate _extract_yaml to shared utils.py module
Replaced print() with logging module in run.py
Fixed file validation: exists() → is_file() to reject directories
Fixed type hint: any → Any (uppercase)
Added guard against empty result.stdout in scenario generator
Fixed potential NameError in finally block (spec_generator)
Optimized O(n²) → O(n) in grader with precomputed set
Added json.JSONDecodeError handling in trace parser
Added scoring section validation in spec parser
Removed unused MAX_SKILL_FILE_SIZE constant

Testing

uv run pytest -v — 25 tests pass
npx markdownlint "skills/skill-comply/**/*.md" — 0 errors
node scripts/ci/catalog.js --text — counts match (28 agents, 117 skills, 59 commands)
node scripts/ci/validate-skills.js — 117 skills validated

Checklist

CONTRIBUTING.md requirements met
No hardcoded paths or personal information
English only
All CI scripts pass locally
All CodeRabbit comments from feat(skills): add skill-comply — automated behavioral compliance measurement #662 addressed

🤖 Generated with Claude Code

Summary by cubic

Adds skill-comply: an automated way to measure if skills, rules, and agent definitions are actually followed. It generates behavioral specs, runs 3 pressure scenarios, classifies tool calls, checks ordering, and outputs a self-contained Markdown report.

New Features
- Spec generation from skill/rule .md files into observable steps.
- Scenario generation at 3 levels: supportive, neutral, competing.
- Sandboxed execution via claude -p with stream-json trace parsing.
- LLM classification + deterministic ordering checks and compliance rate.
- Markdown report with spec, prompts, timelines, and hook promotion suggestions.
- Adds prompts, fixtures, and 25 tests for parser/grader.
Bug Fixes
- Docs: fixed stale skill counts in README/AGENTS (now 28 agents, 117 skills, 59 commands).
- CLI/Runner: switched to logging, created parent dirs for --output, added return-code checks for claude, and stricter sandbox path validation.
- Parser: stricter file checks (is_file()), better errors for missing trace fields, and spec file existence guard.
- Classifier/Grader: warnings on malformed/non-dict JSON, ignore invalid/negative event indices for robust grading.

^{Written for commit d72e3d1. Summary will update on new commits.}

Summary by CodeRabbit

New Features
- Added "skill‑comply": an automated compliance tool that generates three strictness-level scenarios, evaluates agent behavior against expected workflows, and produces human-readable compliance reports with scores and hook-promotion recommendations.
Documentation
- Updated catalog counts: total skills increased from 116 to 117.

…urement Automated compliance measurement for skills, rules, and agent definitions. Generates behavioral specs, runs scenarios at 3 strictness levels, classifies tool calls via LLM, and produces self-contained reports. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ecc-tools · 2026-03-20T15:48:36Z

Analyzing 5000 commits...

ecc-tools · 2026-03-20T15:48:38Z

Analysis Failed

Not Found - https://docs.github.com/rest/git/refs#get-a-reference

Troubleshooting

Cause	Resolution
Large repository	Analysis may timeout on repos with extensive history
API rate limits	Wait 15 minutes before retrying
Network issues	Queue timeout is 15 minutes; retry may succeed
Permissions	Verify app has Contents: Read access

Retry: /ecc-tools analyze

_{Report Issue | ECC Tools}

coderabbitai · 2026-03-20T15:48:51Z

📝 Walkthrough

Walkthrough

A new compliance measurement skill (skill-comply) is added and documentation counts updated. The skill auto-generates expected behavioral sequences from Markdown, creates test scenarios at three strictness levels, executes scenarios via Claude CLI capturing tool-call traces, classifies and grades compliance, and emits detailed Markdown reports.

Changes

Cohort / File(s)	Summary
Documentation Metadata `AGENTS.md`, `README.md`	Updated reported total skills from 116→117 (and related internal count from 115→117 in AGENTS.md); other documented counts unchanged.
skill-comply Setup & Configuration `skills/skill-comply/.gitignore`, `skills/skill-comply/SKILL.md`, `skills/skill-comply/pyproject.toml`	Added SKILL.md, gitignore entries for Python artifacts, and project metadata (requires-python >=3.11, dependency `pyyaml>=6.0`, pytest dev group).
skill-comply Fixtures `skills/skill-comply/fixtures/*`	Added `compliant_trace.jsonl`, `noncompliant_trace.jsonl`, and `tdd_spec.yaml` fixtures representing traces and a five-step TDD workflow spec with scoring threshold.
skill-comply Prompts `skills/skill-comply/prompts/*`	Added `classifier.md`, `scenario_generator.md`, and `spec_generator.md` prompt templates for LLM-driven classification, scenario generation (3 levels), and spec extraction (YAML).
Parsing & Models `skills/skill-comply/scripts/parser.py`, `skills/skill-comply/scripts/utils.py`	Added immutable dataclasses (`ObservationEvent`, `Detector`, `Step`, `ComplianceSpec`) and parsing utilities for JSONL traces and YAML specs; helper `extract_yaml`.
Spec & Scenario Generators `skills/skill-comply/scripts/spec_generator.py`, `skills/skill-comply/scripts/scenario_generator.py`	Added generators that call Claude CLI to produce ComplianceSpec and Scenario instances; include retry and YAML extraction/validation.
Execution & Trace Extraction `skills/skill-comply/scripts/runner.py`	Added scenario execution that creates sandboxes, runs setup commands, invokes Claude with `--output-format stream-json`, parses stream-json to ObservationEvents, and returns ScenarioRun.
LLM Classification & Grading `skills/skill-comply/scripts/classifier.py`, `skills/skill-comply/scripts/grader.py`	Added `classify_events` (LLM-backed JSON classification) and `grade` (temporal-aware grading producing StepResult and ComplianceResult dataclasses).
Reporting & CLI `skills/skill-comply/scripts/report.py`, `skills/skill-comply/scripts/run.py`	Added Markdown report generator and `run.py` orchestrator CLI (spec generation → scenario generation → optional execution → grading → report output).
Tests `skills/skill-comply/tests/test_parser.py`, `skills/skill-comply/tests/test_grader.py`	Added tests covering parsing of fixtures and grading logic with mocked classification for compliant, noncompliant, and empty traces.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Run as run.py
    participant SpecGen as spec_generator
    participant ScenarioGen as scenario_generator
    participant Runner as runner
    participant Classifier as classifier
    participant Grader as grader
    participant Report as report

    User->>Run: execute(skill_path, --model, --output)
    Run->>SpecGen: generate_spec(skill_path)
    SpecGen->>SpecGen: read skill + invoke Claude CLI
    SpecGen-->>Run: ComplianceSpec YAML

    Run->>ScenarioGen: generate_scenarios(spec_yaml)
    ScenarioGen->>ScenarioGen: invoke Claude CLI (scenario_generator)
    ScenarioGen-->>Run: [Scenario...]

    loop for each Scenario
        Run->>Runner: run_scenario(scenario, model)
        Runner->>Runner: create sandbox, git init, run setup commands
        Runner->>Runner: invoke Claude CLI (stream-json)
        Runner->>Runner: parse stream -> ObservationEvents
        Runner-->>Run: ScenarioRun(observations)

        Run->>Classifier: classify_events(spec, observations)
        Classifier->>Classifier: build prompt + invoke Claude CLI
        Classifier-->>Run: {step_id: [event_idx...]}

        Run->>Grader: grade(spec, observations, classifier_model)
        Grader->>Grader: enforce temporal order, detect steps, compute compliance
        Grader-->>Run: ComplianceResult
    end

    Run->>Report: generate_report(skill_path, spec, results)
    Report->>Report: aggregate, build tables, timelines, recommendations
    Report-->>Run: markdown_report
    Run-->>User: saved report path + compliance summary

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

affaan-m

Poem

🐰 I stitched a spec from markdown light,
I birthed three tests — supportive, neutral, fight,
Claude danced through sandboxes, traced each call,
I graded steps and wrote the final hall:
One hundred seventeen skills now measured bright. 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 34.69% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main addition: a new 'skill-comply' feature for automated behavioral compliance measurement of skills and related entities.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-03-20T15:52:43Z

Greptile Summary

This PR adds skill-comply, a new automated compliance measurement system for skills, rules, and agent definitions. The tool generates behavioral specs from .md skill/rule files via LLM, produces three pressure-level scenarios (supportive → neutral → adversarial), executes them through claude -p in sandboxed directories, classifies tool call traces against spec steps using a second LLM call, checks temporal ordering deterministically, and emits a self-contained Markdown report with per-step evidence and hook promotion recommendations. The implementation is well-structured across 9 Python modules with 25 pytest tests and addresses all 10 CodeRabbit comments from the prior PR #662.

Key changes:

New skill definition: skills/skill-comply/SKILL.md registers the skill with the catalog (count updated to 117)
9 Python modules: run.py (CLI), spec_generator.py, scenario_generator.py, runner.py, classifier.py, grader.py, parser.py, report.py, utils.py
3 LLM prompt templates for spec generation, scenario generation, and event classification
25 pytest tests covering parser and grader with mocked LLM classification
Two issues found: Setup command failures in _setup_sandbox are silently ignored (could produce misleading 0% compliance for broken environments), and event.output embedded newlines can corrupt the classifier prompt's per-tool-call record format

Confidence Score: 3/5

Safe to merge with minor fixes — no data loss or security risk, but two logic bugs can silently corrupt compliance scores in realistic use.
The core architecture is clean and the majority of previous CodeRabbit comments are properly addressed. However, two bugs remain that directly affect output correctness: (1) silent swallowing of failed setup commands means the agent can run in a broken sandbox and score 0% for the wrong reason, and (2) unescaped newlines in tool output strings can corrupt the classifier LLM prompt, again silently producing wrong classification results. Neither is a security issue, but both can produce misleading reports which undermine the tool's primary purpose. These are straightforward one-line fixes.
skills/skill-comply/scripts/runner.py (setup command error handling) and skills/skill-comply/scripts/classifier.py (output newline escaping) need attention before merge.

Important Files Changed

Filename	Overview
skills/skill-comply/scripts/runner.py	Runs scenarios via `claude -p` in sandboxed directories and parses stream-json output into observation events. Two issues: setup command failures are silently ignored (could produce false-negative compliance scores), and sandbox directories are never cleaned up after runs with a potential ID collision across different skills.
skills/skill-comply/scripts/classifier.py	LLM-based classifier that maps tool call traces to spec step IDs. The tool_calls prompt string is built with newlines as record separators, but `event.output` may contain literal embedded newlines (e.g., from Bash/pytest output), which breaks the per-entry format and can cause incorrect index assignments.
skills/skill-comply/scripts/grader.py	Grades traces against compliance specs using LLM classification + deterministic temporal ordering checks. Logic is sound; O(n²)→O(n) optimisation for required_ids is correctly implemented with a precomputed set.
skills/skill-comply/scripts/parser.py	Parses JSONL observation traces and YAML compliance specs into typed dataclasses. Both `JSONDecodeError` and `KeyError` are now caught with line-number context; `scoring` section validation added. Clean and well-tested.
skills/skill-comply/scripts/run.py	CLI entry point orchestrating spec generation, scenario generation, execution, and report writing. Fully migrated to `logging`; `is_file()` validation in place. No major issues.
skills/skill-comply/scripts/spec_generator.py	Generates compliance specs from skill files via LLM with retry logic on YAML parse errors. `NameError` in `finally` block fixed; `extract_yaml` extracted to shared `utils.py`.
skills/skill-comply/scripts/scenario_generator.py	Generates three pressure-level scenarios from skill + condensed spec YAML. Empty-output guard added; returncode checked. Clean implementation.
skills/skill-comply/scripts/report.py	Generates self-contained Markdown compliance reports with per-scenario tables, tool call timelines, and hook promotion recommendations. Pipe characters in cell content are escaped correctly.
skills/skill-comply/tests/test_grader.py	25 pytest tests covering compliant, noncompliant, and edge-case grading scenarios using mocked LLM classification. Good coverage of temporal ordering constraints.
skills/skill-comply/scripts/utils.py	Shared `extract_yaml` utility for stripping markdown fences from LLM output. Correctly extracted from duplicate usage in spec_generator and classifier.

Sequence Diagram

sequenceDiagram
    participant User
    participant run.py
    participant spec_generator.py
    participant scenario_generator.py
    participant runner.py
    participant classifier.py
    participant grader.py
    participant report.py

    User->>run.py: skill-comply <path>
    run.py->>spec_generator.py: generate_spec(skill_path)
    spec_generator.py->>spec_generator.py: claude -p spec_generator.md
    spec_generator.py-->>run.py: ComplianceSpec

    run.py->>scenario_generator.py: generate_scenarios(skill_path, spec_yaml)
    scenario_generator.py->>scenario_generator.py: claude -p scenario_generator.md
    scenario_generator.py-->>run.py: list[Scenario] (3 levels)

    loop For each Scenario
        run.py->>runner.py: run_scenario(scenario, model)
        runner.py->>runner.py: _setup_sandbox (git init + setup_commands)
        runner.py->>runner.py: claude -p stream-json
        runner.py->>runner.py: _parse_stream_json → ObservationEvents
        runner.py-->>run.py: ScenarioRun

        run.py->>grader.py: grade(spec, observations)
        grader.py->>classifier.py: classify_events(spec, trace)
        classifier.py->>classifier.py: claude -p classifier.md
        classifier.py-->>grader.py: dict[step_id, list[int]]
        grader.py->>grader.py: _check_temporal_order per step
        grader.py-->>run.py: ComplianceResult
    end

    run.py->>report.py: generate_report(skill_path, spec, results)
    report.py-->>run.py: Markdown string
    run.py-->>User: results/<skill>.md

_{Last reviewed commit: "fix(skill-comply): a..."}

skills/skill-comply/scripts/runner.py

skills/skill-comply/scripts/run.py

skills/skill-comply/scripts/parser.py

skills/skill-comply/scripts/classifier.py

skills/skill-comply/scripts/runner.py

cubic-dev-ai

23 issues found across 23 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="skills/skill-comply/scripts/run.py">

<violation number="1" location="skills/skill-comply/scripts/run.py:108">
P2: Writing to a user-provided `--output` path can fail because parent directories are not created before `write_text()`.</violation>

<violation number="2" location="skills/skill-comply/scripts/run.py:112">
P2: Summary calculation divides by zero when no scenarios are generated, causing a runtime crash instead of a graceful error.</violation>
</file>

<file name="AGENTS.md">

<violation number="1" location="AGENTS.md:3">
P2: The changed summary line introduces an inconsistent skill total within AGENTS.md (117 vs 115), leaving documentation internally contradictory.</violation>
</file>

<file name="README.md">

<violation number="1" location="README.md:1074">
P2: README contains conflicting hardcoded skill counts (117 vs 102), indicating documentation drift and inconsistent user-facing metrics.</violation>
</file>

<file name="skills/skill-comply/scripts/classifier.py">

<violation number="1" location="skills/skill-comply/scripts/classifier.py:24">
P1: Classification parse failures are silently treated as empty matches, causing classifier errors to be scored as real non-compliance.</violation>

<violation number="2" location="skills/skill-comply/scripts/classifier.py:32">
P2: Prompt construction can crash on non-string trace payloads because `event.input`/`event.output` are sliced without coercion.</violation>

<violation number="3" location="skills/skill-comply/scripts/classifier.py:73">
P1: `_parse_classification` can throw uncaught `AttributeError` when LLM returns valid JSON that is not an object.</violation>
</file>

<file name="skills/skill-comply/prompts/scenario_generator.md">

<violation number="1" location="skills/skill-comply/prompts/scenario_generator.md:19">
P2: Ambiguous `{id}` token in setup command example can be executed literally; `setup_commands` are run verbatim and `cwd` is already a per-scenario sandbox, so this risks shared `/tmp/skill-comply-sandbox/{id}` directories and cross-scenario leakage.</violation>

<violation number="2" location="skills/skill-comply/prompts/scenario_generator.md:52">
P1: Prompt template embeds untrusted skill/spec text without guardrails, enabling prompt-injection influence over generated `setup_commands` that are later executed.</violation>
</file>

<file name="skills/skill-comply/scripts/parser.py">

<violation number="1" location="skills/skill-comply/scripts/parser.py:49">
P2: `parse_trace` only checks `exists()`, so directory paths pass validation and fail later with `IsADirectoryError` instead of an explicit file-path validation error.</violation>

<violation number="2" location="skills/skill-comply/scripts/parser.py:87">
P2: Detector step references are not validated against declared step IDs, so invalid after_step/before_step values can silently alter grading behavior or force failures.</violation>
</file>

<file name="skills/skill-comply/prompts/classifier.md">

<violation number="1" location="skills/skill-comply/prompts/classifier.md:9">
P1: Raw untrusted tool-call content is interpolated directly into the classifier prompt without explicit inert-data handling, creating a prompt-injection risk that can skew compliance classification.</violation>
</file>

<file name="skills/skill-comply/scripts/scenario_generator.py">

<violation number="1" location="skills/skill-comply/scripts/scenario_generator.py:44">
P2: Unbounded prompt content is passed as a single CLI argument, which can exceed OS argv limits and cause subprocess invocation failure on larger skills/specs.</violation>

<violation number="2" location="skills/skill-comply/scripts/scenario_generator.py:60">
P2: LLM-generated YAML is parsed and immediately indexed without validating the top-level type or required keys. If the LLM returns malformed YAML or missing fields, this raises runtime KeyError/TypeError and crashes scenario generation. Add explicit schema/type checks (and optionally retries) before indexing.</violation>
</file>

<file name="skills/skill-comply/scripts/report.py">

<violation number="1" location="skills/skill-comply/scripts/report.py:114">
P2: Markdown table cells use unescaped dynamic text, so `|`/newlines in scenario names, step metadata, or failure reasons can corrupt report tables.</violation>
</file>

<file name="skills/skill-comply/scripts/grader.py">

<violation number="1" location="skills/skill-comply/scripts/grader.py:75">
P2: Classifier indices are only checked against the upper bound; negative indices from LLM output will silently map to the end of the list, producing incorrect evidence. Index order is also unnormalized, so non-monotonic output changes which candidate is selected first. Validate bounds (>=0) and normalize ordering before mapping to events.</violation>
</file>

<file name="skills/skill-comply/scripts/utils.py">

<violation number="1" location="skills/skill-comply/scripts/utils.py:9">
P1: YAML extraction is brittle because markdown fences are removed only when they are the first/last lines of the whole response, leaving prose/fences in common LLM outputs and breaking YAML parsing.</violation>
</file>

<file name="skills/skill-comply/tests/test_parser.py">

<violation number="1" location="skills/skill-comply/tests/test_parser.py:28">
P2: Sorting test is vacuous because it uses an already sorted fixture, so it may not catch regressions where parser stops sorting.</violation>

<violation number="2" location="skills/skill-comply/tests/test_parser.py:51">
P2: Missing-file test relies on a hardcoded absolute path, making it environment-dependent instead of using a test-controlled nonexistent path.</violation>
</file>

<file name="skills/skill-comply/scripts/runner.py">

<violation number="1" location="skills/skill-comply/scripts/runner.py:53">
P2: `subprocess.run` is not checked for success, so non‑zero Claude exits (auth/model errors, CLI failures) still get parsed and graded as if they were valid runs, risking misclassification of runtime failures as compliance failures.</violation>

<violation number="2" location="skills/skill-comply/scripts/runner.py:67">
P2: Lossy scenario ID sanitization can map distinct scenarios to the same sandbox directory, and `_setup_sandbox` then deletes that directory, causing destructive overwrites of scenario artifacts.</violation>

<violation number="3" location="skills/skill-comply/scripts/runner.py:79">
P2: Setup commands ignore failures; if `git init` or a setup command fails, the scenario still runs in a partially prepared sandbox, masking setup errors and skewing results.</violation>

<violation number="4" location="skills/skill-comply/scripts/runner.py:83">
P1: Commands from LLM-generated scenarios are executed directly without validation or sandboxing, enabling arbitrary command execution if scenario content is untrusted.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

skills/skill-comply/scripts/classifier.py

cubic-dev-ai · 2026-03-20T15:54:48Z

skills/skill-comply/prompts/scenario_generator.md

+- setup_commands should create a minimal sandbox (dirs, pyproject.toml, etc.)
+- Prompts should be realistic — something a developer would actually ask
+
+Skill content:


P1: Prompt template embeds untrusted skill/spec text without guardrails, enabling prompt-injection influence over generated setup_commands that are later executed.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At skills/skill-comply/prompts/scenario_generator.md, line 52: <comment>Prompt template embeds untrusted skill/spec text without guardrails, enabling prompt-injection influence over generated `setup_commands` that are later executed.</comment> <file context> @@ -0,0 +1,62 @@ +- setup_commands should create a minimal sandbox (dirs, pyproject.toml, etc.) +- Prompts should be realistic — something a developer would actually ask + +Skill content: + +--- </file context>

cubic-dev-ai · 2026-03-20T15:54:48Z

skills/skill-comply/prompts/classifier.md

+{steps_description}
+
+Tool calls (numbered):
+{tool_calls}


P1: Raw untrusted tool-call content is interpolated directly into the classifier prompt without explicit inert-data handling, creating a prompt-injection risk that can skew compliance classification.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At skills/skill-comply/prompts/classifier.md, line 9: <comment>Raw untrusted tool-call content is interpolated directly into the classifier prompt without explicit inert-data handling, creating a prompt-injection risk that can skew compliance classification.</comment> <file context> @@ -0,0 +1,24 @@ +{steps_description} + +Tool calls (numbered): +{tool_calls} + +Respond with ONLY a JSON object mapping step_id to a list of matching tool call numbers. </file context>

cubic-dev-ai · 2026-03-20T15:54:48Z

skills/skill-comply/scripts/utils.py

+def extract_yaml(text: str) -> str:
+    """Extract YAML from LLM output, stripping markdown fences if present."""
+    lines = text.strip().splitlines()
+    if lines and lines[0].startswith("```"):


P1: YAML extraction is brittle because markdown fences are removed only when they are the first/last lines of the whole response, leaving prose/fences in common LLM outputs and breaking YAML parsing.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At skills/skill-comply/scripts/utils.py, line 9: <comment>YAML extraction is brittle because markdown fences are removed only when they are the first/last lines of the whole response, leaving prose/fences in common LLM outputs and breaking YAML parsing.</comment> <file context> @@ -0,0 +1,13 @@ +def extract_yaml(text: str) -> str: + """Extract YAML from LLM output, stripping markdown fences if present.""" + lines = text.strip().splitlines() + if lines and lines[0].startswith("```"): + lines = lines[1:] + if lines and lines[-1].startswith("```"): </file context>

cubic-dev-ai · 2026-03-20T15:54:48Z

skills/skill-comply/tests/test_parser.py

+
+    def test_nonexistent_file_raises(self) -> None:
+        with pytest.raises(FileNotFoundError):
+            parse_trace(Path("/nonexistent/trace.jsonl"))


P2: Missing-file test relies on a hardcoded absolute path, making it environment-dependent instead of using a test-controlled nonexistent path.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At skills/skill-comply/tests/test_parser.py, line 51: <comment>Missing-file test relies on a hardcoded absolute path, making it environment-dependent instead of using a test-controlled nonexistent path.</comment> <file context> @@ -0,0 +1,90 @@ + + def test_nonexistent_file_raises(self) -> None: + with pytest.raises(FileNotFoundError): + parse_trace(Path("/nonexistent/trace.jsonl")) + + </file context>

cubic-dev-ai · 2026-03-20T15:54:48Z

skills/skill-comply/tests/test_parser.py

+    def test_events_sorted_by_timestamp(self) -> None:
+        events = parse_trace(FIXTURES / "compliant_trace.jsonl")
+        timestamps = [e.timestamp for e in events]
+        assert timestamps == sorted(timestamps)


P2: Sorting test is vacuous because it uses an already sorted fixture, so it may not catch regressions where parser stops sorting.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At skills/skill-comply/tests/test_parser.py, line 28: <comment>Sorting test is vacuous because it uses an already sorted fixture, so it may not catch regressions where parser stops sorting.</comment> <file context> @@ -0,0 +1,90 @@ + def test_events_sorted_by_timestamp(self) -> None: + events = parse_trace(FIXTURES / "compliant_trace.jsonl") + timestamps = [e.timestamp for e in events] + assert timestamps == sorted(timestamps) + + def test_event_fields(self) -> None: </file context>

cubic-dev-ai · 2026-03-20T15:54:48Z

skills/skill-comply/scripts/runner.py

+
+def _safe_sandbox_dir(scenario_id: str) -> Path:
+    """Sanitize scenario ID and ensure path stays within sandbox base."""
+    safe_id = re.sub(r"[^a-zA-Z0-9\-_]", "_", scenario_id)


P2: Lossy scenario ID sanitization can map distinct scenarios to the same sandbox directory, and _setup_sandbox then deletes that directory, causing destructive overwrites of scenario artifacts.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At skills/skill-comply/scripts/runner.py, line 67: <comment>Lossy scenario ID sanitization can map distinct scenarios to the same sandbox directory, and `_setup_sandbox` then deletes that directory, causing destructive overwrites of scenario artifacts.</comment> <file context> @@ -0,0 +1,155 @@ + +def _safe_sandbox_dir(scenario_id: str) -> Path: + """Sanitize scenario ID and ensure path stays within sandbox base.""" + safe_id = re.sub(r"[^a-zA-Z0-9\-_]", "_", scenario_id) + path = SANDBOX_BASE / safe_id + path.resolve().relative_to(SANDBOX_BASE.resolve()) </file context>

skills/skill-comply/scripts/runner.py

cubic-dev-ai · 2026-03-20T15:54:48Z

skills/skill-comply/scripts/runner.py

@@ -0,0 +1,155 @@
+"""Run scenarios via claude -p and parse tool calls from stream-json output."""


P2: Setup commands ignore failures; if git init or a setup command fails, the scenario still runs in a partially prepared sandbox, masking setup errors and skewing results.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At skills/skill-comply/scripts/runner.py, line 79: <comment>Setup commands ignore failures; if `git init` or a setup command fails, the scenario still runs in a partially prepared sandbox, masking setup errors and skewing results.</comment> <file context> @@ -0,0 +1,155 @@ + shutil.rmtree(sandbox_dir) + sandbox_dir.mkdir(parents=True) + + subprocess.run(["git", "init"], cwd=sandbox_dir, capture_output=True) + + for cmd in scenario.setup_commands: </file context>

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (10)

skills/skill-comply/prompts/classifier.md (1)

19-20: Generalize filename rules to avoid task-specific bias.

Using calculator.py-specific examples may overfit classification. Prefer generic rules (e.g., test-path patterns vs non-test paths) so behavior stays stable across domains.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/prompts/classifier.md` around lines 19 - 20, The current
rule examples use specific filenames ("test_calculator.py", "calculator.py")
which risks task-specific bias; update the classifier prompt in classifier.md to
use generic filename/path patterns instead (e.g., test path patterns like
"tests/", "test_*.py", "*_test.py" for test files versus non-test paths or
implementation file patterns for implementation files) so the guidance applies
broadly across domains; find the lines that mention "test_calculator.py" and
"calculator.py" and replace them with these generic patterns and a short
explanatory phrase (e.g., "files under tests/ or prefixed with test_ are tests")
to keep behavior stable.

skills/skill-comply/.gitignore (1)

7-7: Consider committing uv.lock for reproducible environments.

Ignoring uv.lock can make test and dependency resolution less deterministic across machines/CI.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/.gitignore` at line 7, Remove uv.lock from the .gitignore
so the lockfile is committed for reproducible installs; update the .gitignore
entry that currently lists "uv.lock" to delete that line, add the generated
uv.lock to the repository, and commit the lockfile (regenerate it with your
package manager if needed) so dependency resolution is deterministic across
machines/CI.

skills/skill-comply/prompts/scenario_generator.md (1)

19-19: Avoid hardcoded absolute sandbox paths in generated scenarios.

Embedding /tmp/skill-comply-sandbox/... in the prompt template can make scenario setup brittle across environments. Prefer a placeholder tied to runner-provided sandbox root.

As per coding guidelines: "Do not use hardcoded values; use constants or configuration instead".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/prompts/scenario_generator.md` at line 19, Replace the
hardcoded "/tmp/skill-comply-sandbox/{id}/..." path in the scenario template
string with a runner-provided sandbox root placeholder (e.g., "{sandbox_root}"
or "${SANDBOX_ROOT}") so the mkdir command becomes "mkdir -p
{sandbox_root}/{id}/src {sandbox_root}/{id}/tests"; update the
scenario_generator.md template where that exact string appears and ensure the
consuming runner injects the sandbox root constant/config value at runtime
instead of using the absolute /tmp path.

skills/skill-comply/scripts/spec_generator.py (1)

50-53: Add validation for empty Claude output before parsing.

Unlike scenario_generator.py (Line 53-54), this function doesn't validate that result.stdout is non-empty before extraction. If Claude returns empty output with exit code 0, extract_yaml returns an empty string, leading to confusing parse errors during retries.

Suggested fix

     if result.returncode != 0:
         raise RuntimeError(f"claude -p failed: {result.stderr}")

+    if not result.stdout.strip():
+        raise RuntimeError("claude -p returned empty output")
+
     raw_yaml = extract_yaml(result.stdout)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/scripts/spec_generator.py` around lines 50 - 53, The
Claude call in spec_generator.py checks returncode but doesn't validate
result.stdout before calling extract_yaml; add a guard after the returncode
check to verify result.stdout is non-empty (e.g., if not result.stdout.strip():)
and raise a RuntimeError including the fact that Claude returned empty output
and any available result.stderr; ensure this uses the same pattern as
scenario_generator.py (checking result.stdout before calling extract_yaml) so
extract_yaml is only invoked with non-empty input.

skills/skill-comply/scripts/run.py (1)

96-100: Replace print() with logger.info() for consistency.

Lines 96 and 100 use print() while the rest of the module uses logger.info(). This creates inconsistent output behavior (e.g., if logging is redirected).

As per coding guidelines: "Avoid using print() statements in Python code; use the logging module instead."

Suggested fix

     for scenario in scenarios:
-        print(f"       Running {scenario.level_name}...", end="", flush=True)
+        logger.info("       Running %s...", scenario.level_name)
         run = run_scenario(scenario, model=args.model)
         result = grade(spec, list(run.observations))
         graded_results.append((scenario.level_name, result, list(run.observations)))
-        print(f" {result.compliance_rate:.0%}")
+        logger.info("       %s: %.0f%%", scenario.level_name, result.compliance_rate * 100)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/scripts/run.py` around lines 96 - 100, Replace the two
print() calls around running scenarios with logger.info() so output follows the
module's logging configuration: use logger.info to log the "Running
{scenario.level_name}..." message (preserve the same text and no newline
behavior by combining messages if needed) and log the result line with the
formatted compliance_rate; update the calls near run_scenario, grade, and
graded_results handling to use logger.info instead of print so logs are
consistent with the module logger.

skills/skill-comply/scripts/classifier.py (1)

70-78: Consider logging when classification parsing fails.

Returning {} on JSONDecodeError/ValueError silently masks LLM output issues. This could lead to false compliance results (all steps appear as "not detected") without any indication of the underlying problem.

Suggested improvement

+import logging
+
+logger = logging.getLogger(__name__)
+
 def _parse_classification(text: str) -> dict[str, list[int]]:
     ...
     try:
         parsed = json.loads(cleaned)
         # Validate structure
         return {
             k: [int(i) for i in v]
             for k, v in parsed.items()
             if isinstance(v, list)
         }
     except (json.JSONDecodeError, ValueError):
+        logger.warning("Failed to parse classification output: %s", cleaned[:200])
         return {}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/scripts/classifier.py` around lines 70 - 78, The except
block that currently catches json.JSONDecodeError/ValueError after
json.loads(cleaned) silently returns {} and masks LLM output issues; update that
except block (where parsed and cleaned are used) to log the exception details
and the raw/cleaned LLM output before returning {} so failures are visible —
include the exception message (and ideally stack/trace) plus the cleaned string
in the log to aid debugging.

skills/skill-comply/scripts/runner.py (1)

16-16: Use tempfile module instead of hardcoded /tmp path.

Hardcoding /tmp is flagged by static analysis (S108) and may not work correctly on all systems or in containerized environments.

Suggested fix

+import tempfile
+
-SANDBOX_BASE = Path("/tmp/skill-comply-sandbox")
+SANDBOX_BASE = Path(tempfile.gettempdir()) / "skill-comply-sandbox"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/scripts/runner.py` at line 16, Replace the hardcoded
SANDBOX_BASE = Path("/tmp/skill-comply-sandbox") with a tempfile-based solution:
import tempfile and either set SANDBOX_BASE = Path(tempfile.gettempdir()) /
"skill-comply-sandbox" for a platform-safe shared temp location, or (preferably)
create a unique per-run directory with tempfile.mkdtemp() (e.g., SANDBOX_BASE =
Path(tempfile.mkdtemp(prefix="skill-comply-"))) and ensure any created temp
directory is cleaned up appropriately; update runner.py to import tempfile and
adjust any code that assumes a persistent /tmp path.

skills/skill-comply/scripts/scenario_generator.py (1)

56-68: Add error handling for YAML parsing failures.

If the LLM returns malformed YAML or missing scenarios key, yaml.safe_load or parsed["scenarios"] will raise exceptions with unclear error messages. Unlike spec_generator.py which has retry logic, this function lacks any YAML error handling.

Suggested fix

     raw_yaml = extract_yaml(result.stdout)
-    parsed = yaml.safe_load(raw_yaml)
-
-    scenarios: list[Scenario] = []
-    for s in parsed["scenarios"]:
+    try:
+        parsed = yaml.safe_load(raw_yaml)
+        if not parsed or "scenarios" not in parsed:
+            raise RuntimeError("LLM output missing 'scenarios' key")
+    except yaml.YAMLError as e:
+        raise RuntimeError(f"Failed to parse scenario YAML: {e}") from e
+
+    scenarios: list[Scenario] = []
+    for s in parsed["scenarios"]:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/scripts/scenario_generator.py` around lines 56 - 68, Wrap
the YAML extraction and parsing in a try/except that catches yaml.YAMLError and
TypeError/KeyError around yaml.safe_load and parsed["scenarios"]; validate that
parsed is a dict and contains the "scenarios" key before iterating, and on
failure either raise a clear exception that includes the raw_yaml (or a
truncated preview) and the original error or trigger the same retry path used by
spec_generator.py; update the scenario construction block (where extract_yaml,
yaml.safe_load, parsed["scenarios"], and Scenario(...) are used) to log/raise a
descriptive error that aids debugging when the LLM returns malformed or missing
YAML.

skills/skill-comply/scripts/grader.py (1)

88-92: Clarify: Only the first valid event is used as evidence per step.

The loop breaks after finding the first event that passes temporal checks (line 92). This means if multiple events could satisfy a step, only the first is recorded as evidence. If this is intentional (e.g., to enforce single-detection semantics), consider adding a brief comment. If multiple events should contribute, remove the break.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/scripts/grader.py` around lines 88 - 92, The loop over
candidates currently stops at the first event that passes temporal checks (the
break after matched.append(event)), which enforces single-evidence-per-step;
decide and apply one of two fixes: if single-detection is intentional, add a
short clarifying comment above the loop referencing the variables/functions (for
example: "Only the first valid event is used as evidence for a step — see
_check_temporal_order and matched") to make the behavior explicit; otherwise
remove the break so the loop continues to evaluate all candidates and append
every event for which _check_temporal_order(step, event, resolved, classified)
returns None.

skills/skill-comply/scripts/parser.py (1)

79-101: Consider wrapping key accesses with clearer error messages.

While the scoring section has explicit validation (lines 92-93), other required keys like id, name, steps, etc. will raise generic KeyError exceptions if missing. For a developer-facing tool, the current behavior is acceptable, but clearer error messages would improve debuggability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/scripts/parser.py` around lines 79 - 101, The parser
currently accesses required keys like "id", "name", "source_rule", "version",
and "steps" directly which yields generic KeyError if missing; add explicit
validation before building Steps and returning ComplianceSpec (e.g., check raw
contains "steps", "id", "name", "source_rule", "version") and raise KeyError
with clear messages like "Missing 'id' in compliance spec" or "Missing 'steps'
array in compliance spec"; also validate each step has required fields before
creating Step/Detector to provide targeted errors; update the block that builds
steps and the return of ComplianceSpec to use these explicit checks so errors
are informative.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@AGENTS.md`:
- Line 3: Update the inconsistent skill count in AGENTS.md: the header line
reads "117 skills" but the Project Structure block still shows "115"; open the
Project Structure section and change the "115" figure to "117" so both in-file
references (the top header "117 skills" and the Project Structure count) match
exactly.

In `@skills/skill-comply/scripts/parser.py`:
- Around line 74-76: parse_spec currently calls path.read_text() without
verifying the file exists (unlike parse_trace), which can produce a less clear
error; update parse_spec to check path.exists() (or path.is_file()) before
reading and raise a clear FileNotFoundError with the same style/message used in
parse_trace if the file is missing, then proceed to call yaml.safe_load as
before (refer to the parse_spec function and compare behavior to parse_trace).

In `@skills/skill-comply/scripts/report.py`:
- Line 52: The string passed to lines.append uses an unnecessary f-string prefix
despite having no placeholders; remove the leading "f" from the call to
lines.append(f"| Recommendation | All steps above threshold — no hook promotion
needed |") in report.py (locate the lines.append invocation that writes the
Recommendation row) so it becomes a plain string literal.
- Around line 39-40: The two calls to lines.append use f-strings with no
placeholders; change lines.append(f"| Metric | Value |") and
lines.append(f"|--------|-------|") to plain strings (lines.append("| Metric |
Value |") and lines.append("|--------|-------|")) so static analysis stops
flagging useless f-strings; update any similar lines.append occurrences in
report.py that use f"" without interpolation.

In `@skills/skill-comply/scripts/runner.py`:
- Around line 81-83: The loop in runner.py currently executes LLM-provided
scenario.setup_commands by calling shlex.split(cmd) and subprocess.run(...)
which is unsafe; replace this with validation against an allowlist of permitted
executables/flags (e.g., only allow "mkdir", "touch", "echo" and safe arg
patterns) before calling subprocess.run, reject or log and skip any command that
doesn't match the allowlist, and/or require explicit user confirmation for
non-allowlisted commands; additionally consider executing allowed commands in a
hardened sandbox (restricted cwd, unprivileged user, or container) using the
existing sandbox_dir and ensure all subprocess.run invocations set shell=False,
check=True, timeout, and capture_output/log stdout/stderr for auditing (refer to
scenario.setup_commands, shlex.split, subprocess.run, and sandbox_dir).

In `@skills/skill-comply/scripts/utils.py`:
- Around line 6-13: The extract_yaml function should validate and fail fast:
trim the input, raise ValueError("Empty model output; expected YAML content") if
empty, then strip leading/trailing code-fence lines only if present (remove
opening ``` and closing ``` together when both exist), build yaml_text =
"\n".join(lines).strip(), and raise ValueError("No YAML content found in model
output") if yaml_text is empty before returning it; update the extract_yaml
implementation to perform these checks and raise clear errors rather than
silently returning empty/invalid strings.

---

Nitpick comments:
In `@skills/skill-comply/.gitignore`:
- Line 7: Remove uv.lock from the .gitignore so the lockfile is committed for
reproducible installs; update the .gitignore entry that currently lists
"uv.lock" to delete that line, add the generated uv.lock to the repository, and
commit the lockfile (regenerate it with your package manager if needed) so
dependency resolution is deterministic across machines/CI.

In `@skills/skill-comply/prompts/classifier.md`:
- Around line 19-20: The current rule examples use specific filenames
("test_calculator.py", "calculator.py") which risks task-specific bias; update
the classifier prompt in classifier.md to use generic filename/path patterns
instead (e.g., test path patterns like "tests/", "test_*.py", "*_test.py" for
test files versus non-test paths or implementation file patterns for
implementation files) so the guidance applies broadly across domains; find the
lines that mention "test_calculator.py" and "calculator.py" and replace them
with these generic patterns and a short explanatory phrase (e.g., "files under
tests/ or prefixed with test_ are tests") to keep behavior stable.

In `@skills/skill-comply/prompts/scenario_generator.md`:
- Line 19: Replace the hardcoded "/tmp/skill-comply-sandbox/{id}/..." path in
the scenario template string with a runner-provided sandbox root placeholder
(e.g., "{sandbox_root}" or "${SANDBOX_ROOT}") so the mkdir command becomes
"mkdir -p {sandbox_root}/{id}/src {sandbox_root}/{id}/tests"; update the
scenario_generator.md template where that exact string appears and ensure the
consuming runner injects the sandbox root constant/config value at runtime
instead of using the absolute /tmp path.

In `@skills/skill-comply/scripts/classifier.py`:
- Around line 70-78: The except block that currently catches
json.JSONDecodeError/ValueError after json.loads(cleaned) silently returns {}
and masks LLM output issues; update that except block (where parsed and cleaned
are used) to log the exception details and the raw/cleaned LLM output before
returning {} so failures are visible — include the exception message (and
ideally stack/trace) plus the cleaned string in the log to aid debugging.

In `@skills/skill-comply/scripts/grader.py`:
- Around line 88-92: The loop over candidates currently stops at the first event
that passes temporal checks (the break after matched.append(event)), which
enforces single-evidence-per-step; decide and apply one of two fixes: if
single-detection is intentional, add a short clarifying comment above the loop
referencing the variables/functions (for example: "Only the first valid event is
used as evidence for a step — see _check_temporal_order and matched") to make
the behavior explicit; otherwise remove the break so the loop continues to
evaluate all candidates and append every event for which
_check_temporal_order(step, event, resolved, classified) returns None.

In `@skills/skill-comply/scripts/parser.py`:
- Around line 79-101: The parser currently accesses required keys like "id",
"name", "source_rule", "version", and "steps" directly which yields generic
KeyError if missing; add explicit validation before building Steps and returning
ComplianceSpec (e.g., check raw contains "steps", "id", "name", "source_rule",
"version") and raise KeyError with clear messages like "Missing 'id' in
compliance spec" or "Missing 'steps' array in compliance spec"; also validate
each step has required fields before creating Step/Detector to provide targeted
errors; update the block that builds steps and the return of ComplianceSpec to
use these explicit checks so errors are informative.

In `@skills/skill-comply/scripts/run.py`:
- Around line 96-100: Replace the two print() calls around running scenarios
with logger.info() so output follows the module's logging configuration: use
logger.info to log the "Running {scenario.level_name}..." message (preserve the
same text and no newline behavior by combining messages if needed) and log the
result line with the formatted compliance_rate; update the calls near
run_scenario, grade, and graded_results handling to use logger.info instead of
print so logs are consistent with the module logger.

In `@skills/skill-comply/scripts/runner.py`:
- Line 16: Replace the hardcoded SANDBOX_BASE =
Path("/tmp/skill-comply-sandbox") with a tempfile-based solution: import
tempfile and either set SANDBOX_BASE = Path(tempfile.gettempdir()) /
"skill-comply-sandbox" for a platform-safe shared temp location, or (preferably)
create a unique per-run directory with tempfile.mkdtemp() (e.g., SANDBOX_BASE =
Path(tempfile.mkdtemp(prefix="skill-comply-"))) and ensure any created temp
directory is cleaned up appropriately; update runner.py to import tempfile and
adjust any code that assumes a persistent /tmp path.

In `@skills/skill-comply/scripts/scenario_generator.py`:
- Around line 56-68: Wrap the YAML extraction and parsing in a try/except that
catches yaml.YAMLError and TypeError/KeyError around yaml.safe_load and
parsed["scenarios"]; validate that parsed is a dict and contains the "scenarios"
key before iterating, and on failure either raise a clear exception that
includes the raw_yaml (or a truncated preview) and the original error or trigger
the same retry path used by spec_generator.py; update the scenario construction
block (where extract_yaml, yaml.safe_load, parsed["scenarios"], and
Scenario(...) are used) to log/raise a descriptive error that aids debugging
when the LLM returns malformed or missing YAML.

In `@skills/skill-comply/scripts/spec_generator.py`:
- Around line 50-53: The Claude call in spec_generator.py checks returncode but
doesn't validate result.stdout before calling extract_yaml; add a guard after
the returncode check to verify result.stdout is non-empty (e.g., if not
result.stdout.strip():) and raise a RuntimeError including the fact that Claude
returned empty output and any available result.stderr; ensure this uses the same
pattern as scenario_generator.py (checking result.stdout before calling
extract_yaml) so extract_yaml is only invoked with non-empty input.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a55fef57-6794-46b5-9f49-f8fa75b7f64d

📥 Commits

Reviewing files that changed from the base of the PR and between 7ccfda9 and ff3cfc4.

📒 Files selected for processing (23)

AGENTS.md
README.md
skills/skill-comply/.gitignore
skills/skill-comply/SKILL.md
skills/skill-comply/fixtures/compliant_trace.jsonl
skills/skill-comply/fixtures/noncompliant_trace.jsonl
skills/skill-comply/fixtures/tdd_spec.yaml
skills/skill-comply/prompts/classifier.md
skills/skill-comply/prompts/scenario_generator.md
skills/skill-comply/prompts/spec_generator.md
skills/skill-comply/pyproject.toml
skills/skill-comply/scripts/__init__.py
skills/skill-comply/scripts/classifier.py
skills/skill-comply/scripts/grader.py
skills/skill-comply/scripts/parser.py
skills/skill-comply/scripts/report.py
skills/skill-comply/scripts/run.py
skills/skill-comply/scripts/runner.py
skills/skill-comply/scripts/scenario_generator.py
skills/skill-comply/scripts/spec_generator.py
skills/skill-comply/scripts/utils.py
skills/skill-comply/tests/test_grader.py
skills/skill-comply/tests/test_parser.py

AGENTS.md

skills/skill-comply/scripts/parser.py

coderabbitai · 2026-03-20T15:57:48Z

skills/skill-comply/scripts/report.py

+    lines.append(f"| Metric | Value |")
+    lines.append(f"|--------|-------|")


⚠️ Potential issue | 🟡 Minor

Remove extraneous f prefix from strings without placeholders.

Static analysis correctly identifies f-strings without any placeholders. These should be plain strings.

Proposed fix

- lines.append(f"| Metric | Value |") - lines.append(f"|--------|-------|") + lines.append("| Metric | Value |") + lines.append("|--------|-------|")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

lines.append(f"| Metric | Value |")

lines.append(f"|--------|-------|")

lines.append("| Metric | Value |")

lines.append("|--------|-------|")

🧰 Tools

🪛 Ruff (0.15.6)

[error] 39-39: f-string without any placeholders

Remove extraneous f prefix

(F541)

[error] 40-40: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@skills/skill-comply/scripts/report.py` around lines 39 - 40, The two calls to lines.append use f-strings with no placeholders; change lines.append(f"| Metric | Value |") and lines.append(f"|--------|-------|") to plain strings (lines.append("| Metric | Value |") and lines.append("|--------|-------|")) so static analysis stops flagging useless f-strings; update any similar lines.append occurrences in report.py that use f"" without interpolation.

coderabbitai · 2026-03-20T15:57:48Z

skills/skill-comply/scripts/report.py

+        step_names = ", ".join(promote_steps)
+        lines.append(f"| Recommendation | **Promote {step_names} to hooks** |")
+    else:
+        lines.append(f"| Recommendation | All steps above threshold — no hook promotion needed |")


⚠️ Potential issue | 🟡 Minor

Remove extraneous f prefix.

This string has no placeholders.

Proposed fix

- lines.append(f"| Recommendation | All steps above threshold — no hook promotion needed |") + lines.append("| Recommendation | All steps above threshold — no hook promotion needed |")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

lines.append(f"| Recommendation | All steps above threshold — no hook promotion needed |")

lines.append("| Recommendation | All steps above threshold — no hook promotion needed |")

🧰 Tools

🪛 Ruff (0.15.6)

[error] 52-52: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@skills/skill-comply/scripts/report.py` at line 52, The string passed to lines.append uses an unnecessary f-string prefix despite having no placeholders; remove the leading "f" from the call to lines.append(f"| Recommendation | All steps above threshold — no hook promotion needed |") in report.py (locate the lines.append invocation that writes the Recommendation row) so it becomes a plain string literal.

coderabbitai · 2026-03-20T15:57:48Z

skills/skill-comply/scripts/runner.py

+    for cmd in scenario.setup_commands:
+        parts = shlex.split(cmd)
+        subprocess.run(parts, cwd=sandbox_dir, capture_output=True)


⚠️ Potential issue | 🟠 Major

Security risk: executing LLM-generated commands without validation.

setup_commands originate from scenario_generator.py, which parses them from LLM output. Executing arbitrary shell commands from LLM responses is dangerous—a malicious or hallucinated command could delete files, exfiltrate data, or compromise the system.

Consider:

Allowlisting permitted commands (e.g., mkdir, touch, echo)

Running setup commands in a more restricted sandbox

Adding user confirmation before executing

Example allowlist approach

+ALLOWED_SETUP_COMMANDS = frozenset({"mkdir", "touch", "echo", "cat"}) + def _setup_sandbox(sandbox_dir: Path, scenario: Scenario) -> None: """Create sandbox directory and run setup commands.""" if sandbox_dir.exists(): shutil.rmtree(sandbox_dir) sandbox_dir.mkdir(parents=True) subprocess.run(["git", "init"], cwd=sandbox_dir, capture_output=True) for cmd in scenario.setup_commands: parts = shlex.split(cmd) + if parts and parts[0] not in ALLOWED_SETUP_COMMANDS: + raise ValueError(f"Disallowed setup command: {parts[0]!r}") subprocess.run(parts, cwd=sandbox_dir, capture_output=True)

🧰 Tools

🪛 Ruff (0.15.6)

[error] 83-83: subprocess call: check for execution of untrusted input

(S603)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@skills/skill-comply/scripts/runner.py` around lines 81 - 83, The loop in runner.py currently executes LLM-provided scenario.setup_commands by calling shlex.split(cmd) and subprocess.run(...) which is unsafe; replace this with validation against an allowlist of permitted executables/flags (e.g., only allow "mkdir", "touch", "echo" and safe arg patterns) before calling subprocess.run, reject or log and skip any command that doesn't match the allowlist, and/or require explicit user confirmation for non-allowlisted commands; additionally consider executing allowed commands in a hardened sandbox (restricted cwd, unprivileged user, or container) using the existing sandbox_dir and ensure all subprocess.run invocations set shell=False, check=True, timeout, and capture_output/log stdout/stderr for auditing (refer to scenario.setup_commands, shlex.split, subprocess.run, and sandbox_dir).

coderabbitai · 2026-03-20T15:57:48Z

skills/skill-comply/scripts/utils.py

+def extract_yaml(text: str) -> str:
+    """Extract YAML from LLM output, stripping markdown fences if present."""
+    lines = text.strip().splitlines()
+    if lines and lines[0].startswith("```"):
+        lines = lines[1:]
+    if lines and lines[-1].startswith("```"):
+        lines = lines[:-1]
+    return "\n".join(lines)


⚠️ Potential issue | 🟠 Major

Harden YAML extraction for malformed/mixed LLM output.

extract_yaml assumes fences are only at the start/end and never validates extracted content. For untrusted model output, this can silently pass invalid payloads and defer failure downstream with unclear errors.

Proposed fix

def extract_yaml(text: str) -> str: """Extract YAML from LLM output, stripping markdown fences if present.""" - lines = text.strip().splitlines() - if lines and lines[0].startswith("```"): - lines = lines[1:] - if lines and lines[-1].startswith("```"): - lines = lines[:-1] - return "\n".join(lines) + stripped = text.strip() + if not stripped: + raise ValueError("Empty model output; expected YAML content") + + lines = stripped.splitlines() + if lines and lines[0].startswith("```"): + lines = lines[1:] + if lines and lines[-1].startswith("```"): + lines = lines[:-1] + + yaml_text = "\n".join(lines).strip() + if not yaml_text: + raise ValueError("No YAML content found in model output") + return yaml_text

As per coding guidelines: "Never trust external data (API responses, user input, file content)", "Always validate all user input before processing at system boundaries", and "Fail fast with clear error messages when validation fails".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@skills/skill-comply/scripts/utils.py` around lines 6 - 13, The extract_yaml function should validate and fail fast: trim the input, raise ValueError("Empty model output; expected YAML content") if empty, then strip leading/trailing code-fence lines only if present (remove opening ``` and closing ``` together when both exist), build yaml_text = "\n".join(lines).strip(), and raise ValueError("No YAML content found in model output") if yaml_text is empty before returning it; update the extract_yaml implementation to perform these checks and raise clear errors rather than silently returning empty/invalid strings.

- AGENTS.md: fix stale skill count (115 → 117) in project structure - run.py: replace remaining print() with logger, add zero-division guard, create parent dirs for --output path - runner.py: add returncode check for claude subprocess, clarify relative_to path traversal validation - parser.py: use is_file() instead of exists(), catch KeyError for missing trace fields, add file check in parse_spec - classifier.py: log warnings on malformed classification output, guard against non-dict JSON responses - grader.py: filter negative indices from LLM classification Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ecc-tools · 2026-03-20T21:44:44Z

Analyzing 5000 commits...

ecc-tools · 2026-03-20T21:44:46Z

Analysis Failed

Not Found - https://docs.github.com/rest/git/refs#get-a-reference

Troubleshooting

Cause	Resolution
Large repository	Analysis may timeout on repos with extensive history
API rate limits	Wait 15 minutes before retrying
Network issues	Queue timeout is 15 minutes; retry may succeed
Permissions	Verify app has Contents: Read access

Retry: /ecc-tools analyze

_{Report Issue | ECC Tools}

greptile-apps · 2026-03-20T21:52:30Z

skills/skill-comply/scripts/runner.py

+    for cmd in scenario.setup_commands:
+        parts = shlex.split(cmd)
+        subprocess.run(parts, cwd=sandbox_dir, capture_output=True)


Setup command failures are silently ignored

subprocess.run return values are not checked inside the setup_commands loop. If any setup command fails (e.g., npm install, pip install some-dep, or even a mistyped shell command), the failure is swallowed entirely. The agent then runs inside a broken environment and may score 0% compliance for reasons completely unrelated to the skill being tested — producing a false negative with no diagnostic information.

For consistency with every other subprocess call in this codebase (all of which check returncode), the result should be inspected:

Suggested change

for cmd in scenario.setup_commands:

parts = shlex.split(cmd)

subprocess.run(parts, cwd=sandbox_dir, capture_output=True)

for cmd in scenario.setup_commands:

parts = shlex.split(cmd)

proc = subprocess.run(parts, cwd=sandbox_dir, capture_output=True)

if proc.returncode != 0:

raise RuntimeError(

f"Setup command {cmd!r} failed (rc={proc.returncode}): "

f"{proc.stderr.decode(errors='replace')[:300]}"

)

greptile-apps · 2026-03-20T21:52:31Z

skills/skill-comply/scripts/classifier.py

+    tool_calls = "\n".join(
+        f"[{i}] {event.tool}: input={event.input[:500]} output={event.output[:200]}"
+        for i, event in enumerate(trace)
+    )


Literal newlines in output break the per-tool-call prompt format

The tool_calls string is assembled with "\n".join(...), using newlines as record separators. However, event.output — which is a raw string from tool results such as Bash — frequently contains embedded newlines (e.g., pytest output, cat output, compiler errors). When those newlines appear inside a [{i}] … entry, the classifier LLM receives a malformed listing where a single tool call spans multiple lines, making it indistinguishable from multiple separate entries. This can cause the classifier to miscount indices or skip events, silently corrupting classification results.

event.input is safe because inputs come from json.dumps(tool_input), which escapes newlines as \n. But event.output is a raw str(output_content) that preserves literal newlines.

Fix by replacing newlines before building the joined string:

Suggested change

tool_calls = "\n".join(

f"[{i}] {event.tool}: input={event.input[:500]} output={event.output[:200]}"

for i, event in enumerate(trace)

)

tool_calls = "\n".join(

f"[{i}] {event.tool}: input={event.input[:500]} output={event.output[:200].replace(chr(10), ' ').replace(chr(13), '')}"

for i, event in enumerate(trace)

)

greptile-apps · 2026-03-20T21:52:33Z

skills/skill-comply/scripts/runner.py

+from scripts.parser import ObservationEvent
+from scripts.scenario_generator import Scenario
+
+SANDBOX_BASE = Path("/tmp/skill-comply-sandbox")


Sandbox directories are never cleaned up and IDs can collide across skills

Two related issues with the sandbox path strategy:

No cleanup after successful runs. _setup_sandbox destroys and recreates the directory at the start of a run, but there is no cleanup at the end. After running skill-comply against several skills, /tmp/skill-comply-sandbox/ accumulates stale directories indefinitely. Only a subsequent run against the same scenario ID would clean it up.

Scenario IDs are LLM-generated and can collide across different skills. The LLM tends to produce generic names (e.g., scenario_level_1, supportive). When two different skills happen to produce the same scenario.id, they map to the same sandbox path. Running both concurrently (or just checking the residue from a previous run) can produce cross-contaminated environments and misleading compliance results.

Both issues can be addressed by incorporating a run-specific prefix and using a try/finally for cleanup in run_scenario:

import uuid SANDBOX_BASE = Path("/tmp/skill-comply-sandbox") def run_scenario(...) -> ScenarioRun: sandbox_dir = _safe_sandbox_dir(f"{scenario.id}_{uuid.uuid4().hex[:8]}") _setup_sandbox(sandbox_dir, scenario) try: result = subprocess.run(...) ... return ScenarioRun(...) finally: shutil.rmtree(sandbox_dir, ignore_errors=True)

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

skills/skill-comply/scripts/runner.py (1)

85-89: ⚠️ Potential issue | 🔴 Critical

Reject and fail fast on generated setup commands.

scenario.setup_commands is generated external data, but Lines 87-89 split and execute it verbatim on the host. Lines 85-89 also ignore exit status, so a malicious or failed setup command can either escape the intended workspace or leave the scenario half-prepared while grading continues.

🔒 Suggested hardening

+def _run_setup_command(cmd: str, *, cwd: Path) -> None:
+    parts = shlex.split(cmd)
+    if not parts:
+        raise ValueError("Empty setup command")
+    if parts[0] not in ALLOWED_SETUP_COMMANDS:
+        raise ValueError(f"Disallowed setup command: {parts[0]!r}")
+    subprocess.run(
+        parts,
+        cwd=cwd,
+        capture_output=True,
+        text=True,
+        timeout=30,
+        check=True,
+    )
+
 def _setup_sandbox(sandbox_dir: Path, scenario: Scenario) -> None:
     """Create sandbox directory and run setup commands."""
     if sandbox_dir.exists():
         shutil.rmtree(sandbox_dir)
     sandbox_dir.mkdir(parents=True)
 
-    subprocess.run(["git", "init"], cwd=sandbox_dir, capture_output=True)
+    subprocess.run(
+        ["git", "init"],
+        cwd=sandbox_dir,
+        capture_output=True,
+        text=True,
+        timeout=30,
+        check=True,
+    )
 
     for cmd in scenario.setup_commands:
-        parts = shlex.split(cmd)
-        subprocess.run(parts, cwd=sandbox_dir, capture_output=True)
+        _run_setup_command(cmd, cwd=sandbox_dir)

As per coding guidelines: "Never trust external data (API responses, user input, file content)" and "Always handle errors explicitly at every level and never silently swallow errors".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/scripts/runner.py` around lines 85 - 89, The loop that
shlex.splits and subprocess.run(s) each entry of scenario.setup_commands (using
shlex.split(cmd) and subprocess.run(parts, cwd=sandbox_dir,
capture_output=True)) must be hardened: validate or whitelist each command
string before executing, disallow shell metacharacters or unexpected binaries,
and refuse unknown/unsafe commands; execute with subprocess.run(..., check=True,
shell=False, stdin=subprocess.DEVNULL, timeout=...) so failures raise and stop
processing, pass an explicit restricted env and cwd=sandbox_dir, and catch
subprocess.CalledProcessError/TimeoutError to log the error and fail fast (raise
or return a non-zero status) from the runner function instead of silently
ignoring return codes.

🧹 Nitpick comments (1)

skills/skill-comply/scripts/classifier.py (1)

10-12: Import placement violates PEP 8 ordering.

The local import at line 12 is placed after module-level code (the logger assignment). Per PEP 8, all imports should be grouped at the top of the file before any other statements.

♻️ Suggested fix

 import logging
 import subprocess
 from pathlib import Path
+
+from scripts.parser import ComplianceSpec, ObservationEvent
 
 logger = logging.getLogger(__name__)
 
-from scripts.parser import ComplianceSpec, ObservationEvent
-
 PROMPTS_DIR = Path(__file__).parent.parent / "prompts"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/scripts/classifier.py` around lines 10 - 12, Move the
local import from scripts.parser (ComplianceSpec, ObservationEvent) above any
module-level statements so imports are grouped at the top of the file;
specifically, place the "from scripts.parser import ComplianceSpec,
ObservationEvent" line before the logger = logging.getLogger(__name__)
assignment in classifier.py to comply with PEP 8 import ordering.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@skills/skill-comply/scripts/runner.py`:
- Around line 16-17: The sandbox directory is currently deterministic and shared
(SANDBOX_BASE and ScenarioRun.sandbox_dir), which risks cross-run clobbering;
change creation to generate a unique per-run sandbox (e.g., use uuid4 or
tempfile.mkdtemp) by adding a helper like make_sandbox_dir(safe_id) that returns
Path(tempfile.mkdtemp(prefix=f"{safe_id}-", dir=SANDBOX_BASE)), update wherever
Sandbox is constructed to call that helper (replace direct
Path("/tmp/skill-comply-sandbox") + safe_id usage), and stop deleting the entire
shared base — only remove the specific per-run directory
(ScenarioRun.sandbox_dir) at teardown to avoid wiping other runs.
- Around line 103-149: The loop that parses stream-json lines must validate and
reject malformed events instead of silently skipping or assuming dicts: after
json.loads(line) ensure the result is a dict (raise or log and continue with a
clear error if not), replace msg.get("type") usage with a guarded access only
after validation, and validate that "content" for assistant messages is a list
before iterating (same validation already done for user messages) to avoid
AttributeError; update the parsing around the variables msg/msg_type/content and
the pending/ObservationEvent creation so malformed scalars/arrays/nulls are
detected and handled with a clear failure or logged rejection rather than
causing downstream crashes.

---

Duplicate comments:
In `@skills/skill-comply/scripts/runner.py`:
- Around line 85-89: The loop that shlex.splits and subprocess.run(s) each entry
of scenario.setup_commands (using shlex.split(cmd) and subprocess.run(parts,
cwd=sandbox_dir, capture_output=True)) must be hardened: validate or whitelist
each command string before executing, disallow shell metacharacters or
unexpected binaries, and refuse unknown/unsafe commands; execute with
subprocess.run(..., check=True, shell=False, stdin=subprocess.DEVNULL,
timeout=...) so failures raise and stop processing, pass an explicit restricted
env and cwd=sandbox_dir, and catch subprocess.CalledProcessError/TimeoutError to
log the error and fail fast (raise or return a non-zero status) from the runner
function instead of silently ignoring return codes.

---

Nitpick comments:
In `@skills/skill-comply/scripts/classifier.py`:
- Around line 10-12: Move the local import from scripts.parser (ComplianceSpec,
ObservationEvent) above any module-level statements so imports are grouped at
the top of the file; specifically, place the "from scripts.parser import
ComplianceSpec, ObservationEvent" line before the logger =
logging.getLogger(__name__) assignment in classifier.py to comply with PEP 8
import ordering.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8986065a-ccdf-431e-b620-da7400638c8f

📥 Commits

Reviewing files that changed from the base of the PR and between ff3cfc4 and d72e3d1.

📒 Files selected for processing (6)

AGENTS.md
skills/skill-comply/scripts/classifier.py
skills/skill-comply/scripts/grader.py
skills/skill-comply/scripts/parser.py
skills/skill-comply/scripts/run.py
skills/skill-comply/scripts/runner.py

🚧 Files skipped from review as they are similar to previous changes (4)

skills/skill-comply/scripts/run.py
skills/skill-comply/scripts/grader.py
AGENTS.md
skills/skill-comply/scripts/parser.py

coderabbitai · 2026-03-20T21:56:43Z

skills/skill-comply/scripts/runner.py

+SANDBOX_BASE = Path("/tmp/skill-comply-sandbox")
+ALLOWED_MODELS = frozenset({"haiku", "sonnet", "opus"})


⚠️ Potential issue | 🟠 Major

Use a per-run sandbox directory.

The path is deterministic (/tmp/skill-comply-sandbox/<sanitized id>), and Lines 81-83 delete it before every run. Rerunning the same scenario—or two distinct IDs that normalize to the same safe_id—can wipe another run's workspace and make ScenarioRun.sandbox_dir point at different contents than the run actually produced. A unique suffix (uuid4/mkdtemp) on the sanitized prefix avoids both collisions and cross-run clobbering.

Also applies to: 70-76, 81-83

🧰 Tools

🪛 Ruff (0.15.6)

[error] 16-16: Probable insecure usage of temporary file or directory: "/tmp/skill-comply-sandbox"

(S108)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@skills/skill-comply/scripts/runner.py` around lines 16 - 17, The sandbox directory is currently deterministic and shared (SANDBOX_BASE and ScenarioRun.sandbox_dir), which risks cross-run clobbering; change creation to generate a unique per-run sandbox (e.g., use uuid4 or tempfile.mkdtemp) by adding a helper like make_sandbox_dir(safe_id) that returns Path(tempfile.mkdtemp(prefix=f"{safe_id}-", dir=SANDBOX_BASE)), update wherever Sandbox is constructed to call that helper (replace direct Path("/tmp/skill-comply-sandbox") + safe_id usage), and stop deleting the entire shared base — only remove the specific per-run directory (ScenarioRun.sandbox_dir) at teardown to avoid wiping other runs.

coderabbitai · 2026-03-20T21:56:43Z

skills/skill-comply/scripts/runner.py

+    for line in stdout.strip().splitlines():
+        try:
+            msg = json.loads(line)
+        except json.JSONDecodeError:
+            continue
+
+        msg_type = msg.get("type")
+
+        if msg_type == "assistant":
+            content = msg.get("message", {}).get("content", [])
+            for block in content:
+                if block.get("type") == "tool_use":
+                    tool_use_id = block.get("id", "")
+                    tool_input = block.get("input", {})
+                    input_str = (
+                        json.dumps(tool_input)[:5000]
+                        if isinstance(tool_input, dict)
+                        else str(tool_input)[:5000]
+                    )
+                    pending[tool_use_id] = {
+                        "tool": block.get("name", "unknown"),
+                        "input": input_str,
+                        "order": event_counter,
+                    }
+                    event_counter += 1
+
+        elif msg_type == "user":
+            content = msg.get("message", {}).get("content", [])
+            if isinstance(content, list):
+                for block in content:
+                    tool_use_id = block.get("tool_use_id", "")
+                    if tool_use_id in pending:
+                        info = pending.pop(tool_use_id)
+                        output_content = block.get("content", "")
+                        if isinstance(output_content, list):
+                            output_str = json.dumps(output_content)[:5000]
+                        else:
+                            output_str = str(output_content)[:5000]
+
+                        events.append(ObservationEvent(
+                            timestamp=f"T{info['order']:04d}",
+                            event="tool_complete",
+                            tool=info["tool"],
+                            session=msg.get("session_id", "unknown"),
+                            input=info["input"],
+                            output=output_str,
+                        ))


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd skills/skill-comply && find . -name "runner.py" -type f

Repository: affaan-m/everything-claude-code

Length of output: 94

🏁 Script executed:

wc -l skills/skill-comply/scripts/runner.py

Repository: affaan-m/everything-claude-code

Length of output: 116

🏁 Script executed:

sed -n '90,160p' skills/skill-comply/scripts/runner.py

Repository: affaan-m/everything-claude-code

Length of output: 2799

🏁 Script executed:

cd skills/skill-comply && rg "_parse_stream_json" -B 3 -A 3 scripts/runner.py

Repository: affaan-m/everything-claude-code

Length of output: 497

🏁 Script executed:

cd skills/skill-comply && rg "parse_stream_json\|ObservationEvent" scripts/ -l

Repository: affaan-m/everything-claude-code

Length of output: 57

🏁 Script executed:

cd skills/skill-comply && head -50 scripts/runner.py

Repository: affaan-m/everything-claude-code

Length of output: 1453

🏁 Script executed:

cd skills/skill-comply && find . -name "*test*.py" -o -name "test_*" -type f | head -20

Repository: affaan-m/everything-claude-code

Length of output: 120

🏁 Script executed:

cd skills/skill-comply && rg "def test.*parse" . 2>/dev/null || echo "No tests found"

Repository: affaan-m/everything-claude-code

Length of output: 292

🏁 Script executed:

cd skills/skill-comply && grep -n "JSONDecodeError\|silent\|malformed" scripts/runner.py

Repository: affaan-m/everything-claude-code

Length of output: 115

🏁 Script executed:

cd skills/skill-comply && sed -n '1,30p' scripts/runner.py && echo "---" && sed -n '75,95p' scripts/runner.py

Repository: affaan-m/everything-claude-code

Length of output: 1425

🏁 Script executed:

cd skills/skill-comply && cat tests/test_parser.py | head -100

Repository: affaan-m/everything-claude-code

Length of output: 3212

🏁 Script executed:

cd skills/skill-comply && rg "stream-json\|JSONDecodeError" tests/ -B 2 -A 2

Repository: affaan-m/everything-claude-code

Length of output: 57

🏁 Script executed:

cd skills/skill-comply && rg "stream-json" . --type py -B 2 -A 5

Repository: affaan-m/everything-claude-code

Length of output: 1722

🏁 Script executed:

cd skills/skill-comply && sed -n '103,149p' scripts/runner.py | cat -n

Repository: affaan-m/everything-claude-code

Length of output: 2420

🏁 Script executed:

python3 << 'EOF' import json # Test what happens if JSON decodes to non-dict test_cases = [ '["array"]', '"string"', '123', 'null', '{"valid": "dict"}', ] for case in test_cases: try: msg = json.loads(case) print(f"Parsed {case:20} -> type {type(msg).__name__:8} -> msg.get('type') raises?", end=" ") try: result = msg.get("type") print(f"No, got {result}") except AttributeError as e: print(f"Yes: {e}") except json.JSONDecodeError: print(f"Failed to parse {case}") EOF

Repository: affaan-m/everything-claude-code

Length of output: 617

🏁 Script executed:

cd skills/skill-comply && rg "observations\|ScenarioRun" scripts/runner.py -A 2 -B 2 | head -40

Repository: affaan-m/everything-claude-code

Length of output: 57

Reject malformed stream-json events instead of silently dropping them.

Lines 103-107 silently skip JSON decode failures, and the parser assumes all events are dict-shaped. If claude outputs a JSON scalar, array, or null, line 107's msg.get("type") will crash with AttributeError: '<type>' object has no attribute 'get'. Additionally, the parser iterates content without validating it's a list for "assistant" events (line 110), while doing so for "user" events (line 129)—inconsistent error handling that masks malformed data and corrupts the trace used for compliance grading.

🧪 Suggested validation

- for line in stdout.strip().splitlines(): + for line_no, line in enumerate(stdout.splitlines(), start=1): + if not line.strip(): + continue try: - msg = json.loads(line) - except json.JSONDecodeError: - continue + raw_msg = json.loads(line) + except json.JSONDecodeError as exc: + raise ValueError( + f"Invalid stream-json event at line {line_no}" + ) from exc + + if not isinstance(raw_msg, dict): + raise ValueError( + f"Invalid stream-json event at line {line_no}: expected object" + ) + msg = raw_msg msg_type = msg.get("type") if msg_type == "assistant": - content = msg.get("message", {}).get("content", []) + content = msg.get("message", {}).get("content", []) + if not isinstance(content, list): + raise ValueError( + f"Invalid stream-json event at line {line_no}: content must be list" + )

Per coding guidelines: "Never trust external data (API responses, user input, file content)" and "Fail fast with clear error messages when validation fails".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@skills/skill-comply/scripts/runner.py` around lines 103 - 149, The loop that parses stream-json lines must validate and reject malformed events instead of silently skipping or assuming dicts: after json.loads(line) ensure the result is a dict (raise or log and continue with a clear error if not), replace msg.get("type") usage with a guarded access only after validation, and validate that "content" for assistant messages is a list before iterating (same validation already done for user messages) to avoid AttributeError; update the parsing around the variables msg/msg_type/content and the pending/ObservationEvent creation so malformed scalars/arrays/nulls are detected and handled with a clear failure or logged rejection rather than causing downstream crashes.

greptile-apps bot reviewed Mar 20, 2026

View reviewed changes

cubic-dev-ai bot reviewed Mar 20, 2026

View reviewed changes

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

greptile-apps bot reviewed Mar 20, 2026

View reviewed changes

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

		@@ -0,0 +1,155 @@
		"""Run scenarios via claude -p and parse tool calls from stream-json output."""

		lines.append(f"\| Metric \| Value \|")
		lines.append(f"\|--------\|-------\|")

	lines.append(f"\| Recommendation \| All steps above threshold — no hook promotion needed \|")
	lines.append("\| Recommendation \| All steps above threshold — no hook promotion needed \|")

		SANDBOX_BASE = Path("/tmp/skill-comply-sandbox")
		ALLOWED_MODELS = frozenset({"haiku", "sonnet", "opus"})

Uh oh!

Conversation

shimo4228 commented Mar 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Changes from #662

Testing

Checklist

Summary by cubic

Summary by CodeRabbit

Uh oh!

ecc-tools bot commented Mar 20, 2026

Uh oh!

ecc-tools bot commented Mar 20, 2026

Analysis Failed

Uh oh!

coderabbitai bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

greptile-apps bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai bot Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

shimo4228 commented Mar 20, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 20, 2026 •

edited

Loading

greptile-apps bot commented Mar 20, 2026 •

edited

Loading

cubic-dev-ai bot Mar 20, 2026 •

edited

Loading

cubic-dev-ai bot Mar 20, 2026 •

edited

Loading

cubic-dev-ai bot Mar 20, 2026 •

edited

Loading

cubic-dev-ai bot Mar 20, 2026 •

edited

Loading

cubic-dev-ai bot Mar 20, 2026 •

edited

Loading

cubic-dev-ai bot Mar 20, 2026 •

edited

Loading

cubic-dev-ai bot Mar 20, 2026 •

edited

Loading