Skip to content

feat(skills): add skill-comply — automated behavioral compliance measurement#724

Open
shimo4228 wants to merge 2 commits intoaffaan-m:mainfrom
shimo4228:feat/skills/skill-comply-v2
Open

feat(skills): add skill-comply — automated behavioral compliance measurement#724
shimo4228 wants to merge 2 commits intoaffaan-m:mainfrom
shimo4228:feat/skills/skill-comply-v2

Conversation

@shimo4228
Copy link
Contributor

@shimo4228 shimo4228 commented Mar 20, 2026

Replaces #662 (closed due to CI issues — now rebased on latest main with all CodeRabbit comments addressed).

Summary

Automated compliance measurement for skills, rules, and agent definitions. Generates behavioral specs from skill files, runs scenarios at 3 prompt strictness levels, classifies tool calls via LLM, and produces self-contained Markdown reports.

  • spec generation — extracts observable behavioral sequences from skill/rule files
  • scenario generation — creates 3 pressure scenarios (explicit → implicit → adversarial)
  • execution — runs scenarios via claude -p in sandboxed directories
  • LLM classification — classifies tool call traces against spec steps
  • grading — temporal ordering checks + compliance rate calculation
  • reporting — self-contained Markdown with per-step evidence and hook promotion recommendations

What's included

Path Description
skills/skill-comply/SKILL.md Skill definition
scripts/ 9 Python modules (run, spec_generator, scenario_generator, runner, classifier, grader, parser, report, utils)
prompts/ 3 LLM prompt templates
tests/ 25 pytest tests (grader + parser)
fixtures/ Test data (YAML spec, JSONL traces)

Changes from #662

All 10 CodeRabbit review comments addressed:

  • Extracted duplicate _extract_yaml to shared utils.py module
  • Replaced print() with logging module in run.py
  • Fixed file validation: exists()is_file() to reject directories
  • Fixed type hint: anyAny (uppercase)
  • Added guard against empty result.stdout in scenario generator
  • Fixed potential NameError in finally block (spec_generator)
  • Optimized O(n²) → O(n) in grader with precomputed set
  • Added json.JSONDecodeError handling in trace parser
  • Added scoring section validation in spec parser
  • Removed unused MAX_SKILL_FILE_SIZE constant

Testing

  • uv run pytest -v — 25 tests pass
  • npx markdownlint "skills/skill-comply/**/*.md" — 0 errors
  • node scripts/ci/catalog.js --text — counts match (28 agents, 117 skills, 59 commands)
  • node scripts/ci/validate-skills.js — 117 skills validated

Checklist

🤖 Generated with Claude Code


Summary by cubic

Adds skill-comply: an automated way to measure if skills, rules, and agent definitions are actually followed. It generates behavioral specs, runs 3 pressure scenarios, classifies tool calls, checks ordering, and outputs a self-contained Markdown report.

  • New Features

    • Spec generation from skill/rule .md files into observable steps.
    • Scenario generation at 3 levels: supportive, neutral, competing.
    • Sandboxed execution via claude -p with stream-json trace parsing.
    • LLM classification + deterministic ordering checks and compliance rate.
    • Markdown report with spec, prompts, timelines, and hook promotion suggestions.
    • Adds prompts, fixtures, and 25 tests for parser/grader.
  • Bug Fixes

    • Docs: fixed stale skill counts in README/AGENTS (now 28 agents, 117 skills, 59 commands).
    • CLI/Runner: switched to logging, created parent dirs for --output, added return-code checks for claude, and stricter sandbox path validation.
    • Parser: stricter file checks (is_file()), better errors for missing trace fields, and spec file existence guard.
    • Classifier/Grader: warnings on malformed/non-dict JSON, ignore invalid/negative event indices for robust grading.

Written for commit d72e3d1. Summary will update on new commits.

Summary by CodeRabbit

  • New Features

    • Added "skill‑comply": an automated compliance tool that generates three strictness-level scenarios, evaluates agent behavior against expected workflows, and produces human-readable compliance reports with scores and hook-promotion recommendations.
  • Documentation

    • Updated catalog counts: total skills increased from 116 to 117.

…urement

Automated compliance measurement for skills, rules, and agent definitions.
Generates behavioral specs, runs scenarios at 3 strictness levels,
classifies tool calls via LLM, and produces self-contained reports.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ecc-tools
Copy link
Contributor

ecc-tools bot commented Mar 20, 2026

Analyzing 5000 commits...

@ecc-tools
Copy link
Contributor

ecc-tools bot commented Mar 20, 2026

Analysis Failed

Not Found - https://docs.github.com/rest/git/refs#get-a-reference

Troubleshooting
Cause Resolution
Large repository Analysis may timeout on repos with extensive history
API rate limits Wait 15 minutes before retrying
Network issues Queue timeout is 15 minutes; retry may succeed
Permissions Verify app has Contents: Read access

Retry: /ecc-tools analyze


Report Issue | ECC Tools

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 20, 2026

📝 Walkthrough

Walkthrough

A new compliance measurement skill (skill-comply) is added and documentation counts updated. The skill auto-generates expected behavioral sequences from Markdown, creates test scenarios at three strictness levels, executes scenarios via Claude CLI capturing tool-call traces, classifies and grades compliance, and emits detailed Markdown reports.

Changes

Cohort / File(s) Summary
Documentation Metadata
AGENTS.md, README.md
Updated reported total skills from 116→117 (and related internal count from 115→117 in AGENTS.md); other documented counts unchanged.
skill-comply Setup & Configuration
skills/skill-comply/.gitignore, skills/skill-comply/SKILL.md, skills/skill-comply/pyproject.toml
Added SKILL.md, gitignore entries for Python artifacts, and project metadata (requires-python >=3.11, dependency pyyaml>=6.0, pytest dev group).
skill-comply Fixtures
skills/skill-comply/fixtures/*
Added compliant_trace.jsonl, noncompliant_trace.jsonl, and tdd_spec.yaml fixtures representing traces and a five-step TDD workflow spec with scoring threshold.
skill-comply Prompts
skills/skill-comply/prompts/*
Added classifier.md, scenario_generator.md, and spec_generator.md prompt templates for LLM-driven classification, scenario generation (3 levels), and spec extraction (YAML).
Parsing & Models
skills/skill-comply/scripts/parser.py, skills/skill-comply/scripts/utils.py
Added immutable dataclasses (ObservationEvent, Detector, Step, ComplianceSpec) and parsing utilities for JSONL traces and YAML specs; helper extract_yaml.
Spec & Scenario Generators
skills/skill-comply/scripts/spec_generator.py, skills/skill-comply/scripts/scenario_generator.py
Added generators that call Claude CLI to produce ComplianceSpec and Scenario instances; include retry and YAML extraction/validation.
Execution & Trace Extraction
skills/skill-comply/scripts/runner.py
Added scenario execution that creates sandboxes, runs setup commands, invokes Claude with --output-format stream-json, parses stream-json to ObservationEvents, and returns ScenarioRun.
LLM Classification & Grading
skills/skill-comply/scripts/classifier.py, skills/skill-comply/scripts/grader.py
Added classify_events (LLM-backed JSON classification) and grade (temporal-aware grading producing StepResult and ComplianceResult dataclasses).
Reporting & CLI
skills/skill-comply/scripts/report.py, skills/skill-comply/scripts/run.py
Added Markdown report generator and run.py orchestrator CLI (spec generation → scenario generation → optional execution → grading → report output).
Tests
skills/skill-comply/tests/test_parser.py, skills/skill-comply/tests/test_grader.py
Added tests covering parsing of fixtures and grading logic with mocked classification for compliant, noncompliant, and empty traces.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Run as run.py
    participant SpecGen as spec_generator
    participant ScenarioGen as scenario_generator
    participant Runner as runner
    participant Classifier as classifier
    participant Grader as grader
    participant Report as report

    User->>Run: execute(skill_path, --model, --output)
    Run->>SpecGen: generate_spec(skill_path)
    SpecGen->>SpecGen: read skill + invoke Claude CLI
    SpecGen-->>Run: ComplianceSpec YAML

    Run->>ScenarioGen: generate_scenarios(spec_yaml)
    ScenarioGen->>ScenarioGen: invoke Claude CLI (scenario_generator)
    ScenarioGen-->>Run: [Scenario...]

    loop for each Scenario
        Run->>Runner: run_scenario(scenario, model)
        Runner->>Runner: create sandbox, git init, run setup commands
        Runner->>Runner: invoke Claude CLI (stream-json)
        Runner->>Runner: parse stream -> ObservationEvents
        Runner-->>Run: ScenarioRun(observations)

        Run->>Classifier: classify_events(spec, observations)
        Classifier->>Classifier: build prompt + invoke Claude CLI
        Classifier-->>Run: {step_id: [event_idx...]}

        Run->>Grader: grade(spec, observations, classifier_model)
        Grader->>Grader: enforce temporal order, detect steps, compute compliance
        Grader-->>Run: ComplianceResult
    end

    Run->>Report: generate_report(skill_path, spec, results)
    Report->>Report: aggregate, build tables, timelines, recommendations
    Report-->>Run: markdown_report
    Run-->>User: saved report path + compliance summary
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • affaan-m

Poem

🐰 I stitched a spec from markdown light,
I birthed three tests — supportive, neutral, fight,
Claude danced through sandboxes, traced each call,
I graded steps and wrote the final hall:
One hundred seventeen skills now measured bright. 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 34.69% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main addition: a new 'skill-comply' feature for automated behavioral compliance measurement of skills and related entities.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 20, 2026

Greptile Summary

This PR adds skill-comply, a new automated compliance measurement system for skills, rules, and agent definitions. The tool generates behavioral specs from .md skill/rule files via LLM, produces three pressure-level scenarios (supportive → neutral → adversarial), executes them through claude -p in sandboxed directories, classifies tool call traces against spec steps using a second LLM call, checks temporal ordering deterministically, and emits a self-contained Markdown report with per-step evidence and hook promotion recommendations. The implementation is well-structured across 9 Python modules with 25 pytest tests and addresses all 10 CodeRabbit comments from the prior PR #662.

Key changes:

  • New skill definition: skills/skill-comply/SKILL.md registers the skill with the catalog (count updated to 117)
  • 9 Python modules: run.py (CLI), spec_generator.py, scenario_generator.py, runner.py, classifier.py, grader.py, parser.py, report.py, utils.py
  • 3 LLM prompt templates for spec generation, scenario generation, and event classification
  • 25 pytest tests covering parser and grader with mocked LLM classification
  • Two issues found: Setup command failures in _setup_sandbox are silently ignored (could produce misleading 0% compliance for broken environments), and event.output embedded newlines can corrupt the classifier prompt's per-tool-call record format

Confidence Score: 3/5

  • Safe to merge with minor fixes — no data loss or security risk, but two logic bugs can silently corrupt compliance scores in realistic use.
  • The core architecture is clean and the majority of previous CodeRabbit comments are properly addressed. However, two bugs remain that directly affect output correctness: (1) silent swallowing of failed setup commands means the agent can run in a broken sandbox and score 0% for the wrong reason, and (2) unescaped newlines in tool output strings can corrupt the classifier LLM prompt, again silently producing wrong classification results. Neither is a security issue, but both can produce misleading reports which undermine the tool's primary purpose. These are straightforward one-line fixes.
  • skills/skill-comply/scripts/runner.py (setup command error handling) and skills/skill-comply/scripts/classifier.py (output newline escaping) need attention before merge.

Important Files Changed

Filename Overview
skills/skill-comply/scripts/runner.py Runs scenarios via claude -p in sandboxed directories and parses stream-json output into observation events. Two issues: setup command failures are silently ignored (could produce false-negative compliance scores), and sandbox directories are never cleaned up after runs with a potential ID collision across different skills.
skills/skill-comply/scripts/classifier.py LLM-based classifier that maps tool call traces to spec step IDs. The tool_calls prompt string is built with newlines as record separators, but event.output may contain literal embedded newlines (e.g., from Bash/pytest output), which breaks the per-entry format and can cause incorrect index assignments.
skills/skill-comply/scripts/grader.py Grades traces against compliance specs using LLM classification + deterministic temporal ordering checks. Logic is sound; O(n²)→O(n) optimisation for required_ids is correctly implemented with a precomputed set.
skills/skill-comply/scripts/parser.py Parses JSONL observation traces and YAML compliance specs into typed dataclasses. Both JSONDecodeError and KeyError are now caught with line-number context; scoring section validation added. Clean and well-tested.
skills/skill-comply/scripts/run.py CLI entry point orchestrating spec generation, scenario generation, execution, and report writing. Fully migrated to logging; is_file() validation in place. No major issues.
skills/skill-comply/scripts/spec_generator.py Generates compliance specs from skill files via LLM with retry logic on YAML parse errors. NameError in finally block fixed; extract_yaml extracted to shared utils.py.
skills/skill-comply/scripts/scenario_generator.py Generates three pressure-level scenarios from skill + condensed spec YAML. Empty-output guard added; returncode checked. Clean implementation.
skills/skill-comply/scripts/report.py Generates self-contained Markdown compliance reports with per-scenario tables, tool call timelines, and hook promotion recommendations. Pipe characters in cell content are escaped correctly.
skills/skill-comply/tests/test_grader.py 25 pytest tests covering compliant, noncompliant, and edge-case grading scenarios using mocked LLM classification. Good coverage of temporal ordering constraints.
skills/skill-comply/scripts/utils.py Shared extract_yaml utility for stripping markdown fences from LLM output. Correctly extracted from duplicate usage in spec_generator and classifier.

Sequence Diagram

sequenceDiagram
    participant User
    participant run.py
    participant spec_generator.py
    participant scenario_generator.py
    participant runner.py
    participant classifier.py
    participant grader.py
    participant report.py

    User->>run.py: skill-comply <path>
    run.py->>spec_generator.py: generate_spec(skill_path)
    spec_generator.py->>spec_generator.py: claude -p spec_generator.md
    spec_generator.py-->>run.py: ComplianceSpec

    run.py->>scenario_generator.py: generate_scenarios(skill_path, spec_yaml)
    scenario_generator.py->>scenario_generator.py: claude -p scenario_generator.md
    scenario_generator.py-->>run.py: list[Scenario] (3 levels)

    loop For each Scenario
        run.py->>runner.py: run_scenario(scenario, model)
        runner.py->>runner.py: _setup_sandbox (git init + setup_commands)
        runner.py->>runner.py: claude -p stream-json
        runner.py->>runner.py: _parse_stream_json → ObservationEvents
        runner.py-->>run.py: ScenarioRun

        run.py->>grader.py: grade(spec, observations)
        grader.py->>classifier.py: classify_events(spec, trace)
        classifier.py->>classifier.py: claude -p classifier.md
        classifier.py-->>grader.py: dict[step_id, list[int]]
        grader.py->>grader.py: _check_temporal_order per step
        grader.py-->>run.py: ComplianceResult
    end

    run.py->>report.py: generate_report(skill_path, spec, results)
    report.py-->>run.py: Markdown string
    run.py-->>User: results/<skill>.md
Loading

Last reviewed commit: "fix(skill-comply): a..."

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

23 issues found across 23 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="skills/skill-comply/scripts/run.py">

<violation number="1" location="skills/skill-comply/scripts/run.py:108">
P2: Writing to a user-provided `--output` path can fail because parent directories are not created before `write_text()`.</violation>

<violation number="2" location="skills/skill-comply/scripts/run.py:112">
P2: Summary calculation divides by zero when no scenarios are generated, causing a runtime crash instead of a graceful error.</violation>
</file>

<file name="AGENTS.md">

<violation number="1" location="AGENTS.md:3">
P2: The changed summary line introduces an inconsistent skill total within AGENTS.md (117 vs 115), leaving documentation internally contradictory.</violation>
</file>

<file name="README.md">

<violation number="1" location="README.md:1074">
P2: README contains conflicting hardcoded skill counts (117 vs 102), indicating documentation drift and inconsistent user-facing metrics.</violation>
</file>

<file name="skills/skill-comply/scripts/classifier.py">

<violation number="1" location="skills/skill-comply/scripts/classifier.py:24">
P1: Classification parse failures are silently treated as empty matches, causing classifier errors to be scored as real non-compliance.</violation>

<violation number="2" location="skills/skill-comply/scripts/classifier.py:32">
P2: Prompt construction can crash on non-string trace payloads because `event.input`/`event.output` are sliced without coercion.</violation>

<violation number="3" location="skills/skill-comply/scripts/classifier.py:73">
P1: `_parse_classification` can throw uncaught `AttributeError` when LLM returns valid JSON that is not an object.</violation>
</file>

<file name="skills/skill-comply/prompts/scenario_generator.md">

<violation number="1" location="skills/skill-comply/prompts/scenario_generator.md:19">
P2: Ambiguous `{id}` token in setup command example can be executed literally; `setup_commands` are run verbatim and `cwd` is already a per-scenario sandbox, so this risks shared `/tmp/skill-comply-sandbox/{id}` directories and cross-scenario leakage.</violation>

<violation number="2" location="skills/skill-comply/prompts/scenario_generator.md:52">
P1: Prompt template embeds untrusted skill/spec text without guardrails, enabling prompt-injection influence over generated `setup_commands` that are later executed.</violation>
</file>

<file name="skills/skill-comply/scripts/parser.py">

<violation number="1" location="skills/skill-comply/scripts/parser.py:49">
P2: `parse_trace` only checks `exists()`, so directory paths pass validation and fail later with `IsADirectoryError` instead of an explicit file-path validation error.</violation>

<violation number="2" location="skills/skill-comply/scripts/parser.py:87">
P2: Detector step references are not validated against declared step IDs, so invalid after_step/before_step values can silently alter grading behavior or force failures.</violation>
</file>

<file name="skills/skill-comply/prompts/classifier.md">

<violation number="1" location="skills/skill-comply/prompts/classifier.md:9">
P1: Raw untrusted tool-call content is interpolated directly into the classifier prompt without explicit inert-data handling, creating a prompt-injection risk that can skew compliance classification.</violation>
</file>

<file name="skills/skill-comply/scripts/scenario_generator.py">

<violation number="1" location="skills/skill-comply/scripts/scenario_generator.py:44">
P2: Unbounded prompt content is passed as a single CLI argument, which can exceed OS argv limits and cause subprocess invocation failure on larger skills/specs.</violation>

<violation number="2" location="skills/skill-comply/scripts/scenario_generator.py:60">
P2: LLM-generated YAML is parsed and immediately indexed without validating the top-level type or required keys. If the LLM returns malformed YAML or missing fields, this raises runtime KeyError/TypeError and crashes scenario generation. Add explicit schema/type checks (and optionally retries) before indexing.</violation>
</file>

<file name="skills/skill-comply/scripts/report.py">

<violation number="1" location="skills/skill-comply/scripts/report.py:114">
P2: Markdown table cells use unescaped dynamic text, so `|`/newlines in scenario names, step metadata, or failure reasons can corrupt report tables.</violation>
</file>

<file name="skills/skill-comply/scripts/grader.py">

<violation number="1" location="skills/skill-comply/scripts/grader.py:75">
P2: Classifier indices are only checked against the upper bound; negative indices from LLM output will silently map to the end of the list, producing incorrect evidence. Index order is also unnormalized, so non-monotonic output changes which candidate is selected first. Validate bounds (>=0) and normalize ordering before mapping to events.</violation>
</file>

<file name="skills/skill-comply/scripts/utils.py">

<violation number="1" location="skills/skill-comply/scripts/utils.py:9">
P1: YAML extraction is brittle because markdown fences are removed only when they are the first/last lines of the whole response, leaving prose/fences in common LLM outputs and breaking YAML parsing.</violation>
</file>

<file name="skills/skill-comply/tests/test_parser.py">

<violation number="1" location="skills/skill-comply/tests/test_parser.py:28">
P2: Sorting test is vacuous because it uses an already sorted fixture, so it may not catch regressions where parser stops sorting.</violation>

<violation number="2" location="skills/skill-comply/tests/test_parser.py:51">
P2: Missing-file test relies on a hardcoded absolute path, making it environment-dependent instead of using a test-controlled nonexistent path.</violation>
</file>

<file name="skills/skill-comply/scripts/runner.py">

<violation number="1" location="skills/skill-comply/scripts/runner.py:53">
P2: `subprocess.run` is not checked for success, so non‑zero Claude exits (auth/model errors, CLI failures) still get parsed and graded as if they were valid runs, risking misclassification of runtime failures as compliance failures.</violation>

<violation number="2" location="skills/skill-comply/scripts/runner.py:67">
P2: Lossy scenario ID sanitization can map distinct scenarios to the same sandbox directory, and `_setup_sandbox` then deletes that directory, causing destructive overwrites of scenario artifacts.</violation>

<violation number="3" location="skills/skill-comply/scripts/runner.py:79">
P2: Setup commands ignore failures; if `git init` or a setup command fails, the scenario still runs in a partially prepared sandbox, masking setup errors and skewing results.</violation>

<violation number="4" location="skills/skill-comply/scripts/runner.py:83">
P1: Commands from LLM-generated scenarios are executed directly without validation or sandboxing, enabling arbitrary command execution if scenario content is untrusted.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

- setup_commands should create a minimal sandbox (dirs, pyproject.toml, etc.)
- Prompts should be realistic — something a developer would actually ask

Skill content:
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Prompt template embeds untrusted skill/spec text without guardrails, enabling prompt-injection influence over generated setup_commands that are later executed.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At skills/skill-comply/prompts/scenario_generator.md, line 52:

<comment>Prompt template embeds untrusted skill/spec text without guardrails, enabling prompt-injection influence over generated `setup_commands` that are later executed.</comment>

<file context>
@@ -0,0 +1,62 @@
+- setup_commands should create a minimal sandbox (dirs, pyproject.toml, etc.)
+- Prompts should be realistic — something a developer would actually ask
+
+Skill content:
+
+---
</file context>
Fix with Cubic

{steps_description}

Tool calls (numbered):
{tool_calls}
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Raw untrusted tool-call content is interpolated directly into the classifier prompt without explicit inert-data handling, creating a prompt-injection risk that can skew compliance classification.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At skills/skill-comply/prompts/classifier.md, line 9:

<comment>Raw untrusted tool-call content is interpolated directly into the classifier prompt without explicit inert-data handling, creating a prompt-injection risk that can skew compliance classification.</comment>

<file context>
@@ -0,0 +1,24 @@
+{steps_description}
+
+Tool calls (numbered):
+{tool_calls}
+
+Respond with ONLY a JSON object mapping step_id to a list of matching tool call numbers.
</file context>
Fix with Cubic

def extract_yaml(text: str) -> str:
"""Extract YAML from LLM output, stripping markdown fences if present."""
lines = text.strip().splitlines()
if lines and lines[0].startswith("```"):
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: YAML extraction is brittle because markdown fences are removed only when they are the first/last lines of the whole response, leaving prose/fences in common LLM outputs and breaking YAML parsing.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At skills/skill-comply/scripts/utils.py, line 9:

<comment>YAML extraction is brittle because markdown fences are removed only when they are the first/last lines of the whole response, leaving prose/fences in common LLM outputs and breaking YAML parsing.</comment>

<file context>
@@ -0,0 +1,13 @@
+def extract_yaml(text: str) -> str:
+    """Extract YAML from LLM output, stripping markdown fences if present."""
+    lines = text.strip().splitlines()
+    if lines and lines[0].startswith("```"):
+        lines = lines[1:]
+    if lines and lines[-1].startswith("```"):
</file context>
Fix with Cubic


def test_nonexistent_file_raises(self) -> None:
with pytest.raises(FileNotFoundError):
parse_trace(Path("/nonexistent/trace.jsonl"))
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Missing-file test relies on a hardcoded absolute path, making it environment-dependent instead of using a test-controlled nonexistent path.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At skills/skill-comply/tests/test_parser.py, line 51:

<comment>Missing-file test relies on a hardcoded absolute path, making it environment-dependent instead of using a test-controlled nonexistent path.</comment>

<file context>
@@ -0,0 +1,90 @@
+
+    def test_nonexistent_file_raises(self) -> None:
+        with pytest.raises(FileNotFoundError):
+            parse_trace(Path("/nonexistent/trace.jsonl"))
+
+
</file context>
Fix with Cubic

def test_events_sorted_by_timestamp(self) -> None:
events = parse_trace(FIXTURES / "compliant_trace.jsonl")
timestamps = [e.timestamp for e in events]
assert timestamps == sorted(timestamps)
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Sorting test is vacuous because it uses an already sorted fixture, so it may not catch regressions where parser stops sorting.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At skills/skill-comply/tests/test_parser.py, line 28:

<comment>Sorting test is vacuous because it uses an already sorted fixture, so it may not catch regressions where parser stops sorting.</comment>

<file context>
@@ -0,0 +1,90 @@
+    def test_events_sorted_by_timestamp(self) -> None:
+        events = parse_trace(FIXTURES / "compliant_trace.jsonl")
+        timestamps = [e.timestamp for e in events]
+        assert timestamps == sorted(timestamps)
+
+    def test_event_fields(self) -> None:
</file context>
Fix with Cubic


def _safe_sandbox_dir(scenario_id: str) -> Path:
"""Sanitize scenario ID and ensure path stays within sandbox base."""
safe_id = re.sub(r"[^a-zA-Z0-9\-_]", "_", scenario_id)
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Lossy scenario ID sanitization can map distinct scenarios to the same sandbox directory, and _setup_sandbox then deletes that directory, causing destructive overwrites of scenario artifacts.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At skills/skill-comply/scripts/runner.py, line 67:

<comment>Lossy scenario ID sanitization can map distinct scenarios to the same sandbox directory, and `_setup_sandbox` then deletes that directory, causing destructive overwrites of scenario artifacts.</comment>

<file context>
@@ -0,0 +1,155 @@
+
+def _safe_sandbox_dir(scenario_id: str) -> Path:
+    """Sanitize scenario ID and ensure path stays within sandbox base."""
+    safe_id = re.sub(r"[^a-zA-Z0-9\-_]", "_", scenario_id)
+    path = SANDBOX_BASE / safe_id
+    path.resolve().relative_to(SANDBOX_BASE.resolve())
</file context>
Fix with Cubic

@@ -0,0 +1,155 @@
"""Run scenarios via claude -p and parse tool calls from stream-json output."""
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Setup commands ignore failures; if git init or a setup command fails, the scenario still runs in a partially prepared sandbox, masking setup errors and skewing results.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At skills/skill-comply/scripts/runner.py, line 79:

<comment>Setup commands ignore failures; if `git init` or a setup command fails, the scenario still runs in a partially prepared sandbox, masking setup errors and skewing results.</comment>

<file context>
@@ -0,0 +1,155 @@
+        shutil.rmtree(sandbox_dir)
+    sandbox_dir.mkdir(parents=True)
+
+    subprocess.run(["git", "init"], cwd=sandbox_dir, capture_output=True)
+
+    for cmd in scenario.setup_commands:
</file context>
Fix with Cubic

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (10)
skills/skill-comply/prompts/classifier.md (1)

19-20: Generalize filename rules to avoid task-specific bias.

Using calculator.py-specific examples may overfit classification. Prefer generic rules (e.g., test-path patterns vs non-test paths) so behavior stays stable across domains.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/prompts/classifier.md` around lines 19 - 20, The current
rule examples use specific filenames ("test_calculator.py", "calculator.py")
which risks task-specific bias; update the classifier prompt in classifier.md to
use generic filename/path patterns instead (e.g., test path patterns like
"tests/", "test_*.py", "*_test.py" for test files versus non-test paths or
implementation file patterns for implementation files) so the guidance applies
broadly across domains; find the lines that mention "test_calculator.py" and
"calculator.py" and replace them with these generic patterns and a short
explanatory phrase (e.g., "files under tests/ or prefixed with test_ are tests")
to keep behavior stable.
skills/skill-comply/.gitignore (1)

7-7: Consider committing uv.lock for reproducible environments.

Ignoring uv.lock can make test and dependency resolution less deterministic across machines/CI.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/.gitignore` at line 7, Remove uv.lock from the .gitignore
so the lockfile is committed for reproducible installs; update the .gitignore
entry that currently lists "uv.lock" to delete that line, add the generated
uv.lock to the repository, and commit the lockfile (regenerate it with your
package manager if needed) so dependency resolution is deterministic across
machines/CI.
skills/skill-comply/prompts/scenario_generator.md (1)

19-19: Avoid hardcoded absolute sandbox paths in generated scenarios.

Embedding /tmp/skill-comply-sandbox/... in the prompt template can make scenario setup brittle across environments. Prefer a placeholder tied to runner-provided sandbox root.

As per coding guidelines: "Do not use hardcoded values; use constants or configuration instead".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/prompts/scenario_generator.md` at line 19, Replace the
hardcoded "/tmp/skill-comply-sandbox/{id}/..." path in the scenario template
string with a runner-provided sandbox root placeholder (e.g., "{sandbox_root}"
or "${SANDBOX_ROOT}") so the mkdir command becomes "mkdir -p
{sandbox_root}/{id}/src {sandbox_root}/{id}/tests"; update the
scenario_generator.md template where that exact string appears and ensure the
consuming runner injects the sandbox root constant/config value at runtime
instead of using the absolute /tmp path.
skills/skill-comply/scripts/spec_generator.py (1)

50-53: Add validation for empty Claude output before parsing.

Unlike scenario_generator.py (Line 53-54), this function doesn't validate that result.stdout is non-empty before extraction. If Claude returns empty output with exit code 0, extract_yaml returns an empty string, leading to confusing parse errors during retries.

Suggested fix
     if result.returncode != 0:
         raise RuntimeError(f"claude -p failed: {result.stderr}")

+    if not result.stdout.strip():
+        raise RuntimeError("claude -p returned empty output")
+
     raw_yaml = extract_yaml(result.stdout)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/scripts/spec_generator.py` around lines 50 - 53, The
Claude call in spec_generator.py checks returncode but doesn't validate
result.stdout before calling extract_yaml; add a guard after the returncode
check to verify result.stdout is non-empty (e.g., if not result.stdout.strip():)
and raise a RuntimeError including the fact that Claude returned empty output
and any available result.stderr; ensure this uses the same pattern as
scenario_generator.py (checking result.stdout before calling extract_yaml) so
extract_yaml is only invoked with non-empty input.
skills/skill-comply/scripts/run.py (1)

96-100: Replace print() with logger.info() for consistency.

Lines 96 and 100 use print() while the rest of the module uses logger.info(). This creates inconsistent output behavior (e.g., if logging is redirected).

As per coding guidelines: "Avoid using print() statements in Python code; use the logging module instead."

Suggested fix
     for scenario in scenarios:
-        print(f"       Running {scenario.level_name}...", end="", flush=True)
+        logger.info("       Running %s...", scenario.level_name)
         run = run_scenario(scenario, model=args.model)
         result = grade(spec, list(run.observations))
         graded_results.append((scenario.level_name, result, list(run.observations)))
-        print(f" {result.compliance_rate:.0%}")
+        logger.info("       %s: %.0f%%", scenario.level_name, result.compliance_rate * 100)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/scripts/run.py` around lines 96 - 100, Replace the two
print() calls around running scenarios with logger.info() so output follows the
module's logging configuration: use logger.info to log the "Running
{scenario.level_name}..." message (preserve the same text and no newline
behavior by combining messages if needed) and log the result line with the
formatted compliance_rate; update the calls near run_scenario, grade, and
graded_results handling to use logger.info instead of print so logs are
consistent with the module logger.
skills/skill-comply/scripts/classifier.py (1)

70-78: Consider logging when classification parsing fails.

Returning {} on JSONDecodeError/ValueError silently masks LLM output issues. This could lead to false compliance results (all steps appear as "not detected") without any indication of the underlying problem.

Suggested improvement
+import logging
+
+logger = logging.getLogger(__name__)
+
 def _parse_classification(text: str) -> dict[str, list[int]]:
     ...
     try:
         parsed = json.loads(cleaned)
         # Validate structure
         return {
             k: [int(i) for i in v]
             for k, v in parsed.items()
             if isinstance(v, list)
         }
     except (json.JSONDecodeError, ValueError):
+        logger.warning("Failed to parse classification output: %s", cleaned[:200])
         return {}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/scripts/classifier.py` around lines 70 - 78, The except
block that currently catches json.JSONDecodeError/ValueError after
json.loads(cleaned) silently returns {} and masks LLM output issues; update that
except block (where parsed and cleaned are used) to log the exception details
and the raw/cleaned LLM output before returning {} so failures are visible —
include the exception message (and ideally stack/trace) plus the cleaned string
in the log to aid debugging.
skills/skill-comply/scripts/runner.py (1)

16-16: Use tempfile module instead of hardcoded /tmp path.

Hardcoding /tmp is flagged by static analysis (S108) and may not work correctly on all systems or in containerized environments.

Suggested fix
+import tempfile
+
-SANDBOX_BASE = Path("/tmp/skill-comply-sandbox")
+SANDBOX_BASE = Path(tempfile.gettempdir()) / "skill-comply-sandbox"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/scripts/runner.py` at line 16, Replace the hardcoded
SANDBOX_BASE = Path("/tmp/skill-comply-sandbox") with a tempfile-based solution:
import tempfile and either set SANDBOX_BASE = Path(tempfile.gettempdir()) /
"skill-comply-sandbox" for a platform-safe shared temp location, or (preferably)
create a unique per-run directory with tempfile.mkdtemp() (e.g., SANDBOX_BASE =
Path(tempfile.mkdtemp(prefix="skill-comply-"))) and ensure any created temp
directory is cleaned up appropriately; update runner.py to import tempfile and
adjust any code that assumes a persistent /tmp path.
skills/skill-comply/scripts/scenario_generator.py (1)

56-68: Add error handling for YAML parsing failures.

If the LLM returns malformed YAML or missing scenarios key, yaml.safe_load or parsed["scenarios"] will raise exceptions with unclear error messages. Unlike spec_generator.py which has retry logic, this function lacks any YAML error handling.

Suggested fix
     raw_yaml = extract_yaml(result.stdout)
-    parsed = yaml.safe_load(raw_yaml)
-
-    scenarios: list[Scenario] = []
-    for s in parsed["scenarios"]:
+    try:
+        parsed = yaml.safe_load(raw_yaml)
+        if not parsed or "scenarios" not in parsed:
+            raise RuntimeError("LLM output missing 'scenarios' key")
+    except yaml.YAMLError as e:
+        raise RuntimeError(f"Failed to parse scenario YAML: {e}") from e
+
+    scenarios: list[Scenario] = []
+    for s in parsed["scenarios"]:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/scripts/scenario_generator.py` around lines 56 - 68, Wrap
the YAML extraction and parsing in a try/except that catches yaml.YAMLError and
TypeError/KeyError around yaml.safe_load and parsed["scenarios"]; validate that
parsed is a dict and contains the "scenarios" key before iterating, and on
failure either raise a clear exception that includes the raw_yaml (or a
truncated preview) and the original error or trigger the same retry path used by
spec_generator.py; update the scenario construction block (where extract_yaml,
yaml.safe_load, parsed["scenarios"], and Scenario(...) are used) to log/raise a
descriptive error that aids debugging when the LLM returns malformed or missing
YAML.
skills/skill-comply/scripts/grader.py (1)

88-92: Clarify: Only the first valid event is used as evidence per step.

The loop breaks after finding the first event that passes temporal checks (line 92). This means if multiple events could satisfy a step, only the first is recorded as evidence. If this is intentional (e.g., to enforce single-detection semantics), consider adding a brief comment. If multiple events should contribute, remove the break.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/scripts/grader.py` around lines 88 - 92, The loop over
candidates currently stops at the first event that passes temporal checks (the
break after matched.append(event)), which enforces single-evidence-per-step;
decide and apply one of two fixes: if single-detection is intentional, add a
short clarifying comment above the loop referencing the variables/functions (for
example: "Only the first valid event is used as evidence for a step — see
_check_temporal_order and matched") to make the behavior explicit; otherwise
remove the break so the loop continues to evaluate all candidates and append
every event for which _check_temporal_order(step, event, resolved, classified)
returns None.
skills/skill-comply/scripts/parser.py (1)

79-101: Consider wrapping key accesses with clearer error messages.

While the scoring section has explicit validation (lines 92-93), other required keys like id, name, steps, etc. will raise generic KeyError exceptions if missing. For a developer-facing tool, the current behavior is acceptable, but clearer error messages would improve debuggability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/scripts/parser.py` around lines 79 - 101, The parser
currently accesses required keys like "id", "name", "source_rule", "version",
and "steps" directly which yields generic KeyError if missing; add explicit
validation before building Steps and returning ComplianceSpec (e.g., check raw
contains "steps", "id", "name", "source_rule", "version") and raise KeyError
with clear messages like "Missing 'id' in compliance spec" or "Missing 'steps'
array in compliance spec"; also validate each step has required fields before
creating Step/Detector to provide targeted errors; update the block that builds
steps and the return of ComplianceSpec to use these explicit checks so errors
are informative.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@AGENTS.md`:
- Line 3: Update the inconsistent skill count in AGENTS.md: the header line
reads "117 skills" but the Project Structure block still shows "115"; open the
Project Structure section and change the "115" figure to "117" so both in-file
references (the top header "117 skills" and the Project Structure count) match
exactly.

In `@skills/skill-comply/scripts/parser.py`:
- Around line 74-76: parse_spec currently calls path.read_text() without
verifying the file exists (unlike parse_trace), which can produce a less clear
error; update parse_spec to check path.exists() (or path.is_file()) before
reading and raise a clear FileNotFoundError with the same style/message used in
parse_trace if the file is missing, then proceed to call yaml.safe_load as
before (refer to the parse_spec function and compare behavior to parse_trace).

In `@skills/skill-comply/scripts/report.py`:
- Line 52: The string passed to lines.append uses an unnecessary f-string prefix
despite having no placeholders; remove the leading "f" from the call to
lines.append(f"| Recommendation | All steps above threshold — no hook promotion
needed |") in report.py (locate the lines.append invocation that writes the
Recommendation row) so it becomes a plain string literal.
- Around line 39-40: The two calls to lines.append use f-strings with no
placeholders; change lines.append(f"| Metric | Value |") and
lines.append(f"|--------|-------|") to plain strings (lines.append("| Metric |
Value |") and lines.append("|--------|-------|")) so static analysis stops
flagging useless f-strings; update any similar lines.append occurrences in
report.py that use f"" without interpolation.

In `@skills/skill-comply/scripts/runner.py`:
- Around line 81-83: The loop in runner.py currently executes LLM-provided
scenario.setup_commands by calling shlex.split(cmd) and subprocess.run(...)
which is unsafe; replace this with validation against an allowlist of permitted
executables/flags (e.g., only allow "mkdir", "touch", "echo" and safe arg
patterns) before calling subprocess.run, reject or log and skip any command that
doesn't match the allowlist, and/or require explicit user confirmation for
non-allowlisted commands; additionally consider executing allowed commands in a
hardened sandbox (restricted cwd, unprivileged user, or container) using the
existing sandbox_dir and ensure all subprocess.run invocations set shell=False,
check=True, timeout, and capture_output/log stdout/stderr for auditing (refer to
scenario.setup_commands, shlex.split, subprocess.run, and sandbox_dir).

In `@skills/skill-comply/scripts/utils.py`:
- Around line 6-13: The extract_yaml function should validate and fail fast:
trim the input, raise ValueError("Empty model output; expected YAML content") if
empty, then strip leading/trailing code-fence lines only if present (remove
opening ``` and closing ``` together when both exist), build yaml_text =
"\n".join(lines).strip(), and raise ValueError("No YAML content found in model
output") if yaml_text is empty before returning it; update the extract_yaml
implementation to perform these checks and raise clear errors rather than
silently returning empty/invalid strings.

---

Nitpick comments:
In `@skills/skill-comply/.gitignore`:
- Line 7: Remove uv.lock from the .gitignore so the lockfile is committed for
reproducible installs; update the .gitignore entry that currently lists
"uv.lock" to delete that line, add the generated uv.lock to the repository, and
commit the lockfile (regenerate it with your package manager if needed) so
dependency resolution is deterministic across machines/CI.

In `@skills/skill-comply/prompts/classifier.md`:
- Around line 19-20: The current rule examples use specific filenames
("test_calculator.py", "calculator.py") which risks task-specific bias; update
the classifier prompt in classifier.md to use generic filename/path patterns
instead (e.g., test path patterns like "tests/", "test_*.py", "*_test.py" for
test files versus non-test paths or implementation file patterns for
implementation files) so the guidance applies broadly across domains; find the
lines that mention "test_calculator.py" and "calculator.py" and replace them
with these generic patterns and a short explanatory phrase (e.g., "files under
tests/ or prefixed with test_ are tests") to keep behavior stable.

In `@skills/skill-comply/prompts/scenario_generator.md`:
- Line 19: Replace the hardcoded "/tmp/skill-comply-sandbox/{id}/..." path in
the scenario template string with a runner-provided sandbox root placeholder
(e.g., "{sandbox_root}" or "${SANDBOX_ROOT}") so the mkdir command becomes
"mkdir -p {sandbox_root}/{id}/src {sandbox_root}/{id}/tests"; update the
scenario_generator.md template where that exact string appears and ensure the
consuming runner injects the sandbox root constant/config value at runtime
instead of using the absolute /tmp path.

In `@skills/skill-comply/scripts/classifier.py`:
- Around line 70-78: The except block that currently catches
json.JSONDecodeError/ValueError after json.loads(cleaned) silently returns {}
and masks LLM output issues; update that except block (where parsed and cleaned
are used) to log the exception details and the raw/cleaned LLM output before
returning {} so failures are visible — include the exception message (and
ideally stack/trace) plus the cleaned string in the log to aid debugging.

In `@skills/skill-comply/scripts/grader.py`:
- Around line 88-92: The loop over candidates currently stops at the first event
that passes temporal checks (the break after matched.append(event)), which
enforces single-evidence-per-step; decide and apply one of two fixes: if
single-detection is intentional, add a short clarifying comment above the loop
referencing the variables/functions (for example: "Only the first valid event is
used as evidence for a step — see _check_temporal_order and matched") to make
the behavior explicit; otherwise remove the break so the loop continues to
evaluate all candidates and append every event for which
_check_temporal_order(step, event, resolved, classified) returns None.

In `@skills/skill-comply/scripts/parser.py`:
- Around line 79-101: The parser currently accesses required keys like "id",
"name", "source_rule", "version", and "steps" directly which yields generic
KeyError if missing; add explicit validation before building Steps and returning
ComplianceSpec (e.g., check raw contains "steps", "id", "name", "source_rule",
"version") and raise KeyError with clear messages like "Missing 'id' in
compliance spec" or "Missing 'steps' array in compliance spec"; also validate
each step has required fields before creating Step/Detector to provide targeted
errors; update the block that builds steps and the return of ComplianceSpec to
use these explicit checks so errors are informative.

In `@skills/skill-comply/scripts/run.py`:
- Around line 96-100: Replace the two print() calls around running scenarios
with logger.info() so output follows the module's logging configuration: use
logger.info to log the "Running {scenario.level_name}..." message (preserve the
same text and no newline behavior by combining messages if needed) and log the
result line with the formatted compliance_rate; update the calls near
run_scenario, grade, and graded_results handling to use logger.info instead of
print so logs are consistent with the module logger.

In `@skills/skill-comply/scripts/runner.py`:
- Line 16: Replace the hardcoded SANDBOX_BASE =
Path("/tmp/skill-comply-sandbox") with a tempfile-based solution: import
tempfile and either set SANDBOX_BASE = Path(tempfile.gettempdir()) /
"skill-comply-sandbox" for a platform-safe shared temp location, or (preferably)
create a unique per-run directory with tempfile.mkdtemp() (e.g., SANDBOX_BASE =
Path(tempfile.mkdtemp(prefix="skill-comply-"))) and ensure any created temp
directory is cleaned up appropriately; update runner.py to import tempfile and
adjust any code that assumes a persistent /tmp path.

In `@skills/skill-comply/scripts/scenario_generator.py`:
- Around line 56-68: Wrap the YAML extraction and parsing in a try/except that
catches yaml.YAMLError and TypeError/KeyError around yaml.safe_load and
parsed["scenarios"]; validate that parsed is a dict and contains the "scenarios"
key before iterating, and on failure either raise a clear exception that
includes the raw_yaml (or a truncated preview) and the original error or trigger
the same retry path used by spec_generator.py; update the scenario construction
block (where extract_yaml, yaml.safe_load, parsed["scenarios"], and
Scenario(...) are used) to log/raise a descriptive error that aids debugging
when the LLM returns malformed or missing YAML.

In `@skills/skill-comply/scripts/spec_generator.py`:
- Around line 50-53: The Claude call in spec_generator.py checks returncode but
doesn't validate result.stdout before calling extract_yaml; add a guard after
the returncode check to verify result.stdout is non-empty (e.g., if not
result.stdout.strip():) and raise a RuntimeError including the fact that Claude
returned empty output and any available result.stderr; ensure this uses the same
pattern as scenario_generator.py (checking result.stdout before calling
extract_yaml) so extract_yaml is only invoked with non-empty input.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a55fef57-6794-46b5-9f49-f8fa75b7f64d

📥 Commits

Reviewing files that changed from the base of the PR and between 7ccfda9 and ff3cfc4.

📒 Files selected for processing (23)
  • AGENTS.md
  • README.md
  • skills/skill-comply/.gitignore
  • skills/skill-comply/SKILL.md
  • skills/skill-comply/fixtures/compliant_trace.jsonl
  • skills/skill-comply/fixtures/noncompliant_trace.jsonl
  • skills/skill-comply/fixtures/tdd_spec.yaml
  • skills/skill-comply/prompts/classifier.md
  • skills/skill-comply/prompts/scenario_generator.md
  • skills/skill-comply/prompts/spec_generator.md
  • skills/skill-comply/pyproject.toml
  • skills/skill-comply/scripts/__init__.py
  • skills/skill-comply/scripts/classifier.py
  • skills/skill-comply/scripts/grader.py
  • skills/skill-comply/scripts/parser.py
  • skills/skill-comply/scripts/report.py
  • skills/skill-comply/scripts/run.py
  • skills/skill-comply/scripts/runner.py
  • skills/skill-comply/scripts/scenario_generator.py
  • skills/skill-comply/scripts/spec_generator.py
  • skills/skill-comply/scripts/utils.py
  • skills/skill-comply/tests/test_grader.py
  • skills/skill-comply/tests/test_parser.py

Comment on lines +39 to +40
lines.append(f"| Metric | Value |")
lines.append(f"|--------|-------|")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove extraneous f prefix from strings without placeholders.

Static analysis correctly identifies f-strings without any placeholders. These should be plain strings.

Proposed fix
-    lines.append(f"| Metric | Value |")
-    lines.append(f"|--------|-------|")
+    lines.append("| Metric | Value |")
+    lines.append("|--------|-------|")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
lines.append(f"| Metric | Value |")
lines.append(f"|--------|-------|")
lines.append("| Metric | Value |")
lines.append("|--------|-------|")
🧰 Tools
🪛 Ruff (0.15.6)

[error] 39-39: f-string without any placeholders

Remove extraneous f prefix

(F541)


[error] 40-40: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/scripts/report.py` around lines 39 - 40, The two calls to
lines.append use f-strings with no placeholders; change lines.append(f"| Metric
| Value |") and lines.append(f"|--------|-------|") to plain strings
(lines.append("| Metric | Value |") and lines.append("|--------|-------|")) so
static analysis stops flagging useless f-strings; update any similar
lines.append occurrences in report.py that use f"" without interpolation.

step_names = ", ".join(promote_steps)
lines.append(f"| Recommendation | **Promote {step_names} to hooks** |")
else:
lines.append(f"| Recommendation | All steps above threshold — no hook promotion needed |")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove extraneous f prefix.

This string has no placeholders.

Proposed fix
-        lines.append(f"| Recommendation | All steps above threshold — no hook promotion needed |")
+        lines.append("| Recommendation | All steps above threshold — no hook promotion needed |")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
lines.append(f"| Recommendation | All steps above threshold — no hook promotion needed |")
lines.append("| Recommendation | All steps above threshold — no hook promotion needed |")
🧰 Tools
🪛 Ruff (0.15.6)

[error] 52-52: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/scripts/report.py` at line 52, The string passed to
lines.append uses an unnecessary f-string prefix despite having no placeholders;
remove the leading "f" from the call to lines.append(f"| Recommendation | All
steps above threshold — no hook promotion needed |") in report.py (locate the
lines.append invocation that writes the Recommendation row) so it becomes a
plain string literal.

Comment on lines +81 to +83
for cmd in scenario.setup_commands:
parts = shlex.split(cmd)
subprocess.run(parts, cwd=sandbox_dir, capture_output=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Security risk: executing LLM-generated commands without validation.

setup_commands originate from scenario_generator.py, which parses them from LLM output. Executing arbitrary shell commands from LLM responses is dangerous—a malicious or hallucinated command could delete files, exfiltrate data, or compromise the system.

Consider:

  1. Allowlisting permitted commands (e.g., mkdir, touch, echo)
  2. Running setup commands in a more restricted sandbox
  3. Adding user confirmation before executing
Example allowlist approach
+ALLOWED_SETUP_COMMANDS = frozenset({"mkdir", "touch", "echo", "cat"})
+
 def _setup_sandbox(sandbox_dir: Path, scenario: Scenario) -> None:
     """Create sandbox directory and run setup commands."""
     if sandbox_dir.exists():
         shutil.rmtree(sandbox_dir)
     sandbox_dir.mkdir(parents=True)

     subprocess.run(["git", "init"], cwd=sandbox_dir, capture_output=True)

     for cmd in scenario.setup_commands:
         parts = shlex.split(cmd)
+        if parts and parts[0] not in ALLOWED_SETUP_COMMANDS:
+            raise ValueError(f"Disallowed setup command: {parts[0]!r}")
         subprocess.run(parts, cwd=sandbox_dir, capture_output=True)
🧰 Tools
🪛 Ruff (0.15.6)

[error] 83-83: subprocess call: check for execution of untrusted input

(S603)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/scripts/runner.py` around lines 81 - 83, The loop in
runner.py currently executes LLM-provided scenario.setup_commands by calling
shlex.split(cmd) and subprocess.run(...) which is unsafe; replace this with
validation against an allowlist of permitted executables/flags (e.g., only allow
"mkdir", "touch", "echo" and safe arg patterns) before calling subprocess.run,
reject or log and skip any command that doesn't match the allowlist, and/or
require explicit user confirmation for non-allowlisted commands; additionally
consider executing allowed commands in a hardened sandbox (restricted cwd,
unprivileged user, or container) using the existing sandbox_dir and ensure all
subprocess.run invocations set shell=False, check=True, timeout, and
capture_output/log stdout/stderr for auditing (refer to scenario.setup_commands,
shlex.split, subprocess.run, and sandbox_dir).

Comment on lines +6 to +13
def extract_yaml(text: str) -> str:
"""Extract YAML from LLM output, stripping markdown fences if present."""
lines = text.strip().splitlines()
if lines and lines[0].startswith("```"):
lines = lines[1:]
if lines and lines[-1].startswith("```"):
lines = lines[:-1]
return "\n".join(lines)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Harden YAML extraction for malformed/mixed LLM output.

extract_yaml assumes fences are only at the start/end and never validates extracted content. For untrusted model output, this can silently pass invalid payloads and defer failure downstream with unclear errors.

Proposed fix
 def extract_yaml(text: str) -> str:
     """Extract YAML from LLM output, stripping markdown fences if present."""
-    lines = text.strip().splitlines()
-    if lines and lines[0].startswith("```"):
-        lines = lines[1:]
-    if lines and lines[-1].startswith("```"):
-        lines = lines[:-1]
-    return "\n".join(lines)
+    stripped = text.strip()
+    if not stripped:
+        raise ValueError("Empty model output; expected YAML content")
+
+    lines = stripped.splitlines()
+    if lines and lines[0].startswith("```"):
+        lines = lines[1:]
+        if lines and lines[-1].startswith("```"):
+            lines = lines[:-1]
+
+    yaml_text = "\n".join(lines).strip()
+    if not yaml_text:
+        raise ValueError("No YAML content found in model output")
+    return yaml_text

As per coding guidelines: "Never trust external data (API responses, user input, file content)", "Always validate all user input before processing at system boundaries", and "Fail fast with clear error messages when validation fails".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/scripts/utils.py` around lines 6 - 13, The extract_yaml
function should validate and fail fast: trim the input, raise ValueError("Empty
model output; expected YAML content") if empty, then strip leading/trailing
code-fence lines only if present (remove opening ``` and closing ``` together
when both exist), build yaml_text = "\n".join(lines).strip(), and raise
ValueError("No YAML content found in model output") if yaml_text is empty before
returning it; update the extract_yaml implementation to perform these checks and
raise clear errors rather than silently returning empty/invalid strings.

- AGENTS.md: fix stale skill count (115 → 117) in project structure
- run.py: replace remaining print() with logger, add zero-division guard,
  create parent dirs for --output path
- runner.py: add returncode check for claude subprocess, clarify
  relative_to path traversal validation
- parser.py: use is_file() instead of exists(), catch KeyError for
  missing trace fields, add file check in parse_spec
- classifier.py: log warnings on malformed classification output,
  guard against non-dict JSON responses
- grader.py: filter negative indices from LLM classification

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ecc-tools
Copy link
Contributor

ecc-tools bot commented Mar 20, 2026

Analyzing 5000 commits...

@ecc-tools
Copy link
Contributor

ecc-tools bot commented Mar 20, 2026

Analysis Failed

Not Found - https://docs.github.com/rest/git/refs#get-a-reference

Troubleshooting
Cause Resolution
Large repository Analysis may timeout on repos with extensive history
API rate limits Wait 15 minutes before retrying
Network issues Queue timeout is 15 minutes; retry may succeed
Permissions Verify app has Contents: Read access

Retry: /ecc-tools analyze


Report Issue | ECC Tools

Comment on lines +87 to +89
for cmd in scenario.setup_commands:
parts = shlex.split(cmd)
subprocess.run(parts, cwd=sandbox_dir, capture_output=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Setup command failures are silently ignored

subprocess.run return values are not checked inside the setup_commands loop. If any setup command fails (e.g., npm install, pip install some-dep, or even a mistyped shell command), the failure is swallowed entirely. The agent then runs inside a broken environment and may score 0% compliance for reasons completely unrelated to the skill being tested — producing a false negative with no diagnostic information.

For consistency with every other subprocess call in this codebase (all of which check returncode), the result should be inspected:

Suggested change
for cmd in scenario.setup_commands:
parts = shlex.split(cmd)
subprocess.run(parts, cwd=sandbox_dir, capture_output=True)
for cmd in scenario.setup_commands:
parts = shlex.split(cmd)
proc = subprocess.run(parts, cwd=sandbox_dir, capture_output=True)
if proc.returncode != 0:
raise RuntimeError(
f"Setup command {cmd!r} failed (rc={proc.returncode}): "
f"{proc.stderr.decode(errors='replace')[:300]}"
)

Comment on lines +34 to +37
tool_calls = "\n".join(
f"[{i}] {event.tool}: input={event.input[:500]} output={event.output[:200]}"
for i, event in enumerate(trace)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Literal newlines in output break the per-tool-call prompt format

The tool_calls string is assembled with "\n".join(...), using newlines as record separators. However, event.output — which is a raw string from tool results such as Bash — frequently contains embedded newlines (e.g., pytest output, cat output, compiler errors). When those newlines appear inside a [{i}] … entry, the classifier LLM receives a malformed listing where a single tool call spans multiple lines, making it indistinguishable from multiple separate entries. This can cause the classifier to miscount indices or skip events, silently corrupting classification results.

event.input is safe because inputs come from json.dumps(tool_input), which escapes newlines as \n. But event.output is a raw str(output_content) that preserves literal newlines.

Fix by replacing newlines before building the joined string:

Suggested change
tool_calls = "\n".join(
f"[{i}] {event.tool}: input={event.input[:500]} output={event.output[:200]}"
for i, event in enumerate(trace)
)
tool_calls = "\n".join(
f"[{i}] {event.tool}: input={event.input[:500]} output={event.output[:200].replace(chr(10), ' ').replace(chr(13), '')}"
for i, event in enumerate(trace)
)

from scripts.parser import ObservationEvent
from scripts.scenario_generator import Scenario

SANDBOX_BASE = Path("/tmp/skill-comply-sandbox")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Sandbox directories are never cleaned up and IDs can collide across skills

Two related issues with the sandbox path strategy:

  1. No cleanup after successful runs. _setup_sandbox destroys and recreates the directory at the start of a run, but there is no cleanup at the end. After running skill-comply against several skills, /tmp/skill-comply-sandbox/ accumulates stale directories indefinitely. Only a subsequent run against the same scenario ID would clean it up.

  2. Scenario IDs are LLM-generated and can collide across different skills. The LLM tends to produce generic names (e.g., scenario_level_1, supportive). When two different skills happen to produce the same scenario.id, they map to the same sandbox path. Running both concurrently (or just checking the residue from a previous run) can produce cross-contaminated environments and misleading compliance results.

Both issues can be addressed by incorporating a run-specific prefix and using a try/finally for cleanup in run_scenario:

import uuid

SANDBOX_BASE = Path("/tmp/skill-comply-sandbox")

def run_scenario(...) -> ScenarioRun:
    sandbox_dir = _safe_sandbox_dir(f"{scenario.id}_{uuid.uuid4().hex[:8]}")
    _setup_sandbox(sandbox_dir, scenario)
    try:
        result = subprocess.run(...)
        ...
        return ScenarioRun(...)
    finally:
        shutil.rmtree(sandbox_dir, ignore_errors=True)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
skills/skill-comply/scripts/runner.py (1)

85-89: ⚠️ Potential issue | 🔴 Critical

Reject and fail fast on generated setup commands.

scenario.setup_commands is generated external data, but Lines 87-89 split and execute it verbatim on the host. Lines 85-89 also ignore exit status, so a malicious or failed setup command can either escape the intended workspace or leave the scenario half-prepared while grading continues.

🔒 Suggested hardening
+def _run_setup_command(cmd: str, *, cwd: Path) -> None:
+    parts = shlex.split(cmd)
+    if not parts:
+        raise ValueError("Empty setup command")
+    if parts[0] not in ALLOWED_SETUP_COMMANDS:
+        raise ValueError(f"Disallowed setup command: {parts[0]!r}")
+    subprocess.run(
+        parts,
+        cwd=cwd,
+        capture_output=True,
+        text=True,
+        timeout=30,
+        check=True,
+    )
+
 def _setup_sandbox(sandbox_dir: Path, scenario: Scenario) -> None:
     """Create sandbox directory and run setup commands."""
     if sandbox_dir.exists():
         shutil.rmtree(sandbox_dir)
     sandbox_dir.mkdir(parents=True)
 
-    subprocess.run(["git", "init"], cwd=sandbox_dir, capture_output=True)
+    subprocess.run(
+        ["git", "init"],
+        cwd=sandbox_dir,
+        capture_output=True,
+        text=True,
+        timeout=30,
+        check=True,
+    )
 
     for cmd in scenario.setup_commands:
-        parts = shlex.split(cmd)
-        subprocess.run(parts, cwd=sandbox_dir, capture_output=True)
+        _run_setup_command(cmd, cwd=sandbox_dir)

As per coding guidelines: "Never trust external data (API responses, user input, file content)" and "Always handle errors explicitly at every level and never silently swallow errors".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/scripts/runner.py` around lines 85 - 89, The loop that
shlex.splits and subprocess.run(s) each entry of scenario.setup_commands (using
shlex.split(cmd) and subprocess.run(parts, cwd=sandbox_dir,
capture_output=True)) must be hardened: validate or whitelist each command
string before executing, disallow shell metacharacters or unexpected binaries,
and refuse unknown/unsafe commands; execute with subprocess.run(..., check=True,
shell=False, stdin=subprocess.DEVNULL, timeout=...) so failures raise and stop
processing, pass an explicit restricted env and cwd=sandbox_dir, and catch
subprocess.CalledProcessError/TimeoutError to log the error and fail fast (raise
or return a non-zero status) from the runner function instead of silently
ignoring return codes.
🧹 Nitpick comments (1)
skills/skill-comply/scripts/classifier.py (1)

10-12: Import placement violates PEP 8 ordering.

The local import at line 12 is placed after module-level code (the logger assignment). Per PEP 8, all imports should be grouped at the top of the file before any other statements.

♻️ Suggested fix
 import logging
 import subprocess
 from pathlib import Path
+
+from scripts.parser import ComplianceSpec, ObservationEvent
 
 logger = logging.getLogger(__name__)
 
-from scripts.parser import ComplianceSpec, ObservationEvent
-
 PROMPTS_DIR = Path(__file__).parent.parent / "prompts"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/scripts/classifier.py` around lines 10 - 12, Move the
local import from scripts.parser (ComplianceSpec, ObservationEvent) above any
module-level statements so imports are grouped at the top of the file;
specifically, place the "from scripts.parser import ComplianceSpec,
ObservationEvent" line before the logger = logging.getLogger(__name__)
assignment in classifier.py to comply with PEP 8 import ordering.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@skills/skill-comply/scripts/runner.py`:
- Around line 16-17: The sandbox directory is currently deterministic and shared
(SANDBOX_BASE and ScenarioRun.sandbox_dir), which risks cross-run clobbering;
change creation to generate a unique per-run sandbox (e.g., use uuid4 or
tempfile.mkdtemp) by adding a helper like make_sandbox_dir(safe_id) that returns
Path(tempfile.mkdtemp(prefix=f"{safe_id}-", dir=SANDBOX_BASE)), update wherever
Sandbox is constructed to call that helper (replace direct
Path("/tmp/skill-comply-sandbox") + safe_id usage), and stop deleting the entire
shared base — only remove the specific per-run directory
(ScenarioRun.sandbox_dir) at teardown to avoid wiping other runs.
- Around line 103-149: The loop that parses stream-json lines must validate and
reject malformed events instead of silently skipping or assuming dicts: after
json.loads(line) ensure the result is a dict (raise or log and continue with a
clear error if not), replace msg.get("type") usage with a guarded access only
after validation, and validate that "content" for assistant messages is a list
before iterating (same validation already done for user messages) to avoid
AttributeError; update the parsing around the variables msg/msg_type/content and
the pending/ObservationEvent creation so malformed scalars/arrays/nulls are
detected and handled with a clear failure or logged rejection rather than
causing downstream crashes.

---

Duplicate comments:
In `@skills/skill-comply/scripts/runner.py`:
- Around line 85-89: The loop that shlex.splits and subprocess.run(s) each entry
of scenario.setup_commands (using shlex.split(cmd) and subprocess.run(parts,
cwd=sandbox_dir, capture_output=True)) must be hardened: validate or whitelist
each command string before executing, disallow shell metacharacters or
unexpected binaries, and refuse unknown/unsafe commands; execute with
subprocess.run(..., check=True, shell=False, stdin=subprocess.DEVNULL,
timeout=...) so failures raise and stop processing, pass an explicit restricted
env and cwd=sandbox_dir, and catch subprocess.CalledProcessError/TimeoutError to
log the error and fail fast (raise or return a non-zero status) from the runner
function instead of silently ignoring return codes.

---

Nitpick comments:
In `@skills/skill-comply/scripts/classifier.py`:
- Around line 10-12: Move the local import from scripts.parser (ComplianceSpec,
ObservationEvent) above any module-level statements so imports are grouped at
the top of the file; specifically, place the "from scripts.parser import
ComplianceSpec, ObservationEvent" line before the logger =
logging.getLogger(__name__) assignment in classifier.py to comply with PEP 8
import ordering.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8986065a-ccdf-431e-b620-da7400638c8f

📥 Commits

Reviewing files that changed from the base of the PR and between ff3cfc4 and d72e3d1.

📒 Files selected for processing (6)
  • AGENTS.md
  • skills/skill-comply/scripts/classifier.py
  • skills/skill-comply/scripts/grader.py
  • skills/skill-comply/scripts/parser.py
  • skills/skill-comply/scripts/run.py
  • skills/skill-comply/scripts/runner.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • skills/skill-comply/scripts/run.py
  • skills/skill-comply/scripts/grader.py
  • AGENTS.md
  • skills/skill-comply/scripts/parser.py

Comment on lines +16 to +17
SANDBOX_BASE = Path("/tmp/skill-comply-sandbox")
ALLOWED_MODELS = frozenset({"haiku", "sonnet", "opus"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use a per-run sandbox directory.

The path is deterministic (/tmp/skill-comply-sandbox/<sanitized id>), and Lines 81-83 delete it before every run. Rerunning the same scenario—or two distinct IDs that normalize to the same safe_id—can wipe another run's workspace and make ScenarioRun.sandbox_dir point at different contents than the run actually produced. A unique suffix (uuid4/mkdtemp) on the sanitized prefix avoids both collisions and cross-run clobbering.

Also applies to: 70-76, 81-83

🧰 Tools
🪛 Ruff (0.15.6)

[error] 16-16: Probable insecure usage of temporary file or directory: "/tmp/skill-comply-sandbox"

(S108)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/scripts/runner.py` around lines 16 - 17, The sandbox
directory is currently deterministic and shared (SANDBOX_BASE and
ScenarioRun.sandbox_dir), which risks cross-run clobbering; change creation to
generate a unique per-run sandbox (e.g., use uuid4 or tempfile.mkdtemp) by
adding a helper like make_sandbox_dir(safe_id) that returns
Path(tempfile.mkdtemp(prefix=f"{safe_id}-", dir=SANDBOX_BASE)), update wherever
Sandbox is constructed to call that helper (replace direct
Path("/tmp/skill-comply-sandbox") + safe_id usage), and stop deleting the entire
shared base — only remove the specific per-run directory
(ScenarioRun.sandbox_dir) at teardown to avoid wiping other runs.

Comment on lines +103 to +149
for line in stdout.strip().splitlines():
try:
msg = json.loads(line)
except json.JSONDecodeError:
continue

msg_type = msg.get("type")

if msg_type == "assistant":
content = msg.get("message", {}).get("content", [])
for block in content:
if block.get("type") == "tool_use":
tool_use_id = block.get("id", "")
tool_input = block.get("input", {})
input_str = (
json.dumps(tool_input)[:5000]
if isinstance(tool_input, dict)
else str(tool_input)[:5000]
)
pending[tool_use_id] = {
"tool": block.get("name", "unknown"),
"input": input_str,
"order": event_counter,
}
event_counter += 1

elif msg_type == "user":
content = msg.get("message", {}).get("content", [])
if isinstance(content, list):
for block in content:
tool_use_id = block.get("tool_use_id", "")
if tool_use_id in pending:
info = pending.pop(tool_use_id)
output_content = block.get("content", "")
if isinstance(output_content, list):
output_str = json.dumps(output_content)[:5000]
else:
output_str = str(output_content)[:5000]

events.append(ObservationEvent(
timestamp=f"T{info['order']:04d}",
event="tool_complete",
tool=info["tool"],
session=msg.get("session_id", "unknown"),
input=info["input"],
output=output_str,
))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd skills/skill-comply && find . -name "runner.py" -type f

Repository: affaan-m/everything-claude-code

Length of output: 94


🏁 Script executed:

wc -l skills/skill-comply/scripts/runner.py

Repository: affaan-m/everything-claude-code

Length of output: 116


🏁 Script executed:

sed -n '90,160p' skills/skill-comply/scripts/runner.py

Repository: affaan-m/everything-claude-code

Length of output: 2799


🏁 Script executed:

cd skills/skill-comply && rg "_parse_stream_json" -B 3 -A 3 scripts/runner.py

Repository: affaan-m/everything-claude-code

Length of output: 497


🏁 Script executed:

cd skills/skill-comply && rg "parse_stream_json\|ObservationEvent" scripts/ -l

Repository: affaan-m/everything-claude-code

Length of output: 57


🏁 Script executed:

cd skills/skill-comply && head -50 scripts/runner.py

Repository: affaan-m/everything-claude-code

Length of output: 1453


🏁 Script executed:

cd skills/skill-comply && find . -name "*test*.py" -o -name "test_*" -type f | head -20

Repository: affaan-m/everything-claude-code

Length of output: 120


🏁 Script executed:

cd skills/skill-comply && rg "def test.*parse" . 2>/dev/null || echo "No tests found"

Repository: affaan-m/everything-claude-code

Length of output: 292


🏁 Script executed:

cd skills/skill-comply && grep -n "JSONDecodeError\|silent\|malformed" scripts/runner.py

Repository: affaan-m/everything-claude-code

Length of output: 115


🏁 Script executed:

cd skills/skill-comply && sed -n '1,30p' scripts/runner.py && echo "---" && sed -n '75,95p' scripts/runner.py

Repository: affaan-m/everything-claude-code

Length of output: 1425


🏁 Script executed:

cd skills/skill-comply && cat tests/test_parser.py | head -100

Repository: affaan-m/everything-claude-code

Length of output: 3212


🏁 Script executed:

cd skills/skill-comply && rg "stream-json\|JSONDecodeError" tests/ -B 2 -A 2

Repository: affaan-m/everything-claude-code

Length of output: 57


🏁 Script executed:

cd skills/skill-comply && rg "stream-json" . --type py -B 2 -A 5

Repository: affaan-m/everything-claude-code

Length of output: 1722


🏁 Script executed:

cd skills/skill-comply && sed -n '103,149p' scripts/runner.py | cat -n

Repository: affaan-m/everything-claude-code

Length of output: 2420


🏁 Script executed:

python3 << 'EOF'
import json

# Test what happens if JSON decodes to non-dict
test_cases = [
    '["array"]',
    '"string"',
    '123',
    'null',
    '{"valid": "dict"}',
]

for case in test_cases:
    try:
        msg = json.loads(case)
        print(f"Parsed {case:20} -> type {type(msg).__name__:8} -> msg.get('type') raises?", end=" ")
        try:
            result = msg.get("type")
            print(f"No, got {result}")
        except AttributeError as e:
            print(f"Yes: {e}")
    except json.JSONDecodeError:
        print(f"Failed to parse {case}")
EOF

Repository: affaan-m/everything-claude-code

Length of output: 617


🏁 Script executed:

cd skills/skill-comply && rg "observations\|ScenarioRun" scripts/runner.py -A 2 -B 2 | head -40

Repository: affaan-m/everything-claude-code

Length of output: 57


Reject malformed stream-json events instead of silently dropping them.

Lines 103-107 silently skip JSON decode failures, and the parser assumes all events are dict-shaped. If claude outputs a JSON scalar, array, or null, line 107's msg.get("type") will crash with AttributeError: '<type>' object has no attribute 'get'. Additionally, the parser iterates content without validating it's a list for "assistant" events (line 110), while doing so for "user" events (line 129)—inconsistent error handling that masks malformed data and corrupts the trace used for compliance grading.

🧪 Suggested validation
-    for line in stdout.strip().splitlines():
+    for line_no, line in enumerate(stdout.splitlines(), start=1):
+        if not line.strip():
+            continue
         try:
-            msg = json.loads(line)
-        except json.JSONDecodeError:
-            continue
+            raw_msg = json.loads(line)
+        except json.JSONDecodeError as exc:
+            raise ValueError(
+                f"Invalid stream-json event at line {line_no}"
+            ) from exc
+
+        if not isinstance(raw_msg, dict):
+            raise ValueError(
+                f"Invalid stream-json event at line {line_no}: expected object"
+            )
+        msg = raw_msg

         msg_type = msg.get("type")

         if msg_type == "assistant":
-            content = msg.get("message", {}).get("content", [])
+            content = msg.get("message", {}).get("content", [])
+            if not isinstance(content, list):
+                raise ValueError(
+                    f"Invalid stream-json event at line {line_no}: content must be list"
+                )

Per coding guidelines: "Never trust external data (API responses, user input, file content)" and "Fail fast with clear error messages when validation fails".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/skill-comply/scripts/runner.py` around lines 103 - 149, The loop that
parses stream-json lines must validate and reject malformed events instead of
silently skipping or assuming dicts: after json.loads(line) ensure the result is
a dict (raise or log and continue with a clear error if not), replace
msg.get("type") usage with a guarded access only after validation, and validate
that "content" for assistant messages is a list before iterating (same
validation already done for user messages) to avoid AttributeError; update the
parsing around the variables msg/msg_type/content and the
pending/ObservationEvent creation so malformed scalars/arrays/nulls are detected
and handled with a clear failure or logged rejection rather than causing
downstream crashes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant