Skip to content

Add flaw tracer for root-cause analysis of plan reports#534

Open
neoneye wants to merge 24 commits intomainfrom
feature/flaw-tracer
Open

Add flaw tracer for root-cause analysis of plan reports#534
neoneye wants to merge 24 commits intomainfrom
feature/flaw-tracer

Conversation

@neoneye
Copy link
Copy Markdown
Member

@neoneye neoneye commented Apr 5, 2026

Summary

  • Adds worker_plan_internal/flaw_tracer/ package — a CLI tool that traces flaws in PlanExe reports upstream through the pipeline DAG to find where they originated
  • Static registry maps all 70 pipeline stages with their output files, upstream dependencies, and source code paths
  • Three-phase recursive algorithm: (1) identify flaws via LLM, (2) trace each upstream with dedup, (3) analyze source code at origin
  • Produces both JSON and markdown reports, sorted by trace depth (deepest root cause first)
  • 46 new tests, all passing. No regressions in the broader test suite (314 passed)

Usage

python -m worker_plan_internal.flaw_tracer \
    --dir /path/to/output \
    --file 030-report.html \
    --flaw "The budget appears unvalidated..." \
    --verbose

Test plan

  • All 46 flaw_tracer tests pass
  • All 314 worker_plan tests pass (no regressions)
  • CLI --help works
  • Module is importable
  • Manual test with a real output directory and LLM (post-merge)

🤖 Generated with Claude Code

neoneye and others added 24 commits April 5, 2026 22:52
Static DAG registry mapping all 48 PlanExe pipeline stages to their
output files, upstream dependencies, and source code paths. Includes
lookup functions (find_stage_by_filename, get_upstream_files,
get_source_code_paths) and 14 passing tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FlawTracer orchestrates three-phase flaw tracing through the pipeline DAG:
- Phase 1: LLM-based flaw identification in starting file
- Phase 2: Recursive upstream tracing with deduplication and max depth
- Phase 3: Source code analysis at flaw origin stages

Tests mock the LLM-calling methods to verify tracing logic, deduplication,
depth limits, multi-flaw handling, and depth-sorted output.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously, _analyze_source_code was only called when no upstream origin
was found (the fallback path). When _trace_upstream successfully identified
a deeper origin, Phase 3 was skipped entirely. Now Phase 3 runs whenever
an origin stage is known, regardless of how it was determined.

Also removes unused imports (json in tracer.py, MagicMock and json in
test_tracer.py) and adds a test verifying Phase 3 is called at a deep
upstream origin.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ze types

- Remove unused `field` import from dataclasses
- Remove unused `source_code_base` parameter from FlawTracer.__init__()
  (registry handles source code path resolution via its own _SOURCE_BASE)
- Replace `Optional[X]` with `X | None` using `from __future__ import annotations`
- Add clarifying comments for dedup strategy and first-match-wins logic
- Remove dead `mock_analysis` variable and unused `SourceCodeAnalysisResult`
  import from test_tracer.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Appends a JSONL line for each significant event during tracing:
phase1_start/done, trace_flaw_start/done, upstream_check, upstream_found,
origin_found, phase3_start, trace_complete.

Monitor progress with: tail -f events.jsonl

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Format: 2026-04-05T23:40:03Z

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents what works, what needs fixing (Phase 1 anchoring,
loose upstream checks, long evidence quotes), and test run results.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 1 prompt now requires the user's specific flaw as the first
result, with additional flaws limited to the same problem family.

Phase 2 prompt now requires causal mechanism (not just topical
overlap) and limits evidence quotes to 200 characters.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Shows "stages/identify_purpose.py" and "assume/identify_purpose.py"
instead of "identify_purpose.py" twice.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 3 always blames the prompt, but some flaws are inherent domain
complexity. Future improvement: classify root causes into prompt-fixable,
domain complexity, and missing input data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…xity, missing_input

Phase 3 now categorizes each root cause so suggestions are honest:
- prompt_fixable: the prompt has a gap that can be edited
- domain_complexity: inherently uncertain/contentious, no prompt change resolves it
- missing_input: the user's plan didn't provide enough detail

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…esults

README: document category field, events.jsonl, updated examples and typical run stats.
AGENTS: move Phase 3 to fixed, add India census v3 results, update what-works-well.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
README: add Tips section (start from self_audit, trust chains over
suggestions, check category, results are non-deterministic) and
Limitations section (LLM subjectivity, first-match-wins, static
registry, text-only, diagnostic not prescriptive).

AGENTS: add non-determinism and registry drift as MEDIUM open issues,
add honest assessment section with guidance on what to trust.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant