Add flaw tracer for root-cause analysis of plan reports#534
Open
Add flaw tracer for root-cause analysis of plan reports#534
Conversation
Static DAG registry mapping all 48 PlanExe pipeline stages to their output files, upstream dependencies, and source code paths. Includes lookup functions (find_stage_by_filename, get_upstream_files, get_source_code_paths) and 14 passing tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FlawTracer orchestrates three-phase flaw tracing through the pipeline DAG: - Phase 1: LLM-based flaw identification in starting file - Phase 2: Recursive upstream tracing with deduplication and max depth - Phase 3: Source code analysis at flaw origin stages Tests mock the LLM-calling methods to verify tracing logic, deduplication, depth limits, multi-flaw handling, and depth-sorted output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously, _analyze_source_code was only called when no upstream origin was found (the fallback path). When _trace_upstream successfully identified a deeper origin, Phase 3 was skipped entirely. Now Phase 3 runs whenever an origin stage is known, regardless of how it was determined. Also removes unused imports (json in tracer.py, MagicMock and json in test_tracer.py) and adds a test verifying Phase 3 is called at a deep upstream origin. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ze types - Remove unused `field` import from dataclasses - Remove unused `source_code_base` parameter from FlawTracer.__init__() (registry handles source code path resolution via its own _SOURCE_BASE) - Replace `Optional[X]` with `X | None` using `from __future__ import annotations` - Add clarifying comments for dedup strategy and first-match-wins logic - Remove dead `mock_analysis` variable and unused `SourceCodeAnalysisResult` import from test_tracer.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Appends a JSONL line for each significant event during tracing: phase1_start/done, trace_flaw_start/done, upstream_check, upstream_found, origin_found, phase3_start, trace_complete. Monitor progress with: tail -f events.jsonl Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Format: 2026-04-05T23:40:03Z Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents what works, what needs fixing (Phase 1 anchoring, loose upstream checks, long evidence quotes), and test run results. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 1 prompt now requires the user's specific flaw as the first result, with additional flaws limited to the same problem family. Phase 2 prompt now requires causal mechanism (not just topical overlap) and limits evidence quotes to 200 characters. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Shows "stages/identify_purpose.py" and "assume/identify_purpose.py" instead of "identify_purpose.py" twice. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 3 always blames the prompt, but some flaws are inherent domain complexity. Future improvement: classify root causes into prompt-fixable, domain complexity, and missing input data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…xity, missing_input Phase 3 now categorizes each root cause so suggestions are honest: - prompt_fixable: the prompt has a gap that can be edited - domain_complexity: inherently uncertain/contentious, no prompt change resolves it - missing_input: the user's plan didn't provide enough detail Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…esults README: document category field, events.jsonl, updated examples and typical run stats. AGENTS: move Phase 3 to fixed, add India census v3 results, update what-works-well. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
README: add Tips section (start from self_audit, trust chains over suggestions, check category, results are non-deterministic) and Limitations section (LLM subjectivity, first-match-wins, static registry, text-only, diagnostic not prescriptive). AGENTS: add non-determinism and registry drift as MEDIUM open issues, add honest assessment section with guidance on what to trust. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
worker_plan_internal/flaw_tracer/package — a CLI tool that traces flaws in PlanExe reports upstream through the pipeline DAG to find where they originatedUsage
Test plan
--helpworks🤖 Generated with Claude Code