diff --git a/AGENTS.md b/AGENTS.md
new file mode 100644
index 00000000..c5a810a9
--- /dev/null
+++ b/AGENTS.md
@@ -0,0 +1,88 @@
+# OpenPlanter
+
+A recursive-language-model investigation agent with a terminal UI. OpenPlanter ingests heterogeneous datasets (corporate registries, campaign finance records, lobbying disclosures, government contracts), resolves entities across them, and surfaces non-obvious connections through evidence-backed analysis. It operates autonomously with file I/O, shell execution, web search, and recursive sub-agent delegation. Open-source alternative to Palantir for investigative journalists, NGOs, OSINT analysts, and researchers.
+
+## Commands
+
+- **Install**: `pip install -e .`
+- **Test (offline)**: `python -m pytest tests/ --ignore=tests/test_live_models.py --ignore=tests/test_integration_live.py`
+- **Test (full, requires API keys)**: `python -m pytest tests/`
+- **Run (TUI)**: `openplanter-agent --workspace DIR`
+- **Run (headless)**: `openplanter-agent --task "OBJECTIVE" --workspace DIR`
+- **Configure keys**: `openplanter-agent --configure-keys`
+- **Docker**: `docker compose up` (mounts `./workspace`, reads `.env`)
+
+## Structure
+
+- `agent/` -- Core agent package (entry point, engine, tools, TUI)
+ - `__main__.py` -- CLI entry point and REPL
+ - `engine.py` -- Recursive language model engine (sub-agent spawning via `subtask`/`execute`)
+ - `runtime.py` -- Session persistence and lifecycle
+ - `model.py` -- Provider-agnostic LLM abstraction (OpenAI, Anthropic, OpenRouter, Cerebras)
+ - `builder.py` -- Engine/model factory
+ - `tools.py` -- 19 workspace tools (file ops, shell, web search, planning, delegation)
+ - `tool_defs.py` -- Tool JSON schemas
+ - `prompts.py` -- System prompt construction (investigation methodology, entity resolution protocol)
+ - `config.py` -- `AgentConfig` dataclass with env var resolution
+ - `credentials.py` -- Credential management (5-tier priority: CLI > env > .env > workspace store > user store)
+ - `tui.py` -- Rich terminal UI with prompt_toolkit
+ - `demo.py` -- Demo mode (entity/path censoring)
+ - `patching.py` -- File patching utilities
+ - `settings.py` -- Persistent workspace settings
+ - `replay_log.py` -- Session replay logging
+- `tests/` -- Unit and integration tests (~8,600 LOC, 24 test files)
+ - `test_live_models.py`, `test_integration_live.py` -- Live API tests (skip in CI)
+ - All other `test_*.py` -- Offline unit tests
+- `skills/openplanter/` -- Investigation methodology skill for Claude Code
+ - `scripts/` -- Stdlib-only Python scripts (entity resolver, cross-reference, evidence chain, confidence scorer, workspace init)
+ - `references/` -- Entity resolution patterns, investigation methodology, output templates
+
+## Conventions
+
+- **Python 3.10+** required. No runtime dependencies beyond `rich`, `prompt_toolkit`, and `pyfiglet`.
+- **TUI**: `rich` for rendering, `prompt_toolkit` for input. No curses.
+- **Skill scripts**: Python stdlib only. Zero external dependencies. Located in `skills/openplanter/scripts/`.
+- **Dataclasses with `slots=True`**: All config and data containers use `@dataclass(slots=True)`.
+- **Provider abstraction**: `model.py` handles all LLM providers behind a unified interface. Never import provider SDKs directly in other modules.
+- **Env var naming**: All runtime settings accept `OPENPLANTER_*` prefix (e.g. `OPENPLANTER_MAX_DEPTH=8`). API keys also accept standard names (e.g. `OPENAI_API_KEY`).
+- **Session data**: Stored in `.openplanter/` within the workspace directory.
+- **Type hints**: Use `from __future__ import annotations` for deferred evaluation. Union syntax: `str | None`, not `Optional[str]`.
+- **Test isolation**: Live API tests are in dedicated files (`test_live_models.py`, `test_integration_live.py`) so they can be excluded with `--ignore`.
+
+## Provider Configuration
+
+| Provider | Default Model | Env Var | Base URL Override |
+|----------|---------------|---------|-------------------|
+| OpenAI | `gpt-5.2` | `OPENAI_API_KEY` | `OPENPLANTER_OPENAI_BASE_URL` |
+| Anthropic | `claude-opus-4-6` | `ANTHROPIC_API_KEY` | `OPENPLANTER_ANTHROPIC_BASE_URL` |
+| OpenRouter | `anthropic/claude-sonnet-4-5` | `OPENROUTER_API_KEY` | `OPENPLANTER_OPENROUTER_BASE_URL` |
+| Cerebras | `qwen-3-235b-a22b-instruct-2507` | `CEREBRAS_API_KEY` | `OPENPLANTER_CEREBRAS_BASE_URL` |
+
+**Service keys**: `EXA_API_KEY` (Exa web search), `VOYAGE_API_KEY` (Voyage embeddings).
+
+Key resolution priority (highest wins):
+1. CLI flags (`--openai-api-key`, etc.)
+2. Environment variables (`OPENAI_API_KEY` or `OPENPLANTER_OPENAI_API_KEY`)
+3. `.env` file in the workspace
+4. Workspace credential store (`.openplanter/credentials.json`)
+5. User credential store (`~/.openplanter/credentials.json`)
+
+## Boundaries
+
+- **Always**: Run `python -m pytest tests/ --ignore=tests/test_live_models.py --ignore=tests/test_integration_live.py` before committing
+- **Always**: Maintain provider abstraction -- all LLM calls go through `model.py`
+- **Always**: Keep skill scripts stdlib-only (no pip dependencies in `skills/`)
+- **Ask**: Before adding new runtime dependencies to `pyproject.toml`
+- **Ask**: Before changing the tool schema format in `tool_defs.py` (affects all providers)
+- **Ask**: Before modifying credential resolution order in `credentials.py`
+- **Never**: Commit API keys, `.env` files, or `credentials.json`
+- **Never**: Import provider-specific SDKs outside `model.py`
+- **Never**: Break the `--ignore` convention for live tests (CI must run without API keys)
+
+## Troubleshooting
+
+- **No API keys found**: Run `openplanter-agent --configure-keys` or set env vars. Keys are resolved from 5 sources (see Provider Configuration).
+- **Docker can't find keys**: Copy `.env.example` to `.env` and fill in keys. The container reads `.env` via `env_file` in `docker-compose.yml`.
+- **Tests fail with import errors**: Ensure editable install with `pip install -e .` from the project root.
+- **Live tests fail**: Expected if no API keys are set. Use `--ignore=tests/test_live_models.py --ignore=tests/test_integration_live.py` to skip them.
+- **Session state corruption**: Delete `.openplanter/` in the workspace directory to reset.
diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
new file mode 100644
index 00000000..cf249b5c
--- /dev/null
+++ b/ARCHITECTURE.md
@@ -0,0 +1,337 @@
+# Architecture
+
+This document describes the high-level architecture of OpenPlanter, a recursive
+language model investigation agent with a terminal UI. It is intended for
+developers extending the codebase and AI agents navigating it. Read this first,
+then use symbol search (`Cmd+T` / `osgrep`) to locate specifics.
+
+## Bird's Eye View
+
+OpenPlanter solves a specific problem: given a workspace full of heterogeneous
+datasets (corporate registries, campaign finance records, lobbying disclosures,
+government contracts), resolve entities across them and surface non-obvious
+connections through evidence-backed analysis.
+
+The core paradigm is a **recursive language model agent loop**. A user submits
+an objective. The engine feeds it to an LLM with tool definitions. The LLM
+returns tool calls (read files, run shell commands, search the web, spawn
+sub-agents). The engine executes them, appends observations, and loops until
+the LLM produces a final text answer or the step budget is exhausted.
+
+Key design principles:
+
+- **Provider-agnostic.** OpenAI, Anthropic, OpenRouter, and Cerebras are
+ first-class providers behind a shared `BaseModel` protocol. No SDK
+ dependencies -- all HTTP is raw `urllib`.
+- **Recursive delegation.** The top-level agent can spawn sub-agents via
+ `subtask` (same or lower-tier model) and `execute` (leaf executor, cheapest
+ model). Sub-agents share workspace state but get independent conversations.
+- **Zero runtime dependencies beyond stdlib + Rich.** Only three PyPI packages:
+ `rich`, `prompt_toolkit`, `pyfiglet` (all for the TUI). The agent core,
+ model layer, and tools use only the Python standard library.
+- **Workspace-sandboxed.** All file operations are confined to the workspace
+ root. Path traversal is blocked at the tool layer.
+
+```
+ +--------------+
+ | CLI / TUI | __main__.py, tui.py
+ +------+-------+
+ | objective
+ +------v-------+
+ | SessionRuntime| runtime.py -- persistence, replay logging
+ +------+-------+
+ | solve()
+ +------v-------+
+ | RLMEngine | engine.py -- recursive step loop
+ | +---------+ |
+ | | BaseModel| | model.py -- provider-agnostic LLM protocol
+ | +---------+ |
+ | +---------+ |
+ | | Tools | | tools.py -- workspace I/O, shell, web search
+ | +---------+ |
+ +--------------+
+```
+
+## High-Level Data Flow
+
+```
+User objective (text)
+ |
+ v
+__main__.main() -- parse args, load credentials, build engine
+ |
+ v
+SessionRuntime.solve() -- open/resume session, wrap event callbacks
+ |
+ v
+RLMEngine.solve_with_context() -- enter recursive step loop
+ |
+ +-> model.complete(conversation) -- LLM API call (SSE streaming)
+ | returns ModelTurn { tool_calls[], text, tokens }
+ |
+ +-> if tool_calls: dispatch each via _apply_tool_call()
+ | +- file tools -> WorkspaceTools methods
+ | +- shell tools -> subprocess with timeout
+ | +- web tools -> Exa API
+ | +- subtask -> _solve_recursive(depth+1, same/lower model)
+ | +- execute -> _solve_recursive(depth+1, cheapest model)
+ | +- think -> no-op (recorded as observation)
+ |
+ +-> append observations to conversation, loop
+ |
+ +-> if no tool_calls + text present: return final answer
+```
+
+## Codemap
+
+### `agent/` -- Core Agent Package (~6,200 lines)
+
+| File | Lines | Purpose |
+|------|------:|---------|
+| `engine.py` | 935 | Recursive step loop (`RLMEngine`), tool dispatch, context condensation, sub-agent spawning, acceptance criteria judging, budget warnings, plan injection |
+| `model.py` | 1020 | Provider-agnostic LLM abstraction: `BaseModel` protocol, `OpenAICompatibleModel`, `AnthropicModel`, SSE streaming, model listing APIs, `ScriptedModel` for tests |
+| `tools.py` | 845 | Workspace-sandboxed tool implementations: file I/O, shell execution (fg/bg), ripgrep search, repo map with symbol extraction, Exa web search, parallel write conflict detection |
+| `tui.py` | 820 | Rich terminal UI: ASCII splash art, thinking display with streaming, step tree rendering, slash commands (`/model`, `/reasoning`, `/status`), `RichREPL` main loop |
+| `__main__.py` | 585 | CLI entry point: argparse, credential loading cascade, provider resolution, engine construction, headless task mode, plain REPL fallback |
+| `tool_defs.py` | 537 | Provider-neutral JSON schemas for all 19 tools, converters to OpenAI and Anthropic formats, strict-mode enforcement for OpenAI |
+| `prompts.py` | 350 | System prompt assembly: base prompt (epistemic discipline, hard rules, data ingestion), recursive REPL section, acceptance criteria section, demo mode section |
+| `runtime.py` | 345 | Session lifecycle: `SessionStore` (create/resume/list sessions, persist state, append JSONL events, write artifacts), `SessionRuntime` (wraps engine with persistence) |
+| `credentials.py` | 270 | Credential management: `CredentialBundle`, `CredentialStore` (workspace-level), `UserCredentialStore` (`~/.openplanter/`), `.env` parsing, interactive prompting |
+| `patching.py` | 260 | Codex-style patch parser and applier: `AddFileOp`, `DeleteFileOp`, `UpdateFileOp`, whitespace-normalized subsequence matching |
+| `builder.py` | 195 | Engine/model factory: provider inference from model name, model construction, `build_engine()`, `build_model_factory()` for sub-agent creation |
+| `settings.py` | 115 | Persistent workspace defaults: `PersistentSettings`, `SettingsStore` (`.openplanter/settings.json`), per-provider model defaults |
+| `demo.py` | 110 | Demo mode: `DemoCensor` replaces workspace path segments with block characters, `DemoRenderHook` intercepts Rich renderables before display |
+| `config.py` | 103 | `AgentConfig` dataclass with ~30 fields, `from_env()` factory reading `OPENPLANTER_*` environment variables |
+| `replay_log.py` | 95 | `ReplayLogger`: delta-encoded JSONL log of every LLM API call for replay/debugging, child loggers for subtask conversations |
+| `__init__.py` | 35 | Public API re-exports |
+
+### `tests/` -- Test Suite (~8,000 lines, 25 files)
+
+Tests use pytest with no external test dependencies. `conftest.py` provides
+`_tc()` shorthand for `ToolCall` creation and `mock_openai_stream` /
+`mock_anthropic_stream` helpers that convert non-streaming response dicts into
+SSE event lists for monkey-patching. Key test files:
+
+- `test_user_stories.py` (1115 lines) -- end-to-end user story scenarios
+- `test_model_complex.py` (836 lines) -- provider model edge cases
+- `test_engine_complex.py` (645 lines) -- recursive delegation, budget, judging
+- `test_integration.py` (642 lines) -- full solve cycles with `ScriptedModel`
+- `test_patching.py` / `test_patching_complex.py` -- Codex patch format
+- `test_live_models.py` / `test_integration_live.py` -- live API tests (skipped by default)
+
+### `skills/openplanter/` -- Claude Code Skill
+
+Investigation methodology extracted for use as a Claude Code skill. Contains:
+- `SKILL.md` -- Epistemic framework, entity resolution protocol, Admiralty
+ confidence tiers, ACH methodology, output standards
+- `scripts/` -- Python stdlib-only helpers: `init_workspace.py`,
+ `entity_resolver.py`, `cross_reference.py`, `evidence_chain.py`,
+ `confidence_scorer.py`
+- `references/` -- Entity resolution patterns, investigation methodology,
+ output templates
+
+## Named Entities
+
+### Core Types
+
+- `RLMEngine` -- The recursive language model engine. Central class that owns
+ the step loop, tool dispatch, sub-agent spawning, and budget management.
+- `BaseModel` -- Protocol defining the LLM interface: `create_conversation`,
+ `complete`, `append_assistant_turn`, `append_tool_results`.
+- `OpenAICompatibleModel` -- Covers OpenAI, OpenRouter, and Cerebras providers.
+ Handles SSE streaming, reasoning effort, strict tool schemas.
+- `AnthropicModel` -- Anthropic-specific model with thinking/adaptive mode,
+ content blocks, and tool_use blocks.
+- `Conversation` -- Opaque message list wrapper. Provider-specific internals
+ hidden behind a common interface.
+- `ModelTurn` -- One assistant response: tool calls, text, stop reason, token
+ counts, raw response for round-tripping.
+- `ToolCall` / `ToolResult` -- Request/response pair for tool invocations.
+- `WorkspaceTools` -- All 19 tool implementations, workspace-sandboxed.
+- `AgentConfig` -- ~30-field dataclass for all runtime configuration.
+- `ExternalContext` -- Accumulates observations across recursive calls for
+ cross-depth context sharing.
+- `SessionRuntime` -- Wraps `RLMEngine` with session persistence, event
+ logging, and replay capture.
+- `SessionStore` -- Filesystem-backed session storage under `.openplanter/sessions/`.
+- `CredentialBundle` -- Six API keys (OpenAI, Anthropic, OpenRouter, Cerebras,
+ Exa, Voyage) with merge and serialization logic.
+- `PersistentSettings` -- Workspace-level defaults for model and reasoning effort.
+- `ReplayLogger` -- Delta-encoded JSONL logger for LLM call replay.
+
+### Key Functions
+
+- `build_engine()` in `builder.py` -- Constructs `RLMEngine` from `AgentConfig`.
+- `build_model_factory()` -- Returns a callable that creates models by name,
+ used by the engine to spawn sub-agents at different tiers.
+- `build_system_prompt()` in `prompts.py` -- Assembles the system prompt from
+ base + optional recursive/acceptance/demo sections.
+- `get_tool_definitions()` in `tool_defs.py` -- Returns filtered tool schemas
+ based on mode (recursive vs flat, with/without acceptance criteria).
+- `_solve_recursive()` -- The inner step loop in `RLMEngine`. Manages the
+ conversation, dispatches tool calls, handles budget warnings, context
+ condensation, and plan injection.
+- `_model_tier()` -- Maps model names to capability tiers (1=opus, 2=sonnet,
+ 3=haiku) for delegation policy enforcement.
+- `_lowest_tier_model()` -- Returns the cheapest model name for `execute` calls.
+- `infer_provider_for_model()` -- Regex-based provider inference from model name.
+
+## Architectural Invariants
+
+1. **All file access is workspace-sandboxed.** `WorkspaceTools._resolve_path()`
+ raises `ToolError` if a resolved path escapes the workspace root. There are
+ no exceptions to this.
+
+2. **Existing files cannot be overwritten without being read first.**
+ `write_file()` blocks writes to existing files not in `_files_read`. This
+ prevents the LLM from destroying workspace data by hallucinating content.
+
+3. **Sub-agents can only delegate DOWN the model tier chain.** `subtask()`
+ enforces that the requested model's tier is >= the current model's tier
+ (opus -> sonnet -> haiku, never haiku -> opus). This prevents cost explosions.
+
+4. **No SDK dependencies for LLM providers.** All HTTP is raw
+ `urllib.request` with manual JSON serialization and SSE parsing. This is
+ deliberate -- it eliminates version conflicts and keeps the dependency
+ footprint minimal.
+
+5. **Shell commands cannot use heredocs or interactive programs.** Runtime
+ policy in `WorkspaceTools._check_shell_policy()` blocks `<< EOF` syntax
+ and programs like `vim`, `nano`, `less`, `top`. These would hang the
+ non-interactive environment.
+
+6. **Identical shell commands are blocked after 2 repetitions at the same
+ depth.** `_runtime_policy_check()` in the engine prevents infinite retry
+ loops.
+
+7. **Tool definitions are the single source of truth.** `TOOL_DEFINITIONS` in
+ `tool_defs.py` is the canonical list. `to_openai_tools()` and
+ `to_anthropic_tools()` are pure converters. Tool behavior in `engine.py`
+ must match the schemas in `tool_defs.py`.
+
+8. **Session state is append-only JSONL.** Events are appended to
+ `events.jsonl`, never rewritten. State snapshots go to `state.json`.
+ Replay logs go to `replay.jsonl` with delta encoding.
+
+## Layer Boundaries
+
+### CLI Entry (`__main__.py`) -> Engine (`engine.py`)
+
+The CLI parses arguments, resolves credentials through a 5-level cascade
+(CLI flags > env vars > `.env` file > workspace store > user store), builds
+an `AgentConfig`, and calls `build_engine()`. It never touches the model or
+tools directly. The `ChatContext` dataclass bundles `SessionRuntime`, `AgentConfig`,
+and `SettingsStore` for the TUI layer.
+
+### Engine (`engine.py`) -> Model (`model.py`)
+
+The engine interacts with models exclusively through the `BaseModel` protocol.
+It never constructs HTTP requests or parses provider-specific responses.
+Streaming deltas are forwarded via the `on_content_delta` callback, which
+the engine installs only for depth-0 calls.
+
+### Engine (`engine.py`) -> Tools (`tools.py`)
+
+Tool dispatch happens in `_apply_tool_call()`, a ~200-line method that
+pattern-matches on tool name and delegates to `WorkspaceTools` methods.
+Tools return `(is_final: bool, observation: str)`. The engine clips
+observations to `max_observation_chars` and appends them to the conversation.
+
+### Session Layer (`runtime.py`) -> Engine
+
+`SessionRuntime.solve()` wraps `RLMEngine.solve_with_context()` with event
+persistence, replay logging, and patch artifact capture. It manages the
+`ExternalContext` across multiple `solve()` calls within a session.
+
+## Cross-Cutting Concerns
+
+### Credential Management
+
+Six API keys flow through a 5-level resolution cascade defined in
+`_load_credentials()` in `__main__.py`:
+
+1. CLI flags (`--openai-api-key`, etc.)
+2. Environment variables (`OPENAI_API_KEY` or `OPENPLANTER_OPENAI_API_KEY`)
+3. `.env` file in workspace root
+4. Workspace credential store (`.openplanter/credentials.json`)
+5. User credential store (`~/.openplanter/credentials.json`)
+
+Higher levels override lower. Credential files are chmod 600.
+
+### Demo Mode
+
+When `--demo` is active, three mechanisms cooperate:
+1. `DemoCensor` replaces workspace path segments (excluding generic parts
+ like "Users", "Desktop") with block characters in all TUI output.
+2. `DemoRenderHook` intercepts Rich renderables before display.
+3. The system prompt instructs the LLM to censor entity names in its own
+ output using block characters.
+
+### Context Condensation
+
+When input tokens exceed 75% of the model's context window, the engine calls
+`condense_conversation()` on the model. Both `OpenAICompatibleModel` and
+`AnthropicModel` implement this by replacing old tool result contents with
+`[earlier tool output condensed]`, preserving required IDs for API compliance.
+
+### Budget Management
+
+The engine injects timestamp, step counter, and context usage tags into the
+first tool result of each step. When the step budget falls below 50%, a
+warning is appended. Below 25%, a critical warning demands immediate output.
+The system prompt reinforces these constraints.
+
+### Parallel Execution
+
+`subtask` and `execute` tool calls are dispatched in parallel via
+`ThreadPoolExecutor`. A parallel write conflict detector in `WorkspaceTools`
+prevents sibling sub-agents from writing to the same file. All other tools
+run sequentially.
+
+### Session Persistence
+
+Each session lives under `.openplanter/sessions/{session_id}/` and contains:
+- `metadata.json` -- timestamps, workspace path
+- `state.json` -- serialized `ExternalContext` observations
+- `events.jsonl` -- append-only trace of all objectives, steps, and results
+- `replay.jsonl` -- delta-encoded LLM call log for exact replay
+- `artifacts/` -- captured patches and other artifacts
+- `*.plan.md` -- investigation plans (newest is auto-injected into context)
+
+### Acceptance Criteria
+
+When enabled, `subtask` and `execute` require an `acceptance_criteria`
+parameter. After the child agent completes, a lightweight judge model
+(cheapest tier) evaluates the result against the criteria and appends
+`PASS` or `FAIL`. The system prompt includes the IMPLEMENT-THEN-VERIFY
+pattern to enforce uncorrelated verification.
+
+## Common Questions
+
+**Where do I add a new tool?**
+1. Add the schema to `TOOL_DEFINITIONS` in `tool_defs.py`.
+2. Implement the method in `WorkspaceTools` in `tools.py`.
+3. Add the dispatch case in `_apply_tool_call()` in `engine.py`.
+
+**Where do I add a new LLM provider?**
+1. Add the model class in `model.py` (implement `BaseModel` protocol).
+2. Add provider regex to `builder.py` and extend `build_engine()` / `build_model_factory()`.
+3. Add credential fields to `CredentialBundle`, `AgentConfig`, and `__main__.py`.
+4. Add default model to `PROVIDER_DEFAULT_MODELS` in `config.py`.
+
+**Where is the system prompt?**
+`prompts.py`. It is assembled from four sections: `SYSTEM_PROMPT_BASE` (always),
+`RECURSIVE_SECTION` (if recursive mode), `ACCEPTANCE_CRITERIA_SECTION` (if
+acceptance criteria enabled), `DEMO_SECTION` (if demo mode).
+
+**How does sub-agent model routing work?**
+`_model_tier()` assigns tiers (1=opus, 2=sonnet, 3=haiku). `subtask` can
+request any model at equal or lower tier. `execute` always uses
+`_lowest_tier_model()` (haiku for Claude, same model for OpenAI). The
+`model_factory` in `RLMEngine` creates model instances on demand, cached by
+`(model_name, reasoning_effort)` tuple.
+
+**How do I run tests?**
+`python -m pytest tests/` (skip live tests with `--ignore=tests/test_live_models.py
+--ignore=tests/test_integration_live.py`). Tests use `ScriptedModel` and
+monkey-patched `_http_stream_sse` -- no API keys required.
diff --git a/CLAUDE.md b/CLAUDE.md
index 2856e623..9f17fe3a 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1,3 +1,63 @@
+# OpenPlanter
+
+Recursive LLM investigation agent for entity resolution across heterogeneous datasets (corporate registries, campaign finance, lobbying disclosures). Builds evidence chains through autonomous sub-agent delegation. Supports 4 providers: OpenAI, Anthropic, OpenRouter, Cerebras.
+
+## Stack
+
+Python 3.10+, rich, prompt_toolkit, pyfiglet. Skill scripts use stdlib only (zero deps).
+
+## Commands
+
+- **Install**: `pip install -e .`
+- **Run**: `openplanter-agent --workspace DIR`
+- **Headless**: `openplanter-agent --task "objective" --workspace DIR`
+- **Test**: `python -m pytest tests/ --ignore=tests/test_live_models.py --ignore=tests/test_integration_live.py`
+- **Docker**: `docker compose up`
+
+## Structure
+
+- `agent/` -- core engine, provider abstraction, tools, TUI
+- `tests/` -- unit and integration tests
+- `skills/openplanter/` -- Claude Code skill (stdlib-only scripts)
+- `skills/openplanter/.claude/agents/` -- agent definitions (investigation-agent, verifier-agent)
+
+## Key Files
+
+| File | Purpose |
+|------|---------|
+| `agent/engine.py` | Recursive investigation engine |
+| `agent/model.py` | Provider-agnostic LLM abstraction |
+| `agent/tools.py` | 19 workspace tools (file I/O, shell, web, delegation) |
+| `agent/prompts.py` | System prompt construction |
+| `agent/tui.py` | Rich terminal UI |
+
+## Conventions
+
+- Agent code uses rich/prompt_toolkit for TUI
+- Skill scripts use Python stdlib only -- no third-party imports
+- `.claude/settings.json` disables AI commit attribution
+
+## Provider Config
+
+| Provider | Env Var |
+|----------|---------|
+| OpenAI | `OPENAI_API_KEY` |
+| Anthropic | `ANTHROPIC_API_KEY` |
+| OpenRouter | `OPENROUTER_API_KEY` |
+| Cerebras | `CEREBRAS_API_KEY` |
+
+Additional: `EXA_API_KEY` (web search), `VOYAGE_API_KEY` (embeddings), `FEC_API_KEY` (FEC API, optional). All keys support `OPENPLANTER_` prefix.
+
+## On-Demand References
+
+Read these when working on the skill or investigation methodology:
+- `skills/openplanter/SKILL.md` -- full methodology, entity resolution protocol, confidence tiers
+- `skills/openplanter/references/public-records-apis.md` -- API endpoints, auth, rate limits
+- `skills/openplanter/references/entity-resolution-patterns.md` -- normalization tables, suffix maps
+- `skills/openplanter/references/investigation-methodology.md` -- epistemic framework, ACH, swarm roles
+- `skills/openplanter/.claude/agents/investigation-agent.md` -- investigation agent decision tree
+- `skills/openplanter/.claude/agents/verifier-agent.md` -- independent verification protocol
+
## Stop Checklist
Before completing a task, check each item:
diff --git a/skills/openplanter/.claude/agents/investigation-agent.md b/skills/openplanter/.claude/agents/investigation-agent.md
new file mode 100644
index 00000000..cfe70eca
--- /dev/null
+++ b/skills/openplanter/.claude/agents/investigation-agent.md
@@ -0,0 +1,89 @@
+---
+name: investigation-agent
+description: "OpenPlanter investigation agent for cross-dataset entity resolution, evidence chain construction, and structured OSINT analysis. Routes between skill scripts and full RLM delegation based on task complexity."
+tools:
+ - Bash
+ - Read
+ - Write
+ - Edit
+ - Glob
+ - Grep
+ - Task
+ - WebFetch
+---
+
+# Investigation Agent
+
+You are an OpenPlanter investigation agent. Your job is to cross-reference datasets, resolve entities, build evidence chains, and produce confidence-scored findings using the OpenPlanter methodology.
+
+## Decision Tree: Scripts vs. RLM Delegation
+
+**Use skill scripts directly** when:
+- 1-2 datasets to cross-reference
+- Entity resolution + cross-referencing only
+- No web research needed
+- Fewer than 20 reasoning steps
+
+**Delegate to RLM** (`delegate_to_rlm.py`) when:
+- 3+ datasets require cross-referencing
+- Web search is required for entity enrichment
+- Iterative exploration with hypothesis refinement
+- 20+ reasoning steps or multi-stage investigation
+
+**Use the full pipeline** (`investigate.py`) when:
+- End-to-end investigation from raw data to findings report
+- Multiple phases need orchestration
+
+## Available Scripts
+
+All scripts are in `~/.claude/skills/openplanter/scripts/`. Run via `python3`.
+
+| Script | Purpose |
+|--------|---------|
+| `init_workspace.py` | Create workspace directory structure |
+| `entity_resolver.py` | Fuzzy entity matching → `entities/canonical.json` |
+| `cross_reference.py` | Link records across datasets → `findings/cross-references.json` |
+| `evidence_chain.py` | Validate evidence chain structure |
+| `confidence_scorer.py` | Score findings by Admiralty tiers |
+| `dataset_fetcher.py` | Download bulk public datasets (SEC, FEC, OFAC, LDA) |
+| `web_enrich.py` | Enrich entities via Exa neural search |
+| `scrape_records.py` | Fetch entity records from government APIs |
+| `delegate_to_rlm.py` | Spawn full OpenPlanter agent for complex tasks |
+| `investigate.py` | Run full pipeline: collect → resolve → enrich → analyze → report |
+
+## RLM Delegation — Provider-Agnostic
+
+The RLM agent supports any LLM provider. Provider is auto-inferred from the model name:
+
+```bash
+# Anthropic (default)
+python3 scripts/delegate_to_rlm.py --objective "..." --workspace DIR --model claude-sonnet-4-5-20250929
+
+# OpenAI
+python3 scripts/delegate_to_rlm.py --objective "..." --workspace DIR --model gpt-4o
+
+# OpenRouter (any model via slash routing)
+python3 scripts/delegate_to_rlm.py --objective "..." --workspace DIR --model anthropic/claude-sonnet-4-5
+
+# Cerebras (specify --provider when model name lacks provider substring)
+python3 scripts/delegate_to_rlm.py --objective "..." --workspace DIR --model qwen-3-235b-a22b-instruct-2507 --provider cerebras
+```
+
+API keys pass through environment variables: `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `OPENROUTER_API_KEY`, `CEREBRAS_API_KEY` (or `OPENPLANTER_`-prefixed variants).
+
+## Epistemic Rules
+
+1. Ground truth comes from files, not memory. Read data before modifying.
+2. Success does not mean correctness. Verify outcomes, not exit codes.
+3. Three failures = wrong approach. Change strategy entirely.
+4. Produce artifacts early. Write a first draft, then iterate.
+5. Implementation and verification must be uncorrelated.
+
+## Confidence Tiers (Admiralty System)
+
+| Tier | Criteria |
+|------|----------|
+| Confirmed | 2+ independent sources, hard signal match (EIN, phone) |
+| Probable | Strong single source, high fuzzy match (>0.85) |
+| Possible | Circumstantial only, moderate match (0.55-0.84) |
+| Unresolved | Contradictory evidence, insufficient data |
diff --git a/skills/openplanter/.claude/agents/verifier-agent.md b/skills/openplanter/.claude/agents/verifier-agent.md
new file mode 100644
index 00000000..7e385853
--- /dev/null
+++ b/skills/openplanter/.claude/agents/verifier-agent.md
@@ -0,0 +1,84 @@
+---
+name: verifier-agent
+description: "Read-only verification agent for OpenPlanter investigations. Validates evidence chains, spot-checks entity resolution, and verifies confidence scores independently from analysis agents."
+tools:
+ - Bash
+ - Read
+ - Glob
+ - Grep
+---
+
+# Verifier Agent
+
+You are an independent verification agent. You receive investigation output files and validation criteria. You have **no shared context** from the analysis phase — this is deliberate, to maintain uncorrelated verification per OpenPlanter's epistemic framework.
+
+## Verification Protocol
+
+1. **Load output files fresh.** Read `entities/canonical.json`, `findings/cross-references.json`, `evidence/chains.json`, and `evidence/scoring-log.json` from the workspace.
+
+2. **Spot-check N random records** against raw source data in `datasets/`. For each:
+ - Verify the entity name appears in the claimed source file
+ - Verify the linking fields match
+ - Verify the match score is plausible
+
+3. **Verify row counts** match expectations:
+ - Count entities in canonical.json vs. raw dataset rows
+ - Count cross-references vs. entities with multi-source presence
+ - Check that no records were silently dropped
+
+4. **Run validation scripts** (note: `evidence_chain.py` writes `evidence/validation-report.json` as a side effect; `confidence_scorer.py` must use `--dry-run` to avoid mutating workspace):
+ ```bash
+ python3 ~/.claude/skills/openplanter/scripts/evidence_chain.py /path/to/workspace
+ python3 ~/.claude/skills/openplanter/scripts/confidence_scorer.py /path/to/workspace --dry-run
+ ```
+
+5. **Report raw output only.** Do not interpret or justify findings — report what you observe.
+
+## What to Check
+
+- **Entity resolution**: Are high-confidence matches actually the same entity? Sample 5 random confirmed matches and verify manually.
+- **Cross-references**: Do cross-referenced records actually share the claimed entity? Check linking fields.
+- **Evidence chains**: Are all hops documented with source records? Is the weakest link correctly identified?
+- **Confidence scores**: Do scores align with the Admiralty criteria? Are hard signal conflicts flagged as unresolved?
+- **Provenance**: Do all datasets have provenance metadata (source URL, timestamp, checksum)?
+
+## Anti-Bias Checks
+
+- **Confirmation bias**: Score hypotheses by inconsistency count, not confirmation count
+- **Circular reporting**: Verify independence of collection paths before counting corroborations
+- **Anchoring**: Do not pre-judge based on entity names or known associations
+
+## Output Format
+
+```markdown
+## Verification Report
+
+**Workspace:** /path/to/workspace
+**Verified:** YYYY-MM-DD HH:MM UTC
+
+### Entity Resolution
+- Records spot-checked: N
+- Correct matches: N
+- False positives: N
+- Issues: [list]
+
+### Cross-References
+- Records spot-checked: N
+- Valid links: N
+- Broken links: N
+- Issues: [list]
+
+### Evidence Chains
+- Chains validated: N
+- Valid: N
+- Invalid: N
+- Issues: [list]
+
+### Confidence Scores
+- Scores checked: N
+- Correctly assigned: N
+- Misassigned: N
+- Issues: [list]
+
+### Verdict: PASS / FAIL / PARTIAL
+```
diff --git a/skills/openplanter/README.md b/skills/openplanter/README.md
new file mode 100644
index 00000000..16fa1948
--- /dev/null
+++ b/skills/openplanter/README.md
@@ -0,0 +1,228 @@
+# OpenPlanter — Claude Code Skill
+
+A [Claude Code](https://docs.anthropic.com/en/docs/claude-code) skill that extracts OpenPlanter's investigation methodology into standalone, zero-dependency Python scripts. Resolve entities, cross-reference datasets, build evidence chains, and score confidence tiers — all from Claude Code's terminal.
+
+## Why a Skill?
+
+OpenPlanter is a full agent with TUI, LLM providers, recursive sub-tasks, and session management. The skill distills its **methodology** — entity resolution, evidence chain construction, confidence scoring — into lightweight scripts that run inside Claude Code without spinning up the full agent.
+
+Use the skill when you want OpenPlanter's analytical tradecraft in any Claude Code session, on any dataset, without configuring providers or launching the TUI.
+
+## Installation
+
+Copy the `skills/openplanter/` directory into your Claude Code skills folder:
+
+```bash
+# From this repo
+cp -r skills/openplanter ~/.claude/skills/openplanter
+```
+
+Or clone and symlink:
+
+```bash
+git clone https://github.com/ShinMegamiBoson/OpenPlanter.git
+ln -s "$(pwd)/OpenPlanter/skills/openplanter" ~/.claude/skills/openplanter
+```
+
+Claude Code discovers skills automatically from `~/.claude/skills/`.
+
+## Quick Start
+
+```bash
+# 1. Initialize a workspace
+python3 ~/.claude/skills/openplanter/scripts/init_workspace.py /tmp/investigation
+
+# 2. Add datasets
+cp campaign_finance.csv lobbying.json /tmp/investigation/datasets/
+
+# 3. Resolve entities across datasets
+python3 ~/.claude/skills/openplanter/scripts/entity_resolver.py /tmp/investigation
+
+# 4. Cross-reference linked records
+python3 ~/.claude/skills/openplanter/scripts/cross_reference.py /tmp/investigation
+
+# 5. Validate evidence chains
+python3 ~/.claude/skills/openplanter/scripts/evidence_chain.py /tmp/investigation
+
+# 6. Score confidence
+python3 ~/.claude/skills/openplanter/scripts/confidence_scorer.py /tmp/investigation
+```
+
+## Scripts
+
+All scripts use Python stdlib only. Zero external dependencies. Python 3.10+. External tools (Exa, Firecrawl, OpenPlanter agent) are invoked as subprocesses when needed.
+
+### Core Analysis
+
+| Script | Purpose |
+|--------|---------|
+| `init_workspace.py` | Create workspace directory structure (`datasets/`, `entities/`, `findings/`, `evidence/`, `plans/`) |
+| `entity_resolver.py` | Fuzzy entity matching with Union-Find clustering. Produces `entities/canonical.json` |
+| `cross_reference.py` | Link records across datasets using the canonical entity map. Produces `findings/cross-references.json` |
+| `evidence_chain.py` | Validate evidence chain structure (hops, corroboration, source records) |
+| `confidence_scorer.py` | Re-score findings by confidence tier. Updates JSON files in-place |
+
+### Data Collection & Enrichment
+
+| Script | Purpose |
+|--------|---------|
+| `dataset_fetcher.py` | Download bulk public datasets (SEC, FEC, OFAC, OpenSanctions, LDA) with provenance metadata |
+| `web_enrich.py` | Enrich resolved entities via Exa neural search (requires `exa-search` skill + `EXA_API_KEY`) |
+| `scrape_records.py` | Fetch entity-specific records from government APIs (SEC EDGAR, FEC, Senate LDA, USAspending) |
+
+### Orchestration & Delegation
+
+| Script | Purpose |
+|--------|---------|
+| `investigate.py` | Run full pipeline end-to-end: collect → resolve → enrich → analyze → report |
+| `delegate_to_rlm.py` | Spawn full OpenPlanter RLM agent for complex investigations (provider-agnostic) |
+
+### Entity Resolver
+
+Normalize → Block → Compare → Score → Cluster:
+
+```bash
+# Default threshold (0.85)
+python3 scripts/entity_resolver.py /path/to/workspace
+
+# Lower threshold for wider matching
+python3 scripts/entity_resolver.py /path/to/workspace --threshold 0.70
+
+# Specify which columns contain entity names
+python3 scripts/entity_resolver.py /path/to/workspace --name-columns "name,contributor_name,registrant"
+```
+
+Name normalization includes: Unicode NFKD decomposition, diacritic stripping, legal suffix removal (LLC, Inc, Corp, Ltd, etc.), ampersand canonicalization (`&` → `and`), noise word removal (`the`, `a`, `an`, `of`), punctuation stripping, whitespace collapse.
+
+Blocking uses first-3-character keys with sorted-neighborhood cross-block comparison to reduce O(N^2) pairwise cost.
+
+### Cross-Reference
+
+Requires `entities/canonical.json` (run entity resolver first):
+
+```bash
+# All datasets
+python3 scripts/cross_reference.py /path/to/workspace
+
+# Specific datasets only
+python3 scripts/cross_reference.py /path/to/workspace --datasets campaign.csv lobby.json
+
+# Require 3+ datasets for a match
+python3 scripts/cross_reference.py /path/to/workspace --min-datasets 3
+```
+
+### Confidence Scorer
+
+```bash
+# Dry run (show changes without modifying)
+python3 scripts/confidence_scorer.py /path/to/workspace --dry-run
+
+# Score and update in-place
+python3 scripts/confidence_scorer.py /path/to/workspace
+```
+
+## Confidence Tiers
+
+Based on the Admiralty System (NATO AJP-2.1):
+
+| Tier | Criteria |
+|------|----------|
+| **Confirmed** | 2+ independent sources with different collection paths AND high similarity (≥0.85); or hard signal match (EIN, phone) across sources |
+| **Probable** | Strong single source (official record); 2+ sources with moderate similarity; or hard signal with ≥0.70 similarity |
+| **Possible** | Circumstantial evidence only; moderate fuzzy match (0.55–0.84); single-source chain with ≤3 hops |
+| **Unresolved** | Contradictory evidence; conflicting hard identifiers; insufficient data; weak chain |
+
+Hard signals (EIN, TIN, phone, email) are verified for **agreement**, not just presence. Conflicting identifiers across variants force "unresolved" status.
+
+## Workspace Structure
+
+```
+investigation/
+├── datasets/
+│ ├── *.csv, *.json # Source files
+│ ├── bulk/{source}/ # Bulk downloads (dataset_fetcher.py)
+│ └── scraped/{source}/ # API results (scrape_records.py)
+├── entities/
+│ ├── canonical.json # Entity resolution output
+│ └── enriched.json # Web enrichment (web_enrich.py)
+├── findings/
+│ ├── cross-references.json
+│ └── summary.md # Pipeline report (investigate.py)
+├── evidence/
+│ ├── chains.json
+│ └── scoring-log.json
+└── plans/
+ └── plan.md
+```
+
+## Methodology
+
+The skill encodes OpenPlanter's epistemic discipline:
+
+1. **Ground truth comes from files, not memory.** Read actual data before modifying.
+2. **Success does not mean correctness.** Verify outcomes, not exit codes.
+3. **Three failures = wrong approach.** Change strategy entirely.
+4. **Produce artifacts early.** Write a working first draft, then iterate.
+5. **Implementation and verification must be uncorrelated.** The agent that performs analysis must not be its sole verifier.
+
+See `SKILL.md` for the full methodology reference, including:
+- Entity Resolution Protocol (6-stage pipeline)
+- Evidence Chain Construction (Admiralty grading)
+- Analysis of Competing Hypotheses (ACH)
+- Key Assumptions Check
+- Multi-agent investigation patterns
+
+## Reference Documents
+
+| File | Contents |
+|------|----------|
+| `SKILL.md` | Full methodology — entity resolution, evidence chains, confidence scoring, ACH, multi-agent patterns |
+| `references/entity-resolution-patterns.md` | Normalization tables, suffix maps, address canonicalization |
+| `references/investigation-methodology.md` | Epistemic framework extracted from OpenPlanter's `prompts.py` |
+| `references/output-templates.md` | JSON/Markdown templates for investigation deliverables |
+| `references/public-records-apis.md` | API endpoints, auth, rate limits for SEC, FEC, LDA, OFAC, USAspending |
+
+## Integration Modes
+
+| Mode | When | How |
+|------|------|-----|
+| **Methodology Only** | 1-2 datasets, local analysis | Core scripts directly |
+| **Web-Enriched** | Need external data, public records | `dataset_fetcher.py` + `web_enrich.py` + `scrape_records.py` |
+| **Full RLM Delegation** | Complex investigations, 3+ datasets, web research | `delegate_to_rlm.py` (provider-agnostic: Anthropic, OpenAI, OpenRouter, Cerebras) |
+| **Full Pipeline** | End-to-end with reporting | `investigate.py /path/to/workspace --phases all` |
+
+### Required API Keys
+
+| Key | Used By | Required |
+|-----|---------|----------|
+| `EXA_API_KEY` | `web_enrich.py` | For web enrichment |
+| `FEC_API_KEY` | `scrape_records.py` | Optional (`DEMO_KEY` fallback) |
+| `ANTHROPIC_API_KEY` | `delegate_to_rlm.py` | If using Anthropic models |
+| `OPENAI_API_KEY` | `delegate_to_rlm.py` | If using OpenAI models |
+| `OPENROUTER_API_KEY` | `delegate_to_rlm.py` | If using OpenRouter models |
+| `CEREBRAS_API_KEY` | `delegate_to_rlm.py` | If using Cerebras models |
+
+## Relation to the OpenPlanter Agent
+
+| | Agent (`openplanter-agent`) | Skill (`skills/openplanter/`) |
+|---|---|---|
+| Runtime | Full TUI + recursive sub-agents | Claude Code terminal |
+| Dependencies | `rich`, `prompt_toolkit`, LLM providers | Python stdlib only |
+| Entity resolution | LLM-assisted (via tool calls) | `difflib.SequenceMatcher` |
+| Web search | Exa API (built-in) | `web_enrich.py` (Exa via subprocess) |
+| Public records | Via shell tools | `scrape_records.py` + `dataset_fetcher.py` (urllib) |
+| Session management | Built-in persistence | Claude Code sessions |
+| Confidence scoring | Inline in prompts | Standalone scorer script |
+| Delegation | `subtask`/`execute` tools | `delegate_to_rlm.py` bridge |
+
+The skill complements the agent. Use the agent for full autonomous investigations. Use the skill for targeted analysis within existing Claude Code workflows. Use `delegate_to_rlm.py` to bridge both worlds.
+
+## Multi-Agent Investigation
+
+For complex investigations requiring parallel workstreams, the skill composes with [**minoan-swarm**](https://github.com/tdimino/claude-code-minoan/tree/main/skills/planning-productivity/minoan-swarm) — a Claude Code skill for multi-agent teams with shared task lists and parallel workstreams. Separate the verifier agent from analysis agents to maintain uncorrelated verification.
+
+See `SKILL.md` § Multi-Agent Investigation and `references/investigation-methodology.md` for the swarm role template.
+
+## Contributing
+
+Scripts should remain zero-dependency (Python stdlib only). Add new scripts to `scripts/`, update `SKILL.md`, and ensure `--help` works.
diff --git a/skills/openplanter/SKILL.md b/skills/openplanter/SKILL.md
new file mode 100644
index 00000000..6424cafd
--- /dev/null
+++ b/skills/openplanter/SKILL.md
@@ -0,0 +1,362 @@
+---
+name: openplanter
+description: "Dataset investigation and entity resolution using OpenPlanter methodology. This skill should be used when cross-referencing heterogeneous datasets, performing entity resolution, building evidence chains with confidence tiers, or conducting structured investigations requiring epistemic discipline."
+argument-hint: "[workspace-path or investigation-query]"
+---
+
+# OpenPlanter — Investigation Methodology Skill
+
+Epistemic framework for cross-dataset investigation, entity resolution, and evidence-backed analysis. Extracted from the [OpenPlanter](https://github.com/ShinMegamiBoson/OpenPlanter) recursive investigation agent and enriched with professional OSINT tradecraft (Admiralty System, ACH, FollowTheMoney schema, intelligence cycle methodology).
+
+Claude Code already has the tools. This skill provides the **methodology**.
+
+## When to Use
+
+- Cross-referencing heterogeneous datasets (corporate registries, campaign finance, lobbying, property records, contracts)
+- Entity resolution across datasets with inconsistent naming
+- Building evidence chains with provenance and confidence tiers
+- Structured OSINT investigations requiring epistemic discipline
+- Any analysis where claims need to trace to cited source records
+
+## Quick Start
+
+```bash
+# 1. Initialize workspace
+python3 ~/.claude/skills/openplanter/scripts/init_workspace.py /path/to/investigation
+
+# 2. Drop datasets into datasets/
+cp campaign_finance.csv lobbying.json corporate_registry.csv /path/to/investigation/datasets/
+
+# 3. Write an investigation plan
+# → plans/plan.md (see references/output-templates.md for plan template)
+
+# 4. Resolve entities across datasets
+python3 ~/.claude/skills/openplanter/scripts/entity_resolver.py /path/to/investigation
+
+# 5. Cross-reference linked entities
+python3 ~/.claude/skills/openplanter/scripts/cross_reference.py /path/to/investigation
+
+# 6. Validate evidence chains
+python3 ~/.claude/skills/openplanter/scripts/evidence_chain.py /path/to/investigation
+
+# 7. Score confidence
+python3 ~/.claude/skills/openplanter/scripts/confidence_scorer.py /path/to/investigation
+```
+
+## Investigation Methodology
+
+### Epistemic Discipline
+
+Assume nothing about the environment until confirmed firsthand. These principles prevent the most common investigation failures:
+
+1. **Ground truth comes from files, not memory.** Read actual data before modifying it, and read actual error messages before diagnosing. Model memory of data structure is unreliable—reading the file takes seconds, recovering from a wrong assumption takes minutes.
+2. **Empty output is ambiguous.** If a command returns empty, cross-check with `ls -la` and `wc -c` before concluding a file is actually empty, because output capture mechanisms can silently lose data.
+3. **Success does not mean correctness.** A command that "succeeds" may have done nothing. Check actual outcomes, not just exit codes. After downloading, verify with `ls` and `wc -c`. After extraction, verify expected files exist.
+4. **Verify round-trip correctness.** After any data transformation (parsing, linking, aggregation), check the result from the consumer's perspective—load the output, spot-check records, verify row counts. Transformations that silently drop records are the most common source of wrong conclusions.
+5. **Three failures = wrong approach.** If a command fails 3 times, change strategy entirely. Repeating an identical command expecting different results wastes context window.
+6. **Produce artifacts early.** Write a working first draft of findings as soon as the requirements are clear, then iterate. An imperfect deliverable beats a perfect analysis with no output. If 3+ steps have passed without writing any output, stop and write—even if incomplete.
+
+For the full epistemic framework (including data ingestion rules and hard rules from OpenPlanter's prompts.py), see `references/investigation-methodology.md`.
+
+### Entity Resolution Protocol
+
+**Pipeline: Normalize → Block → Compare → Score → Cluster → Review**
+
+Adapted from OpenPlanter's prompts.py, enriched with Middesk, OpenSanctions, and ICIJ patterns.
+
+1. **Normalize** — Apply canonical key transformation: Unicode NFKD + diacritic stripping, case folding, legal suffix canonicalization (LLC/Inc/Corp/Ltd → canonical forms), punctuation removal, ampersand normalization (`&` → `and`). See `references/entity-resolution-patterns.md` for complete normalization tables and suffix maps.
+
+2. **Block** — Reduce O(N^2) comparisons using blocking keys: first 3 characters of normalized name + state/jurisdiction, phonetic key (Soundex/Double Metaphone), token overlap via inverted index.
+
+3. **Compare** — Pairwise similarity with cascading checks: `real_quick_ratio()` → `quick_ratio()` → `ratio()`. Use `autojunk=False` for entity names because strings under 200 characters produce false negatives with junk heuristics. Include token set comparison for word-order variants ("Apple Inc" vs "Inc Apple").
+
+4. **Score** — Multi-signal weighted model:
+ - Hard signals: TIN/EIN exact match (1.0), identical phone E.164 (0.8), identical email (0.8)
+ - Soft signals: name similarity (0.5), address fuzzy (0.2), state match (0.1)
+ - Hard disqualifiers: TIN mismatch when both present (-0.5), country mismatch (-0.5)
+
+5. **Cluster** — Group via transitive closure using Union-Find. Gate closure so all pairwise scores exceed threshold—this prevents chain errors where A≈B and B≈C but A≉C. Exclude registered agent addresses from triggering transitive closure alone, because thousands of entities share the same registered agent.
+
+6. **Review** — Flag by confidence band:
+ - Score >= 0.85: auto-match (confirmed)
+ - Score 0.70-0.84: queue for review (probable)
+ - Score 0.55-0.69: include in wide net (possible)
+ - Score < 0.55: discard
+
+### Evidence Chain Construction
+
+Every claim traces to a specific record in a specific dataset. This is what separates investigation from speculation.
+
+```
+Claim
+ └── Evidence Item
+ ├── type: document | record | image | testimony
+ ├── source_ref → Source
+ │ ├── url / file path
+ │ ├── collection_timestamp
+ │ ├── source_type: primary | secondary | tertiary | official | unofficial
+ │ └── reliability_grade: A–F (Admiralty)
+ ├── credibility_grade: 1–6 (Admiralty)
+ ├── corroboration_status: single | corroborated | contradicted | unresolvable
+ └── match_details (if cross-reference)
+ ├── fields_matched: [name, address, ein]
+ ├── match_type: exact | fuzzy | address-based
+ └── link_strength: weakest criterion in the chain
+```
+
+**Key principles:**
+- Distinguish direct evidence (A appears in record X), circumstantial evidence (A's address matches B's address), and absence of evidence (no disclosure found)
+- Document every hop in a multi-step chain with source record, linking field, and match quality
+- Link strength = weakest criterion in the chain (a chain is only as strong as its weakest link)
+- Track source lineage to detect circular reporting—Source B citing Source A is not independent corroboration
+
+### Confidence Tiers
+
+Based on the Admiralty System (NATO AJP-2.1), adapted for dataset investigation:
+
+| Tier | Criteria | Required Evidence | Admiralty Equivalent |
+|------|----------|-------------------|---------------------|
+| **Confirmed** | 2+ independent sources with different collection paths; hard signal match (EIN, phone); or official record with verifiable provenance | Independent corroboration required | A1–B1 |
+| **Probable** | Strong single source (official record); high fuzzy match (>0.85) on name + address + state; consistent with known patterns | Single strong source acceptable | B2–C2 |
+| **Possible** | Circumstantial evidence only; moderate fuzzy match (0.55-0.84); consistent but not yet corroborated; requires additional investigation | Hypothesis supported but not confirmed | C3–D3 |
+| **Unresolved** | Contradictory evidence; insufficient data; single weak source; or unable to verify | Cannot determine with available evidence | D4–F6 |
+
+**Sherman Kent probability mapping:**
+- Confirmed ≈ "almost certain" (93-99%)
+- Probable ≈ "likely" (75-85%)
+- Possible ≈ "chances about even" (45-55%)
+- Unresolved ≈ insufficient basis for estimate
+
+### Verification Principle
+
+**Implementation and verification must be uncorrelated.** An agent that performs an analysis introduces systematic bias when self-verifying—it "knows" what the answer should be and unconsciously confirms it. Use the implement-then-verify pattern:
+
+```
+Step 1: Perform entity resolution and cross-referencing (analysis agent)
+Step 2: Read the result files
+Step 3: Independent verification (separate agent or separate pass with no shared context from step 1):
+ - Load output files fresh
+ - Spot-check N random records against raw source data
+ - Verify row counts match expectations
+ - Run validation script
+ - Report raw output only
+```
+
+The verification executor has no context from the analysis executor. It runs commands and reports output, making its evidence independent.
+
+**Anti-bias checks:**
+- **Confirmation bias**: Score hypotheses by inconsistency count (ACH), not confirmation count, because disconfirming evidence is more diagnostic
+- **Anchoring**: Do not rank hypotheses until evidence collection is complete
+- **Circular reporting**: Track source lineage; verify independence of collection paths before counting corroborations
+- **Satisficing**: Require minimum 3 competing hypotheses before scoring—this prevents premature commitment to the first plausible explanation
+
+### Analysis Output Standards
+
+Include in all investigation deliverables:
+
+1. **Methodology section**: Sources used, entity resolution approach, linking logic, known limitations
+2. **Confidence breakdown**: Count of findings per tier (confirmed/probable/possible/unresolved)
+3. **Evidence appendix**: Every hop, every source record cited, every match score
+4. **Structured output**: JSON for machine-readable (`findings/`), Markdown for human-readable (`findings/`)
+5. **Provenance**: For each dataset—source URL/path, access timestamp, transformations applied
+
+See `references/output-templates.md` for ready-to-use templates (investigation plans, summaries, evidence chains).
+
+## Integration Modes
+
+The skill operates in three modes based on investigation complexity:
+
+| Mode | When | Scripts |
+|------|------|---------|
+| **Methodology Only** | Simple tasks, 1-2 datasets, local analysis | `entity_resolver.py`, `cross_reference.py`, `evidence_chain.py`, `confidence_scorer.py` |
+| **Web-Enriched** | Need external data, public records, entity enrichment | Above + `dataset_fetcher.py`, `web_enrich.py`, `scrape_records.py`, + 6 specialized fetchers |
+| **Full RLM Delegation** | Complex multi-step investigations, 3+ datasets, 20+ reasoning steps | `delegate_to_rlm.py` → full OpenPlanter agent (provider-agnostic, session-resumable) |
+
+**One-command pipeline** for any mode: `investigate.py /path/to/workspace --phases all`
+
+### RLM Delegation — Provider-Agnostic
+
+The RLM agent auto-detects the LLM provider from the model name. Works with any provider the agent supports:
+
+```bash
+# Anthropic (default)
+python3 scripts/delegate_to_rlm.py --objective "..." --workspace DIR --model claude-sonnet-4-5-20250929
+
+# OpenAI
+python3 scripts/delegate_to_rlm.py --objective "..." --workspace DIR --model gpt-4o
+
+# OpenRouter (any model via slash routing)
+python3 scripts/delegate_to_rlm.py --objective "..." --workspace DIR --model anthropic/claude-sonnet-4-5
+
+# Ollama (local inference, air-gapped)
+python3 scripts/delegate_to_rlm.py --objective "..." --workspace DIR --provider ollama --model llama3
+
+# Cerebras (model name doesn't contain "cerebras", so specify --provider)
+python3 scripts/delegate_to_rlm.py --objective "..." --workspace DIR --model qwen-3-235b-a22b-instruct-2507 --provider cerebras
+
+# Resume a previous investigation session
+python3 scripts/delegate_to_rlm.py --resume abc123 --workspace DIR
+
+# List saved sessions
+python3 scripts/delegate_to_rlm.py --list-sessions --workspace DIR
+
+# Control reasoning depth
+python3 scripts/delegate_to_rlm.py --objective "..." --workspace DIR --reasoning-effort high
+
+# List available models for a provider
+python3 scripts/delegate_to_rlm.py --list-models --provider ollama
+```
+
+Provider auto-detection: `claude-*` → anthropic, `gpt-*/o1-*/o3-*` → openai, `org/model` → openrouter, `*cerebras*` → cerebras, `llama*/qwen*/mistral*/gemma*` → ollama. For models without a recognizable prefix, pass `--provider` explicitly.
+
+API keys pass through environment variables: `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `OPENROUTER_API_KEY`, `CEREBRAS_API_KEY` (or `OPENPLANTER_`-prefixed variants). Ollama requires no API key. Set `OPENPLANTER_REPO` to override local clone discovery.
+
+## Scripts Reference
+
+All scripts use Python stdlib only. Zero external dependencies. External tools (Exa, Firecrawl, OpenPlanter agent) are invoked as subprocesses.
+
+### Core Analysis
+
+| Script | Purpose | Example |
+|--------|---------|---------|
+| `init_workspace.py` | Create investigation workspace structure | `python3 scripts/init_workspace.py /tmp/investigation` |
+| `entity_resolver.py` | Fuzzy entity matching + canonical map | `python3 scripts/entity_resolver.py /tmp/investigation --threshold 0.85` |
+| `cross_reference.py` | Cross-dataset record linking | `python3 scripts/cross_reference.py /tmp/investigation` |
+| `evidence_chain.py` | Validate evidence chain structure | `python3 scripts/evidence_chain.py /tmp/investigation` |
+| `confidence_scorer.py` | Score findings by confidence tier | `python3 scripts/confidence_scorer.py /tmp/investigation` |
+
+### Data Collection & Enrichment
+
+| Script | Purpose | Example |
+|--------|---------|---------|
+| `dataset_fetcher.py` | Download bulk public datasets (SEC, FEC, OFAC, LDA, OpenSanctions) | `python3 scripts/dataset_fetcher.py /tmp/investigation --sources sec,fec` |
+| `web_enrich.py` | Enrich entities via Exa neural search | `python3 scripts/web_enrich.py /tmp/investigation --categories company,news` |
+| `scrape_records.py` | Fetch entity records from government APIs | `python3 scripts/scrape_records.py /tmp/investigation --entities "Acme Corp" --sources sec,fec` |
+
+### Specialized Data Fetchers
+
+Individual scripts for targeted government and public data sources. All use Python stdlib only, produce JSON + provenance sidecar, and support `--dry-run` and `--list`.
+
+| Script | Data Source | Auth | Key Linking Fields |
+|--------|-------------|------|--------------------|
+| `fetch_census.py` | US Census Bureau ACS 5-Year | Optional `CENSUS_API_KEY` | Geography (state, county, ZIP) |
+| `fetch_epa.py` | EPA ECHO Facility Search | None | `registry_id`, lat/lon, SIC/NAICS |
+| `fetch_icij.py` | ICIJ Offshore Leaks Database | None | `icij_id`, entity name, jurisdiction |
+| `fetch_osha.py` | OSHA DOL Enforcement | None | `activity_nr`, `estab_name`, SIC |
+| `fetch_propublica990.py` | ProPublica Nonprofit Explorer (IRS 990) | None | `ein`, org name, NTEE code |
+| `fetch_sam.py` | SAM.gov Entity Registration | `SAM_GOV_API_KEY` | UEI, CAGE code, NAICS |
+
+**Usage pattern** (all fetchers follow the same interface):
+```bash
+python3 scripts/fetch_sam.py /tmp/investigation --query "Raytheon" --state CT
+python3 scripts/fetch_epa.py /tmp/investigation --state TX --query "Refinery"
+python3 scripts/fetch_icij.py /tmp/investigation --entity "Mossack" --type intermediary
+python3 scripts/fetch_propublica990.py /tmp/investigation --ein 237327340
+python3 scripts/fetch_census.py /tmp/investigation --state 36 --county "*"
+python3 scripts/fetch_osha.py /tmp/investigation --sic 2911 --state TX
+```
+
+### Orchestration & Delegation
+
+| Script | Purpose | Example |
+|--------|---------|---------|
+| `investigate.py` | Run full pipeline end-to-end | `python3 scripts/investigate.py /tmp/investigation --phases all` |
+| `delegate_to_rlm.py` | Spawn full OpenPlanter agent (session-resumable, provider-agnostic) | `python3 scripts/delegate_to_rlm.py --objective "..." --workspace DIR` |
+
+### Knowledge Graph
+
+| Script | Purpose | Example |
+|--------|---------|---------|
+| `wiki_graph_query.py` | Query OpenPlanter wiki knowledge graph (read-only) | `python3 scripts/wiki_graph_query.py /tmp/investigation --entity "Raytheon" --neighbors` |
+
+Supports entity lookup, neighbor traversal, BFS path finding, full-text search, and graph statistics. Reads NetworkX node-link JSON graphs produced by OpenPlanter's wiki_graph.py during delegated investigations.
+
+## Skill Integration
+
+OpenPlanter methodology composes with existing Claude Code skills:
+
+| Investigation Task | Skill | Integration Pattern |
+|---|---|---|
+| Web research for entity enrichment | `exa-search` | `web_enrich.py` calls `exa_search.py` as subprocess |
+| Scrape JS-heavy public records portals | `Firecrawl` | `firecrawl scrape URL --only-main-content` |
+| Structured government APIs | Built-in | `scrape_records.py` queries SEC, FEC, LDA, USAspending via `urllib` |
+| Bulk dataset downloads | Built-in | `dataset_fetcher.py` fetches SEC, FEC, OFAC, OpenSanctions, LDA |
+| Defense contractor lookup | Built-in | `fetch_sam.py` queries SAM.gov by name/UEI/CAGE/NAICS |
+| Environmental compliance | Built-in | `fetch_epa.py` queries EPA ECHO for facilities + violations |
+| Nonprofit/dark money flows | Built-in | `fetch_propublica990.py` queries IRS 990 data via ProPublica |
+| Offshore entity chains | Built-in | `fetch_icij.py` queries Panama/Paradise/Pandora Papers |
+| Workplace safety records | Built-in | `fetch_osha.py` queries DOL enforcement data |
+| Demographics/economic context | Built-in | `fetch_census.py` queries Census ACS 5-Year estimates |
+| Knowledge graph query | Built-in | `wiki_graph_query.py` reads OpenPlanter wiki graphs |
+| Local RAG over large document corpora | `rlama` | Create collection from `datasets/`, query semantically |
+| Parallel investigation threads | `minoan-swarm` | Elat Research Swarm with domain-split investigators |
+| Academic/legal research | `academic-research` | Case law, regulatory filings, citations |
+| Twitter/social media OSINT | `twitter` | `x-search` for entity mentions, `bird` for profile data |
+| Daimonic timeline curation | `worldwarwatcher-update` | Mazkir ha-Milḥamat entity resolution for non-military domains |
+
+### US Public Records Datasets
+
+Key datasets and their linking keys for cross-reference investigations:
+
+| Dataset | Access | Linking Key | Script | Format |
+|---|---|---|---|---|
+| FEC Campaign Finance | `api.open.fec.gov` + bulk CSV | `committee_id`, contributor name | `dataset_fetcher.py` | CSV, JSON API |
+| Senate LDA Lobbying | `lda.senate.gov/api` | Registrant name, Client name | `dataset_fetcher.py` | JSON API, XML |
+| SEC EDGAR | `data.sec.gov` + EFTS search | `CIK` (Central Index Key) | `dataset_fetcher.py` | JSON, XBRL |
+| SAM.gov Entity Registration | `api.sam.gov` | UEI, CAGE code, NAICS | `fetch_sam.py` | JSON API |
+| EPA ECHO Facilities | `echodata.epa.gov` | FRS Registry ID, lat/lon | `fetch_epa.py` | JSON API |
+| ProPublica 990 (IRS) | `projects.propublica.org` | EIN, org name | `fetch_propublica990.py` | JSON API |
+| ICIJ Offshore Leaks | `offshoreleaks.icij.org` | Node ID, entity name, jurisdiction | `fetch_icij.py` | JSON API |
+| OSHA Inspections | `enforcedata.dol.gov` | Activity number, SIC code | `fetch_osha.py` | JSON API |
+| US Census ACS | `api.census.gov` | State/county/ZIP FIPS | `fetch_census.py` | JSON API |
+| OFAC Sanctions | `treasury.gov/ofac` (or OpenSanctions) | Name + aliases, identifiers | `dataset_fetcher.py` | CSV, XML |
+| State Corporate Registries | Per-state (or OpenCorporates API) | State registration number | — | Varies |
+| Property Records | County-level (or ATTOM/CoreLogic) | Parcel ID (APN), owner name | — | CSV, shapefile |
+
+**Cross-dataset linking challenge**: No universal corporate ID exists in US public records. The standard approach: normalize names, fuzzy match, filter by jurisdiction/address, then anchor on known IDs (CIK, committee_id, EIN, UEI, CAGE, FRS ID) when available.
+
+## Multi-Agent Investigation
+
+For complex investigations requiring parallel workstreams, use `minoan-swarm`. Separate the verifier agent from analysis agents to maintain uncorrelated verification—the verifier receives only output files and verification criteria, with no shared context from the analysis phase.
+
+See `references/investigation-methodology.md` for the full swarm role template (keret, kothar, resheph, anat, shapash).
+
+## Structured Analytical Techniques
+
+For complex scenarios with multiple possible explanations, apply Analysis of Competing Hypotheses (ACH) and Key Assumptions Check. These techniques are detailed in `references/investigation-methodology.md`.
+
+**ACH summary**: Build a matrix of hypotheses vs. evidence. Score by **inconsistency count** (fewest I markers wins), not confirmation count. This counteracts confirmation bias because disconfirming evidence is more diagnostic than supporting evidence. Identify linchpin evidence whose reclassification would change the conclusion.
+
+## Deep References
+
+- `references/investigation-methodology.md` — Full epistemic framework, ACH procedure, Key Assumptions Check, multi-agent swarm template
+- `references/entity-resolution-patterns.md` — Complete normalization tables, suffix maps, address canonicalization
+- `references/output-templates.md` — JSON/Markdown templates for investigation plans, summaries, and evidence chains
+- `references/public-records-apis.md` — API endpoints, auth, rate limits, linking keys for SEC, FEC, LDA, OFAC, USAspending, Census, EPA, ICIJ, OSHA, ProPublica 990, SAM.gov
+
+## OpenPlanter Tool to Claude Code Mapping
+
+| OpenPlanter Tool | Claude Code Equivalent |
+|---|---|
+| `list_files` | `Glob`, `Bash(ls)` |
+| `read_file` | `Read` |
+| `write_file` | `Write` |
+| `search_files` | `Grep` |
+| `edit_file` | `Edit` |
+| `apply_patch` | `Edit` |
+| `run_shell` | `Bash` |
+| `web_search` | `exa-search` skill |
+| `fetch_url` | `Firecrawl` skill / `WebFetch` |
+| `subtask` | `Task` tool (minoan-swarm) |
+| `execute` | `Task` tool (haiku model) |
+| `think` | Native reasoning |
+| `wiki_graph` | `wiki_graph_query.py` (read-only) |
+| `fetch_sam` | `fetch_sam.py` |
+| `fetch_epa` | `fetch_epa.py` |
+| `fetch_icij` | `fetch_icij.py` |
+| `fetch_osha` | `fetch_osha.py` |
+| `fetch_990` | `fetch_propublica990.py` |
+| `fetch_census` | `fetch_census.py` |
+| `resume_session` | `delegate_to_rlm.py --resume` |
+
+No capability gap. The methodology is what matters, not the tooling.
diff --git a/skills/openplanter/references/entity-resolution-patterns.md b/skills/openplanter/references/entity-resolution-patterns.md
new file mode 100644
index 00000000..65f8b48d
--- /dev/null
+++ b/skills/openplanter/references/entity-resolution-patterns.md
@@ -0,0 +1,283 @@
+# Entity Resolution Patterns
+
+Comprehensive normalization rules, suffix tables, and matching patterns for cross-dataset entity resolution.
+
+## Name Normalization Pipeline
+
+Apply these steps in order to every entity name before comparison:
+
+```
+1. Unicode NFKD decomposition → strip diacritics (category 'Mn')
+2. NFKC normalization (handles ligatures, fullwidth chars)
+3. Lowercase
+4. Expand/normalize ampersand: & → and
+5. Extract and canonicalize legal suffix (separate from base name)
+6. Strip noise words (The, A, An — for organization names only)
+7. Remove punctuation except hyphens in hyphenated names
+8. Collapse whitespace
+9. Strip leading/trailing whitespace
+```
+
+## Corporate Suffix Canonical Mapping
+
+The single most critical normalization step. All variants must map to a single canonical form.
+
+| Raw Input Variants | Canonical Form |
+|---|---|
+| `LLC`, `L.L.C.`, `Limited Liability Company`, `limited liability co` | `LLC` |
+| `Inc`, `Inc.`, `Incorporated`, `Incorp.` | `INC` |
+| `Corp`, `Corp.`, `Corporation` | `CORP` |
+| `Ltd`, `Ltd.`, `Limited`, `Limted` (common typo) | `LTD` |
+| `Co`, `Co.`, `Company`, `Compny` | `CO` |
+| `LP`, `L.P.`, `Limited Partnership` | `LP` |
+| `LLP`, `L.L.P.`, `Limited Liability Partnership` | `LLP` |
+| `PLC`, `P.L.C.`, `Public Limited Company` | `PLC` |
+| `GmbH`, `Gesellschaft mit beschraenkter Haftung` | `GMBH` |
+| `SA`, `S.A.`, `Societe Anonyme`, `Sociedad Anonima` | `SA` |
+| `AG`, `Aktiengesellschaft` | `AG` |
+| `BV`, `B.V.`, `Besloten Vennootschap` | `BV` |
+| `SRL`, `S.R.L.`, `Sociedad de Responsabilidad Limitada` | `SRL` |
+| `Pty`, `Pty Ltd`, `Proprietary Limited` | `PTY` |
+| `PC`, `P.C.`, `Professional Corporation` | `PC` |
+| `PA`, `P.A.`, `Professional Association` | `PA` |
+| `NV`, `N.V.`, `Naamloze Vennootschap` | `NV` |
+| `SARL`, `S.A.R.L.` | `SARL` |
+
+**Regex patterns for extraction:**
+
+```python
+SUFFIX_PATTERNS = [
+ (r'\blimited\s+liability\s+company\b', 'LLC'),
+ (r'\bl\.?l\.?c\.?\b', 'LLC'),
+ (r'\bincorporated\b', 'INC'),
+ (r'\binc\.?\b', 'INC'),
+ (r'\bcorporation\b', 'CORP'),
+ (r'\bcorp\.?\b', 'CORP'),
+ (r'\blimited\s+liability\s+partnership\b', 'LLP'),
+ (r'\bl\.?l\.?p\.?\b', 'LLP'),
+ (r'\blimited\s+partnership\b', 'LP'),
+ (r'\bl\.?p\.?\b', 'LP'),
+ (r'\blimited\b', 'LTD'),
+ (r'\bltd\.?\b', 'LTD'),
+ (r'\bcompany\b', 'CO'),
+ (r'\bco\.?\b', 'CO'),
+ (r'\bpublic\s+limited\s+company\b', 'PLC'),
+ (r'\bp\.?l\.?c\.?\b', 'PLC'),
+ (r'\bprofessional\s+corporation\b', 'PC'),
+ (r'\bp\.?c\.?\b', 'PC'),
+ (r'\bprofessional\s+association\b', 'PA'),
+ (r'\bp\.?a\.?\b', 'PA'),
+]
+```
+
+**Important**: Match longest pattern first (e.g., "Limited Liability Company" before "Limited" and "Company").
+
+## Person Name Normalization
+
+### Nickname Expansion Table
+
+| Nickname | Canonical |
+|---|---|
+| Bill, Billy, Willy | William |
+| Bob, Bobby, Rob, Robbie | Robert |
+| Jim, Jimmy, Jamie | James |
+| Mike, Mikey | Michael |
+| Tom, Tommy | Thomas |
+| Dick, Rick, Rich | Richard |
+| Jack, Jackie | John |
+| Ted, Teddy, Ed, Eddie, Ned | Edward |
+| Joe, Joey | Joseph |
+| Dan, Danny | Daniel |
+| Dave, Davy | David |
+| Pat, Patty | Patrick / Patricia |
+| Liz, Beth, Betty, Betsy, Eliza | Elizabeth |
+| Kate, Katie, Kathy, Cathy | Katherine / Catherine |
+| Maggie, Meg, Peggy | Margaret |
+| Sue, Suzy | Susan |
+| Jenny, Jen | Jennifer |
+| Chris | Christopher / Christine |
+| Alex | Alexander / Alexandra |
+| Sam | Samuel / Samantha |
+| Tony | Anthony |
+| Chuck | Charles |
+| Larry | Lawrence |
+| Jerry | Gerald / Jerome |
+| Harry | Harold / Henry |
+| Hank | Henry |
+| Steve | Stephen / Steven |
+| Andy | Andrew |
+| Tim | Timothy |
+| Jeff | Jeffrey |
+| Ron | Ronald |
+
+### Person Name Rules
+
+| Issue | Rule |
+|---|---|
+| Middle name/initial | Match on `first + last` only as fallback; middle initial mismatch = warning not failure |
+| Name order | Try `First Last` and `Last, First` both |
+| Titles/honorifics | Strip: Dr., Mr., Mrs., Ms., Prof., Jr., Sr., II, III, IV, Esq. |
+| Suffixes | Separate: Jr., Sr., II, III, IV, Esq., MD, PhD, CPA |
+| Hyphenated names | Treat as single unit; also try each component separately |
+| Transliteration variants | Arabic: Mohamed/Muhammad/Mohammed; Chinese: Wei/Wai; Russian: Sergey/Sergei |
+
+## Address Normalization
+
+### USPS Canonical Abbreviations
+
+| Full Form | Canonical | Full Form | Canonical |
+|---|---|---|---|
+| Street | ST | Suite | STE |
+| Avenue | AVE | Apartment | APT |
+| Boulevard | BLVD | Unit | UNIT |
+| Drive | DR | Building | BLDG |
+| Road | RD | Floor | FL |
+| Lane | LN | Room | RM |
+| Court | CT | Department | DEPT |
+| Place | PL | Highway | HWY |
+| Circle | CIR | Parkway | PKWY |
+| Way | WAY | Expressway | EXPY |
+| Terrace | TER | Turnpike | TPKE |
+| Trail | TRL | Crossing | XING |
+
+### Directional Abbreviations
+
+| Full | Canonical |
+|---|---|
+| North | N |
+| South | S |
+| East | E |
+| West | W |
+| Northeast | NE |
+| Northwest | NW |
+| Southeast | SE |
+| Southwest | SW |
+
+### State Abbreviations
+
+Always normalize to 2-letter USPS code. Example: `New York` → `NY`, `California` → `CA`.
+
+### Address Normalization Pipeline
+
+```
+1. Uppercase entire address
+2. Replace directional words with abbreviations
+3. Replace street type words with abbreviations
+4. Replace unit type words with abbreviations
+5. Remove periods and commas
+6. Normalize spacing (collapse multiple spaces)
+7. Remove # symbol before unit numbers
+```
+
+## Common Abbreviation Expansions
+
+| Abbreviation | Full Form |
+|---|---|
+| Intl, Int'l | International |
+| Mgmt | Management |
+| Svcs, Svc | Services, Service |
+| Assoc, Assocs | Associates |
+| Bros | Brothers |
+| Natl, Nat'l | National |
+| Dept | Department |
+| Grp | Group |
+| Hldgs | Holdings |
+| Fin, Finl | Financial |
+| Tech | Technology / Technologies |
+| Sys | Systems |
+| Dev | Development |
+| Mfg | Manufacturing |
+| Dist | Distribution / Distributing |
+| Prop, Props | Properties |
+| Invt, Inv | Investment(s) |
+| Ins | Insurance |
+| Rlty | Realty |
+| Pharm | Pharmaceutical(s) |
+| Engr, Eng | Engineering |
+| Consult | Consulting |
+| Comm, Comms | Communications |
+| Ent, Entmt | Entertainment |
+| Advsr, Adv | Advisor(s) / Advisory |
+
+## Fuzzy Matching Thresholds
+
+### Recommended Thresholds (difflib.SequenceMatcher)
+
+| Score Range | Classification | Action |
+|---|---|---|
+| >= 0.95 | Near-certain (exact after normalization) | Auto-merge |
+| 0.85–0.94 | High confidence | Auto-merge with audit log |
+| 0.70–0.84 | Probable match | Queue for review |
+| 0.55–0.69 | Possible match | Surface as candidate, do not auto-merge |
+| < 0.55 | No match | Discard |
+
+### Signal Weights for Composite Scoring
+
+| Signal | Weight | Notes |
+|---|---|---|
+| TIN/EIN exact match | 1.0 | Near-deterministic; hard signal |
+| Phone E.164 exact match | 0.8 | Hard signal |
+| Email exact match | 0.8 | Hard signal |
+| Name token set ratio | 0.5 | Primary soft signal |
+| Name Jaro-Winkler | 0.3 | Supplementary |
+| Name phonetic match | 0.2 | Supplementary |
+| Address exact match | 0.4 | Moderate signal |
+| Address fuzzy match | 0.2 | Weak signal |
+| State/jurisdiction match | 0.1 | Weak signal |
+| Suffix mismatch | -0.1 | Penalty |
+| Country mismatch | -0.5 | Strong disqualifier |
+
+## Edge Cases
+
+### Known Registered Agent Addresses (Do NOT use for positive matching)
+
+These addresses host 10,000+ entities and should be flagged:
+
+- `1209 Orange Street, Wilmington, DE 19801` — The Corporation Trust Company (~250K entities)
+- `2711 Centerville Road, Suite 400, Wilmington, DE 19808` — Corporation Service Company
+- `850 New Burton Road, Suite 201, Dover, DE 19904` — Registered Agents Inc
+- `1013 Centre Road, Suite 403-B, Wilmington, DE 19805` — The Company Corporation
+- `311 S State Street, Dover, DE 19901` — Legalinc Corporate Services
+- `1100 H Street NW, Suite 840, Washington, DC 20005` — CT Corporation System (DC)
+
+When either entity address is a known RA address, exclude address from positive matching signals.
+
+### DBA (Doing Business As) Handling
+
+A single legal entity may file under multiple trade names:
+- `McDonald's Corporation` operates as `McDonald's`
+- `Alphabet Inc` operates as `Google`
+- `Meta Platforms Inc` operates as `Facebook`, `Instagram`, `WhatsApp`
+
+Strategy: maintain a `legal_name ↔ dba_name` mapping. When matching, try both. If DBA match scores >= 0.85, link to parent legal entity.
+
+### Shell Company Detection Signals
+
+- Generic name pattern: `[Word] Holdings LLC`, `[Word] Ventures LLC`, `[Word] Capital Partners LLC`
+- Known RA address (see above)
+- No officers listed, no phone, no web presence
+- Recently formed (within 90 days of investigated transaction)
+- Same formation date as multiple other entities with same RA
+
+**Rule**: Require >= 2 independent signals for shell-company entities. Do not trust name or address alone.
+
+### Regex Patterns for Entity Extraction
+
+```python
+# EIN / TIN (XX-XXXXXXX)
+EIN_PAT = r'\b(\d{2}-\d{7})\b'
+
+# US Phone (flexible)
+PHONE_PAT = r'(?:\+?1[-.\s]?)?(?:\(?\d{3}\)?[-.\s]?)\d{3}[-.\s]?\d{4}'
+
+# Email
+EMAIL_PAT = r'[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}'
+
+# Dollar amounts
+DOLLAR_PAT = r'\$\s?(\d{1,3}(?:,\d{3})*(?:\.\d{2})?)\b'
+
+# Dates (ISO, US, written)
+DATE_ISO = r'\b\d{4}[-/]\d{2}[-/]\d{2}\b'
+DATE_US = r'\b\d{1,2}/\d{1,2}/\d{2,4}\b'
+DATE_WRITTEN = r'\b\d{1,2}\s+(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\s+\d{4}\b'
+```
diff --git a/skills/openplanter/references/investigation-methodology.md b/skills/openplanter/references/investigation-methodology.md
new file mode 100644
index 00000000..795cf566
--- /dev/null
+++ b/skills/openplanter/references/investigation-methodology.md
@@ -0,0 +1,212 @@
+# Investigation Methodology — Full Reference
+
+Verbatim extraction from OpenPlanter `agent/prompts.py` (SYSTEM_PROMPT_BASE, RECURSIVE_SECTION, ACCEPTANCE_CRITERIA_SECTION), enriched with professional OSINT tradecraft.
+
+## Epistemic Discipline (from prompts.py)
+
+You are a skeptical professional. Assume nothing about the environment is what you'd expect until you've confirmed it firsthand.
+
+- Empty output is information about the capture mechanism, not about the file or command. Cross-check: if `cat file` returns empty, run `ls -la file` and `wc -c file` before concluding the file is actually empty.
+- A command that "succeeds" may have done nothing. Check actual outcomes, not just exit codes. After downloading a file, verify with ls and wc -c. After extracting an archive, verify the expected files exist.
+- Your memory of how data is structured is unreliable. Read the actual file before modifying it. Read actual error messages before diagnosing. Read actual data files before producing output.
+- Existing files in the workspace are ground truth placed there by the task. They contain data and logic you cannot reliably reconstruct from memory. Read them. Do not overwrite them with content from your training data.
+- If a command returns empty output, do NOT assume it failed. The output capture mechanism can lose data. Re-run the command once, or cross-check with `wc -c` before concluding the file/command produced nothing.
+- If THREE consecutive commands all return empty, assume systematic capture failure. Switch strategy: redirect output to a file, then read the file.
+
+## Hard Rules (from prompts.py)
+
+1. NEVER overwrite existing files with content generated from memory. Read first.
+2. Always write required output files before finishing—partial results beat no results.
+3. If a command fails 3 times, your approach is wrong. Change strategy entirely.
+4. Never repeat an identical command expecting different results.
+5. Preserve exact precision in numeric output. Never round, truncate, or reformat numbers unless explicitly asked.
+6. When the task asks you to "report" or "output" a result, ALWAYS write it to a structured file (results.json, findings.md, output.csv) in the workspace root.
+
+## Data Ingestion and Management (from prompts.py)
+
+- Ingest and verify before analyzing. For any new dataset: run wc -l, head -20, and sample queries to confirm format, encoding, and completeness before proceeding.
+- Preserve original source files; create derived versions separately. Never modify raw data in place.
+- When fetching APIs, paginate properly, verify completeness (compare returned count to expected total), and cache results to local files for repeatability.
+- Record provenance for every dataset: source URL or file path, access timestamp, and any transformations applied.
+
+## Entity Resolution and Cross-Dataset Linking (from prompts.py, enriched)
+
+- Handle name variants systematically: fuzzy matching, case normalization, suffix handling (LLC, Inc, Corp, Ltd), and whitespace/punctuation normalization.
+- Build entity maps: create a canonical entity file mapping all observed name variants to resolved canonical identities. Update it as new evidence appears.
+- Document linking logic explicitly. When linking entities across datasets, record which fields matched, the match type (exact, fuzzy, address-based), and confidence. Link strength = weakest criterion in the chain.
+- Flag uncertain matches separately from confirmed matches. Use explicit confidence tiers (confirmed, probable, possible, unresolved).
+
+### Enrichment: Record Linkage Pipeline
+
+```
+Raw Dataset A ──┐
+ ├─→ [Normalize] ─→ [Block] ─→ [Compare Pairs] ─→ [Score] ─→ [Cluster] ─→ Linked Records
+Raw Dataset B ──┘
+```
+
+**Blocking strategies** (reduce O(N^2)):
+- Exact block key: `(first_3_chars_of_name, state)`
+- Phonetic key: Double Metaphone of name
+- Sorted Neighborhood: sort by key, compare within sliding window
+- Token overlap: inverted index on name tokens, retrieve candidates sharing >=1 token
+- Multiple blocking passes with different keys (union of candidate sets) maximize recall
+
+**Transitive closure caveats:**
+- Records A=B and B=C do NOT automatically imply A=C without explicit closure
+- Gate closure: all pairwise scores must exceed threshold to prevent chain errors
+- Shell companies sharing registered agent addresses should NOT trigger transitive closure on address alone
+- Known registered-agent addresses (e.g., 1209 Orange St, Wilmington, DE — 250K+ entities) must be flagged and excluded from address-based matching
+
+## Evidence Chains and Source Citation (from prompts.py)
+
+- Every claim must trace to a specific record in a specific dataset. No unsourced assertions.
+- Build evidence chains: when connecting entity A to entity C through entity B, document each hop—the source record, the linking field, and the match quality.
+- Distinguish direct evidence (A appears in record X), circumstantial evidence (A's address matches B's address), and absence of evidence (no disclosure found).
+- Structure findings as: claim → evidence → source → confidence level. Readers must be able to verify any claim by following the chain back to raw data.
+
+### Enrichment: Source Reliability Assessment (Admiralty System)
+
+**Source Reliability (A–F):**
+
+| Grade | Label | Definition |
+|---|---|---|
+| A | Completely reliable | No doubt about authenticity; history of complete reliability |
+| B | Usually reliable | Minor doubt; history of valid information in most instances |
+| C | Fairly reliable | Doubt; provided valid information in the past |
+| D | Not usually reliable | Significant doubt; valid information only occasionally |
+| E | Unreliable | History of invalid information |
+| F | Cannot be judged | Insufficient basis to evaluate; new or unassessed source |
+
+**Information Credibility (1–6):**
+
+| Grade | Label | Definition |
+|---|---|---|
+| 1 | Confirmed | Confirmed by other independent sources; consistent with known information |
+| 2 | Probably true | Not confirmed; logical; consistent with other intelligence |
+| 3 | Possibly true | Not confirmed; reasonably logical; agrees with some other intelligence |
+| 4 | Doubtful | Not confirmed; possible but illogical; no other information on subject |
+| 5 | Improbable | Not confirmed; contradicted by other intelligence |
+| 6 | Cannot be judged | No basis exists for evaluating validity |
+
+### Enrichment: Circular Reporting Detection
+
+Circular reporting is the most dangerous OSINT failure mode: a single source cited through multiple intermediaries appears as multiple independent confirmations.
+
+**Detection protocol:**
+1. Track every source's upstream lineage
+2. Build a directed graph of "cites" relationships
+3. Independence = no path from confirmer back to original source
+4. Before counting a corroboration, verify independence of collection path
+5. Flag any two "confirming" sources that share a common ancestor
+
+## Analysis Output Standards (from prompts.py)
+
+- Write findings to structured files (JSON for machine-readable, Markdown for human-readable), not just text answers.
+- Include a methodology section in every deliverable: sources used, entity resolution approach, linking logic, and known limitations.
+- Produce both a summary (key findings, confidence levels) and a detailed evidence appendix (every hop, every source record cited).
+- Ground all narrative in cited evidence. No speculation without explicit "hypothesis" or "unconfirmed" labels.
+
+## Planning (from prompts.py, adapted)
+
+For nontrivial objectives (multi-step analysis, cross-dataset investigation, complex data pipeline), your FIRST action should be to create an analysis plan.
+
+The plan should include:
+1. Data sources and expected formats
+2. Entity resolution strategy
+3. Cross-dataset linking approach
+4. Evidence chain construction method
+5. Expected deliverables and output format
+6. Risks and limitations
+
+Skip planning for trivial objectives (single lookups, direct questions).
+
+## Execution Tactics (from prompts.py)
+
+1. Produce analysis artifacts early, then refine. Write a working first draft of the output file as soon as you understand the requirements, then iterate. An imperfect deliverable beats a perfect analysis with no output.
+2. Never destroy what you built. After verifying something works, remove only verification artifacts (test files, temp data). Do not overwrite the thing you were asked to create.
+3. Verify round-trip correctness. After any data transformation, check the result from the consumer's perspective—load the output file, spot-check records, verify row counts—before declaring success.
+4. Prefer tool defaults and POSIX portability. Use default options unless you have clear evidence otherwise.
+5. Break long-running commands into small steps. Process files incrementally.
+
+## Verification Principle (from prompts.py)
+
+Implementation and verification must be UNCORRELATED. An agent that performs an analysis must NOT be the sole verifier of that analysis—its self-assessment is inherently biased.
+
+Use the IMPLEMENT-THEN-VERIFY pattern:
+
+```
+Step 1: Perform analysis (analysis agent)
+Step 2: Read results
+Step 3: VERIFY independently (different agent, no context from step 1):
+ - Run exact commands and return raw output only
+ - Load output files fresh, spot-check records
+ - Verify counts, structure, completeness
+ - Report pass/fail with raw evidence
+```
+
+The verification executor has NO context from the analysis executor. It simply runs commands and reports output. This makes its evidence independent.
+
+**Writing good acceptance criteria:**
+
+GOOD (independently checkable):
+- "Entity linkage report contains 5+ cross-dataset matches with source citations"
+- "findings.md contains a Methodology section and an Evidence Appendix section"
+- `python3 -c 'import json; d=json.load(open("out.json")); print(len(d))'` outputs >= 10
+
+BAD (not independently checkable):
+- "Analysis should be thorough"
+- "All entities resolved"
+- "Results are accurate and complete"
+
+## Structured Analytical Techniques
+
+### Analysis of Competing Hypotheses (ACH) — Heuer (1999)
+
+The definitive anti-confirmation-bias technique from the CIA.
+
+1. **Hypothesis Generation** — Brainstorm all mutually exclusive, collectively exhaustive hypotheses. Include null hypothesis and adversarial deception hypothesis.
+2. **List Evidence** — All information items: facts, assumptions, and absence-of-evidence.
+3. **Build Matrix** — Hypotheses as columns, evidence as rows. Assign: C (consistent), I (inconsistent), N/A.
+4. **Focus on Discriminators** — Evidence that is C for some hypotheses and I for others. Items that are C for ALL hypotheses have no discriminatory value.
+5. **Score by Inconsistency** — The hypothesis with the fewest I markers is most likely—NOT the one with the most C markers. This reversal prevents confirmation bias.
+6. **Identify Linchpins** — Evidence items whose reclassification would change the conclusion. Flag for priority verification.
+7. **Report with Uncertainty** — Communicate confidence level and key uncertainties. State which evidence would falsify the preferred hypothesis.
+8. **Set Reassessment Triggers** — Define what new information would trigger re-analysis.
+
+### Key Assumptions Check (KAC)
+
+1. State the working hypothesis clearly
+2. List all stated assumptions explicitly
+3. Surface unstated assumptions: "What must be true for this to hold?"
+4. Challenge each: "What would need to be false for this to fail?"
+5. Rate: essential / contestable / unsupported
+6. Eliminate non-essential assumptions; flag contestable ones for disclosure
+
+### Devil's Advocacy
+
+Assign one analysis pass to argue against the prevailing conclusion with maximum rigor. In automated systems: run a second LLM pass with a "steelman the opposite" system prompt. Apply against every intermediate finding, not just the final one.
+
+### Team A / Team B
+
+Two independent analysis passes with different assumptions. Useful when stakes are high. In automated systems: two LLM runs with different system prompts (one confirmatory, one skeptical).
+
+## Anti-Bias Checklist
+
+Apply before finalizing any investigation:
+
+- [ ] **Confirmation bias**: Scored by inconsistency count (ACH), not confirmation count?
+- [ ] **Anchoring**: Hypotheses ranked only AFTER evidence collection complete?
+- [ ] **Circular reporting**: Source lineage tracked; independence verified for all corroborations?
+- [ ] **Satisficing**: Minimum 3 competing hypotheses considered before scoring?
+- [ ] **Availability heuristic**: Evidence weighted by source quality, not recency or vividness?
+- [ ] **Vividness bias**: Testimonial evidence held to higher corroboration standard than structural evidence?
+
+## Intelligence Cycle (5-Phase)
+
+1. **Planning and Direction** — Define the intelligence requirement precisely. Set collection priorities. Map available sources. Define what a satisfactory answer looks like.
+2. **Collection** — Execute against the plan. Timestamp and archive every collected item. Tag with provisional Admiralty source reliability grade.
+3. **Processing** — Convert raw data to analyzable format. Extract entities. Deduplicate and crosslink.
+4. **Analysis and Production** — Apply SATs. Build evidence chains. Apply confidence tiers. Identify gaps.
+5. **Dissemination** — Produce report with explicit confidence markings. Distinguish raw findings from analytical judgments. Define revision triggers.
+
+The cycle is iterative: Phase 5 produces new questions that re-enter Phase 1.
diff --git a/skills/openplanter/references/output-templates.md b/skills/openplanter/references/output-templates.md
new file mode 100644
index 00000000..8cc31b9c
--- /dev/null
+++ b/skills/openplanter/references/output-templates.md
@@ -0,0 +1,266 @@
+# Output Templates
+
+Ready-to-use templates for investigation deliverables.
+
+## Investigation Summary (Markdown)
+
+```markdown
+# Investigation Summary: [Title]
+
+**Date**: [YYYY-MM-DD]
+**Investigator**: [name / agent]
+**Workspace**: [path]
+
+## Executive Summary
+
+[2-3 sentence overview of key findings and confidence level]
+
+## Methodology
+
+### Data Sources
+| Dataset | Records | Format | Collection Date | Source Reliability |
+|---------|---------|--------|-----------------|-------------------|
+| [name] | [N] | CSV/JSON | [date] | [A-F] |
+
+### Entity Resolution Approach
+- Normalization: [Unicode NFKD + suffix canonicalization + ...]
+- Blocking: [first_3_chars + state]
+- Similarity: [difflib.SequenceMatcher, threshold 0.85/0.70]
+- Clustering: [Union-Find with threshold-gated transitive closure]
+
+### Linking Logic
+- Primary linking fields: [name, EIN, address]
+- Match types used: [exact, fuzzy, address-based]
+- Cross-reference strategy: [description]
+
+### Known Limitations
+- [Dataset gaps]
+- [Entity resolution edge cases]
+- [Temporal coverage limitations]
+
+## Key Findings
+
+### Finding 1: [Title]
+**Confidence**: [Confirmed/Probable/Possible/Unresolved]
+
+[Description of finding grounded in cited evidence]
+
+**Evidence chain**:
+1. [Source record A] → [field match] → [Source record B] (score: 0.92)
+2. [Source record B] → [address match] → [Source record C] (score: 0.78)
+
+### Finding 2: [Title]
+...
+
+## Confidence Breakdown
+
+| Tier | Count | Percentage |
+|------|-------|------------|
+| Confirmed | [N] | [%] |
+| Probable | [N] | [%] |
+| Possible | [N] | [%] |
+| Unresolved | [N] | [%] |
+
+## Anti-Bias Checks Applied
+- [x] Confirmation bias: scored by inconsistency (ACH)
+- [x] Circular reporting: source lineage verified
+- [x] Satisficing: 3+ hypotheses considered
+- [ ] Independent verification: [pending/completed]
+
+## Recommendations
+1. [Next steps]
+2. [Additional data to collect]
+3. [Hypotheses to investigate further]
+
+## Evidence Appendix
+
+[Detailed evidence chains for each finding — see evidence/chains.json]
+```
+
+## Entity Map (JSON)
+
+```json
+{
+ "metadata": {
+ "created": "2026-02-20T12:00:00Z",
+ "workspace": "/path/to/investigation",
+ "datasets_processed": ["campaign.csv", "lobby.json", "registry.csv"],
+ "total_entities": 142,
+ "resolution_threshold": 0.85
+ },
+ "entities": [
+ {
+ "canonical_id": "entity-001",
+ "canonical_name": "Acme Corporation",
+ "entity_type": "organization",
+ "variants": [
+ {
+ "name": "ACME CORP LLC",
+ "source": "campaign.csv",
+ "row": 42,
+ "similarity": 0.91
+ },
+ {
+ "name": "Acme Corporation",
+ "source": "registry.csv",
+ "row": 17,
+ "similarity": 1.0
+ },
+ {
+ "name": "Acme Corp.",
+ "source": "lobby.json",
+ "path": "$.registrants[3].name",
+ "similarity": 0.95
+ }
+ ],
+ "identifiers": {
+ "ein": "12-3456789",
+ "state_reg": "DE-1234567",
+ "fec_committee_id": null
+ },
+ "properties": {
+ "jurisdiction": "DE",
+ "address": "1234 Main St, Wilmington, DE 19801",
+ "industry": "Technology"
+ },
+ "confidence": "confirmed",
+ "confidence_basis": "EIN exact match across 2 datasets + name similarity > 0.90"
+ }
+ ]
+}
+```
+
+## Evidence Chain (JSON)
+
+```json
+{
+ "metadata": {
+ "investigation": "Campaign Finance Cross-Reference",
+ "created": "2026-02-20T12:00:00Z",
+ "total_chains": 23
+ },
+ "chains": [
+ {
+ "chain_id": "chain-001",
+ "claim": "Acme Corporation made campaign contributions while simultaneously lobbying on related legislation",
+ "confidence": "probable",
+ "admiralty_grade": "B2",
+ "hops": [
+ {
+ "hop": 1,
+ "from_entity": "Acme Corp LLC",
+ "from_dataset": "campaign.csv",
+ "from_record": {"row": 42, "fields": {"contributor_name": "ACME CORP LLC", "amount": "$50,000", "recipient": "Committee XYZ", "date": "2025-03-15"}},
+ "to_entity": "Acme Corporation",
+ "to_dataset": "registry.csv",
+ "to_record": {"row": 17, "fields": {"name": "Acme Corporation", "ein": "12-3456789", "status": "Active"}},
+ "link_field": "name",
+ "match_type": "fuzzy",
+ "match_score": 0.91,
+ "link_strength": "strong"
+ },
+ {
+ "hop": 2,
+ "from_entity": "Acme Corporation",
+ "from_dataset": "registry.csv",
+ "from_record": {"row": 17},
+ "to_entity": "Acme Corp.",
+ "to_dataset": "lobby.json",
+ "to_record": {"path": "$.registrants[3]", "fields": {"name": "Acme Corp.", "client": "Self", "issue": "Technology regulation"}},
+ "link_field": "name + ein",
+ "match_type": "exact_ein + fuzzy_name",
+ "match_score": 0.95,
+ "link_strength": "strong"
+ }
+ ],
+ "corroboration": {
+ "status": "corroborated",
+ "independent_sources": 3,
+ "circular_check": "passed",
+ "source_lineage": ["campaign.csv (FEC bulk 2025-Q1)", "registry.csv (DE SOS 2025-01)", "lobby.json (Senate LDA 2025-Q1)"]
+ },
+ "key_assumptions": [
+ "ACME CORP LLC and Acme Corporation are the same legal entity (supported by EIN match)",
+ "Lobbying activity is related to campaign contributions (temporal correlation within same quarter)"
+ ],
+ "falsification_conditions": [
+ "EIN 12-3456789 belongs to a different entity than Acme Corporation in DE registry",
+ "Lobbying issue area has no connection to recipient committee's jurisdiction"
+ ]
+ }
+ ]
+}
+```
+
+## Cross-Reference Report (Markdown)
+
+```markdown
+# Cross-Reference Report
+
+**Datasets**: campaign.csv, lobby.json, registry.csv
+**Date**: 2026-02-20
+**Total cross-references found**: 47
+
+## Match Summary
+
+| Dataset Pair | Matches | Avg Score | Confirmed | Probable | Possible |
+|---|---|---|---|---|---|
+| campaign ↔ registry | 23 | 0.89 | 15 | 6 | 2 |
+| campaign ↔ lobby | 18 | 0.82 | 8 | 7 | 3 |
+| lobby ↔ registry | 31 | 0.91 | 22 | 7 | 2 |
+
+## Highest-Confidence Cross-References
+
+| Entity | Campaign | Lobby | Registry | Score | Confidence |
+|---|---|---|---|---|---|
+| Acme Corp | Row 42 ($50K to Cmte XYZ) | Reg #3 (Tech regulation) | Row 17 (DE, Active) | 0.95 | Confirmed |
+| Beta Holdings | Row 108 ($25K to Cmte ABC) | Reg #7 (Energy policy) | Row 89 (NV, Active) | 0.88 | Probable |
+
+## Entities Appearing in All Datasets
+
+[List of entities with records in all loaded datasets — highest investigation priority]
+
+## Entities with Conflicting Information
+
+[List of entities where datasets contradict each other — requires investigation]
+
+## Unresolved Matches (Manual Review Required)
+
+| Entity A | Dataset | Entity B | Dataset | Score | Issue |
+|---|---|---|---|---|---|
+| Smith & Co | campaign.csv:205 | Smith Company LLC | registry.csv:312 | 0.72 | Possible DBA; confirm EIN |
+```
+
+## Workspace README (for init_workspace.py)
+
+```markdown
+# Investigation Workspace
+
+## Structure
+
+```
+datasets/ — Raw source data (CSV, JSON). Never modify originals.
+entities/ — Resolved entity maps (canonical.json)
+findings/ — Analysis outputs (cross-references, summaries)
+evidence/ — Evidence chains with full provenance
+plans/ — Investigation plans and methodology docs
+```
+
+## Workflow
+
+1. Drop datasets into `datasets/`
+2. Write investigation plan in `plans/plan.md`
+3. Run entity resolution: `entity_resolver.py`
+4. Run cross-referencing: `cross_reference.py`
+5. Validate evidence: `evidence_chain.py`
+6. Score confidence: `confidence_scorer.py`
+7. Review findings in `findings/`
+
+## Provenance
+
+Record for every dataset:
+- Source URL or file path
+- Access/download timestamp
+- Any transformations applied
+- Admiralty source reliability grade (A-F)
+```
diff --git a/skills/openplanter/references/public-records-apis.md b/skills/openplanter/references/public-records-apis.md
new file mode 100644
index 00000000..22debd03
--- /dev/null
+++ b/skills/openplanter/references/public-records-apis.md
@@ -0,0 +1,208 @@
+# Public Records API Reference
+
+Quick reference for US government APIs used by `scrape_records.py` and `dataset_fetcher.py`. All endpoints are accessed via `urllib.request` (stdlib only).
+
+## SEC EDGAR
+
+**Entity submissions** (filing history, metadata):
+```
+GET https://data.sec.gov/submissions/CIK{cik_padded_10}.json
+Header: User-Agent: OpenPlanter/1.0 openplanter@investigation.local
+```
+- **Auth**: User-Agent with name + email (mandatory, no API key)
+- **Rate**: ~10 requests/sec
+- **Linking keys**: `cik` (Central Index Key), `tickers`, `exchanges`
+- **CIK lookup**: `https://www.sec.gov/files/company_tickers.json` (JSON map of all tickers → CIK)
+
+**Full-text search** (EDGAR EFTS):
+```
+GET https://efts.sec.gov/LATEST/search-index?q={query}&dateRange=custom&startdt=2020-01-01&forms=10-K,10-Q,8-K,DEF+14A&hits.hits.total=true
+```
+
+## FEC (Federal Election Commission)
+
+**Committee name search**:
+```
+GET https://api.open.fec.gov/v1/names/committees/?q={name}&api_key={key}
+```
+- **Auth**: Free API key at [api.open.fec.gov](https://api.open.fec.gov/developers/). `DEMO_KEY` available (1000 req/hr)
+- **Rate**: 1000 requests/hr per key
+- **Linking keys**: `committee_id`, `committee_name`, `treasurer_name`
+
+**Individual contributions (Schedule A)**:
+```
+GET https://api.open.fec.gov/v1/schedules/schedule_a/?contributor_name={name}&api_key={key}&per_page=100
+```
+
+**Bulk downloads** (used by `dataset_fetcher.py`):
+```
+https://www.fec.gov/files/bulk-downloads/2024/committee_master_2024.csv
+```
+
+## Senate LDA (Lobbying Disclosure Act)
+
+**Registrant search**:
+```
+GET https://lda.senate.gov/api/v1/registrants/?name={name}&format=json&page_size=25
+```
+- **Auth**: None
+- **Rate**: ~1 request/sec (polite)
+- **Linking keys**: `id`, `name`, `house_registrant_id`
+- **Pagination**: `{next, previous, count, results}` — follow `next` URL
+
+**Filing search**:
+```
+GET https://lda.senate.gov/api/v1/filings/?filing_type=1®istrant_name={name}&format=json&page_size=25
+```
+
+## OFAC SDN (Treasury Sanctions)
+
+**Bulk download** (used by `dataset_fetcher.py`):
+```
+https://www.treasury.gov/ofac/downloads/sdn.csv
+```
+- **Auth**: None
+- **Format**: Pipe-delimited CSV
+- **Linking keys**: `uid`, `name`, `sdnType`, `programs`
+
+## OpenSanctions
+
+**Bulk download** (used by `dataset_fetcher.py`):
+```
+https://data.opensanctions.org/datasets/latest/sanctions/targets.simple.csv
+```
+- **Auth**: None (non-commercial use)
+- **Linking keys**: `id`, `name`, `countries`, `identifiers`
+
+## USAspending.gov
+
+**Award search** (POST JSON):
+```
+POST https://api.usaspending.gov/api/v2/search/spending_by_award/
+Content-Type: application/json
+
+{
+ "filters": {
+ "keyword": "entity name",
+ "time_period": [{"start_date": "2020-01-01", "end_date": "2026-12-31"}]
+ },
+ "fields": ["Award ID", "Recipient Name", "Award Amount", "Awarding Agency"],
+ "page": 1,
+ "limit": 25,
+ "sort": "Award Amount",
+ "order": "desc"
+}
+```
+- **Auth**: None
+- **Rate**: Liberal (no published limit)
+- **Linking keys**: `recipient_name`, `award_id`, `awarding_agency`
+
+## US Census Bureau (ACS)
+
+**American Community Survey 5-Year Estimates**:
+```
+GET https://api.census.gov/data/{year}/acs/acs5?get={variables}&for={geography}&key={key}
+```
+- **Auth**: Optional `CENSUS_API_KEY` (free at [api.census.gov](https://api.census.gov/data/key_signup.html)). Works without key at lower rate.
+- **Rate**: ~500 req/day without key, higher with key
+- **Linking keys**: `state`, `county`, `zip code tabulation area`
+- **Script**: `fetch_census.py`
+- **Key variables**: B01003_001E (population), B19013_001E (median income), B15003_022E (bachelor's degree)
+
+**Geography codes**:
+```
+for=state:* # All states
+for=county:*&in=state:36 # All counties in NY (FIPS 36)
+for=zip+code+tabulation+area:12508 # Single ZIP
+```
+
+## EPA ECHO (Enforcement & Compliance History)
+
+**Facility search**:
+```
+GET https://echodata.epa.gov/echo/echo_rest_services.get_facilities?output=JSON&p_fn={name}&p_st={state}
+```
+- **Auth**: None (free public API)
+- **Rate**: ~2 req/sec recommended
+- **Linking keys**: `RegistryId` (FRS ID), `FacilityName`, `SICCodes`, `NAICSCodes`, `Lat`, `Lon`
+- **Script**: `fetch_epa.py`
+- **Response format**: `{"Results": {"Facilities": [...]}}`
+
+**Program flags**: `AirFlag` (CAA), `CWAFlag` (CWA), `RCRAFlag` (RCRA), `TRIFlag` (TRI)
+
+## ICIJ Offshore Leaks Database
+
+**Entity search**:
+```
+GET https://offshoreleaks.icij.org/api/v1/search?q={query}&type={entity|officer|intermediary|address}&limit=100
+```
+- **Auth**: None (free public database)
+- **Rate**: ~1 req/sec recommended (429 on burst)
+- **Linking keys**: `id` (ICIJ node ID), `name`, `jurisdiction`, `country_codes`, `sourceID`
+- **Script**: `fetch_icij.py`
+- **Datasets**: Panama Papers (2016), Paradise Papers (2017), Pandora Papers (2021), Offshore Leaks (2013), Bahamas Leaks (2016)
+
+## OSHA Inspection Data (DOL Enforcement)
+
+**Inspection search**:
+```
+GET https://enforcedata.dol.gov/api/enforcement?dataset=inspection&$filter={filter}&$top=25&$orderby=open_date desc
+```
+- **Auth**: None (free public API)
+- **Rate**: ~2 req/sec recommended
+- **Linking keys**: `activity_nr`, `estab_name`, `sic_code`, `naics_code`, `site_state`
+- **Script**: `fetch_osha.py`
+- **Filter syntax**: `estab_name eq 'Name' and site_state eq 'TX' and sic_code eq '2911'`
+
+## ProPublica Nonprofit Explorer (IRS 990)
+
+**Organization search**:
+```
+GET https://projects.propublica.org/nonprofits/api/v2/search.json?q={name}&state[id]={state}&ntee[id]={code}
+```
+- **Auth**: None (free public API)
+- **Rate**: ~1 req/sec recommended
+- **Linking keys**: `ein`, `name`, `ntee_code`, `state`
+- **Script**: `fetch_propublica990.py`
+
+**Direct EIN lookup** (includes filing data + officer compensation):
+```
+GET https://projects.propublica.org/nonprofits/api/v2/organizations/{ein}.json
+```
+
+## SAM.gov Entity Registration
+
+**Entity search**:
+```
+GET https://api.sam.gov/entity-information/v3/entities?api_key={key}&legalBusinessName={name}®istrationStatus=A
+```
+- **Auth**: `SAM_GOV_API_KEY` required (free at [api.data.gov](https://api.data.gov/signup/))
+- **Rate**: 1000 req/day
+- **Linking keys**: `ueiSAM` (Unique Entity ID), `cageCode`, `legalBusinessName`, `naicsCode`
+- **Script**: `fetch_sam.py`
+- **Sections**: `includeSections=entityRegistration,coreData`
+
+## Cross-Dataset Linking Strategy
+
+No universal corporate ID exists in US public records. Standard approach:
+
+1. **Normalize entity names** (strip legal suffixes, case fold, Unicode NFKD)
+2. **Fuzzy match** across datasets (`difflib.SequenceMatcher`, threshold ≥ 0.85)
+3. **Filter by jurisdiction** (state, address)
+4. **Anchor on hard IDs** when available: CIK (SEC), committee_id (FEC), EIN (IRS/ProPublica), UEI (SAM.gov), FRS ID (EPA), CAGE code (DoD)
+5. **Score confidence** using Admiralty tiers
+6. **Cross-link via NAICS/SIC** when name matching is ambiguous
+
+## Environment Variables
+
+| Variable | Used By | Required |
+|----------|---------|----------|
+| `FEC_API_KEY` | `scrape_records.py` | No (`DEMO_KEY` fallback) |
+| `EXA_API_KEY` | `web_enrich.py` | Yes (for Exa search) |
+| `CENSUS_API_KEY` | `fetch_census.py` | No (works without, lower rate) |
+| `SAM_GOV_API_KEY` | `fetch_sam.py` | Yes (free at api.data.gov) |
+| `ANTHROPIC_API_KEY` | `delegate_to_rlm.py` | If using Anthropic models |
+| `OPENAI_API_KEY` | `delegate_to_rlm.py` | If using OpenAI models |
+| `OPENROUTER_API_KEY` | `delegate_to_rlm.py` | If using OpenRouter models |
+| `CEREBRAS_API_KEY` | `delegate_to_rlm.py` | If using Cerebras models |
+| `OPENPLANTER_REPO` | `delegate_to_rlm.py` | No (auto-discovers local clone) |
diff --git a/skills/openplanter/scripts/confidence_scorer.py b/skills/openplanter/scripts/confidence_scorer.py
new file mode 100644
index 00000000..8790428c
--- /dev/null
+++ b/skills/openplanter/scripts/confidence_scorer.py
@@ -0,0 +1,368 @@
+#!/usr/bin/env python3
+"""Score investigation findings by confidence tier.
+
+Reads findings files in findings/ and evidence chains in evidence/,
+re-scores based on evidence chain strength, source diversity, and
+corroboration status. Updates confidence fields in-place.
+
+Scoring rules:
+ - 2+ independent sources with different collection paths → confirmed
+ - Strong single source (official record) or high match score → probable
+ - Circumstantial evidence only or moderate match → possible
+ - Insufficient data or contradictory evidence → unresolved
+
+Uses Python stdlib only — zero external dependencies.
+
+Usage:
+ python3 confidence_scorer.py /path/to/investigation
+ python3 confidence_scorer.py /path/to/investigation --dry-run
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from collections import defaultdict
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any
+
+CONFIDENCE_TIERS = ["confirmed", "probable", "possible", "unresolved"]
+
+
+def score_entity(entity: dict[str, Any]) -> tuple[str, str]:
+ """Score an entity from canonical.json based on source diversity and variant count.
+
+ Returns (new_confidence, reason).
+ """
+ sources = entity.get("sources", [])
+ variants = entity.get("variants", [])
+
+ unique_sources = len(set(sources))
+ variant_count = len(variants)
+
+ # Check for hard signals: same identifier value across 2+ sources
+ has_hard_signal = False
+ has_hard_disqualifier = False
+ similarities = []
+ hard_signal_values: dict[str, list[tuple[str, str]]] = defaultdict(list)
+ for v in variants:
+ sim = v.get("similarity", 0)
+ if isinstance(sim, (int, float)):
+ similarities.append(sim)
+ fields = v.get("fields", {})
+ source = v.get("source", "")
+ for field_name in ("ein", "tin", "phone", "email"):
+ val = str(fields.get(field_name, "")).strip()
+ if val:
+ hard_signal_values[field_name].append((val, source))
+
+ for field_name, entries in hard_signal_values.items():
+ vals = [v for v, _ in entries]
+ srcs = [s for _, s in entries]
+ if len(set(vals)) == 1 and len(set(srcs)) >= 2:
+ has_hard_signal = True
+ elif len(set(vals)) > 1:
+ has_hard_disqualifier = True
+
+ avg_similarity = sum(similarities) / len(similarities) if similarities else 0
+ max_similarity = max(similarities) if similarities else 0
+
+ # Hard disqualifier overrides everything
+ if has_hard_disqualifier:
+ return "unresolved", (
+ f"conflicting hard identifiers across variants"
+ )
+
+ # Scoring logic
+ if unique_sources >= 2 and (has_hard_signal or avg_similarity >= 0.90):
+ return "confirmed", (
+ f"{unique_sources} independent sources"
+ + (", hard signal match" if has_hard_signal else "")
+ + f", avg similarity {avg_similarity:.2f}"
+ )
+ elif unique_sources >= 2 and avg_similarity >= 0.85:
+ return "confirmed", (
+ f"{unique_sources} sources, avg similarity {avg_similarity:.2f}"
+ )
+ elif unique_sources >= 2 or (has_hard_signal and avg_similarity >= 0.70):
+ return "probable", (
+ f"{unique_sources} sources"
+ + (", hard signal" if has_hard_signal else "")
+ + f", avg similarity {avg_similarity:.2f}"
+ )
+ elif variant_count >= 2 and avg_similarity >= 0.70:
+ return "possible", (
+ f"{variant_count} variants, avg similarity {avg_similarity:.2f}"
+ )
+ elif variant_count >= 2:
+ return "possible", f"{variant_count} variants, low similarity {avg_similarity:.2f}"
+ else:
+ return "unresolved", f"single record, similarity {max_similarity:.2f}"
+
+
+def score_cross_reference(xref: dict[str, Any]) -> tuple[str, str]:
+ """Score a cross-reference based on dataset count and match quality."""
+ dataset_count = xref.get("dataset_count", 0)
+ pairs = xref.get("cross_reference_pairs", [])
+
+ if not pairs:
+ return "unresolved", "no cross-reference pairs"
+
+ # Analyze pair quality
+ exact_matches = 0
+ total_fields = 0
+ for pair in pairs:
+ mf = pair.get("matching_fields", {})
+ total_fields += len(mf)
+ exact_matches += pair.get("exact_match_count", 0)
+
+ match_rate = exact_matches / total_fields if total_fields > 0 else 0
+
+ if dataset_count >= 3 and match_rate >= 0.5:
+ return "confirmed", (
+ f"{dataset_count} datasets, {exact_matches}/{total_fields} "
+ f"exact field matches ({match_rate:.0%})"
+ )
+ elif dataset_count >= 2 and match_rate >= 0.3:
+ return "probable", (
+ f"{dataset_count} datasets, {match_rate:.0%} field match rate"
+ )
+ elif dataset_count >= 2:
+ return "possible", (
+ f"{dataset_count} datasets, low field match rate {match_rate:.0%}"
+ )
+ else:
+ return "unresolved", f"only {dataset_count} dataset(s)"
+
+
+def score_evidence_chain(chain: dict[str, Any]) -> tuple[str, str]:
+ """Score an evidence chain based on hop quality and corroboration."""
+ hops = chain.get("hops", [])
+ corr = chain.get("corroboration", {})
+
+ if not hops:
+ return "unresolved", "no evidence hops"
+
+ # Analyze hop quality
+ scores = []
+ for hop in hops:
+ s = hop.get("match_score")
+ if s is not None:
+ try:
+ scores.append(float(s))
+ except (ValueError, TypeError):
+ pass
+
+ min_score = min(scores) if scores else 0
+ avg_score = sum(scores) / len(scores) if scores else 0
+
+ # Check corroboration
+ corr_status = corr.get("status", "single")
+ independent_sources = corr.get("independent_sources", 1)
+ if isinstance(independent_sources, str):
+ try:
+ independent_sources = int(independent_sources)
+ except ValueError:
+ independent_sources = 1
+
+ # Link strength = weakest hop
+ if corr_status == "corroborated" and independent_sources >= 2 and min_score >= 0.85:
+ return "confirmed", (
+ f"corroborated by {independent_sources} sources, "
+ f"min hop score {min_score:.2f}"
+ )
+ elif (
+ corr_status == "corroborated"
+ or (independent_sources >= 2 and min_score >= 0.70)
+ ):
+ return "probable", (
+ f"{'corroborated' if corr_status == 'corroborated' else 'multi-source'}, "
+ f"min hop score {min_score:.2f}"
+ )
+ elif min_score >= 0.55 and len(hops) <= 3:
+ return "possible", (
+ f"single source chain, {len(hops)} hops, min score {min_score:.2f}"
+ )
+ elif corr_status == "contradicted":
+ return "unresolved", "contradicted by other evidence"
+ else:
+ return "unresolved", f"weak chain, min score {min_score:.2f}"
+
+
+def process_file(
+ filepath: Path, dry_run: bool
+) -> tuple[dict[str, int], list[dict[str, Any]]]:
+ """Process a single JSON findings file. Returns (tier_counts, changes)."""
+ tier_counts: dict[str, int] = defaultdict(int)
+ changes: list[dict[str, Any]] = []
+
+ try:
+ data = json.loads(filepath.read_text(encoding="utf-8"))
+ except (json.JSONDecodeError, UnicodeDecodeError) as e:
+ print(f" Error reading {filepath.name}: {e}")
+ return dict(tier_counts), changes
+
+ modified = False
+
+ # Detect file type and score accordingly
+ if isinstance(data, dict):
+ # Canonical entity map
+ entities = data.get("entities", [])
+ if entities and "canonical_name" in (entities[0] if entities else {}):
+ for entity in entities:
+ old = entity.get("confidence", "unresolved")
+ new, reason = score_entity(entity)
+ tier_counts[new] += 1
+ if old != new:
+ changes.append(
+ {
+ "id": entity.get("canonical_id", "?"),
+ "name": entity.get("canonical_name", "?"),
+ "old": old,
+ "new": new,
+ "reason": reason,
+ }
+ )
+ entity["confidence"] = new
+ entity["confidence_basis"] = reason
+ modified = True
+
+ # Cross-references
+ xrefs = data.get("cross_references", [])
+ for xref in xrefs:
+ old = xref.get("confidence", "unresolved")
+ new, reason = score_cross_reference(xref)
+ tier_counts[new] += 1
+ if old != new:
+ changes.append(
+ {
+ "id": xref.get("entity_id", "?"),
+ "name": xref.get("entity_name", "?"),
+ "old": old,
+ "new": new,
+ "reason": reason,
+ }
+ )
+ xref["confidence"] = new
+ xref["confidence_basis"] = reason
+ modified = True
+
+ # Evidence chains
+ chains = data.get("chains", data.get("evidence_chains", []))
+ for chain in chains:
+ old = chain.get("confidence", "unresolved")
+ new, reason = score_evidence_chain(chain)
+ tier_counts[new] += 1
+ if old != new:
+ changes.append(
+ {
+ "id": chain.get("chain_id", "?"),
+ "name": chain.get("claim", "?")[:60],
+ "old": old,
+ "new": new,
+ "reason": reason,
+ }
+ )
+ chain["confidence"] = new
+ chain["confidence_basis"] = reason
+ modified = True
+
+ # Write back if modified and not dry-run
+ if modified and not dry_run:
+ filepath.write_text(
+ json.dumps(data, indent=2, ensure_ascii=False), encoding="utf-8"
+ )
+
+ return dict(tier_counts), changes
+
+
+def main() -> None:
+ parser = argparse.ArgumentParser(
+ description="Re-score investigation findings by confidence tier"
+ )
+ parser.add_argument("workspace", type=Path, help="Path to investigation workspace")
+ parser.add_argument(
+ "--dry-run",
+ action="store_true",
+ help="Show what would change without modifying files",
+ )
+ args = parser.parse_args()
+
+ workspace = args.workspace.resolve()
+
+ # Find all JSON files in entities/, findings/, evidence/
+ search_dirs = [
+ workspace / "entities",
+ workspace / "findings",
+ workspace / "evidence",
+ ]
+ json_files: list[Path] = []
+ for d in search_dirs:
+ if d.exists():
+ json_files.extend(d.glob("*.json"))
+
+ if not json_files:
+ print("No JSON files found in entities/, findings/, or evidence/")
+ sys.exit(0)
+
+ total_tiers: dict[str, int] = defaultdict(int)
+ all_changes: list[dict[str, Any]] = []
+
+ mode = "DRY RUN" if args.dry_run else "SCORING"
+ print(f"Confidence scoring ({mode}):\n")
+
+ for fp in sorted(json_files):
+ tiers, changes = process_file(fp, args.dry_run)
+ for tier, count in tiers.items():
+ total_tiers[tier] += count
+ all_changes.extend(changes)
+
+ status = f"{sum(tiers.values())} items"
+ if changes:
+ status += f", {len(changes)} re-scored"
+ print(f" {fp.relative_to(workspace)}: {status}")
+
+ # Report changes
+ if all_changes:
+ print(f"\nConfidence changes ({len(all_changes)}):")
+ for c in all_changes:
+ arrow = f"{c['old']} -> {c['new']}"
+ print(f" [{arrow:30s}] {c['name']}")
+ print(f" Reason: {c['reason']}")
+ else:
+ print("\nNo confidence changes needed.")
+
+ # Summary
+ print(f"\nConfidence breakdown:")
+ for tier in CONFIDENCE_TIERS:
+ count = total_tiers.get(tier, 0)
+ print(f" {tier}: {count}")
+
+ total = sum(total_tiers.values())
+ if total > 0:
+ confirmed = total_tiers.get("confirmed", 0)
+ print(f"\n Total items: {total}")
+ print(f" Confirmed rate: {confirmed/total:.0%}")
+
+ # Write scoring log
+ if not args.dry_run:
+ now = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
+ log = {
+ "metadata": {
+ "scored_at": now,
+ "workspace": str(workspace),
+ "files_processed": len(json_files),
+ },
+ "summary": dict(total_tiers),
+ "changes": all_changes,
+ }
+ log_dir = workspace / "evidence"
+ log_dir.mkdir(exist_ok=True)
+ log_path = log_dir / "scoring-log.json"
+ log_path.write_text(json.dumps(log, indent=2), encoding="utf-8")
+ print(f"\n Scoring log: {log_path}")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/skills/openplanter/scripts/cross_reference.py b/skills/openplanter/scripts/cross_reference.py
new file mode 100644
index 00000000..eb10a394
--- /dev/null
+++ b/skills/openplanter/scripts/cross_reference.py
@@ -0,0 +1,308 @@
+#!/usr/bin/env python3
+"""Cross-dataset record linking.
+
+Loads the canonical entity map from entities/canonical.json and finds
+records across datasets that share canonical entities. Outputs a
+cross-reference report to findings/cross-references.json.
+
+Uses Python stdlib only — zero external dependencies.
+
+Usage:
+ python3 cross_reference.py /path/to/investigation
+ python3 cross_reference.py /path/to/investigation --datasets campaign.csv lobby.json
+ python3 cross_reference.py /path/to/investigation --min-datasets 2
+"""
+from __future__ import annotations
+
+import argparse
+import csv
+import json
+import sys
+from collections import defaultdict
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any
+
+
+def load_canonical_map(workspace: Path) -> dict[str, Any]:
+ """Load the canonical entity map."""
+ path = workspace / "entities" / "canonical.json"
+ if not path.exists():
+ print(f"Error: {path} not found. Run entity_resolver.py first.", file=sys.stderr)
+ sys.exit(1)
+ return json.loads(path.read_text(encoding="utf-8"))
+
+
+def load_dataset_records(filepath: Path) -> list[dict[str, str]]:
+ """Load all records from a CSV or JSON file."""
+ if filepath.suffix == ".csv":
+ return _load_csv_records(filepath)
+ elif filepath.suffix == ".json":
+ return _load_json_records(filepath)
+ return []
+
+
+def _load_csv_records(filepath: Path) -> list[dict[str, str]]:
+ records = []
+ for encoding in ("utf-8-sig", "utf-8", "latin-1"):
+ try:
+ with open(filepath, newline="", encoding=encoding) as f:
+ sample = f.read(4096)
+ f.seek(0)
+ try:
+ dialect = csv.Sniffer().sniff(sample[:2048], delimiters=",\t|;")
+ except csv.Error:
+ dialect = csv.excel
+ reader = csv.DictReader(f, dialect=dialect)
+ for row_num, row in enumerate(reader, start=2):
+ rec = {k: (v or "").strip() for k, v in row.items()}
+ rec["__source_file"] = filepath.name
+ rec["__source_location"] = f"row:{row_num}"
+ records.append(rec)
+ return records
+ except UnicodeDecodeError:
+ continue
+ print(f" Error: Cannot decode {filepath.name}")
+ return records
+
+
+def _load_json_records(filepath: Path) -> list[dict[str, str]]:
+ records = []
+ try:
+ data = json.loads(filepath.read_text(encoding="utf-8"))
+ except (json.JSONDecodeError, UnicodeDecodeError):
+ return records
+
+ items: list[dict] = []
+ path_prefix = "$"
+ if isinstance(data, list):
+ items = [d for d in data if isinstance(d, dict)]
+ elif isinstance(data, dict):
+ for key in ("data", "results", "records", "items", "registrants", "filings"):
+ if key in data and isinstance(data[key], list):
+ items = [d for d in data[key] if isinstance(d, dict)]
+ path_prefix = f"$.{key}"
+ break
+ else:
+ items = [data]
+
+ for idx, item in enumerate(items):
+ rec = {k: str(v).strip() for k, v in item.items() if v is not None}
+ rec["__source_file"] = filepath.name
+ rec["__source_location"] = f"{path_prefix}[{idx}]"
+ records.append(rec)
+ return records
+
+
+def find_cross_references(
+ canonical_map: dict[str, Any],
+ all_records: dict[str, list[dict[str, str]]],
+ min_datasets: int = 2,
+ dataset_filter: list[str] | None = None,
+) -> list[dict[str, Any]]:
+ """Find records across datasets that share canonical entities."""
+ entities = canonical_map.get("entities", [])
+ cross_refs: list[dict[str, Any]] = []
+
+ for entity in entities:
+ canonical_name = entity["canonical_name"]
+ canonical_id = entity["canonical_id"]
+ variants = entity.get("variants", [])
+ sources = entity.get("sources", [])
+
+ if dataset_filter:
+ sources = [s for s in sources if s in dataset_filter]
+
+ if len(sources) < min_datasets:
+ continue
+
+ # Collect matching records from each dataset
+ dataset_records: dict[str, list[dict[str, Any]]] = defaultdict(list)
+ for variant in variants:
+ src = variant["source"]
+ if dataset_filter and src not in dataset_filter:
+ continue
+
+ # Find the actual record in the loaded data
+ loc = variant.get("location", "")
+ for rec in all_records.get(src, []):
+ if rec.get("__source_location") == loc:
+ dataset_records[src].append(
+ {
+ "variant_name": variant["name"],
+ "location": loc,
+ "similarity": variant.get("similarity", 0),
+ "fields": {
+ k: v
+ for k, v in rec.items()
+ if not k.startswith("__") and v
+ },
+ }
+ )
+ break
+
+ if len(dataset_records) >= min_datasets:
+ # Build cross-reference entry
+ pairs: list[dict[str, Any]] = []
+ ds_list = sorted(dataset_records.keys())
+ for i in range(len(ds_list)):
+ for j in range(i + 1, len(ds_list)):
+ ds_a, ds_b = ds_list[i], ds_list[j]
+ for rec_a in dataset_records[ds_a]:
+ for rec_b in dataset_records[ds_b]:
+ # Find common fields
+ common_fields = set(rec_a["fields"].keys()) & set(
+ rec_b["fields"].keys()
+ )
+ matching_fields = {
+ f: {
+ "a": rec_a["fields"][f],
+ "b": rec_b["fields"][f],
+ "match": rec_a["fields"][f].lower().strip()
+ == rec_b["fields"][f].lower().strip(),
+ }
+ for f in common_fields
+ if not f.startswith("__")
+ }
+ pairs.append(
+ {
+ "dataset_a": ds_a,
+ "record_a": rec_a,
+ "dataset_b": ds_b,
+ "record_b": rec_b,
+ "matching_fields": matching_fields,
+ "common_field_count": len(matching_fields),
+ "exact_match_count": sum(
+ 1
+ for v in matching_fields.values()
+ if v["match"]
+ ),
+ }
+ )
+
+ cross_refs.append(
+ {
+ "entity_id": canonical_id,
+ "entity_name": canonical_name,
+ "datasets": ds_list,
+ "dataset_count": len(ds_list),
+ "records_by_dataset": {
+ ds: len(recs) for ds, recs in dataset_records.items()
+ },
+ "cross_reference_pairs": pairs,
+ "confidence": entity.get("confidence", "unresolved"),
+ }
+ )
+
+ return cross_refs
+
+
+def generate_summary(
+ cross_refs: list[dict[str, Any]], datasets: list[str]
+) -> dict[str, Any]:
+ """Generate summary statistics."""
+ by_confidence = defaultdict(int)
+ by_dataset_pair: dict[str, int] = defaultdict(int)
+
+ for xref in cross_refs:
+ by_confidence[xref["confidence"]] += 1
+ for pair in xref.get("cross_reference_pairs", []):
+ key = f"{pair['dataset_a']} <-> {pair['dataset_b']}"
+ by_dataset_pair[key] += 1
+
+ return {
+ "total_cross_references": len(cross_refs),
+ "by_confidence": dict(by_confidence),
+ "by_dataset_pair": dict(by_dataset_pair),
+ "datasets_analyzed": datasets,
+ "entities_in_all_datasets": sum(
+ 1 for x in cross_refs if x["dataset_count"] == len(datasets)
+ ),
+ }
+
+
+def main() -> None:
+ parser = argparse.ArgumentParser(
+ description="Cross-reference records across datasets using canonical entity map"
+ )
+ parser.add_argument("workspace", type=Path, help="Path to investigation workspace")
+ parser.add_argument(
+ "--datasets",
+ type=str,
+ nargs="*",
+ default=None,
+ help="Specific dataset filenames to cross-reference (default: all)",
+ )
+ parser.add_argument(
+ "--min-datasets",
+ type=int,
+ default=2,
+ help="Minimum number of datasets an entity must appear in (default: 2)",
+ )
+ args = parser.parse_args()
+
+ workspace = args.workspace.resolve()
+ datasets_dir = workspace / "datasets"
+
+ # Load canonical map
+ print("Loading canonical entity map...")
+ canonical_map = load_canonical_map(workspace)
+ entity_count = len(canonical_map.get("entities", []))
+ print(f" {entity_count} canonical entities loaded")
+
+ # Load dataset records
+ data_files = list(datasets_dir.glob("*.csv")) + list(datasets_dir.glob("*.json"))
+ if args.datasets:
+ data_files = [f for f in data_files if f.name in args.datasets]
+
+ all_records: dict[str, list[dict[str, str]]] = {}
+ print("Loading dataset records:")
+ for fp in sorted(data_files):
+ recs = load_dataset_records(fp)
+ all_records[fp.name] = recs
+ print(f" {fp.name}: {len(recs)} records")
+
+ # Find cross-references
+ print(f"\nCross-referencing (min {args.min_datasets} datasets)...")
+ cross_refs = find_cross_references(
+ canonical_map, all_records, args.min_datasets, args.datasets
+ )
+
+ # Generate summary
+ summary = generate_summary(cross_refs, [f.name for f in data_files])
+
+ # Write output
+ now = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
+ output = {
+ "metadata": {
+ "created": now,
+ "workspace": str(workspace),
+ "min_datasets": args.min_datasets,
+ "dataset_filter": args.datasets,
+ },
+ "summary": summary,
+ "cross_references": cross_refs,
+ }
+
+ findings_dir = workspace / "findings"
+ findings_dir.mkdir(exist_ok=True)
+ out_path = findings_dir / "cross-references.json"
+ out_path.write_text(json.dumps(output, indent=2, ensure_ascii=False), encoding="utf-8")
+
+ print(f"\nCross-reference report: {out_path}")
+ print(f" Total cross-references: {summary['total_cross_references']}")
+ print(f" Entities in all datasets: {summary['entities_in_all_datasets']}")
+ if summary["by_confidence"]:
+ print(" By confidence:")
+ for tier, count in sorted(summary["by_confidence"].items()):
+ print(f" {tier}: {count}")
+ if summary["by_dataset_pair"]:
+ print(" By dataset pair:")
+ for pair, count in sorted(
+ summary["by_dataset_pair"].items(), key=lambda x: -x[1]
+ ):
+ print(f" {pair}: {count}")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/skills/openplanter/scripts/dataset_fetcher.py b/skills/openplanter/scripts/dataset_fetcher.py
new file mode 100644
index 00000000..72300be5
--- /dev/null
+++ b/skills/openplanter/scripts/dataset_fetcher.py
@@ -0,0 +1,261 @@
+#!/usr/bin/env python3
+"""Download common public datasets for OSINT investigations.
+
+Fetches bulk datasets from US government portals and open data sources
+into a workspace's datasets/bulk/ directory with provenance metadata.
+
+Uses Python stdlib only — zero external dependencies.
+
+Supported sources:
+ sec — SEC EDGAR company tickers (CIK lookup table)
+ fec — FEC committee master file (current cycle)
+ ofac — OFAC SDN list (Treasury sanctions)
+ sanctions — OpenSanctions simplified sanctions CSV
+ lda — Senate LDA registrant list (current year)
+
+Usage:
+ python3 dataset_fetcher.py /path/to/investigation --sources sec,fec,ofac
+ python3 dataset_fetcher.py /path/to/investigation --sources all
+ python3 dataset_fetcher.py /path/to/investigation --list
+"""
+from __future__ import annotations
+
+import argparse
+import hashlib
+import json
+import os
+import sys
+import time
+import urllib.error
+import urllib.request
+from datetime import datetime, timezone
+from pathlib import Path
+
+# ---------------------------------------------------------------------------
+# Source registry — each entry defines a downloadable dataset
+# ---------------------------------------------------------------------------
+
+SOURCES: dict[str, dict] = {
+ "sec": {
+ "name": "SEC EDGAR Company Tickers",
+ "description": "Ticker-to-CIK mapping for all SEC-registered entities",
+ "url": "https://www.sec.gov/files/company_tickers.json",
+ "filename": "company_tickers.json",
+ "format": "json",
+ "linking_keys": ["cik", "ticker"],
+ "headers": {"User-Agent": "OpenPlanter/1.0 openplanter@investigation.local"},
+ },
+ "fec": {
+ "name": "FEC Committee Master File",
+ "description": "All registered political committees (current election cycle)",
+ "url": "https://www.fec.gov/files/bulk-downloads/2024/committee_master_2024.csv",
+ "filename": "committee_master.csv",
+ "format": "csv",
+ "linking_keys": ["committee_id", "committee_name", "treasurer_name"],
+ "headers": {},
+ },
+ "ofac": {
+ "name": "OFAC SDN List",
+ "description": "Treasury Specially Designated Nationals and Blocked Persons",
+ "url": "https://www.treasury.gov/ofac/downloads/sdn.csv",
+ "filename": "sdn.csv",
+ "format": "csv",
+ "linking_keys": ["uid", "name", "sdnType", "programs"],
+ "headers": {},
+ },
+ "sanctions": {
+ "name": "OpenSanctions Simplified",
+ "description": "Consolidated sanctions targets (non-commercial use)",
+ "url": "https://data.opensanctions.org/datasets/latest/sanctions/targets.simple.csv",
+ "filename": "sanctions_targets.csv",
+ "format": "csv",
+ "linking_keys": ["id", "name", "countries", "identifiers"],
+ "headers": {},
+ },
+ "lda": {
+ "name": "Senate LDA Registrants",
+ "description": "Lobbying registrants from Senate Lobbying Disclosure Act filings",
+ "url": "https://lda.senate.gov/api/v1/registrants/?format=json&page_size=100",
+ "filename": "lda_registrants.json",
+ "format": "json",
+ "linking_keys": ["id", "name", "house_registrant_id"],
+ "headers": {},
+ "paginated": True,
+ },
+}
+
+
+def _sha256(data: bytes) -> str:
+ return hashlib.sha256(data).hexdigest()
+
+
+def _download(url: str, headers: dict[str, str], timeout: int = 60) -> bytes:
+ """Download a URL and return raw bytes."""
+ req = urllib.request.Request(url, headers=headers)
+ with urllib.request.urlopen(req, timeout=timeout) as resp:
+ return resp.read()
+
+
+def _download_paginated(
+ base_url: str, headers: dict[str, str], timeout: int = 60, max_pages: int = 50
+) -> list[dict]:
+ """Download paginated JSON API (Senate LDA style: {next, results})."""
+ all_results: list[dict] = []
+ url: str | None = base_url
+ page = 0
+
+ while url and page < max_pages:
+ req = urllib.request.Request(url, headers=headers)
+ with urllib.request.urlopen(req, timeout=timeout) as resp:
+ data = json.loads(resp.read().decode("utf-8"))
+
+ results = data.get("results", [])
+ all_results.extend(results)
+ url = data.get("next")
+ page += 1
+
+ if url:
+ time.sleep(0.5) # polite rate limiting
+
+ if url:
+ print(f" WARNING: Truncated at {max_pages} pages; more data available")
+
+ return all_results
+
+
+def fetch_source(
+ source_id: str,
+ workspace: Path,
+ timeout: int = 120,
+ dry_run: bool = False,
+) -> dict:
+ """Fetch a single source and write to datasets/bulk/{source_id}/."""
+ spec = SOURCES[source_id]
+ dest_dir = workspace / "datasets" / "bulk" / source_id
+ dest_dir.mkdir(parents=True, exist_ok=True)
+
+ dest_file = dest_dir / spec["filename"]
+ provenance_file = dest_dir / "provenance.json"
+
+ if dry_run:
+ print(f" [dry-run] Would download: {spec['url']}")
+ print(f" [dry-run] Destination: {dest_file}")
+ return {"source": source_id, "status": "dry-run"}
+
+ print(f" Downloading {spec['name']}...")
+ print(f" URL: {spec['url']}")
+
+ now = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
+
+ try:
+ if spec.get("paginated"):
+ results = _download_paginated(
+ spec["url"], spec.get("headers", {}), timeout=timeout
+ )
+ data = json.dumps(results, indent=2).encode("utf-8")
+ else:
+ data = _download(spec["url"], spec.get("headers", {}), timeout=timeout)
+
+ dest_file.write_bytes(data)
+
+ provenance = {
+ "source_id": source_id,
+ "name": spec["name"],
+ "description": spec["description"],
+ "url": spec["url"],
+ "format": spec["format"],
+ "linking_keys": spec["linking_keys"],
+ "download_timestamp": now,
+ "file": spec["filename"],
+ "size_bytes": len(data),
+ "sha256": _sha256(data),
+ }
+ provenance_file.write_text(json.dumps(provenance, indent=2), encoding="utf-8")
+
+ print(f" Saved: {dest_file} ({len(data):,} bytes)")
+ return {"source": source_id, "status": "ok", "size": len(data)}
+
+ except urllib.error.URLError as e:
+ msg = f"Download failed: {e}"
+ print(f" ERROR: {msg}", file=sys.stderr)
+ return {"source": source_id, "status": "error", "error": msg}
+ except Exception as e:
+ msg = f"Unexpected error: {e}"
+ print(f" ERROR: {msg}", file=sys.stderr)
+ return {"source": source_id, "status": "error", "error": msg}
+
+
+def main() -> None:
+ parser = argparse.ArgumentParser(
+ description="Download public datasets for OSINT investigations"
+ )
+ parser.add_argument("workspace", type=Path, help="Path to investigation workspace")
+ parser.add_argument(
+ "--sources",
+ type=str,
+ default="all",
+ help="Comma-separated source IDs, or 'all' (default: all)",
+ )
+ parser.add_argument(
+ "--list",
+ action="store_true",
+ dest="list_sources",
+ help="List available sources and exit",
+ )
+ parser.add_argument(
+ "--dry-run",
+ action="store_true",
+ help="Show what would be downloaded without fetching",
+ )
+ parser.add_argument(
+ "--timeout",
+ type=int,
+ default=120,
+ help="HTTP timeout in seconds (default: 120)",
+ )
+ args = parser.parse_args()
+
+ if args.list_sources:
+ print("Available sources:\n")
+ for sid, spec in SOURCES.items():
+ print(f" {sid:12s} {spec['name']}")
+ print(f" {'':<12s} {spec['description']}")
+ print(f" {'':<12s} Format: {spec['format']} Keys: {', '.join(spec['linking_keys'])}")
+ print()
+ return
+
+ workspace = args.workspace.resolve()
+ if not workspace.exists():
+ print(f"Error: workspace does not exist: {workspace}", file=sys.stderr)
+ sys.exit(1)
+
+ if args.sources == "all":
+ source_ids = list(SOURCES.keys())
+ else:
+ source_ids = [s.strip() for s in args.sources.split(",")]
+ unknown = [s for s in source_ids if s not in SOURCES]
+ if unknown:
+ print(
+ f"Error: unknown source(s): {', '.join(unknown)}\n"
+ f"Available: {', '.join(SOURCES.keys())}",
+ file=sys.stderr,
+ )
+ sys.exit(1)
+
+ print(f"Fetching {len(source_ids)} source(s) into {workspace / 'datasets' / 'bulk'}/\n")
+ results = []
+ for sid in source_ids:
+ result = fetch_source(sid, workspace, timeout=args.timeout, dry_run=args.dry_run)
+ results.append(result)
+ print()
+
+ ok = sum(1 for r in results if r["status"] == "ok")
+ errs = sum(1 for r in results if r["status"] == "error")
+ print(f"Done: {ok} succeeded, {errs} failed out of {len(results)} source(s)")
+
+ if errs:
+ sys.exit(1)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/skills/openplanter/scripts/delegate_to_rlm.py b/skills/openplanter/scripts/delegate_to_rlm.py
new file mode 100644
index 00000000..a4126497
--- /dev/null
+++ b/skills/openplanter/scripts/delegate_to_rlm.py
@@ -0,0 +1,567 @@
+#!/usr/bin/env python3
+"""Bridge script: delegate an investigation to the OpenPlanter RLM agent.
+
+Spawns the full OpenPlanter recursive language model agent as a subprocess
+for complex investigations that exceed what the skill scripts can handle.
+The RLM agent uses tiered model delegation to minimize cost while maintaining
+investigation quality. Provider-agnostic: works with any LLM provider the
+agent supports (Anthropic, OpenAI, OpenRouter, Cerebras, Ollama) — auto-
+inferred from the model name or set explicitly.
+
+Use skill scripts when: 2 datasets, entity resolution + cross-referencing,
+no web research needed. Delegate to RLM when: 3+ datasets, web search
+required, iterative exploration, or 20+ reasoning steps needed.
+
+Returns JSON to stdout with investigation results and session metadata.
+
+Uses Python stdlib only — zero external dependencies.
+
+Provider auto-detection examples:
+ claude-* → anthropic
+ gpt-*, o1-*, o3-* → openai
+ */model-name → openrouter (slash = OpenRouter routing)
+ llama*cerebras → cerebras
+ llama3*, qwen* → ollama (local inference)
+
+API keys pass through environment variables (checked by the agent):
+ ANTHROPIC_API_KEY, OPENAI_API_KEY, OPENROUTER_API_KEY, CEREBRAS_API_KEY
+ (or OPENPLANTER_-prefixed variants)
+ Ollama requires no API key (local).
+
+Supports session management:
+ --resume SESSION_ID Resume a saved investigation session
+ --list-sessions List all saved sessions in workspace
+ --list-models Show available models for the current provider
+ --reasoning-effort Control reasoning depth (low/medium/high)
+
+Usage:
+ python3 delegate_to_rlm.py --objective "Cross-reference campaign finance
+ with lobbying disclosures" --workspace /path/to/investigation
+ python3 delegate_to_rlm.py --objective "..." --workspace DIR --model claude-sonnet-4-5-20250929
+ python3 delegate_to_rlm.py --objective "..." --workspace DIR --model gpt-4o --provider openai
+ python3 delegate_to_rlm.py --objective "..." --workspace DIR --model anthropic/claude-sonnet-4-5
+ python3 delegate_to_rlm.py --objective "..." --workspace DIR --max-steps 30 --timeout 300
+ python3 delegate_to_rlm.py --resume abc123 --workspace DIR
+ python3 delegate_to_rlm.py --list-sessions --workspace DIR
+ python3 delegate_to_rlm.py --list-models --provider ollama
+ python3 delegate_to_rlm.py --objective "..." --workspace DIR --provider ollama --model llama3
+ python3 delegate_to_rlm.py --objective "..." --workspace DIR --reasoning-effort high
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import shutil
+import subprocess
+import sys
+import time
+from pathlib import Path
+
+# Repo directory found by find_agent_command(), used as cwd for -m agent
+_AGENT_REPO: Path | None = None
+
+
+def find_agent_command() -> list[str] | None:
+ """Locate the openplanter-agent command.
+
+ Checks in order:
+ 1. openplanter-agent on PATH (pip install -e or cargo install)
+ 2. python -m agent from known repo locations (local clone)
+
+ When using ``-m agent``, sets ``_AGENT_REPO`` so subprocess.run
+ can pass the correct ``cwd``.
+
+ Local clone paths include Tom's dev layout and common conventions.
+ """
+ global _AGENT_REPO
+
+ # Check PATH first (pip install -e, cargo install, or npm global)
+ if shutil.which("openplanter-agent"):
+ _AGENT_REPO = None
+ return ["openplanter-agent"]
+
+ # Check common repo locations — local clone discovery
+ candidates = [
+ Path.home() / "Desktop" / "Programming" / "OpenPlanter",
+ Path.home() / "OpenPlanter",
+ Path.home() / "src" / "OpenPlanter",
+ Path.home() / "projects" / "OpenPlanter",
+ Path.cwd().parent,
+ Path.cwd().parent.parent,
+ ]
+ # Also check OPENPLANTER_REPO env var for explicit override
+ env_repo = os.environ.get("OPENPLANTER_REPO")
+ if env_repo:
+ candidates.insert(0, Path(env_repo))
+
+ for repo in candidates:
+ if (repo / "agent" / "__main__.py").exists():
+ _AGENT_REPO = repo
+ return [sys.executable, "-m", "agent"]
+
+ return None
+
+
+def _infer_provider(model: str) -> str:
+ """Infer the LLM provider from the model name.
+
+ Mirrors the logic in agent/builder.py:infer_provider_for_model() so we
+ can set a sensible default without importing the agent package.
+ """
+ if "/" in model:
+ return "openrouter"
+ lower = model.lower()
+ if lower.startswith("claude"):
+ return "anthropic"
+ if any(lower.startswith(p) for p in ("gpt", "o1-", "o1", "o3-", "o3", "o4-", "o4", "chatgpt")):
+ return "openai"
+ if "cerebras" in lower:
+ return "cerebras"
+ # Local models typically served via Ollama
+ if any(lower.startswith(p) for p in ("llama", "qwen", "mistral", "gemma", "phi", "deepseek")):
+ return "ollama"
+ # Fall back to auto — let the agent figure it out
+ return "auto"
+
+
+def build_command(
+ objective: str,
+ workspace: str,
+ model: str = "claude-sonnet-4-5-20250929",
+ provider: str = "auto",
+ max_steps: int = 50,
+ max_depth: int = 3,
+ recursive: bool = True,
+ acceptance_criteria: bool = True,
+ reasoning_effort: str | None = None,
+ resume_session: str | None = None,
+) -> list[str]:
+ """Build the openplanter-agent CLI command.
+
+ Provider is auto-inferred from the model name if set to "auto".
+ Supports any model/provider combination the OpenPlanter agent supports:
+ Anthropic (claude-*), OpenAI (gpt-*, o1-*, o3-*), OpenRouter (org/model),
+ Cerebras (llama*cerebras, qwen-*), Ollama (local models).
+ """
+ agent_cmd = find_agent_command()
+ if not agent_cmd:
+ raise RuntimeError(
+ "openplanter-agent not found. Install with: "
+ "pip install -e /path/to/OpenPlanter\n"
+ "Or set OPENPLANTER_REPO=/path/to/repo"
+ )
+
+ resolved_provider = provider if provider != "auto" else _infer_provider(model)
+
+ # Resume mode: simpler command, just session ID + workspace
+ if resume_session:
+ cmd = [
+ *agent_cmd,
+ "--resume", resume_session,
+ "--workspace", workspace,
+ "--headless",
+ ]
+ if resolved_provider != "auto":
+ cmd.extend(["--provider", resolved_provider])
+ return cmd
+
+ cmd = [
+ *agent_cmd,
+ "--task", objective,
+ "--workspace", workspace,
+ "--model", model,
+ "--max-steps", str(max_steps),
+ "--max-depth", str(max_depth),
+ "--headless",
+ ]
+ # Only pass --provider if we resolved to a specific one
+ if resolved_provider != "auto":
+ cmd.extend(["--provider", resolved_provider])
+ if recursive:
+ cmd.append("--recursive")
+ if acceptance_criteria:
+ cmd.append("--acceptance-criteria")
+ if reasoning_effort:
+ cmd.extend(["--reasoning-effort", reasoning_effort])
+ return cmd
+
+
+def parse_output(stdout: str) -> tuple[str, list[str]]:
+ """Separate trace lines from the final answer."""
+ lines = stdout.strip().split("\n")
+ trace_lines = []
+ answer_lines = []
+ in_answer = False
+
+ for line in lines:
+ if line.startswith("trace>"):
+ trace_lines.append(line)
+ in_answer = False
+ elif line.startswith(" ") and not in_answer and not trace_lines:
+ # Startup info lines (Provider, Model, etc.)
+ continue
+ else:
+ answer_lines.append(line)
+ in_answer = True
+
+ answer = "\n".join(answer_lines).strip()
+ return answer, trace_lines
+
+
+def collect_session_artifacts(workspace: Path) -> dict:
+ """Find the most recent session and collect its metadata."""
+ session_dir = workspace / ".openplanter" / "sessions"
+ if not session_dir.exists():
+ return {}
+
+ sessions = sorted(
+ [d for d in session_dir.iterdir() if d.is_dir()],
+ key=lambda p: p.stat().st_mtime,
+ reverse=True,
+ )
+ if not sessions:
+ return {}
+
+ latest = sessions[0]
+ artifacts: dict = {
+ "session_id": latest.name,
+ "session_path": str(latest),
+ }
+
+ # Read metadata
+ meta_path = latest / "metadata.json"
+ if meta_path.exists():
+ try:
+ artifacts["metadata"] = json.loads(meta_path.read_text(encoding="utf-8"))
+ except (json.JSONDecodeError, OSError):
+ pass
+
+ # List artifact files
+ art_dir = latest / "artifacts"
+ if art_dir.exists():
+ artifacts["artifact_files"] = [
+ str(p.relative_to(latest))
+ for p in art_dir.rglob("*")
+ if p.is_file()
+ ]
+
+ # Read investigation plan if exists
+ plans = sorted(
+ latest.glob("*.plan.md"),
+ key=lambda p: p.stat().st_mtime,
+ reverse=True,
+ )
+ if plans:
+ try:
+ artifacts["plan"] = plans[0].read_text(encoding="utf-8")[:5000]
+ except OSError:
+ pass
+
+ return artifacts
+
+
+def collect_output_files(workspace: Path) -> list[str]:
+ """List investigation output files in the workspace."""
+ output_files = []
+ for subdir in ["findings", "entities", "evidence"]:
+ d = workspace / subdir
+ if d.exists():
+ for f in d.iterdir():
+ if f.is_file() and f.name != ".gitkeep":
+ output_files.append(str(f.relative_to(workspace)))
+ return output_files
+
+
+def list_sessions(workspace: str) -> list[dict]:
+ """List saved investigation sessions in the workspace."""
+ ws = Path(workspace).resolve()
+ session_dir = ws / ".openplanter" / "sessions"
+ if not session_dir.exists():
+ return []
+
+ sessions = []
+ for d in sorted(session_dir.iterdir(), key=lambda p: p.stat().st_mtime, reverse=True):
+ if not d.is_dir():
+ continue
+ entry: dict = {
+ "session_id": d.name,
+ "created": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime(d.stat().st_ctime)),
+ "modified": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime(d.stat().st_mtime)),
+ }
+ meta_path = d / "metadata.json"
+ if meta_path.exists():
+ try:
+ meta = json.loads(meta_path.read_text(encoding="utf-8"))
+ entry["objective"] = meta.get("objective", "")
+ entry["status"] = meta.get("status", "")
+ entry["model"] = meta.get("model", "")
+ entry["steps_taken"] = meta.get("steps_taken", 0)
+ except (json.JSONDecodeError, OSError):
+ pass
+ sessions.append(entry)
+ return sessions
+
+
+def list_models(provider: str = "ollama", timeout: int = 10) -> list[dict]:
+ """List available models for a provider.
+
+ Currently supports Ollama (local) by querying its API.
+ For cloud providers, returns a curated list of recommended models.
+ """
+ if provider == "ollama":
+ import urllib.error
+ import urllib.request
+ ollama_url = os.environ.get("OLLAMA_HOST", "http://localhost:11434")
+ try:
+ req = urllib.request.Request(
+ f"{ollama_url}/api/tags",
+ headers={"Accept": "application/json"},
+ )
+ with urllib.request.urlopen(req, timeout=timeout) as resp:
+ data = json.loads(resp.read().decode("utf-8"))
+ return [
+ {"name": m.get("name", ""), "size": m.get("size", 0)}
+ for m in data.get("models", [])
+ ]
+ except (urllib.error.URLError, OSError):
+ return [{"error": f"Ollama not reachable at {ollama_url}"}]
+
+ # Curated recommendations for cloud providers
+ recommendations = {
+ "anthropic": [
+ {"name": "claude-sonnet-4-5-20250929", "note": "Default, best cost/quality"},
+ {"name": "claude-opus-4-6", "note": "Maximum capability"},
+ {"name": "claude-haiku-4-5-20251001", "note": "Fast, low cost"},
+ ],
+ "openai": [
+ {"name": "gpt-4o", "note": "Default"},
+ {"name": "o3", "note": "Reasoning model"},
+ ],
+ "openrouter": [
+ {"name": "anthropic/claude-sonnet-4-5", "note": "Via OpenRouter"},
+ {"name": "meta-llama/llama-3.1-70b-instruct", "note": "Open-weight"},
+ ],
+ }
+ return recommendations.get(provider, [{"note": f"Unknown provider: {provider}"}])
+
+
+def run_delegation(
+ objective: str,
+ workspace: str,
+ model: str = "claude-sonnet-4-5-20250929",
+ provider: str = "auto",
+ max_steps: int = 50,
+ max_depth: int = 3,
+ timeout: int = 600,
+ recursive: bool = True,
+ acceptance_criteria: bool = True,
+ reasoning_effort: str | None = None,
+ resume_session: str | None = None,
+) -> dict:
+ """Run the OpenPlanter agent and return structured results."""
+ ws = Path(workspace).resolve()
+
+ try:
+ cmd = build_command(
+ objective=objective,
+ workspace=str(ws),
+ model=model,
+ provider=provider,
+ max_steps=max_steps,
+ max_depth=max_depth,
+ recursive=recursive,
+ acceptance_criteria=acceptance_criteria,
+ reasoning_effort=reasoning_effort,
+ resume_session=resume_session,
+ )
+ except RuntimeError as e:
+ return {
+ "status": "error",
+ "answer": None,
+ "error": str(e),
+ "elapsed_sec": 0,
+ }
+
+ t0 = time.monotonic()
+ try:
+ result = subprocess.run(
+ cmd,
+ capture_output=True,
+ text=True,
+ timeout=timeout,
+ env={**os.environ},
+ cwd=str(_AGENT_REPO) if _AGENT_REPO else None,
+ )
+ except subprocess.TimeoutExpired:
+ return {
+ "status": "timeout",
+ "answer": None,
+ "error": f"RLM agent exceeded {timeout}s timeout",
+ "elapsed_sec": round(time.monotonic() - t0, 2),
+ }
+ except FileNotFoundError:
+ return {
+ "status": "error",
+ "answer": None,
+ "error": "openplanter-agent command not found",
+ "elapsed_sec": round(time.monotonic() - t0, 2),
+ }
+
+ elapsed = round(time.monotonic() - t0, 2)
+ answer, trace_lines = parse_output(result.stdout)
+
+ return {
+ "status": "success" if result.returncode == 0 else "error",
+ "answer": answer,
+ "exit_code": result.returncode,
+ "stderr": result.stderr[:2000] if result.stderr else None,
+ "trace_steps": len(trace_lines),
+ "artifacts": collect_session_artifacts(ws),
+ "output_files": collect_output_files(ws),
+ "elapsed_sec": elapsed,
+ }
+
+
+def main() -> None:
+ parser = argparse.ArgumentParser(
+ description="Delegate an investigation to the OpenPlanter RLM agent"
+ )
+ parser.add_argument(
+ "--objective",
+ help="Investigation objective (what the agent should accomplish)",
+ )
+ parser.add_argument(
+ "--workspace", type=Path,
+ help="Path to investigation workspace directory",
+ )
+ parser.add_argument(
+ "--model", default="claude-sonnet-4-5-20250929",
+ help="Model name for the top-level agent. Provider is auto-inferred "
+ "from the name: claude-* → anthropic, gpt-*/o1-* → openai, "
+ "org/model → openrouter, llama*/qwen* → ollama. "
+ "(default: claude-sonnet-4-5-20250929)",
+ )
+ parser.add_argument(
+ "--provider", default="auto",
+ help="LLM provider: auto, anthropic, openai, openrouter, cerebras, "
+ "ollama. 'auto' infers from the model name. (default: auto)",
+ )
+ parser.add_argument(
+ "--max-steps", type=int, default=50,
+ help="Maximum steps per agent call (default: 50)",
+ )
+ parser.add_argument(
+ "--max-depth", type=int, default=3,
+ help="Maximum recursion depth for sub-agents (default: 3)",
+ )
+ parser.add_argument(
+ "--timeout", type=int, default=600,
+ help="Wall-clock timeout in seconds (default: 600)",
+ )
+ parser.add_argument(
+ "--no-recursive", action="store_true",
+ help="Disable recursive sub-agent delegation",
+ )
+ parser.add_argument(
+ "--no-acceptance-criteria", action="store_true",
+ help="Disable acceptance criteria judging",
+ )
+ parser.add_argument(
+ "--reasoning-effort", choices=["low", "medium", "high"],
+ help="Control reasoning depth (low/medium/high)",
+ )
+ parser.add_argument(
+ "--resume", dest="resume_session",
+ help="Resume a saved investigation session by ID",
+ )
+ parser.add_argument(
+ "--list-sessions", action="store_true",
+ help="List all saved investigation sessions in workspace",
+ )
+ parser.add_argument(
+ "--list-models", action="store_true",
+ help="Show available models for the current provider",
+ )
+ parser.add_argument(
+ "--check", action="store_true",
+ help="Check if openplanter-agent is available and exit",
+ )
+ args = parser.parse_args()
+
+ # Info-only commands (no workspace/objective required)
+ if args.check:
+ cmd = find_agent_command()
+ if cmd:
+ print(json.dumps({"available": True, "command": cmd}))
+ else:
+ print(json.dumps({"available": False, "error": "openplanter-agent not found"}))
+ sys.exit(1)
+ return
+
+ if args.list_models:
+ provider = args.provider if args.provider != "auto" else _infer_provider(args.model)
+ models = list_models(provider)
+ print(json.dumps(models, indent=2))
+ return
+
+ if args.list_sessions:
+ if not args.workspace:
+ print(json.dumps({"error": "--workspace required for --list-sessions"}))
+ sys.exit(1)
+ sessions = list_sessions(str(args.workspace.resolve()))
+ print(json.dumps(sessions, indent=2))
+ return
+
+ # Resume mode: objective not required
+ if args.resume_session:
+ if not args.workspace:
+ print(json.dumps({"error": "--workspace required for --resume"}))
+ sys.exit(1)
+ workspace = args.workspace.resolve()
+ if not workspace.exists():
+ print(json.dumps({"status": "error", "error": f"Workspace does not exist: {workspace}"}))
+ sys.exit(1)
+ result = run_delegation(
+ objective="", # Not used in resume mode
+ workspace=str(workspace),
+ model=args.model,
+ provider=args.provider,
+ timeout=args.timeout,
+ resume_session=args.resume_session,
+ )
+ print(json.dumps(result, indent=2))
+ sys.exit(0 if result["status"] == "success" else 1)
+ return
+
+ # Standard delegation: objective + workspace required
+ if not args.objective:
+ parser.error("--objective is required (unless using --resume, --list-sessions, or --list-models)")
+ if not args.workspace:
+ parser.error("--workspace is required")
+
+ workspace = args.workspace.resolve()
+ if not workspace.exists():
+ print(
+ json.dumps({"status": "error", "error": f"Workspace does not exist: {workspace}"}),
+ )
+ sys.exit(1)
+
+ result = run_delegation(
+ objective=args.objective,
+ workspace=str(workspace),
+ model=args.model,
+ provider=args.provider,
+ max_steps=args.max_steps,
+ max_depth=args.max_depth,
+ timeout=args.timeout,
+ recursive=not args.no_recursive,
+ acceptance_criteria=not args.no_acceptance_criteria,
+ reasoning_effort=args.reasoning_effort,
+ )
+ print(json.dumps(result, indent=2))
+ sys.exit(0 if result["status"] == "success" else 1)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/skills/openplanter/scripts/entity_resolver.py b/skills/openplanter/scripts/entity_resolver.py
new file mode 100644
index 00000000..a8ffab76
--- /dev/null
+++ b/skills/openplanter/scripts/entity_resolver.py
@@ -0,0 +1,481 @@
+#!/usr/bin/env python3
+"""Fuzzy entity matching and canonical map builder.
+
+Reads all CSV/JSON files in a workspace's datasets/ directory, extracts
+entity names, normalizes them, performs pairwise fuzzy matching, and
+outputs a canonical entity map to entities/canonical.json.
+
+Uses Python stdlib only — zero external dependencies.
+
+Usage:
+ python3 entity_resolver.py /path/to/investigation
+ python3 entity_resolver.py /path/to/investigation --threshold 0.80
+ python3 entity_resolver.py /path/to/investigation --name-columns "name,contributor_name,registrant"
+"""
+from __future__ import annotations
+
+import argparse
+import csv
+import difflib
+import json
+import re
+import sys
+import unicodedata
+from collections import defaultdict
+from dataclasses import asdict, dataclass, field
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any
+
+# --- Name Normalization ---
+
+# Legal suffix patterns — ordered longest first for greedy matching
+SUFFIX_PATTERNS: list[tuple[re.Pattern, str]] = [
+ (re.compile(r"\blimited\s+liability\s+company\b", re.I), ""),
+ (re.compile(r"\blimited\s+liability\s+partnership\b", re.I), ""),
+ (re.compile(r"\blimited\s+partnership\b", re.I), ""),
+ (re.compile(r"\bpublic\s+limited\s+company\b", re.I), ""),
+ (re.compile(r"\bprofessional\s+corporation\b", re.I), ""),
+ (re.compile(r"\bprofessional\s+association\b", re.I), ""),
+ (re.compile(r"\bincorporated\b", re.I), ""),
+ (re.compile(r"\bcorporation\b", re.I), ""),
+ (re.compile(r"\bcompany\b", re.I), ""),
+ (re.compile(r"\blimited\b", re.I), ""),
+ (re.compile(r"\bl\.?l\.?c\.?\b", re.I), ""),
+ (re.compile(r"\bl\.?l\.?p\.?\b", re.I), ""),
+ (re.compile(r"\bl\.?p\.?\b", re.I), ""),
+ (re.compile(r"\bp\.?l\.?c\.?\b", re.I), ""),
+ (re.compile(r"\bp\.?c\.?\b", re.I), ""),
+ (re.compile(r"\bp\.?a\.?\b", re.I), ""),
+ (re.compile(r"\binc\.?\b", re.I), ""),
+ (re.compile(r"\bcorp\.?\b", re.I), ""),
+ (re.compile(r"\bltd\.?\b", re.I), ""),
+ (re.compile(r"\bco\.?\b", re.I), ""),
+]
+
+NOISE_WORDS = re.compile(r"\b(?:the|a|an|of)\b", re.I)
+PUNCT = re.compile(r"[^\w\s-]")
+MULTI_SPACE = re.compile(r"\s+")
+
+# Common name columns to auto-detect
+DEFAULT_NAME_COLUMNS = [
+ "name",
+ "entity_name",
+ "company_name",
+ "organization_name",
+ "org_name",
+ "contributor_name",
+ "registrant_name",
+ "registrant",
+ "client_name",
+ "client",
+ "vendor_name",
+ "vendor",
+ "employer_name",
+ "employer",
+ "recipient_name",
+ "recipient",
+ "lobbyist_name",
+ "lobbyist",
+ "owner_name",
+ "owner",
+ "grantee",
+ "contractor",
+]
+
+
+def strip_diacritics(s: str) -> str:
+ decomposed = unicodedata.normalize("NFKD", s)
+ stripped = "".join(c for c in decomposed if unicodedata.category(c) != "Mn")
+ return unicodedata.normalize("NFKC", stripped)
+
+
+def canonical_key(name: str) -> str:
+ """Produce a normalized key for entity deduplication."""
+ s = strip_diacritics(name)
+ s = s.lower()
+ # Normalize ampersand BEFORE noise word removal
+ s = s.replace("&", " and ")
+ # Strip legal suffixes
+ for pat, repl in SUFFIX_PATTERNS:
+ s = pat.sub(repl, s)
+ # Remove noise words (not "and" — it's a real token in entity names)
+ s = NOISE_WORDS.sub("", s)
+ # Strip punctuation except hyphens
+ s = PUNCT.sub(" ", s)
+ # Collapse whitespace
+ s = MULTI_SPACE.sub(" ", s).strip()
+ return s
+
+
+def entity_similarity(a: str, b: str, cutoff: float = 0.5) -> float:
+ """Cascading similarity check using difflib.SequenceMatcher.
+
+ Stages: real_quick_ratio (O(1)) → quick_ratio (O(min(N,M))) → ratio (O(N*M)).
+ Returns 0.0 if below cutoff at any stage.
+ """
+ ka, kb = canonical_key(a), canonical_key(b)
+ if ka == kb:
+ return 1.0
+ if not ka or not kb:
+ return 0.0
+ sm = difflib.SequenceMatcher(None, ka, kb, autojunk=False)
+ if sm.real_quick_ratio() < cutoff:
+ return 0.0
+ if sm.quick_ratio() < cutoff:
+ return 0.0
+ return round(sm.ratio(), 4)
+
+
+# --- Data Loading ---
+
+
+@dataclass
+class EntityRecord:
+ name: str
+ normalized: str
+ source_file: str
+ source_location: str # "row:42" or "$.path[3].name"
+ fields: dict[str, str] = field(default_factory=dict)
+
+
+@dataclass
+class CanonicalEntity:
+ canonical_id: str
+ canonical_name: str
+ variants: list[dict[str, Any]] = field(default_factory=list)
+ sources: list[str] = field(default_factory=list)
+ confidence: str = "unresolved"
+
+
+def detect_name_columns(headers: list[str], explicit: list[str] | None) -> list[str]:
+ """Find which columns contain entity names."""
+ if explicit:
+ return [h for h in headers if h.lower() in [e.lower() for e in explicit]]
+ lower_headers = {h.lower(): h for h in headers}
+ found = []
+ for candidate in DEFAULT_NAME_COLUMNS:
+ if candidate in lower_headers:
+ found.append(lower_headers[candidate])
+ return found
+
+
+def load_csv(filepath: Path, name_columns: list[str] | None) -> list[EntityRecord]:
+ """Load entities from a CSV file."""
+ records: list[EntityRecord] = []
+ # Try multiple encodings
+ for encoding in ("utf-8-sig", "utf-8", "latin-1"):
+ try:
+ with open(filepath, newline="", encoding=encoding) as f:
+ sample = f.read(4096)
+ f.seek(0)
+ try:
+ dialect = csv.Sniffer().sniff(sample[:2048], delimiters=",\t|;")
+ except csv.Error:
+ dialect = csv.excel
+ reader = csv.DictReader(f, dialect=dialect)
+ if reader.fieldnames is None:
+ return records
+ cols = detect_name_columns(list(reader.fieldnames), name_columns)
+ if not cols:
+ print(f" Warning: No name columns found in {filepath.name}")
+ return records
+ for row_num, row in enumerate(reader, start=2):
+ for col in cols:
+ val = (row.get(col) or "").strip()
+ if val and len(val) > 1:
+ records.append(
+ EntityRecord(
+ name=val,
+ normalized=canonical_key(val),
+ source_file=filepath.name,
+ source_location=f"row:{row_num}",
+ fields={
+ k: (v or "").strip()
+ for k, v in row.items()
+ if v and v.strip()
+ },
+ )
+ )
+ return records
+ except UnicodeDecodeError:
+ continue
+ print(f" Error: Cannot decode {filepath.name}")
+ return records
+
+
+def load_json(filepath: Path, name_columns: list[str] | None) -> list[EntityRecord]:
+ """Load entities from a JSON file (array of objects or nested)."""
+ records: list[EntityRecord] = []
+ try:
+ data = json.loads(filepath.read_text(encoding="utf-8"))
+ except (json.JSONDecodeError, UnicodeDecodeError) as e:
+ print(f" Error reading {filepath.name}: {e}")
+ return records
+
+ if isinstance(data, list):
+ items = data
+ path_prefix = "$"
+ elif isinstance(data, dict):
+ # Try common wrapper patterns
+ for key in ("data", "results", "records", "items", "registrants", "filings"):
+ if key in data and isinstance(data[key], list):
+ items = data[key]
+ path_prefix = f"$.{key}"
+ break
+ else:
+ items = [data]
+ path_prefix = "$"
+ else:
+ return records
+
+ for idx, item in enumerate(items):
+ if not isinstance(item, dict):
+ continue
+ cols = detect_name_columns(list(item.keys()), name_columns)
+ if not cols and idx == 0:
+ print(f" Warning: No name columns found in {filepath.name}")
+ for col in cols:
+ val = str(item.get(col, "")).strip()
+ if val and len(val) > 1:
+ records.append(
+ EntityRecord(
+ name=val,
+ normalized=canonical_key(val),
+ source_file=filepath.name,
+ source_location=f"{path_prefix}[{idx}].{col}",
+ fields={
+ k: str(v).strip()
+ for k, v in item.items()
+ if v is not None and str(v).strip()
+ },
+ )
+ )
+ return records
+
+
+# --- Entity Resolution ---
+
+
+class UnionFind:
+ """Disjoint-set data structure for entity clustering."""
+
+ def __init__(self) -> None:
+ self.parent: dict[int, int] = {}
+ self.rank: dict[int, int] = {}
+
+ def find(self, x: int) -> int:
+ if x not in self.parent:
+ self.parent[x] = x
+ self.rank[x] = 0
+ if self.parent[x] != x:
+ self.parent[x] = self.find(self.parent[x]) # path compression
+ return self.parent[x]
+
+ def union(self, x: int, y: int) -> None:
+ rx, ry = self.find(x), self.find(y)
+ if rx == ry:
+ return
+ # union by rank
+ if self.rank[rx] < self.rank[ry]:
+ rx, ry = ry, rx
+ self.parent[ry] = rx
+ if self.rank[rx] == self.rank[ry]:
+ self.rank[rx] += 1
+
+ def clusters(self, indices: list[int]) -> dict[int, list[int]]:
+ groups: dict[int, list[int]] = defaultdict(list)
+ for i in indices:
+ groups[self.find(i)].append(i)
+ return dict(groups)
+
+
+def resolve_entities(
+ records: list[EntityRecord], threshold: float
+) -> list[CanonicalEntity]:
+ """Group records into canonical entities using fuzzy matching."""
+ if not records:
+ return []
+
+ # --- Blocking: group by first 3 chars of normalized name ---
+ blocks: dict[str, list[int]] = defaultdict(list)
+ for idx, rec in enumerate(records):
+ key = rec.normalized[:3] if len(rec.normalized) >= 3 else rec.normalized
+ blocks[key].append(idx)
+
+ # --- Pairwise comparison within blocks ---
+ uf = UnionFind()
+ comparisons = 0
+ matches = 0
+
+ for block_key, indices in blocks.items():
+ for i in range(len(indices)):
+ for j in range(i + 1, len(indices)):
+ a_idx, b_idx = indices[i], indices[j]
+ comparisons += 1
+ score = entity_similarity(
+ records[a_idx].name, records[b_idx].name, cutoff=threshold
+ )
+ if score >= threshold:
+ uf.union(a_idx, b_idx)
+ matches += 1
+
+ # Also compare across adjacent blocks (sorted neighborhood window=1)
+ sorted_blocks = sorted(blocks.keys())
+ for k in range(len(sorted_blocks) - 1):
+ key_a, key_b = sorted_blocks[k], sorted_blocks[k + 1]
+ # Only compare first few from each block to limit cost
+ for a_idx in blocks[key_a][:5]:
+ for b_idx in blocks[key_b][:5]:
+ comparisons += 1
+ score = entity_similarity(
+ records[a_idx].name, records[b_idx].name, cutoff=threshold
+ )
+ if score >= threshold:
+ uf.union(a_idx, b_idx)
+ matches += 1
+
+ # --- Build canonical entities from clusters ---
+ clusters = uf.clusters(list(range(len(records))))
+ entities: list[CanonicalEntity] = []
+
+ for cluster_id, (root, members) in enumerate(
+ sorted(clusters.items(), key=lambda x: -len(x[1])), start=1
+ ):
+ # Pick the most common name variant as canonical
+ name_counts: dict[str, int] = defaultdict(int)
+ for idx in members:
+ name_counts[records[idx].name] += 1
+ canonical_name = max(name_counts, key=name_counts.get) # type: ignore[arg-type]
+
+ variants = []
+ sources = set()
+ for idx in members:
+ rec = records[idx]
+ sources.add(rec.source_file)
+ variants.append(
+ {
+ "name": rec.name,
+ "source": rec.source_file,
+ "location": rec.source_location,
+ "similarity": entity_similarity(canonical_name, rec.name),
+ }
+ )
+
+ # Determine confidence based on source diversity AND match quality
+ unique_sources = len(sources)
+ sims = [v.get("similarity", 0) for v in variants if isinstance(v.get("similarity"), (int, float))]
+ avg_sim = sum(sims) / len(sims) if sims else 0
+
+ if unique_sources >= 2 and avg_sim >= 0.85:
+ confidence = "confirmed"
+ elif unique_sources >= 2:
+ confidence = "probable"
+ elif len(members) >= 3:
+ confidence = "probable"
+ elif len(members) >= 2:
+ confidence = "possible"
+ else:
+ confidence = "unresolved"
+
+ entities.append(
+ CanonicalEntity(
+ canonical_id=f"entity-{cluster_id:04d}",
+ canonical_name=canonical_name,
+ variants=variants,
+ sources=sorted(sources),
+ confidence=confidence,
+ )
+ )
+
+ print(f" Comparisons: {comparisons:,}")
+ print(f" Matches found: {matches:,}")
+ print(f" Canonical entities: {len(entities):,}")
+ return entities
+
+
+# --- Main ---
+
+
+def main() -> None:
+ parser = argparse.ArgumentParser(
+ description="Resolve entities across datasets in an investigation workspace"
+ )
+ parser.add_argument("workspace", type=Path, help="Path to investigation workspace")
+ parser.add_argument(
+ "--threshold",
+ type=float,
+ default=0.85,
+ help="Similarity threshold for matching (default: 0.85)",
+ )
+ parser.add_argument(
+ "--name-columns",
+ type=str,
+ default=None,
+ help="Comma-separated list of column names containing entity names",
+ )
+ args = parser.parse_args()
+
+ workspace = args.workspace.resolve()
+ datasets_dir = workspace / "datasets"
+ if not datasets_dir.exists():
+ print(f"Error: {datasets_dir} does not exist", file=sys.stderr)
+ sys.exit(1)
+
+ name_cols = args.name_columns.split(",") if args.name_columns else None
+
+ # Load all datasets
+ all_records: list[EntityRecord] = []
+ data_files = list(datasets_dir.glob("*.csv")) + list(datasets_dir.glob("*.json"))
+
+ if not data_files:
+ print(f"No CSV or JSON files found in {datasets_dir}")
+ sys.exit(0)
+
+ print(f"Loading datasets from {datasets_dir}:")
+ for fp in sorted(data_files):
+ print(f" {fp.name}...", end=" ")
+ if fp.suffix == ".csv":
+ recs = load_csv(fp, name_cols)
+ elif fp.suffix == ".json":
+ recs = load_json(fp, name_cols)
+ else:
+ recs = []
+ all_records.extend(recs)
+ print(f"{len(recs)} entities")
+
+ print(f"\nTotal entity records: {len(all_records):,}")
+ print(f"Resolving with threshold: {args.threshold}")
+
+ entities = resolve_entities(all_records, args.threshold)
+
+ # Write output
+ now = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
+ output = {
+ "metadata": {
+ "created": now,
+ "workspace": str(workspace),
+ "datasets_processed": sorted({r.source_file for r in all_records}),
+ "total_entities": len(entities),
+ "total_records": len(all_records),
+ "resolution_threshold": args.threshold,
+ },
+ "entities": [asdict(e) for e in entities],
+ }
+
+ out_path = workspace / "entities" / "canonical.json"
+ out_path.write_text(json.dumps(output, indent=2, ensure_ascii=False), encoding="utf-8")
+ print(f"\nCanonical map written: {out_path}")
+
+ # Summary
+ by_confidence = defaultdict(int)
+ for e in entities:
+ by_confidence[e.confidence] += 1
+ print("\nConfidence breakdown:")
+ for tier in ("confirmed", "probable", "possible", "unresolved"):
+ count = by_confidence.get(tier, 0)
+ print(f" {tier}: {count}")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/skills/openplanter/scripts/evidence_chain.py b/skills/openplanter/scripts/evidence_chain.py
new file mode 100644
index 00000000..ea6aafa6
--- /dev/null
+++ b/skills/openplanter/scripts/evidence_chain.py
@@ -0,0 +1,284 @@
+#!/usr/bin/env python3
+"""Validate evidence chain structure in investigation findings.
+
+Checks each claim in findings/ for:
+ - Evidence items with source records
+ - Dataset references
+ - Confidence tier assignment
+ - Required fields (claim, evidence, source, confidence)
+ - Evidence chain completeness (every hop documented)
+
+Reports pass/fail per claim, missing fields, and broken chains.
+
+Uses Python stdlib only — zero external dependencies.
+
+Usage:
+ python3 evidence_chain.py /path/to/investigation
+ python3 evidence_chain.py /path/to/investigation --strict
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+
+VALID_CONFIDENCE = {"confirmed", "probable", "possible", "unresolved"}
+VALID_MATCH_TYPES = {"exact", "fuzzy", "address-based", "ein", "phone", "email", "fuzzy_name", "exact_ein"}
+VALID_CORROBORATION = {"single", "corroborated", "contradicted", "unresolvable"}
+
+REQUIRED_CHAIN_FIELDS = {"claim", "confidence", "hops"}
+REQUIRED_HOP_FIELDS = {"from_entity", "from_dataset", "to_entity", "to_dataset", "match_type"}
+
+
+@dataclass
+class ValidationResult:
+ chain_id: str
+ claim: str
+ status: str = "pass" # pass, warn, fail
+ issues: list[str] = field(default_factory=list)
+ warnings: list[str] = field(default_factory=list)
+
+ def add_issue(self, msg: str) -> None:
+ self.issues.append(msg)
+ self.status = "fail"
+
+ def add_warning(self, msg: str) -> None:
+ self.warnings.append(msg)
+ if self.status == "pass":
+ self.status = "warn"
+
+
+def validate_hop(hop: dict[str, Any], hop_num: int, strict: bool) -> list[str]:
+ """Validate a single hop in an evidence chain."""
+ issues: list[str] = []
+
+ # Check required fields
+ for f in REQUIRED_HOP_FIELDS:
+ if f not in hop or not hop[f]:
+ issues.append(f"Hop {hop_num}: missing required field '{f}'")
+
+ # Check match type
+ mt = hop.get("match_type", "")
+ if mt and mt not in VALID_MATCH_TYPES:
+ # Allow compound types like "exact_ein + fuzzy_name"
+ parts = [p.strip() for p in mt.replace("+", ",").split(",")]
+ for part in parts:
+ if part and part not in VALID_MATCH_TYPES:
+ issues.append(f"Hop {hop_num}: unknown match_type '{part}'")
+
+ # Check match score if present
+ score = hop.get("match_score")
+ if score is not None:
+ try:
+ s = float(score)
+ if not 0.0 <= s <= 1.0:
+ issues.append(f"Hop {hop_num}: match_score {s} out of range [0, 1]")
+ except (ValueError, TypeError):
+ issues.append(f"Hop {hop_num}: match_score '{score}' is not numeric")
+
+ # Check link strength
+ if strict and "link_strength" not in hop:
+ issues.append(f"Hop {hop_num}: missing link_strength (required in strict mode)")
+
+ # Check source records
+ if strict:
+ if "from_record" not in hop:
+ issues.append(f"Hop {hop_num}: missing from_record (required in strict mode)")
+ if "to_record" not in hop:
+ issues.append(f"Hop {hop_num}: missing to_record (required in strict mode)")
+
+ return issues
+
+
+def validate_chain(chain: dict[str, Any], strict: bool) -> ValidationResult:
+ """Validate an evidence chain."""
+ chain_id = chain.get("chain_id", "unknown")
+ claim = chain.get("claim", "(no claim)")
+ result = ValidationResult(chain_id=chain_id, claim=claim)
+
+ # Check required top-level fields
+ for f in REQUIRED_CHAIN_FIELDS:
+ if f not in chain or not chain[f]:
+ result.add_issue(f"Missing required field: '{f}'")
+
+ # Validate confidence tier
+ confidence = chain.get("confidence", "")
+ if confidence and confidence not in VALID_CONFIDENCE:
+ result.add_issue(f"Invalid confidence tier: '{confidence}'")
+
+ # Validate hops
+ hops = chain.get("hops", [])
+ if not hops:
+ result.add_issue("No evidence hops — claim has no supporting evidence chain")
+ else:
+ for i, hop in enumerate(hops, start=1):
+ hop_issues = validate_hop(hop, i, strict)
+ for issue in hop_issues:
+ result.add_issue(issue)
+
+ # Check chain continuity — each hop's to_entity should be the next hop's from_entity
+ for i in range(len(hops) - 1):
+ to_entity = hops[i].get("to_entity", "")
+ from_entity = hops[i + 1].get("from_entity", "")
+ if to_entity and from_entity and to_entity != from_entity:
+ result.add_warning(
+ f"Chain break: hop {i+1} ends at '{to_entity}' "
+ f"but hop {i+2} starts at '{from_entity}'"
+ )
+
+ # Validate corroboration (if present)
+ corr = chain.get("corroboration", {})
+ if corr:
+ status = corr.get("status", "")
+ if status and status not in VALID_CORROBORATION:
+ result.add_warning(f"Unknown corroboration status: '{status}'")
+ if status == "corroborated":
+ sources = corr.get("independent_sources", 0)
+ if isinstance(sources, int) and sources < 2:
+ result.add_warning(
+ f"Corroborated but independent_sources={sources} (need >=2)"
+ )
+
+ # Check key assumptions (warn if missing in strict mode)
+ if strict:
+ if "key_assumptions" not in chain or not chain["key_assumptions"]:
+ result.add_warning("No key_assumptions documented")
+ if "falsification_conditions" not in chain or not chain["falsification_conditions"]:
+ result.add_warning("No falsification_conditions documented")
+
+ return result
+
+
+def validate_findings_file(filepath: Path, strict: bool) -> list[ValidationResult]:
+ """Validate all evidence chains in a findings file."""
+ results: list[ValidationResult] = []
+
+ try:
+ data = json.loads(filepath.read_text(encoding="utf-8"))
+ except (json.JSONDecodeError, UnicodeDecodeError) as e:
+ r = ValidationResult(chain_id="file", claim=filepath.name)
+ r.add_issue(f"Cannot parse file: {e}")
+ return [r]
+
+ # Handle different file structures
+ chains: list[dict[str, Any]] = []
+
+ if isinstance(data, list):
+ chains = [c for c in data if isinstance(c, dict)]
+ elif isinstance(data, dict):
+ # Check common wrapper patterns
+ for key in ("chains", "evidence_chains", "findings", "cross_references"):
+ if key in data and isinstance(data[key], list):
+ chains = [c for c in data[key] if isinstance(c, dict)]
+ break
+ else:
+ # Single chain
+ chains = [data]
+
+ if not chains:
+ r = ValidationResult(chain_id="file", claim=filepath.name)
+ r.add_warning("No evidence chains found in file")
+ results.append(r)
+ return results
+
+ for chain in chains:
+ results.append(validate_chain(chain, strict))
+
+ return results
+
+
+def main() -> None:
+ parser = argparse.ArgumentParser(
+ description="Validate evidence chain structure in investigation findings"
+ )
+ parser.add_argument("workspace", type=Path, help="Path to investigation workspace")
+ parser.add_argument(
+ "--strict",
+ action="store_true",
+ help="Strict mode: require link_strength, source records, assumptions",
+ )
+ args = parser.parse_args()
+
+ workspace = args.workspace.resolve()
+
+ # Check both findings/ and evidence/ directories
+ search_dirs = [workspace / "findings", workspace / "evidence"]
+ json_files: list[Path] = []
+ for d in search_dirs:
+ if d.exists():
+ json_files.extend(d.glob("*.json"))
+
+ if not json_files:
+ print("No JSON files found in findings/ or evidence/")
+ sys.exit(0)
+
+ total_pass = 0
+ total_warn = 0
+ total_fail = 0
+ all_results: list[ValidationResult] = []
+
+ print(f"Validating evidence chains (strict={'yes' if args.strict else 'no'}):\n")
+
+ for fp in sorted(json_files):
+ results = validate_findings_file(fp, args.strict)
+ all_results.extend(results)
+ print(f" {fp.relative_to(workspace)}:")
+
+ for r in results:
+ icon = {"pass": "+", "warn": "~", "fail": "!"}[r.status]
+ print(f" [{icon}] {r.chain_id}: {r.claim[:60]}")
+ for issue in r.issues:
+ print(f" FAIL: {issue}")
+ for warning in r.warnings:
+ print(f" WARN: {warning}")
+
+ if r.status == "pass":
+ total_pass += 1
+ elif r.status == "warn":
+ total_warn += 1
+ else:
+ total_fail += 1
+
+ # Write validation report
+ report = {
+ "metadata": {
+ "workspace": str(workspace),
+ "strict_mode": args.strict,
+ "files_checked": len(json_files),
+ "chains_validated": len(all_results),
+ },
+ "summary": {
+ "pass": total_pass,
+ "warn": total_warn,
+ "fail": total_fail,
+ },
+ "results": [
+ {
+ "chain_id": r.chain_id,
+ "claim": r.claim,
+ "status": r.status,
+ "issues": r.issues,
+ "warnings": r.warnings,
+ }
+ for r in all_results
+ ],
+ }
+
+ evidence_dir = workspace / "evidence"
+ evidence_dir.mkdir(exist_ok=True)
+ report_path = evidence_dir / "validation-report.json"
+ report_path.write_text(json.dumps(report, indent=2), encoding="utf-8")
+
+ print(f"\n{'='*50}")
+ print(f" Pass: {total_pass} Warn: {total_warn} Fail: {total_fail}")
+ print(f" Report: {report_path}")
+
+ if total_fail > 0:
+ sys.exit(1)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/skills/openplanter/scripts/fetch_census.py b/skills/openplanter/scripts/fetch_census.py
new file mode 100644
index 00000000..a7b9e66e
--- /dev/null
+++ b/skills/openplanter/scripts/fetch_census.py
@@ -0,0 +1,246 @@
+#!/usr/bin/env python3
+"""Fetch US Census Bureau American Community Survey (ACS) data.
+
+Queries the Census Bureau API for demographic and economic data by geography.
+Useful for investigations involving population context, income distribution,
+housing patterns, and demographic profiling of areas linked to entities.
+
+Uses Python stdlib only — zero external dependencies.
+
+API: https://api.census.gov/data/{year}/acs/acs5
+Auth: Optional API key (CENSUS_API_KEY env var) — works without key at lower rate limits.
+Rate limit: ~500 req/day without key, unlimited with key.
+
+Usage:
+ python3 fetch_census.py /path/to/investigation --state 36 --county 027
+ python3 fetch_census.py /path/to/investigation --state 36 --place "New York city"
+ python3 fetch_census.py /path/to/investigation --zipcode 10001
+ python3 fetch_census.py /path/to/investigation --list
+"""
+from __future__ import annotations
+
+import argparse
+import hashlib
+import json
+import os
+import sys
+import time
+import urllib.error
+import urllib.parse
+import urllib.request
+from datetime import datetime, timezone
+from pathlib import Path
+
+# ACS 5-year variables of investigative interest
+DEFAULT_VARIABLES = [
+ "NAME", # Geography name
+ "B01003_001E", # Total population
+ "B19013_001E", # Median household income
+ "B25077_001E", # Median home value
+ "B25064_001E", # Median gross rent
+ "B23025_002E", # In labor force
+ "B23025_005E", # Unemployed
+ "B15003_022E", # Bachelor's degree
+ "B15003_023E", # Master's degree
+ "B15003_025E", # Doctorate degree
+ "B02001_002E", # White alone
+ "B02001_003E", # Black alone
+ "B03003_003E", # Hispanic or Latino
+]
+
+VARIABLE_LABELS = {
+ "NAME": "Geography Name",
+ "B01003_001E": "Total Population",
+ "B19013_001E": "Median Household Income ($)",
+ "B25077_001E": "Median Home Value ($)",
+ "B25064_001E": "Median Gross Rent ($)",
+ "B23025_002E": "In Labor Force",
+ "B23025_005E": "Unemployed",
+ "B15003_022E": "Bachelor's Degree",
+ "B15003_023E": "Master's Degree",
+ "B15003_025E": "Doctorate Degree",
+ "B02001_002E": "White Alone",
+ "B02001_003E": "Black or African American Alone",
+ "B03003_003E": "Hispanic or Latino",
+}
+
+BASE_URL = "https://api.census.gov/data"
+DEFAULT_YEAR = 2022 # Latest available 5-year ACS
+
+
+def fetch_acs(
+ year: int,
+ variables: list[str],
+ state: str | None = None,
+ county: str | None = None,
+ zipcode: str | None = None,
+ api_key: str | None = None,
+ timeout: int = 30,
+) -> list[dict]:
+ """Query ACS 5-year estimates."""
+ url = f"{BASE_URL}/{year}/acs/acs5"
+ params: dict[str, str] = {
+ "get": ",".join(variables),
+ }
+
+ # Build geography
+ if zipcode:
+ params["for"] = f"zip code tabulation area:{zipcode}"
+ elif county and state:
+ params["for"] = f"county:{county}"
+ params["in"] = f"state:{state}"
+ elif state:
+ params["for"] = f"county:*"
+ params["in"] = f"state:{state}"
+ else:
+ params["for"] = "us:*"
+
+ if api_key:
+ params["key"] = api_key
+
+ query = urllib.parse.urlencode(params)
+ full_url = f"{url}?{query}"
+
+ req = urllib.request.Request(full_url, headers={"User-Agent": "OpenPlanter/1.0"})
+ try:
+ with urllib.request.urlopen(req, timeout=timeout) as resp:
+ data = json.loads(resp.read().decode("utf-8"))
+ except urllib.error.HTTPError as e:
+ print(f"ERROR: Census API returned {e.code}: {e.reason}", file=sys.stderr)
+ if e.code == 204:
+ print("No data available for this geography", file=sys.stderr)
+ return []
+ raise
+
+ if not data or len(data) < 2:
+ return []
+
+ headers = data[0]
+ records = []
+ for row in data[1:]:
+ record = dict(zip(headers, row))
+ records.append(record)
+
+ return records
+
+
+def write_results(
+ workspace: Path,
+ records: list[dict],
+ query_params: dict,
+ year: int,
+) -> Path:
+ """Write results to workspace with provenance."""
+ out_dir = workspace / "datasets" / "bulk" / "census"
+ out_dir.mkdir(parents=True, exist_ok=True)
+
+ # Build filename from query params
+ parts = [f"acs5-{year}"]
+ if query_params.get("state"):
+ parts.append(f"state{query_params['state']}")
+ if query_params.get("county"):
+ parts.append(f"county{query_params['county']}")
+ if query_params.get("zipcode"):
+ parts.append(f"zip{query_params['zipcode']}")
+ filename = "-".join(parts) + ".json"
+
+ content = json.dumps(records, indent=2)
+ out_path = out_dir / filename
+ out_path.write_text(content, encoding="utf-8")
+
+ # Provenance sidecar
+ provenance = {
+ "source_id": "census",
+ "name": f"US Census ACS 5-Year Estimates ({year})",
+ "url": f"{BASE_URL}/{year}/acs/acs5",
+ "format": "json",
+ "linking_keys": ["state", "county", "zip code tabulation area", "NAME"],
+ "query_params": query_params,
+ "download_timestamp": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
+ "file": filename,
+ "size_bytes": len(content.encode("utf-8")),
+ "sha256": hashlib.sha256(content.encode("utf-8")).hexdigest(),
+ "record_count": len(records),
+ "variables": list(VARIABLE_LABELS.keys()),
+ }
+ prov_path = out_dir / "provenance.json"
+ prov_path.write_text(json.dumps(provenance, indent=2), encoding="utf-8")
+
+ return out_path
+
+
+def main() -> None:
+ parser = argparse.ArgumentParser(
+ description="Fetch US Census ACS demographic data for investigation context"
+ )
+ parser.add_argument(
+ "workspace", type=Path,
+ help="Path to investigation workspace directory",
+ )
+ parser.add_argument("--state", help="FIPS state code (e.g., 36 for NY)")
+ parser.add_argument("--county", help="FIPS county code (e.g., 027 for Dutchess)")
+ parser.add_argument("--zipcode", help="ZIP code tabulation area")
+ parser.add_argument(
+ "--year", type=int, default=DEFAULT_YEAR,
+ help=f"ACS year (default: {DEFAULT_YEAR})",
+ )
+ parser.add_argument(
+ "--variables", help="Comma-separated ACS variable codes (default: demographic set)",
+ )
+ parser.add_argument("--timeout", type=int, default=30, help="Request timeout (default: 30s)")
+ parser.add_argument("--dry-run", action="store_true", help="Print URL without fetching")
+ parser.add_argument("--list", dest="list_vars", action="store_true", help="List available variables")
+
+ args = parser.parse_args()
+
+ if args.list_vars:
+ print("Default ACS 5-Year Variables:")
+ for code, label in VARIABLE_LABELS.items():
+ print(f" {code:16s} {label}")
+ return
+
+ workspace = args.workspace.resolve()
+ if not workspace.exists():
+ print(f"ERROR: Workspace does not exist: {workspace}", file=sys.stderr)
+ sys.exit(1)
+
+ variables = args.variables.split(",") if args.variables else DEFAULT_VARIABLES
+ api_key = os.environ.get("CENSUS_API_KEY")
+
+ query_params = {
+ "state": args.state,
+ "county": args.county,
+ "zipcode": args.zipcode,
+ "year": args.year,
+ }
+
+ if args.dry_run:
+ print(f"Would fetch ACS {args.year} data:")
+ print(f" Variables: {len(variables)}")
+ print(f" State: {args.state or 'all'}")
+ print(f" County: {args.county or 'all'}")
+ print(f" ZIP: {args.zipcode or 'n/a'}")
+ print(f" API key: {'set' if api_key else 'not set (rate-limited)'}")
+ return
+
+ print(f"Fetching Census ACS {args.year} data...")
+ records = fetch_acs(
+ year=args.year,
+ variables=variables,
+ state=args.state,
+ county=args.county,
+ zipcode=args.zipcode,
+ api_key=api_key,
+ timeout=args.timeout,
+ )
+
+ if not records:
+ print("No records returned", file=sys.stderr)
+ sys.exit(1)
+
+ out_path = write_results(workspace, records, query_params, args.year)
+ print(f"Census: {len(records)} records → {out_path}")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/skills/openplanter/scripts/fetch_epa.py b/skills/openplanter/scripts/fetch_epa.py
new file mode 100644
index 00000000..1402aea5
--- /dev/null
+++ b/skills/openplanter/scripts/fetch_epa.py
@@ -0,0 +1,232 @@
+#!/usr/bin/env python3
+"""Fetch EPA ECHO facility data for environmental compliance investigations.
+
+Queries the EPA Enforcement and Compliance History Online (ECHO) API for
+facility records, violations, inspections, and enforcement actions.
+Critical for infrastructure investigations — links facility IDs to
+geographic coordinates, SIC codes, and compliance history.
+
+Uses Python stdlib only — zero external dependencies.
+
+API: https://echo.epa.gov/tools/data-downloads
+ECHO API: https://echodata.epa.gov/echo/dfr_rest_services.get_facility_info
+Auth: None required (free public API).
+Rate limit: Undocumented, ~2 req/sec recommended.
+
+Usage:
+ python3 fetch_epa.py /path/to/investigation --query "Acme Chemical"
+ python3 fetch_epa.py /path/to/investigation --zipcode 12508
+ python3 fetch_epa.py /path/to/investigation --state NY --city Beacon
+ python3 fetch_epa.py /path/to/investigation --registry-id 110070416170
+ python3 fetch_epa.py /path/to/investigation --list
+"""
+from __future__ import annotations
+
+import argparse
+import hashlib
+import json
+import sys
+import time
+import urllib.error
+import urllib.parse
+import urllib.request
+from datetime import datetime, timezone
+from pathlib import Path
+
+# ECHO Detailed Facility Report API
+ECHO_BASE = "https://echodata.epa.gov/echo"
+FACILITY_SEARCH = f"{ECHO_BASE}/echo_rest_services.get_facilities"
+FACILITY_INFO = f"{ECHO_BASE}/dfr_rest_services.get_facility_info"
+
+# FRS (Facility Registry Service) for cross-referencing
+FRS_BASE = "https://ofmpub.epa.gov/frs_public2/frs_rest_services.get_facilities"
+
+
+def search_facilities(
+ query: str | None = None,
+ state: str | None = None,
+ city: str | None = None,
+ zipcode: str | None = None,
+ registry_id: str | None = None,
+ page_size: int = 25,
+ timeout: int = 30,
+) -> list[dict]:
+ """Search ECHO for facilities matching criteria."""
+ params: dict[str, str] = {
+ "output": "JSON",
+ "p_act": "Y", # Active facilities
+ }
+ if query:
+ params["p_fn"] = query # Facility name
+ if state:
+ params["p_st"] = state
+ if city:
+ params["p_ct"] = city
+ if zipcode:
+ params["p_zip"] = zipcode
+ if registry_id:
+ params["p_frs"] = registry_id
+
+ url = f"{FACILITY_SEARCH}?{urllib.parse.urlencode(params)}"
+ req = urllib.request.Request(url, headers={"User-Agent": "OpenPlanter/1.0"})
+
+ try:
+ with urllib.request.urlopen(req, timeout=timeout) as resp:
+ data = json.loads(resp.read().decode("utf-8"))
+ except urllib.error.HTTPError as e:
+ print(f"ERROR: ECHO API returned {e.code}: {e.reason}", file=sys.stderr)
+ raise
+
+ # ECHO wraps results in Results.Facilities
+ results = data.get("Results", {})
+ facilities = results.get("Facilities", [])
+
+ records = []
+ for fac in facilities:
+ record = {
+ "registry_id": fac.get("RegistryId", ""),
+ "facility_name": fac.get("FacilityName", ""),
+ "street": fac.get("Street", ""),
+ "city": fac.get("City", ""),
+ "state": fac.get("State", ""),
+ "zip": fac.get("Zip", ""),
+ "county": fac.get("County", ""),
+ "lat": fac.get("Lat", ""),
+ "lon": fac.get("Lon", ""),
+ "sic_codes": fac.get("SICCodes", ""),
+ "naics_codes": fac.get("NAICSCodes", ""),
+ "facility_type": fac.get("FacilityType", ""),
+ "air_flag": fac.get("AirFlag", ""),
+ "water_flag": fac.get("CWAFlag", ""),
+ "rcra_flag": fac.get("RCRAFlag", ""),
+ "tri_flag": fac.get("TRIFlag", ""),
+ "current_violations": fac.get("CurrVioFlag", ""),
+ "qtrs_in_nc": fac.get("QtrsInNC", ""),
+ "inspection_count": fac.get("InspectionCount", ""),
+ "formal_action_count": fac.get("FormalActionCount", ""),
+ }
+ records.append(record)
+
+ return records
+
+
+def write_results(
+ workspace: Path,
+ records: list[dict],
+ query_params: dict,
+) -> Path:
+ """Write results to workspace with provenance."""
+ out_dir = workspace / "datasets" / "scraped" / "epa"
+ out_dir.mkdir(parents=True, exist_ok=True)
+
+ # Build filename
+ parts = ["echo"]
+ if query_params.get("query"):
+ safe = query_params["query"][:40].replace(" ", "_").replace("/", "_")
+ parts.append(safe)
+ if query_params.get("state"):
+ parts.append(query_params["state"])
+ if query_params.get("zipcode"):
+ parts.append(f"zip{query_params['zipcode']}")
+ filename = "-".join(parts) + ".json"
+
+ content = json.dumps(records, indent=2)
+ out_path = out_dir / filename
+ out_path.write_text(content, encoding="utf-8")
+
+ provenance = {
+ "source_id": "epa",
+ "name": "EPA ECHO Facility Search",
+ "url": FACILITY_SEARCH,
+ "format": "json",
+ "linking_keys": ["registry_id", "facility_name", "sic_codes", "naics_codes", "lat", "lon"],
+ "query_params": query_params,
+ "download_timestamp": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
+ "file": filename,
+ "size_bytes": len(content.encode("utf-8")),
+ "sha256": hashlib.sha256(content.encode("utf-8")).hexdigest(),
+ "record_count": len(records),
+ }
+ prov_path = out_dir / "provenance.json"
+ prov_path.write_text(json.dumps(provenance, indent=2), encoding="utf-8")
+
+ return out_path
+
+
+def main() -> None:
+ parser = argparse.ArgumentParser(
+ description="Fetch EPA ECHO facility data for environmental compliance investigations"
+ )
+ parser.add_argument(
+ "workspace", type=Path,
+ help="Path to investigation workspace directory",
+ )
+ parser.add_argument("--query", "-q", help="Facility name search")
+ parser.add_argument("--state", help="State abbreviation (e.g., NY)")
+ parser.add_argument("--city", help="City name")
+ parser.add_argument("--zipcode", help="ZIP code")
+ parser.add_argument("--registry-id", help="EPA FRS Registry ID")
+ parser.add_argument("--timeout", type=int, default=30, help="Request timeout (default: 30s)")
+ parser.add_argument("--dry-run", action="store_true", help="Print query without fetching")
+ parser.add_argument("--list", dest="list_info", action="store_true", help="Show available fields")
+
+ args = parser.parse_args()
+
+ if args.list_info:
+ print("EPA ECHO Facility Fields:")
+ print(" registry_id EPA Facility Registry Service ID")
+ print(" facility_name Official facility name")
+ print(" sic_codes Standard Industrial Classification")
+ print(" naics_codes North American Industry Classification")
+ print(" lat, lon Geographic coordinates")
+ print(" current_violations Active violation flag")
+ print(" qtrs_in_nc Quarters in non-compliance")
+ print(" inspection_count Total inspections")
+ print(" formal_action_count Formal enforcement actions")
+ print("\nProgram flags: air_flag, water_flag (CWA), rcra_flag, tri_flag")
+ return
+
+ workspace = args.workspace.resolve()
+ if not workspace.exists():
+ print(f"ERROR: Workspace does not exist: {workspace}", file=sys.stderr)
+ sys.exit(1)
+
+ if not any([args.query, args.state, args.city, args.zipcode, args.registry_id]):
+ print("ERROR: Provide at least one search criterion", file=sys.stderr)
+ sys.exit(1)
+
+ query_params = {
+ "query": args.query,
+ "state": args.state,
+ "city": args.city,
+ "zipcode": args.zipcode,
+ "registry_id": args.registry_id,
+ }
+
+ if args.dry_run:
+ print("Would search EPA ECHO:")
+ for k, v in query_params.items():
+ if v:
+ print(f" {k}: {v}")
+ return
+
+ print("Searching EPA ECHO facilities...")
+ records = search_facilities(
+ query=args.query,
+ state=args.state,
+ city=args.city,
+ zipcode=args.zipcode,
+ registry_id=args.registry_id,
+ timeout=args.timeout,
+ )
+
+ if not records:
+ print("No facilities found", file=sys.stderr)
+ sys.exit(1)
+
+ out_path = write_results(workspace, records, query_params)
+ print(f"EPA: {len(records)} facilities → {out_path}")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/skills/openplanter/scripts/fetch_icij.py b/skills/openplanter/scripts/fetch_icij.py
new file mode 100644
index 00000000..b60ad9fa
--- /dev/null
+++ b/skills/openplanter/scripts/fetch_icij.py
@@ -0,0 +1,278 @@
+#!/usr/bin/env python3
+"""Fetch ICIJ Offshore Leaks Database records.
+
+Queries the International Consortium of Investigative Journalists' Offshore
+Leaks Database for entities, officers, intermediaries, and addresses linked
+to offshore structures. Covers Panama Papers, Paradise Papers, Pandora Papers,
+and Offshore Leaks datasets.
+
+Critical for sanctions evasion investigations — links entity names to
+offshore jurisdictions, registered agents, and beneficial ownership chains.
+
+Uses Python stdlib only — zero external dependencies.
+
+API: https://offshoreleaks.icij.org/search (public search)
+ https://offshoreleaks-data.icij.org/offshoreleaks/search (JSON API)
+Auth: None required (free public database).
+Rate limit: Undocumented, ~1 req/sec recommended.
+
+Usage:
+ python3 fetch_icij.py /path/to/investigation --entity "Acme Holdings"
+ python3 fetch_icij.py /path/to/investigation --entity "Mossack" --type intermediary
+ python3 fetch_icij.py /path/to/investigation --entity "Iran" --jurisdiction
+ python3 fetch_icij.py /path/to/investigation --list
+"""
+from __future__ import annotations
+
+import argparse
+import hashlib
+import json
+import sys
+import time
+import urllib.error
+import urllib.parse
+import urllib.request
+from datetime import datetime, timezone
+from pathlib import Path
+
+SEARCH_URL = "https://offshoreleaks.icij.org/api/v1/search"
+
+# Entity types in the ICIJ database
+ENTITY_TYPES = {
+ "entity": "Offshore entities (companies, trusts, foundations)",
+ "officer": "Officers and beneficial owners",
+ "intermediary": "Intermediaries (law firms, banks, agents)",
+ "address": "Registered addresses",
+}
+
+# Datasets available
+DATASETS = {
+ "panama_papers": "Mossack Fonseca leak (2016)",
+ "paradise_papers": "Appleby + Asiaciti Trust (2017)",
+ "pandora_papers": "14 offshore service providers (2021)",
+ "offshore_leaks": "Original ICIJ leak (2013)",
+ "bahamas_leaks": "Bahamas corporate registry (2016)",
+}
+
+
+def search_icij(
+ query: str,
+ entity_type: str | None = None,
+ country: str | None = None,
+ jurisdiction: str | None = None,
+ dataset: str | None = None,
+ limit: int = 100,
+ timeout: int = 30,
+) -> list[dict]:
+ """Search the ICIJ Offshore Leaks Database."""
+ params: dict[str, str] = {
+ "q": query,
+ "limit": str(limit),
+ }
+ if entity_type:
+ params["type"] = entity_type
+ if country:
+ params["country"] = country
+ if jurisdiction:
+ params["jurisdiction"] = jurisdiction
+ if dataset:
+ params["dataset"] = dataset
+
+ url = f"{SEARCH_URL}?{urllib.parse.urlencode(params)}"
+ req = urllib.request.Request(url, headers={
+ "User-Agent": "OpenPlanter/1.0 (OSINT research tool)",
+ "Accept": "application/json",
+ })
+
+ try:
+ with urllib.request.urlopen(req, timeout=timeout) as resp:
+ raw = resp.read().decode("utf-8")
+ except urllib.error.HTTPError as e:
+ if e.code == 429:
+ print("Rate limited by ICIJ — wait 60s and retry", file=sys.stderr)
+ sys.exit(1)
+ print(f"ERROR: ICIJ API returned {e.code}: {e.reason}", file=sys.stderr)
+ raise
+
+ try:
+ data = json.loads(raw)
+ except json.JSONDecodeError:
+ # Fallback: try scraping the public search HTML for structured data
+ print("WARN: JSON parse failed, ICIJ may require browser access", file=sys.stderr)
+ return _fallback_search(query, entity_type, timeout)
+
+ # Handle different response shapes
+ if isinstance(data, list):
+ return data
+ if isinstance(data, dict):
+ return data.get("results", data.get("data", [data]))
+
+ return []
+
+
+def _fallback_search(query: str, entity_type: str | None, timeout: int) -> list[dict]:
+ """Fallback: scrape the public search page for basic results."""
+ params = {"q": query}
+ if entity_type:
+ params["e"] = entity_type
+ url = f"https://offshoreleaks.icij.org/search?{urllib.parse.urlencode(params)}"
+ req = urllib.request.Request(url, headers={"User-Agent": "OpenPlanter/1.0"})
+
+ try:
+ with urllib.request.urlopen(req, timeout=timeout) as resp:
+ html = resp.read().decode("utf-8")
+ except urllib.error.HTTPError:
+ return []
+
+ # Extract JSON-LD or structured data if present
+ import re
+ ld_match = re.search(r'', html, re.DOTALL)
+ if ld_match:
+ try:
+ return json.loads(ld_match.group(1))
+ except json.JSONDecodeError:
+ pass
+
+ # Return empty — the web interface may require JavaScript
+ print("WARN: ICIJ web interface may require JavaScript rendering", file=sys.stderr)
+ return []
+
+
+def normalize_records(records: list[dict]) -> list[dict]:
+ """Normalize ICIJ records to a consistent schema."""
+ normalized = []
+ for rec in records:
+ entry = {
+ "icij_id": rec.get("id", rec.get("node_id", "")),
+ "name": rec.get("name", rec.get("entity_name", "")),
+ "type": rec.get("type", rec.get("entity_type", "")),
+ "jurisdiction": rec.get("jurisdiction", rec.get("jurisdiction_description", "")),
+ "country": rec.get("country_codes", rec.get("countries", "")),
+ "dataset": rec.get("dataset", rec.get("sourceID", "")),
+ "address": rec.get("address", rec.get("registered_address", "")),
+ "incorporation_date": rec.get("incorporation_date", ""),
+ "inactivation_date": rec.get("inactivation_date", ""),
+ "status": rec.get("status", ""),
+ "service_provider": rec.get("service_provider", ""),
+ "linked_to": rec.get("connected_to", rec.get("linked_entities", [])),
+ "note": rec.get("note", rec.get("internal_id", "")),
+ }
+ normalized.append(entry)
+ return normalized
+
+
+def write_results(
+ workspace: Path,
+ records: list[dict],
+ query_params: dict,
+) -> Path:
+ """Write results to workspace with provenance."""
+ out_dir = workspace / "datasets" / "scraped" / "icij"
+ out_dir.mkdir(parents=True, exist_ok=True)
+
+ safe_query = query_params["query"][:40].replace(" ", "_").replace("/", "_")
+ entity_type = query_params.get("entity_type") or "all"
+ filename = f"icij-{safe_query}-{entity_type}.json"
+
+ content = json.dumps(records, indent=2)
+ out_path = out_dir / filename
+ out_path.write_text(content, encoding="utf-8")
+
+ provenance = {
+ "source_id": "icij",
+ "name": "ICIJ Offshore Leaks Database",
+ "url": SEARCH_URL,
+ "format": "json",
+ "linking_keys": ["icij_id", "name", "jurisdiction", "country", "dataset"],
+ "query_params": query_params,
+ "download_timestamp": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
+ "file": filename,
+ "size_bytes": len(content.encode("utf-8")),
+ "sha256": hashlib.sha256(content.encode("utf-8")).hexdigest(),
+ "record_count": len(records),
+ }
+ prov_path = out_dir / "provenance.json"
+ prov_path.write_text(json.dumps(provenance, indent=2), encoding="utf-8")
+
+ return out_path
+
+
+def main() -> None:
+ parser = argparse.ArgumentParser(
+ description="Search ICIJ Offshore Leaks Database for offshore entities and ownership"
+ )
+ parser.add_argument(
+ "workspace", type=Path,
+ help="Path to investigation workspace directory",
+ )
+ parser.add_argument("--entity", "-e", required=False, help="Entity name to search")
+ parser.add_argument(
+ "--type", "-t", dest="entity_type",
+ choices=list(ENTITY_TYPES.keys()),
+ help="Filter by entity type",
+ )
+ parser.add_argument("--country", help="Filter by country code (e.g., IR, RU, CN)")
+ parser.add_argument("--jurisdiction", help="Filter by jurisdiction")
+ parser.add_argument("--dataset", choices=list(DATASETS.keys()), help="Filter by leak dataset")
+ parser.add_argument("--limit", type=int, default=100, help="Max results (default: 100)")
+ parser.add_argument("--timeout", type=int, default=30, help="Request timeout (default: 30s)")
+ parser.add_argument("--dry-run", action="store_true", help="Print query without fetching")
+ parser.add_argument("--list", dest="list_info", action="store_true", help="Show entity types and datasets")
+
+ args = parser.parse_args()
+
+ if args.list_info:
+ print("ICIJ Entity Types:")
+ for k, v in ENTITY_TYPES.items():
+ print(f" {k:15s} {v}")
+ print("\nDatasets:")
+ for k, v in DATASETS.items():
+ print(f" {k:20s} {v}")
+ return
+
+ if not args.entity:
+ print("ERROR: --entity is required", file=sys.stderr)
+ sys.exit(1)
+
+ workspace = args.workspace.resolve()
+ if not workspace.exists():
+ print(f"ERROR: Workspace does not exist: {workspace}", file=sys.stderr)
+ sys.exit(1)
+
+ query_params = {
+ "query": args.entity,
+ "entity_type": args.entity_type,
+ "country": args.country,
+ "jurisdiction": args.jurisdiction,
+ "dataset": args.dataset,
+ }
+
+ if args.dry_run:
+ print("Would search ICIJ Offshore Leaks:")
+ for k, v in query_params.items():
+ if v:
+ print(f" {k}: {v}")
+ return
+
+ print(f"Searching ICIJ Offshore Leaks for '{args.entity}'...")
+ records = search_icij(
+ query=args.entity,
+ entity_type=args.entity_type,
+ country=args.country,
+ jurisdiction=args.jurisdiction,
+ dataset=args.dataset,
+ limit=args.limit,
+ timeout=args.timeout,
+ )
+
+ if not records:
+ print("No records found", file=sys.stderr)
+ sys.exit(1)
+
+ normalized = normalize_records(records)
+ out_path = write_results(workspace, normalized, query_params)
+ print(f"ICIJ: {len(normalized)} records → {out_path}")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/skills/openplanter/scripts/fetch_osha.py b/skills/openplanter/scripts/fetch_osha.py
new file mode 100644
index 00000000..a388dacf
--- /dev/null
+++ b/skills/openplanter/scripts/fetch_osha.py
@@ -0,0 +1,271 @@
+#!/usr/bin/env python3
+"""Fetch OSHA inspection and violation data.
+
+Queries the Occupational Safety and Health Administration enforcement data
+for workplace inspection records, violations, and penalties. Links
+establishment names to SIC codes, inspection types, and penalty amounts.
+
+Uses Python stdlib only — zero external dependencies.
+
+API: https://enforcedata.dol.gov/homePage/api_dataset
+ OSHA dataset via DOL Enforcement API
+Auth: None required (free public API).
+Rate limit: Undocumented, ~2 req/sec recommended.
+
+Usage:
+ python3 fetch_osha.py /path/to/investigation --query "Acme Manufacturing"
+ python3 fetch_osha.py /path/to/investigation --state NY --sic 2911
+ python3 fetch_osha.py /path/to/investigation --establishment "BP" --state TX
+ python3 fetch_osha.py /path/to/investigation --list
+"""
+from __future__ import annotations
+
+import argparse
+import hashlib
+import json
+import sys
+import urllib.error
+import urllib.parse
+import urllib.request
+from datetime import datetime, timezone
+from pathlib import Path
+
+# DOL Enforcement Data API
+DOL_BASE = "https://enforcedata.dol.gov/api/enforcement"
+OSHA_DATASET = "inspection"
+
+# Key SIC codes for WW Watcher investigations
+RELEVANT_SIC_CODES = {
+ "1311": "Crude petroleum and natural gas",
+ "1381": "Drilling oil and gas wells",
+ "2911": "Petroleum refining",
+ "3312": "Steel works, blast furnaces",
+ "3489": "Ordnance and accessories NEC",
+ "3699": "Electronic components NEC",
+ "3724": "Aircraft engines and engine parts",
+ "3761": "Guided missiles and space vehicles",
+ "4412": "Deep sea foreign transport—freight",
+ "4911": "Electric services",
+ "4922": "Natural gas distribution",
+ "4923": "Natural gas transmission and distribution",
+ "4924": "Natural gas distribution",
+ "4953": "Refuse systems (hazardous waste)",
+}
+
+
+def search_inspections(
+ query: str | None = None,
+ state: str | None = None,
+ sic: str | None = None,
+ limit: int = 25,
+ timeout: int = 30,
+) -> list[dict]:
+ """Search OSHA inspection records."""
+ # DOL API uses a specific query format
+ filters = []
+ if query:
+ filters.append(f"estab_name eq '{query}'")
+ if state:
+ filters.append(f"site_state eq '{state}'")
+ if sic:
+ filters.append(f"sic_code eq '{sic}'")
+
+ params: dict[str, str] = {
+ "dataset": OSHA_DATASET,
+ "$top": str(limit),
+ "$orderby": "open_date desc",
+ }
+ if filters:
+ params["$filter"] = " and ".join(filters)
+
+ url = f"{DOL_BASE}?{urllib.parse.urlencode(params)}"
+ req = urllib.request.Request(url, headers={
+ "User-Agent": "OpenPlanter/1.0",
+ "Accept": "application/json",
+ })
+
+ try:
+ with urllib.request.urlopen(req, timeout=timeout) as resp:
+ data = json.loads(resp.read().decode("utf-8"))
+ except urllib.error.HTTPError as e:
+ if e.code == 400:
+ # Try alternative: OSHA public search
+ return _fallback_search(query, state, sic, limit, timeout)
+ print(f"ERROR: DOL API returned {e.code}: {e.reason}", file=sys.stderr)
+ raise
+
+ # DOL API returns {"d": {"results": [...]}} or flat array
+ if isinstance(data, dict):
+ results = data.get("d", data).get("results", data.get("data", []))
+ elif isinstance(data, list):
+ results = data
+ else:
+ results = []
+
+ records = []
+ for item in results:
+ record = {
+ "activity_nr": str(item.get("activity_nr", "")),
+ "estab_name": item.get("estab_name", ""),
+ "site_address": item.get("site_address", ""),
+ "site_city": item.get("site_city", ""),
+ "site_state": item.get("site_state", ""),
+ "site_zip": item.get("site_zip", ""),
+ "sic_code": item.get("sic_code", ""),
+ "naics_code": item.get("naics_code", ""),
+ "insp_type": item.get("insp_type", ""),
+ "open_date": item.get("open_date", ""),
+ "close_case_date": item.get("close_case_date", ""),
+ "total_violations": item.get("total_violations", 0),
+ "total_serious": item.get("total_serious", 0),
+ "total_willful": item.get("total_willful", 0),
+ "total_repeat": item.get("total_repeat", 0),
+ "total_penalty": item.get("total_current_penalty", item.get("total_penalty", 0)),
+ "nr_in_estab": item.get("nr_in_estab", ""),
+ }
+ records.append(record)
+
+ return records
+
+
+def _fallback_search(
+ query: str | None,
+ state: str | None,
+ sic: str | None,
+ limit: int,
+ timeout: int,
+) -> list[dict]:
+ """Fallback: try the OSHA public establishment search."""
+ if not query:
+ return []
+
+ params = {
+ "p_search_type": "A",
+ "p_search_text": query,
+ "p_format": "json",
+ }
+ if state:
+ params["p_state"] = state
+
+ url = f"https://www.osha.gov/pls/imis/establishment.search?{urllib.parse.urlencode(params)}"
+ req = urllib.request.Request(url, headers={"User-Agent": "OpenPlanter/1.0"})
+
+ try:
+ with urllib.request.urlopen(req, timeout=timeout) as resp:
+ raw = resp.read().decode("utf-8")
+ except urllib.error.HTTPError:
+ return []
+
+ try:
+ return json.loads(raw) if raw.strip().startswith("[") else []
+ except json.JSONDecodeError:
+ return []
+
+
+def write_results(
+ workspace: Path,
+ records: list[dict],
+ query_params: dict,
+) -> Path:
+ """Write results to workspace with provenance."""
+ out_dir = workspace / "datasets" / "scraped" / "osha"
+ out_dir.mkdir(parents=True, exist_ok=True)
+
+ parts = ["osha"]
+ if query_params.get("query"):
+ safe = query_params["query"][:40].replace(" ", "_").replace("/", "_")
+ parts.append(safe)
+ if query_params.get("state"):
+ parts.append(query_params["state"])
+ if query_params.get("sic"):
+ parts.append(f"sic{query_params['sic']}")
+ filename = "-".join(parts) + ".json"
+
+ content = json.dumps(records, indent=2)
+ out_path = out_dir / filename
+ out_path.write_text(content, encoding="utf-8")
+
+ provenance = {
+ "source_id": "osha",
+ "name": "OSHA Inspection Data",
+ "url": DOL_BASE,
+ "format": "json",
+ "linking_keys": ["activity_nr", "estab_name", "sic_code", "naics_code", "site_state"],
+ "query_params": query_params,
+ "download_timestamp": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
+ "file": filename,
+ "size_bytes": len(content.encode("utf-8")),
+ "sha256": hashlib.sha256(content.encode("utf-8")).hexdigest(),
+ "record_count": len(records),
+ }
+ prov_path = out_dir / "provenance.json"
+ prov_path.write_text(json.dumps(provenance, indent=2), encoding="utf-8")
+
+ return out_path
+
+
+def main() -> None:
+ parser = argparse.ArgumentParser(
+ description="Fetch OSHA inspection and violation data"
+ )
+ parser.add_argument(
+ "workspace", type=Path,
+ help="Path to investigation workspace directory",
+ )
+ parser.add_argument("--query", "-q", help="Establishment name search")
+ parser.add_argument("--state", help="State abbreviation (e.g., TX)")
+ parser.add_argument("--sic", help="SIC code (e.g., 2911 for petroleum refining)")
+ parser.add_argument("--limit", type=int, default=25, help="Max results (default: 25)")
+ parser.add_argument("--timeout", type=int, default=30, help="Request timeout (default: 30s)")
+ parser.add_argument("--dry-run", action="store_true", help="Print query without fetching")
+ parser.add_argument("--list", dest="list_info", action="store_true", help="Show relevant SIC codes")
+
+ args = parser.parse_args()
+
+ if args.list_info:
+ print("Relevant SIC Codes for WW Watcher:")
+ for code, desc in sorted(RELEVANT_SIC_CODES.items()):
+ print(f" {code} {desc}")
+ return
+
+ workspace = args.workspace.resolve()
+ if not workspace.exists():
+ print(f"ERROR: Workspace does not exist: {workspace}", file=sys.stderr)
+ sys.exit(1)
+
+ if not any([args.query, args.state, args.sic]):
+ print("ERROR: Provide at least one search criterion", file=sys.stderr)
+ sys.exit(1)
+
+ query_params = {
+ "query": args.query,
+ "state": args.state,
+ "sic": args.sic,
+ }
+
+ if args.dry_run:
+ print("Would search OSHA inspections:")
+ for k, v in query_params.items():
+ if v:
+ print(f" {k}: {v}")
+ return
+
+ print("Searching OSHA inspections...")
+ records = search_inspections(
+ query=args.query,
+ state=args.state,
+ sic=args.sic,
+ limit=args.limit,
+ timeout=args.timeout,
+ )
+
+ if not records:
+ print("No inspections found", file=sys.stderr)
+ sys.exit(1)
+
+ out_path = write_results(workspace, records, query_params)
+ print(f"OSHA: {len(records)} inspections → {out_path}")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/skills/openplanter/scripts/fetch_propublica990.py b/skills/openplanter/scripts/fetch_propublica990.py
new file mode 100644
index 00000000..142964ca
--- /dev/null
+++ b/skills/openplanter/scripts/fetch_propublica990.py
@@ -0,0 +1,272 @@
+#!/usr/bin/env python3
+"""Fetch nonprofit tax filing data from ProPublica Nonprofit Explorer.
+
+Queries the ProPublica Nonprofit Explorer API for IRS Form 990 data —
+organizational finances, compensation, program expenses, and governance.
+Links EINs to organization names, revenue, and executive compensation.
+
+Critical for investigations involving nonprofits, foundations, think tanks,
+and dark money flows in defense/intelligence circles.
+
+Uses Python stdlib only — zero external dependencies.
+
+API: https://projects.propublica.org/nonprofits/api/v2
+Auth: None required (free public API).
+Rate limit: Undocumented, ~1 req/sec recommended.
+
+Usage:
+ python3 fetch_propublica990.py /path/to/investigation --query "Heritage Foundation"
+ python3 fetch_propublica990.py /path/to/investigation --ein 237327340
+ python3 fetch_propublica990.py /path/to/investigation --query "defense" --state DC --ntee U
+ python3 fetch_propublica990.py /path/to/investigation --list
+"""
+from __future__ import annotations
+
+import argparse
+import hashlib
+import json
+import sys
+import time
+import urllib.error
+import urllib.parse
+import urllib.request
+from datetime import datetime, timezone
+from pathlib import Path
+
+PP_BASE = "https://projects.propublica.org/nonprofits/api/v2"
+
+# NTEE codes relevant to defense/intelligence/policy investigations
+RELEVANT_NTEE = {
+ "Q": "International, Foreign Affairs, and National Security",
+ "R": "Civil Rights, Social Action, Advocacy",
+ "S": "Community Improvement, Capacity Building",
+ "U": "Science and Technology Research Institutes",
+ "W": "Public, Society Benefit — Multipurpose and Other",
+ "X": "Religion Related, Spiritual Development",
+}
+
+
+def search_organizations(
+ query: str,
+ state: str | None = None,
+ ntee: str | None = None,
+ page: int = 0,
+ timeout: int = 30,
+) -> list[dict]:
+ """Search ProPublica Nonprofit Explorer."""
+ params: dict[str, str] = {
+ "q": query,
+ "page": str(page),
+ }
+ if state:
+ params["state[id]"] = state
+ if ntee:
+ params["ntee[id]"] = ntee
+
+ url = f"{PP_BASE}/search.json?{urllib.parse.urlencode(params)}"
+ req = urllib.request.Request(url, headers={
+ "User-Agent": "OpenPlanter/1.0",
+ "Accept": "application/json",
+ })
+
+ try:
+ with urllib.request.urlopen(req, timeout=timeout) as resp:
+ data = json.loads(resp.read().decode("utf-8"))
+ except urllib.error.HTTPError as e:
+ print(f"ERROR: ProPublica API returned {e.code}: {e.reason}", file=sys.stderr)
+ raise
+
+ organizations = data.get("organizations", [])
+ records = []
+ for org in organizations:
+ record = {
+ "ein": str(org.get("ein", "")),
+ "name": org.get("name", ""),
+ "city": org.get("city", ""),
+ "state": org.get("state", ""),
+ "ntee_code": org.get("ntee_code", ""),
+ "subsection_code": org.get("subsection_code", ""),
+ "classification_codes": org.get("classification_codes", ""),
+ "ruling_date": org.get("ruling_date", ""),
+ "deductibility_code": org.get("deductibility_code", ""),
+ "foundation_code": org.get("foundation_code", ""),
+ "activity_codes": org.get("activity_codes", ""),
+ "organization_code": org.get("organization_code", ""),
+ "exempt_organization_status_code": org.get("exempt_organization_status_code", ""),
+ "tax_period": org.get("tax_period", ""),
+ "asset_amount": org.get("asset_amount", 0),
+ "income_amount": org.get("income_amount", 0),
+ "revenue_amount": org.get("revenue_amount", 0),
+ "score": org.get("score", 0),
+ }
+ records.append(record)
+
+ return records
+
+
+def get_organization(ein: str, timeout: int = 30) -> dict | None:
+ """Get detailed organization data by EIN."""
+ url = f"{PP_BASE}/organizations/{ein}.json"
+ req = urllib.request.Request(url, headers={
+ "User-Agent": "OpenPlanter/1.0",
+ "Accept": "application/json",
+ })
+
+ try:
+ with urllib.request.urlopen(req, timeout=timeout) as resp:
+ data = json.loads(resp.read().decode("utf-8"))
+ except urllib.error.HTTPError as e:
+ if e.code == 404:
+ return None
+ print(f"ERROR: ProPublica API returned {e.code}: {e.reason}", file=sys.stderr)
+ raise
+
+ org = data.get("organization", {})
+ filings = data.get("filings_with_data", [])
+
+ record = {
+ "ein": str(org.get("ein", "")),
+ "name": org.get("name", ""),
+ "address": org.get("address", ""),
+ "city": org.get("city", ""),
+ "state": org.get("state", ""),
+ "zipcode": org.get("zipcode", ""),
+ "ntee_code": org.get("ntee_code", ""),
+ "subsection_code": org.get("subsection_code", ""),
+ "total_revenue": org.get("total_revenue", 0),
+ "total_expenses": org.get("total_expenses", 0),
+ "total_assets": org.get("total_assets", 0),
+ "tax_period": org.get("tax_period", ""),
+ "filing_count": len(filings),
+ "latest_filing": {},
+ }
+
+ if filings:
+ latest = filings[0]
+ record["latest_filing"] = {
+ "tax_prd": latest.get("tax_prd", ""),
+ "tax_prd_yr": latest.get("tax_prd_yr", ""),
+ "totrevenue": latest.get("totrevenue", 0),
+ "totfuncexpns": latest.get("totfuncexpns", 0),
+ "totassetsend": latest.get("totassetsend", 0),
+ "totliabend": latest.get("totliabend", 0),
+ "compnsatncurrofcrs": latest.get("compnsatncurrofcrs", 0),
+ "pdf_url": latest.get("pdf_url", ""),
+ }
+
+ return record
+
+
+def write_results(
+ workspace: Path,
+ records: list[dict],
+ query_params: dict,
+) -> Path:
+ """Write results to workspace with provenance."""
+ out_dir = workspace / "datasets" / "scraped" / "propublica990"
+ out_dir.mkdir(parents=True, exist_ok=True)
+
+ if query_params.get("ein"):
+ filename = f"990-ein{query_params['ein']}.json"
+ else:
+ safe = query_params.get("query", "search")[:40].replace(" ", "_").replace("/", "_")
+ filename = f"990-{safe}.json"
+
+ content = json.dumps(records, indent=2)
+ out_path = out_dir / filename
+ out_path.write_text(content, encoding="utf-8")
+
+ provenance = {
+ "source_id": "propublica990",
+ "name": "ProPublica Nonprofit Explorer (IRS 990)",
+ "url": PP_BASE,
+ "format": "json",
+ "linking_keys": ["ein", "name", "ntee_code", "state"],
+ "query_params": query_params,
+ "download_timestamp": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
+ "file": filename,
+ "size_bytes": len(content.encode("utf-8")),
+ "sha256": hashlib.sha256(content.encode("utf-8")).hexdigest(),
+ "record_count": len(records),
+ }
+ prov_path = out_dir / "provenance.json"
+ prov_path.write_text(json.dumps(provenance, indent=2), encoding="utf-8")
+
+ return out_path
+
+
+def main() -> None:
+ parser = argparse.ArgumentParser(
+ description="Fetch nonprofit IRS 990 data from ProPublica Nonprofit Explorer"
+ )
+ parser.add_argument(
+ "workspace", type=Path,
+ help="Path to investigation workspace directory",
+ )
+ parser.add_argument("--query", "-q", help="Organization name search")
+ parser.add_argument("--ein", help="Employer Identification Number (direct lookup)")
+ parser.add_argument("--state", help="State abbreviation (e.g., DC)")
+ parser.add_argument("--ntee", help="NTEE classification code (e.g., Q, U)")
+ parser.add_argument("--page", type=int, default=0, help="Results page (default: 0)")
+ parser.add_argument("--timeout", type=int, default=30, help="Request timeout (default: 30s)")
+ parser.add_argument("--dry-run", action="store_true", help="Print query without fetching")
+ parser.add_argument("--list", dest="list_info", action="store_true", help="Show relevant NTEE codes")
+
+ args = parser.parse_args()
+
+ if args.list_info:
+ print("Relevant NTEE Codes:")
+ for code, desc in sorted(RELEVANT_NTEE.items()):
+ print(f" {code} {desc}")
+ return
+
+ if not args.query and not args.ein:
+ print("ERROR: Provide --query or --ein", file=sys.stderr)
+ sys.exit(1)
+
+ workspace = args.workspace.resolve()
+ if not workspace.exists():
+ print(f"ERROR: Workspace does not exist: {workspace}", file=sys.stderr)
+ sys.exit(1)
+
+ query_params = {
+ "query": args.query,
+ "ein": args.ein,
+ "state": args.state,
+ "ntee": args.ntee,
+ }
+
+ if args.dry_run:
+ print("Would search ProPublica 990:")
+ for k, v in query_params.items():
+ if v:
+ print(f" {k}: {v}")
+ return
+
+ if args.ein:
+ print(f"Fetching 990 for EIN {args.ein}...")
+ record = get_organization(args.ein, timeout=args.timeout)
+ if not record:
+ print(f"No organization found for EIN {args.ein}", file=sys.stderr)
+ sys.exit(1)
+ records = [record]
+ else:
+ print(f"Searching nonprofits for '{args.query}'...")
+ records = search_organizations(
+ query=args.query,
+ state=args.state,
+ ntee=args.ntee,
+ page=args.page,
+ timeout=args.timeout,
+ )
+
+ if not records:
+ print("No records found", file=sys.stderr)
+ sys.exit(1)
+
+ out_path = write_results(workspace, records, query_params)
+ print(f"ProPublica 990: {len(records)} organizations → {out_path}")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/skills/openplanter/scripts/fetch_sam.py b/skills/openplanter/scripts/fetch_sam.py
new file mode 100644
index 00000000..f76b7ab3
--- /dev/null
+++ b/skills/openplanter/scripts/fetch_sam.py
@@ -0,0 +1,282 @@
+#!/usr/bin/env python3
+"""Fetch SAM.gov entity registration data.
+
+Queries the System for Award Management (SAM.gov) Entity Management API
+for federal contractor registrations. Links DUNS/UEI numbers to entity
+names, CAGE codes, NAICS codes, and federal contract eligibility.
+
+Critical for defense contractor investigations — every entity doing
+business with the US government must register in SAM.gov.
+
+Uses Python stdlib only — zero external dependencies.
+
+API: https://api.sam.gov/entity-information/v3/entities
+Auth: SAM_GOV_API_KEY env var required (free registration at api.data.gov).
+Rate limit: 1000 req/day with key.
+
+Usage:
+ python3 fetch_sam.py /path/to/investigation --query "Raytheon"
+ python3 fetch_sam.py /path/to/investigation --uei "ABCDEF123456"
+ python3 fetch_sam.py /path/to/investigation --cage "1ABC2"
+ python3 fetch_sam.py /path/to/investigation --naics 336411 --state CT
+ python3 fetch_sam.py /path/to/investigation --list
+"""
+from __future__ import annotations
+
+import argparse
+import hashlib
+import json
+import os
+import sys
+import urllib.error
+import urllib.parse
+import urllib.request
+from datetime import datetime, timezone
+from pathlib import Path
+
+SAM_BASE = "https://api.sam.gov/entity-information/v3/entities"
+
+# NAICS codes for defense/infrastructure investigations
+RELEVANT_NAICS = {
+ "336411": "Aircraft manufacturing",
+ "336412": "Aircraft engine and engine parts manufacturing",
+ "336414": "Guided missile and space vehicle manufacturing",
+ "336415": "Guided missile and space vehicle propulsion",
+ "336419": "Other guided missile and space vehicle parts",
+ "332993": "Ammunition manufacturing",
+ "332994": "Small arms, ordnance, and accessories",
+ "334511": "Search, detection, navigation instruments",
+ "334519": "Other measuring and controlling devices",
+ "541330": "Engineering services",
+ "541511": "Custom computer programming services",
+ "541512": "Computer systems design services",
+ "541519": "Other computer related services",
+ "541690": "Other scientific and technical consulting",
+ "541715": "R&D in physical, engineering, and life sciences",
+ "561210": "Facilities support services",
+ "561612": "Security guards and patrol services",
+ "562211": "Hazardous waste treatment and disposal",
+ "324110": "Petroleum refineries",
+ "486110": "Pipeline transportation of crude oil",
+ "486210": "Pipeline transportation of natural gas",
+}
+
+
+def search_entities(
+ query: str | None = None,
+ uei: str | None = None,
+ cage: str | None = None,
+ naics: str | None = None,
+ state: str | None = None,
+ country: str | None = None,
+ api_key: str | None = None,
+ limit: int = 25,
+ timeout: int = 30,
+) -> list[dict]:
+ """Search SAM.gov entity registrations."""
+ if not api_key:
+ api_key = os.environ.get("SAM_GOV_API_KEY") or os.environ.get("SAM_API_KEY")
+ if not api_key:
+ print(
+ "ERROR: SAM_GOV_API_KEY required. Register free at https://api.data.gov",
+ file=sys.stderr,
+ )
+ sys.exit(1)
+
+ params: dict[str, str] = {
+ "api_key": api_key,
+ "registrationStatus": "A", # Active
+ "includeSections": "entityRegistration,coreData",
+ "page": "0",
+ "size": str(limit),
+ }
+ if query:
+ params["legalBusinessName"] = query
+ if uei:
+ params["ueiSAM"] = uei
+ if cage:
+ params["cageCode"] = cage
+ if naics:
+ params["naicsCode"] = naics
+ if state:
+ params["physicalAddressStateCode"] = state
+ if country:
+ params["physicalAddressCountryCode"] = country
+
+ url = f"{SAM_BASE}?{urllib.parse.urlencode(params)}"
+ req = urllib.request.Request(url, headers={
+ "User-Agent": "OpenPlanter/1.0",
+ "Accept": "application/json",
+ })
+
+ try:
+ with urllib.request.urlopen(req, timeout=timeout) as resp:
+ data = json.loads(resp.read().decode("utf-8"))
+ except urllib.error.HTTPError as e:
+ if e.code == 403:
+ print("ERROR: Invalid SAM.gov API key", file=sys.stderr)
+ elif e.code == 429:
+ print("ERROR: SAM.gov rate limit exceeded (1000/day)", file=sys.stderr)
+ else:
+ print(f"ERROR: SAM.gov API returned {e.code}: {e.reason}", file=sys.stderr)
+ raise
+
+ entities = data.get("entityData", [])
+ total = data.get("totalRecords", 0)
+
+ records = []
+ for entity in entities:
+ reg = entity.get("entityRegistration", {})
+ core = entity.get("coreData", {})
+ phys_addr = core.get("physicalAddress", {})
+ bus_types = core.get("businessTypes", {})
+ naics_list = core.get("naicsCode", [])
+
+ record = {
+ "uei": reg.get("ueiSAM", ""),
+ "cage_code": reg.get("cageCode", ""),
+ "legal_business_name": reg.get("legalBusinessName", ""),
+ "dba_name": reg.get("dbaName", ""),
+ "registration_status": reg.get("registrationStatus", ""),
+ "registration_date": reg.get("registrationDate", ""),
+ "expiration_date": reg.get("expirationDate", ""),
+ "activation_date": reg.get("activationDate", ""),
+ "entity_type": reg.get("entityType", ""),
+ "entity_structure": reg.get("entityStructure", ""),
+ "exclusion_status": reg.get("exclusionStatusFlag", ""),
+ "address_line1": phys_addr.get("addressLine1", ""),
+ "city": phys_addr.get("city", ""),
+ "state": phys_addr.get("stateOrProvinceCode", ""),
+ "zip": phys_addr.get("zipCode", ""),
+ "country": phys_addr.get("countryCode", ""),
+ "naics_codes": [n.get("naicsCode", "") for n in naics_list] if isinstance(naics_list, list) else [],
+ "primary_naics": core.get("primaryNaics", ""),
+ "business_type_list": bus_types.get("businessTypeList", []),
+ "sba_business_types": bus_types.get("sbaBusinessTypeList", []),
+ }
+ records.append(record)
+
+ if total > limit:
+ print(f" (showing {len(records)} of {total} total)", file=sys.stderr)
+
+ return records
+
+
+def write_results(
+ workspace: Path,
+ records: list[dict],
+ query_params: dict,
+) -> Path:
+ """Write results to workspace with provenance."""
+ out_dir = workspace / "datasets" / "scraped" / "sam"
+ out_dir.mkdir(parents=True, exist_ok=True)
+
+ parts = ["sam"]
+ if query_params.get("query"):
+ safe = query_params["query"][:40].replace(" ", "_").replace("/", "_")
+ parts.append(safe)
+ if query_params.get("uei"):
+ parts.append(f"uei{query_params['uei']}")
+ if query_params.get("cage"):
+ parts.append(f"cage{query_params['cage']}")
+ if query_params.get("naics"):
+ parts.append(f"naics{query_params['naics']}")
+ filename = "-".join(parts) + ".json"
+
+ content = json.dumps(records, indent=2)
+ out_path = out_dir / filename
+ out_path.write_text(content, encoding="utf-8")
+
+ provenance = {
+ "source_id": "sam",
+ "name": "SAM.gov Entity Registration",
+ "url": SAM_BASE,
+ "format": "json",
+ "linking_keys": ["uei", "cage_code", "legal_business_name", "naics_codes", "state"],
+ "query_params": query_params,
+ "download_timestamp": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
+ "file": filename,
+ "size_bytes": len(content.encode("utf-8")),
+ "sha256": hashlib.sha256(content.encode("utf-8")).hexdigest(),
+ "record_count": len(records),
+ }
+ prov_path = out_dir / "provenance.json"
+ prov_path.write_text(json.dumps(provenance, indent=2), encoding="utf-8")
+
+ return out_path
+
+
+def main() -> None:
+ parser = argparse.ArgumentParser(
+ description="Fetch SAM.gov federal contractor registration data"
+ )
+ parser.add_argument(
+ "workspace", type=Path,
+ help="Path to investigation workspace directory",
+ )
+ parser.add_argument("--query", "-q", help="Legal business name search")
+ parser.add_argument("--uei", help="Unique Entity Identifier (UEI)")
+ parser.add_argument("--cage", help="CAGE code")
+ parser.add_argument("--naics", help="NAICS code (e.g., 336411)")
+ parser.add_argument("--state", help="State abbreviation (e.g., CT)")
+ parser.add_argument("--country", default="USA", help="Country code (default: USA)")
+ parser.add_argument("--limit", type=int, default=25, help="Max results (default: 25)")
+ parser.add_argument("--timeout", type=int, default=30, help="Request timeout (default: 30s)")
+ parser.add_argument("--dry-run", action="store_true", help="Print query without fetching")
+ parser.add_argument("--list", dest="list_info", action="store_true", help="Show relevant NAICS codes")
+
+ args = parser.parse_args()
+
+ if args.list_info:
+ print("Relevant NAICS Codes for Defense/Infrastructure:")
+ for code, desc in sorted(RELEVANT_NAICS.items()):
+ print(f" {code} {desc}")
+ return
+
+ workspace = args.workspace.resolve()
+ if not workspace.exists():
+ print(f"ERROR: Workspace does not exist: {workspace}", file=sys.stderr)
+ sys.exit(1)
+
+ if not any([args.query, args.uei, args.cage, args.naics]):
+ print("ERROR: Provide at least one search criterion", file=sys.stderr)
+ sys.exit(1)
+
+ query_params = {
+ "query": args.query,
+ "uei": args.uei,
+ "cage": args.cage,
+ "naics": args.naics,
+ "state": args.state,
+ "country": args.country,
+ }
+
+ if args.dry_run:
+ print("Would search SAM.gov:")
+ for k, v in query_params.items():
+ if v:
+ print(f" {k}: {v}")
+ return
+
+ print("Searching SAM.gov entity registrations...")
+ records = search_entities(
+ query=args.query,
+ uei=args.uei,
+ cage=args.cage,
+ naics=args.naics,
+ state=args.state,
+ country=args.country,
+ limit=args.limit,
+ timeout=args.timeout,
+ )
+
+ if not records:
+ print("No entities found", file=sys.stderr)
+ sys.exit(1)
+
+ out_path = write_results(workspace, records, query_params)
+ print(f"SAM.gov: {len(records)} entities → {out_path}")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/skills/openplanter/scripts/init_workspace.py b/skills/openplanter/scripts/init_workspace.py
new file mode 100644
index 00000000..82daa803
--- /dev/null
+++ b/skills/openplanter/scripts/init_workspace.py
@@ -0,0 +1,193 @@
+#!/usr/bin/env python3
+"""Initialize an OpenPlanter investigation workspace.
+
+Creates the standard directory structure for dataset investigation:
+ datasets/ — Raw source data (CSV, JSON). Never modify originals.
+ entities/ — Resolved entity maps (canonical.json)
+ findings/ — Analysis outputs (cross-references, summaries)
+ evidence/ — Evidence chains with full provenance
+ plans/ — Investigation plans and methodology docs
+
+Usage:
+ python3 init_workspace.py /path/to/investigation
+ python3 init_workspace.py /path/to/investigation --plan "Cross-reference campaign finance with lobbying"
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from datetime import datetime, timezone
+from pathlib import Path
+
+DIRS = ["datasets", "entities", "findings", "evidence", "plans"]
+
+README = """\
+# Investigation Workspace
+
+Created: {timestamp}
+
+## Structure
+
+```
+datasets/ Raw source data (CSV, JSON). Never modify originals.
+entities/ Resolved entity maps (canonical.json)
+findings/ Analysis outputs (cross-references, summaries)
+evidence/ Evidence chains with full provenance
+plans/ Investigation plans and methodology docs
+```
+
+## Workflow
+
+1. Drop datasets into `datasets/`
+2. Write investigation plan in `plans/plan.md`
+3. Run entity resolution: `entity_resolver.py `
+4. Run cross-referencing: `cross_reference.py `
+5. Validate evidence chains: `evidence_chain.py `
+6. Score confidence: `confidence_scorer.py `
+7. Review findings in `findings/`
+
+## Provenance
+
+Record for every dataset:
+- Source URL or file path
+- Access/download timestamp
+- Any transformations applied
+- Admiralty source reliability grade (A-F)
+
+## Confidence Tiers
+
+| Tier | Criteria |
+|------|----------|
+| Confirmed | 2+ independent sources; hard signal match (EIN, phone) |
+| Probable | Strong single source; high fuzzy match (>0.85) on name + address |
+| Possible | Circumstantial only; moderate match (0.55-0.84) |
+| Unresolved | Contradictory evidence; insufficient data |
+"""
+
+PLAN_TEMPLATE = """\
+# Investigation Plan: {title}
+
+**Date**: {date}
+**Objective**: {objective}
+
+## Data Sources
+
+| Dataset | Format | Expected Records | Linking Fields |
+|---------|--------|-----------------|----------------|
+| | CSV/JSON | ~N | name, address, ... |
+
+## Entity Resolution Strategy
+
+- Primary matching fields: [name, EIN, address]
+- Blocking key: [first_3_chars + state]
+- Similarity threshold: 0.85 (confirmed), 0.70 (probable)
+
+## Cross-Dataset Linking Approach
+
+- Link datasets via: [field matching, fuzzy name, shared address]
+- Expected match rate: ~N%
+
+## Evidence Chain Construction
+
+- Confidence model: 4-tier (confirmed/probable/possible/unresolved)
+- Minimum corroboration: 2 independent sources for confirmed
+
+## Expected Deliverables
+
+- [ ] Entity canonical map (entities/canonical.json)
+- [ ] Cross-reference report (findings/cross-references.json)
+- [ ] Investigation summary (findings/summary.md)
+- [ ] Evidence appendix (evidence/chains.json)
+
+## Risks and Limitations
+
+- [Known data quality issues]
+- [Missing datasets]
+- [Entity resolution edge cases]
+"""
+
+
+def init_workspace(workspace: Path, plan_title: str | None = None) -> None:
+ workspace = workspace.resolve()
+
+ if workspace.exists() and any(workspace.iterdir()):
+ # Check if already initialized
+ if (workspace / "entities").exists():
+ print(f"Workspace already initialized: {workspace}")
+ return
+ # Non-empty but not an investigation workspace — proceed carefully
+ print(f"Warning: {workspace} is not empty. Creating investigation subdirectories.")
+
+ workspace.mkdir(parents=True, exist_ok=True)
+
+ for d in DIRS:
+ (workspace / d).mkdir(exist_ok=True)
+ # Add .gitkeep to empty dirs
+ gitkeep = workspace / d / ".gitkeep"
+ if not gitkeep.exists():
+ gitkeep.touch()
+
+ # Write README
+ now = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
+ readme_path = workspace / "README.md"
+ if not readme_path.exists():
+ readme_path.write_text(README.format(timestamp=now), encoding="utf-8")
+
+ # Initialize empty canonical entity map
+ canonical_path = workspace / "entities" / "canonical.json"
+ if not canonical_path.exists():
+ canonical_path.write_text(
+ json.dumps(
+ {
+ "metadata": {
+ "created": now,
+ "workspace": str(workspace),
+ "datasets_processed": [],
+ "total_entities": 0,
+ "resolution_threshold": 0.85,
+ },
+ "entities": [],
+ },
+ indent=2,
+ ),
+ encoding="utf-8",
+ )
+
+ # Write plan template if requested
+ if plan_title:
+ plan_path = workspace / "plans" / "plan.md"
+ if not plan_path.exists():
+ today = datetime.now(timezone.utc).strftime("%Y-%m-%d")
+ plan_path.write_text(
+ PLAN_TEMPLATE.format(
+ title=plan_title, date=today, objective=plan_title
+ ),
+ encoding="utf-8",
+ )
+ print(f" Plan template: {plan_path}")
+
+ print(f"Workspace initialized: {workspace}")
+ for d in DIRS:
+ print(f" {d}/")
+ print(f" README.md")
+ print(f" entities/canonical.json")
+
+
+def main() -> None:
+ parser = argparse.ArgumentParser(
+ description="Initialize an OpenPlanter investigation workspace"
+ )
+ parser.add_argument("workspace", type=Path, help="Path to workspace directory")
+ parser.add_argument(
+ "--plan",
+ type=str,
+ default=None,
+ help="Investigation title — creates a plan template in plans/plan.md",
+ )
+ args = parser.parse_args()
+ init_workspace(args.workspace, args.plan)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/skills/openplanter/scripts/investigate.py b/skills/openplanter/scripts/investigate.py
new file mode 100644
index 00000000..91967c4e
--- /dev/null
+++ b/skills/openplanter/scripts/investigate.py
@@ -0,0 +1,415 @@
+#!/usr/bin/env python3
+"""Master orchestrator: run a full investigation pipeline end-to-end.
+
+Chains the OpenPlanter skill scripts in the correct order:
+ collect → resolve → enrich → analyze → report
+
+Each phase can be run independently or as part of the full pipeline.
+Produces a structured findings summary in findings/summary.md.
+
+Uses Python stdlib only — zero external dependencies.
+Invokes sibling scripts as subprocesses.
+
+Usage:
+ python3 investigate.py /path/to/workspace --objective "Investigate X"
+ python3 investigate.py /path/to/workspace --phases collect,resolve,analyze
+ python3 investigate.py /path/to/workspace --phases all
+ python3 investigate.py /path/to/workspace --phases report --dry-run
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import subprocess
+import sys
+import time
+from datetime import datetime, timezone
+from pathlib import Path
+
+# ---------------------------------------------------------------------------
+# Script locations (sibling scripts in the same directory)
+# ---------------------------------------------------------------------------
+
+_SCRIPT_DIR = Path(__file__).resolve().parent
+
+PHASE_SCRIPTS = {
+ "init": _SCRIPT_DIR / "init_workspace.py",
+ "fetch": _SCRIPT_DIR / "dataset_fetcher.py",
+ "scrape": _SCRIPT_DIR / "scrape_records.py",
+ "resolve": _SCRIPT_DIR / "entity_resolver.py",
+ "crossref": _SCRIPT_DIR / "cross_reference.py",
+ "enrich": _SCRIPT_DIR / "web_enrich.py",
+ "evidence": _SCRIPT_DIR / "evidence_chain.py",
+ "score": _SCRIPT_DIR / "confidence_scorer.py",
+}
+
+# Phase groups map to pipeline stages
+PHASE_GROUPS = {
+ "collect": ["fetch", "scrape"],
+ "resolve": ["resolve", "crossref"],
+ "enrich": ["enrich"],
+ "analyze": ["evidence", "score"],
+ "report": [], # handled separately
+}
+
+ALL_PHASES = ["collect", "resolve", "enrich", "analyze", "report"]
+
+
+def run_script(script: Path, args: list[str], timeout: int = 300) -> dict:
+ """Run a sibling script and capture its result."""
+ cmd = [sys.executable, str(script), *args]
+ name = script.stem
+
+ print(f" Running: {name}")
+ t0 = time.monotonic()
+
+ try:
+ result = subprocess.run(
+ cmd,
+ capture_output=True,
+ text=True,
+ timeout=timeout,
+ env={**os.environ},
+ )
+ elapsed = round(time.monotonic() - t0, 2)
+
+ if result.stdout:
+ # Indent output
+ for line in result.stdout.strip().split("\n"):
+ print(f" {line}")
+
+ if result.returncode != 0 and result.stderr:
+ for line in result.stderr.strip().split("\n")[:5]:
+ print(f" ERROR: {line}")
+
+ return {
+ "script": name,
+ "status": "ok" if result.returncode == 0 else "error",
+ "exit_code": result.returncode,
+ "elapsed_sec": elapsed,
+ }
+ except subprocess.TimeoutExpired:
+ return {
+ "script": name,
+ "status": "timeout",
+ "elapsed_sec": round(time.monotonic() - t0, 2),
+ }
+ except FileNotFoundError:
+ return {
+ "script": name,
+ "status": "error",
+ "error": f"Script not found: {script}",
+ "elapsed_sec": 0,
+ }
+
+
+def run_phase(
+ phase: str,
+ workspace: Path,
+ timeout: int,
+ extra_args: dict[str, list[str]] | None = None,
+) -> list[dict]:
+ """Run all scripts in a phase group."""
+ print(f"\n{'='*60}")
+ print(f" Phase: {phase.upper()}")
+ print(f"{'='*60}\n")
+
+ results = []
+ scripts = PHASE_GROUPS.get(phase, [])
+
+ for script_key in scripts:
+ script = PHASE_SCRIPTS.get(script_key)
+ if not script or not script.exists():
+ print(f" Skipping {script_key}: script not found")
+ results.append({
+ "script": script_key,
+ "status": "skipped",
+ "reason": "not found",
+ })
+ continue
+
+ args = [str(workspace)]
+ # Add any extra args for this script
+ if extra_args and script_key in extra_args:
+ args.extend(extra_args[script_key])
+
+ result = run_script(script, args, timeout=timeout)
+ results.append(result)
+ print()
+
+ return results
+
+
+def generate_report(workspace: Path) -> str:
+ """Generate a findings summary from all investigation outputs."""
+ lines = [
+ f"# Investigation Summary",
+ f"",
+ f"**Generated:** {datetime.now(timezone.utc).strftime('%Y-%m-%d %H:%M:%S UTC')}",
+ f"**Workspace:** `{workspace}`",
+ f"",
+ ]
+
+ # Entity resolution summary
+ canon_path = workspace / "entities" / "canonical.json"
+ if canon_path.exists():
+ try:
+ data = json.loads(canon_path.read_text(encoding="utf-8"))
+ entities = data.get("entities", data) if isinstance(data, dict) else data
+ lines.append(f"## Entity Resolution")
+ lines.append(f"")
+ lines.append(f"- **Canonical entities:** {len(entities)}")
+ # Count by confidence
+ conf_counts: dict[str, int] = {}
+ for e in entities:
+ c = e.get("confidence", "unresolved")
+ conf_counts[c] = conf_counts.get(c, 0) + 1
+ for tier, count in sorted(conf_counts.items()):
+ lines.append(f" - {tier}: {count}")
+ lines.append(f"")
+ except (json.JSONDecodeError, OSError):
+ pass
+
+ # Cross-reference summary
+ xref_path = workspace / "findings" / "cross-references.json"
+ if xref_path.exists():
+ try:
+ data = json.loads(xref_path.read_text(encoding="utf-8"))
+ xrefs = data.get("cross_references", data) if isinstance(data, dict) else data
+ if isinstance(xrefs, list):
+ lines.append(f"## Cross-References")
+ lines.append(f"")
+ lines.append(f"- **Cross-referenced entities:** {len(xrefs)}")
+ multi_source = sum(
+ 1 for x in xrefs
+ if len(x.get("sources", x.get("datasets", []))) >= 2
+ )
+ lines.append(f"- **Multi-source matches:** {multi_source}")
+ lines.append(f"")
+ except (json.JSONDecodeError, OSError):
+ pass
+
+ # Evidence chains summary
+ chains_path = workspace / "evidence" / "chains.json"
+ if chains_path.exists():
+ try:
+ data = json.loads(chains_path.read_text(encoding="utf-8"))
+ chains = data.get("chains", data) if isinstance(data, dict) else data
+ if isinstance(chains, list):
+ lines.append(f"## Evidence Chains")
+ lines.append(f"")
+ lines.append(f"- **Chains validated:** {len(chains)}")
+ lines.append(f"")
+ except (json.JSONDecodeError, OSError):
+ pass
+
+ # Confidence scoring summary
+ scoring_path = workspace / "evidence" / "scoring-log.json"
+ if scoring_path.exists():
+ try:
+ data = json.loads(scoring_path.read_text(encoding="utf-8"))
+ if isinstance(data, dict):
+ lines.append(f"## Confidence Scoring")
+ lines.append(f"")
+ for key, val in data.items():
+ if key != "timestamp":
+ lines.append(f"- **{key}:** {val}")
+ lines.append(f"")
+ except (json.JSONDecodeError, OSError):
+ pass
+
+ # Enrichment summary
+ enriched_path = workspace / "entities" / "enriched.json"
+ if enriched_path.exists():
+ try:
+ data = json.loads(enriched_path.read_text(encoding="utf-8"))
+ meta = data.get("enrichment_metadata", {})
+ enriched = data.get("entities", [])
+ lines.append(f"## Web Enrichment")
+ lines.append(f"")
+ lines.append(f"- **Entities enriched:** {meta.get('entities_enriched', len(enriched))}")
+ lines.append(f"- **Categories:** {', '.join(meta.get('categories', []))}")
+ total_results = sum(len(e.get("search_results", [])) for e in enriched)
+ lines.append(f"- **Total search results:** {total_results}")
+ lines.append(f"")
+ except (json.JSONDecodeError, OSError):
+ pass
+
+ # Scraped records summary
+ scraped_prov = workspace / "datasets" / "scraped" / "provenance.json"
+ if scraped_prov.exists():
+ try:
+ data = json.loads(scraped_prov.read_text(encoding="utf-8"))
+ results = data.get("results", [])
+ lines.append(f"## Public Records")
+ lines.append(f"")
+ lines.append(f"- **API queries:** {len(results)}")
+ ok = sum(1 for r in results if r.get("status") == "ok")
+ lines.append(f"- **Matches found:** {ok}")
+ lines.append(f"")
+ except (json.JSONDecodeError, OSError):
+ pass
+
+ # Dataset inventory
+ ds_dir = workspace / "datasets"
+ if ds_dir.exists():
+ files = [f for f in ds_dir.rglob("*") if f.is_file() and f.name != ".gitkeep"]
+ if files:
+ lines.append(f"## Dataset Inventory")
+ lines.append(f"")
+ lines.append(f"- **Total files:** {len(files)}")
+ total_size = sum(f.stat().st_size for f in files)
+ if total_size > 1_000_000:
+ lines.append(f"- **Total size:** {total_size / 1_000_000:.1f} MB")
+ else:
+ lines.append(f"- **Total size:** {total_size / 1_000:.1f} KB")
+ lines.append(f"")
+
+ # Methodology
+ lines.extend([
+ f"## Methodology",
+ f"",
+ f"This investigation used the OpenPlanter skill pipeline:",
+ f"1. Entity resolution via fuzzy matching (difflib.SequenceMatcher)",
+ f"2. Cross-referencing across datasets using canonical entity map",
+ f"3. Evidence chain validation with hop tracking",
+ f"4. Confidence scoring using Admiralty System tiers (NATO AJP-2.1)",
+ f"",
+ f"See `entities/canonical.json` for the full entity map and "
+ f"`findings/cross-references.json` for detailed cross-references.",
+ ])
+
+ return "\n".join(lines)
+
+
+def main() -> None:
+ parser = argparse.ArgumentParser(
+ description="Run OpenPlanter investigation pipeline"
+ )
+ parser.add_argument(
+ "workspace", type=Path,
+ help="Path to investigation workspace",
+ )
+ parser.add_argument(
+ "--objective", type=str,
+ help="Investigation objective (included in report)",
+ )
+ parser.add_argument(
+ "--phases", type=str, default="all",
+ help=f"Comma-separated phases or 'all' (default: all). "
+ f"Available: {', '.join(ALL_PHASES)}",
+ )
+ parser.add_argument(
+ "--threshold", type=float, default=0.85,
+ help="Entity resolution similarity threshold (default: 0.85)",
+ )
+ parser.add_argument(
+ "--fetch-sources", type=str, default="sec",
+ help="Dataset sources for collect phase (default: sec). "
+ "Options: sec, fec, ofac, sanctions, lda, or 'all'",
+ )
+ parser.add_argument(
+ "--timeout", type=int, default=300,
+ help="Timeout per script in seconds (default: 300)",
+ )
+ parser.add_argument(
+ "--dry-run", action="store_true",
+ help="Show what would run without executing",
+ )
+ args = parser.parse_args()
+
+ workspace = args.workspace.resolve()
+
+ # Auto-init workspace if it doesn't exist
+ if not workspace.exists():
+ print(f"Initializing workspace: {workspace}")
+ init_script = PHASE_SCRIPTS["init"]
+ if init_script.exists():
+ run_script(init_script, [str(workspace)], timeout=30)
+ else:
+ workspace.mkdir(parents=True)
+ for d in ["datasets", "entities", "findings", "evidence", "plans"]:
+ (workspace / d).mkdir(exist_ok=True)
+
+ # Resolve phases
+ if args.phases == "all":
+ phases = ALL_PHASES
+ else:
+ phases = [p.strip() for p in args.phases.split(",")]
+ unknown = [p for p in phases if p not in ALL_PHASES]
+ if unknown:
+ print(f"Error: unknown phase(s): {', '.join(unknown)}\n"
+ f"Available: {', '.join(ALL_PHASES)}", file=sys.stderr)
+ sys.exit(1)
+
+ # Extra args per script
+ extra_args: dict[str, list[str]] = {
+ "fetch": ["--sources", args.fetch_sources],
+ "resolve": ["--threshold", str(args.threshold)],
+ }
+
+ print(f"OpenPlanter Investigation Pipeline")
+ print(f"Workspace: {workspace}")
+ if args.objective:
+ print(f"Objective: {args.objective}")
+ print(f"Phases: {', '.join(phases)}")
+ print()
+
+ if args.dry_run:
+ for phase in phases:
+ scripts = PHASE_GROUPS.get(phase, [])
+ for sk in scripts:
+ script = PHASE_SCRIPTS.get(sk)
+ print(f" [dry-run] Phase {phase}: would run {sk} ({script})")
+ if phase == "report":
+ print(f" [dry-run] Phase report: would generate findings/summary.md")
+ return
+
+ t_start = time.monotonic()
+ all_results: list[dict] = []
+
+ for phase in phases:
+ if phase == "report":
+ # Generate report
+ print(f"\n{'='*60}")
+ print(f" Phase: REPORT")
+ print(f"{'='*60}\n")
+
+ report = generate_report(workspace)
+ if args.objective:
+ report = report.replace(
+ "# Investigation Summary",
+ f"# Investigation Summary\n\n**Objective:** {args.objective}",
+ )
+
+ report_path = workspace / "findings" / "summary.md"
+ report_path.parent.mkdir(parents=True, exist_ok=True)
+ report_path.write_text(report, encoding="utf-8")
+ print(f" Wrote: {report_path}")
+ all_results.append({"script": "report", "status": "ok"})
+ else:
+ results = run_phase(phase, workspace, args.timeout, extra_args)
+ all_results.extend(results)
+
+ elapsed = round(time.monotonic() - t_start, 2)
+
+ print(f"\n{'='*60}")
+ print(f" Pipeline Complete ({elapsed}s)")
+ print(f"{'='*60}\n")
+
+ ok = sum(1 for r in all_results if r.get("status") == "ok")
+ errs = sum(1 for r in all_results if r.get("status") == "error")
+ skipped = sum(1 for r in all_results if r.get("status") == "skipped")
+ print(f" {ok} succeeded, {errs} errors, {skipped} skipped")
+
+ if errs:
+ print("\nErrors:")
+ for r in all_results:
+ if r.get("status") == "error":
+ print(f" - {r['script']}: {r.get('error', 'non-zero exit')}")
+ sys.exit(1)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/skills/openplanter/scripts/scrape_records.py b/skills/openplanter/scripts/scrape_records.py
new file mode 100644
index 00000000..b99fa1ab
--- /dev/null
+++ b/skills/openplanter/scripts/scrape_records.py
@@ -0,0 +1,394 @@
+#!/usr/bin/env python3
+"""Fetch entity-specific records from public records APIs.
+
+Queries structured government APIs (SEC EDGAR, FEC, Senate LDA, USAspending)
+for entity-specific records using urllib. For JavaScript-heavy portals (state
+SOS sites), optionally delegates to Firecrawl as a subprocess.
+
+Uses Python stdlib only — zero external dependencies.
+
+Supported API sources:
+ sec — SEC EDGAR entity submissions (CIK lookup + filing history)
+ fec — FEC individual/committee contributions (requires FEC_API_KEY)
+ lda — Senate LDA lobbying filings by registrant/client name
+ spend — USAspending.gov award search by recipient name
+
+Usage:
+ python3 scrape_records.py /path/to/workspace --entities "Acme Corp"
+ python3 scrape_records.py /path/to/workspace --sources sec,fec,lda
+ python3 scrape_records.py /path/to/workspace --all-entities --sources sec
+ python3 scrape_records.py /path/to/workspace --entities "Acme" --dry-run
+"""
+from __future__ import annotations
+
+import argparse
+import hashlib
+import json
+import os
+import re
+import sys
+import time
+import urllib.error
+import urllib.parse
+import urllib.request
+from datetime import datetime, timezone
+from pathlib import Path
+
+# ---------------------------------------------------------------------------
+# API source definitions
+# ---------------------------------------------------------------------------
+
+_USER_AGENT = "OpenPlanter/1.0 openplanter@investigation.local"
+
+# SEC EDGAR: rate limit ~10 req/sec, requires User-Agent with email
+_SEC_FULLTEXT = "https://efts.sec.gov/LATEST/search-index?q={query}&dateRange=custom&startdt=2020-01-01&forms=10-K,10-Q,8-K,DEF+14A&hits.hits.total=true"
+_SEC_SUBMISSIONS = "https://data.sec.gov/submissions/CIK{cik}.json"
+_SEC_TICKERS = "https://www.sec.gov/files/company_tickers.json"
+
+# FEC: free API key at api.open.fec.gov, 1000 requests/hr
+_FEC_COMMITTEES = "https://api.open.fec.gov/v1/names/committees/?q={query}&api_key={key}"
+_FEC_CANDIDATES = "https://api.open.fec.gov/v1/names/candidates/?q={query}&api_key={key}"
+
+# Senate LDA: no auth, ~1 req/sec polite
+_LDA_FILINGS = "https://lda.senate.gov/api/v1/filings/?filing_type=1®istrant_name={query}&format=json&page_size=25"
+_LDA_REGISTRANTS = "https://lda.senate.gov/api/v1/registrants/?name={query}&format=json&page_size=25"
+
+# USAspending: no auth, liberal rate limits
+_SPEND_SEARCH = "https://api.usaspending.gov/api/v2/search/spending_by_award/"
+
+
+def _sha256(data: bytes) -> str:
+ return hashlib.sha256(data).hexdigest()
+
+
+def _http_get(url: str, headers: dict[str, str] | None = None, timeout: int = 30) -> bytes:
+ """GET request via urllib, return raw bytes."""
+ hdrs = {"User-Agent": _USER_AGENT}
+ if headers:
+ hdrs.update(headers)
+ req = urllib.request.Request(url, headers=hdrs)
+ with urllib.request.urlopen(req, timeout=timeout) as resp:
+ return resp.read()
+
+
+def _http_post_json(url: str, payload: dict, timeout: int = 30) -> bytes:
+ """POST JSON via urllib, return raw bytes."""
+ data = json.dumps(payload).encode("utf-8")
+ req = urllib.request.Request(
+ url,
+ data=data,
+ headers={
+ "User-Agent": _USER_AGENT,
+ "Content-Type": "application/json",
+ },
+ )
+ with urllib.request.urlopen(req, timeout=timeout) as resp:
+ return resp.read()
+
+
+# ---------------------------------------------------------------------------
+# Source-specific fetchers
+# ---------------------------------------------------------------------------
+
+def fetch_sec(entity_name: str, dest_dir: Path, timeout: int) -> dict:
+ """Look up entity on SEC EDGAR and fetch submissions."""
+ # Step 1: Find CIK via company tickers JSON
+ print(f" SEC: Looking up CIK for '{entity_name}'")
+ try:
+ raw = _http_get(_SEC_TICKERS, timeout=timeout)
+ tickers = json.loads(raw)
+ except (urllib.error.URLError, json.JSONDecodeError) as e:
+ return {"source": "sec", "status": "error", "error": f"Ticker lookup failed: {e}"}
+
+ # Search for matching entity (case-insensitive substring)
+ query_lower = entity_name.lower()
+ matches = []
+ for _key, entry in tickers.items():
+ title = entry.get("title", "").lower()
+ if query_lower in title or title in query_lower:
+ matches.append(entry)
+
+ if not matches:
+ return {"source": "sec", "status": "no_match", "entity": entity_name}
+
+ # Fetch submissions for best match
+ best = matches[0]
+ cik = str(best.get("cik_str", best.get("cik", ""))).zfill(10)
+ print(f" SEC: Found CIK {cik} ({best.get('title', '')})")
+
+ try:
+ sub_url = _SEC_SUBMISSIONS.format(cik=cik)
+ sub_raw = _http_get(sub_url, timeout=timeout)
+ submissions = json.loads(sub_raw)
+ except (urllib.error.URLError, json.JSONDecodeError) as e:
+ return {"source": "sec", "status": "error", "error": f"Submissions fetch failed: {e}"}
+
+ # Write to file
+ out_file = dest_dir / "sec" / f"{cik}.json"
+ out_file.parent.mkdir(parents=True, exist_ok=True)
+ out_file.write_bytes(sub_raw)
+
+ recent_filings = submissions.get("filings", {}).get("recent", {})
+ filing_count = len(recent_filings.get("accessionNumber", []))
+
+ return {
+ "source": "sec",
+ "status": "ok",
+ "entity": entity_name,
+ "cik": cik,
+ "company_name": submissions.get("name", ""),
+ "filing_count": filing_count,
+ "file": str(out_file.relative_to(dest_dir.parent.parent)),
+ }
+
+
+def fetch_fec(entity_name: str, dest_dir: Path, timeout: int) -> dict:
+ """Search FEC for committees/candidates matching entity name."""
+ api_key = os.environ.get("FEC_API_KEY", "DEMO_KEY")
+ if api_key == "DEMO_KEY":
+ print(" FEC: Using DEMO_KEY (limited to 1000 req/hr). Set FEC_API_KEY for production.")
+
+ print(f" FEC: Searching committees for '{entity_name}'")
+ encoded = urllib.parse.quote(entity_name)
+
+ try:
+ url = _FEC_COMMITTEES.format(query=encoded, key=api_key)
+ raw = _http_get(url, timeout=timeout)
+ data = json.loads(raw)
+ except (urllib.error.URLError, json.JSONDecodeError) as e:
+ # Strip API key from error message to prevent leaking credentials
+ err_msg = str(e).replace(api_key, "***")
+ return {"source": "fec", "status": "error", "error": f"Committee search failed: {err_msg}"}
+
+ results = data.get("results", [])
+
+ # Write results
+ safe_name = re.sub(r"[^\w\-]", "_", entity_name.lower())[:60]
+ out_file = dest_dir / "fec" / f"{safe_name}.json"
+ out_file.parent.mkdir(parents=True, exist_ok=True)
+ out_file.write_text(json.dumps(results, indent=2), encoding="utf-8")
+
+ return {
+ "source": "fec",
+ "status": "ok" if results else "no_match",
+ "entity": entity_name,
+ "match_count": len(results),
+ "file": str(out_file.relative_to(dest_dir.parent.parent)),
+ }
+
+
+def fetch_lda(entity_name: str, dest_dir: Path, timeout: int) -> dict:
+ """Search Senate LDA for lobbying filings by registrant name."""
+ print(f" LDA: Searching registrants for '{entity_name}'")
+ encoded = urllib.parse.quote(entity_name)
+
+ try:
+ url = _LDA_REGISTRANTS.format(query=encoded)
+ raw = _http_get(url, timeout=timeout)
+ data = json.loads(raw)
+ except (urllib.error.URLError, json.JSONDecodeError) as e:
+ return {"source": "lda", "status": "error", "error": f"LDA search failed: {e}"}
+
+ results = data.get("results", [])
+
+ safe_name = re.sub(r"[^\w\-]", "_", entity_name.lower())[:60]
+ out_file = dest_dir / "lda" / f"{safe_name}.json"
+ out_file.parent.mkdir(parents=True, exist_ok=True)
+ out_file.write_text(json.dumps(results, indent=2), encoding="utf-8")
+
+ return {
+ "source": "lda",
+ "status": "ok" if results else "no_match",
+ "entity": entity_name,
+ "match_count": len(results),
+ "file": str(out_file.relative_to(dest_dir.parent.parent)),
+ }
+
+
+def fetch_spend(entity_name: str, dest_dir: Path, timeout: int) -> dict:
+ """Search USAspending.gov for awards to entity."""
+ print(f" USAspending: Searching awards for '{entity_name}'")
+
+ payload = {
+ "filters": {
+ "keyword": entity_name,
+ "time_period": [{"start_date": "2020-01-01", "end_date": "2026-12-31"}],
+ },
+ "fields": [
+ "Award ID", "Recipient Name", "Award Amount",
+ "Awarding Agency", "Award Type", "Start Date",
+ ],
+ "page": 1,
+ "limit": 25,
+ "sort": "Award Amount",
+ "order": "desc",
+ }
+
+ try:
+ raw = _http_post_json(_SPEND_SEARCH, payload, timeout=timeout)
+ data = json.loads(raw)
+ except (urllib.error.URLError, json.JSONDecodeError) as e:
+ return {"source": "spend", "status": "error", "error": f"USAspending search failed: {e}"}
+
+ results = data.get("results", [])
+
+ safe_name = re.sub(r"[^\w\-]", "_", entity_name.lower())[:60]
+ out_file = dest_dir / "spend" / f"{safe_name}.json"
+ out_file.parent.mkdir(parents=True, exist_ok=True)
+ out_file.write_text(json.dumps(results, indent=2), encoding="utf-8")
+
+ return {
+ "source": "spend",
+ "status": "ok" if results else "no_match",
+ "entity": entity_name,
+ "match_count": len(results),
+ "file": str(out_file.relative_to(dest_dir.parent.parent)),
+ }
+
+
+# ---------------------------------------------------------------------------
+# Source registry
+# ---------------------------------------------------------------------------
+
+SOURCES = {
+ "sec": {"name": "SEC EDGAR", "func": fetch_sec, "auth": "User-Agent header (built-in)"},
+ "fec": {"name": "FEC API", "func": fetch_fec, "auth": "FEC_API_KEY (DEMO_KEY fallback)"},
+ "lda": {"name": "Senate LDA", "func": fetch_lda, "auth": "None"},
+ "spend": {"name": "USAspending", "func": fetch_spend, "auth": "None"},
+}
+
+
+def load_entity_names(workspace: Path) -> list[str]:
+ """Load canonical entity names from entities/canonical.json."""
+ canon_path = workspace / "entities" / "canonical.json"
+ if not canon_path.exists():
+ return []
+ try:
+ data = json.loads(canon_path.read_text(encoding="utf-8"))
+ entities = data.get("entities", data) if isinstance(data, dict) else data
+ return [e.get("canonical_name", "") for e in entities if e.get("canonical_name")]
+ except (json.JSONDecodeError, OSError):
+ return []
+
+
+def main() -> None:
+ parser = argparse.ArgumentParser(
+ description="Fetch entity-specific records from public records APIs"
+ )
+ parser.add_argument(
+ "workspace", type=Path,
+ help="Path to investigation workspace",
+ )
+ parser.add_argument(
+ "--entities", nargs="+",
+ help="Entity names to search for",
+ )
+ parser.add_argument(
+ "--all-entities", action="store_true",
+ help="Search for all entities in entities/canonical.json",
+ )
+ parser.add_argument(
+ "--sources", type=str, default="sec,fec,lda,spend",
+ help=f"Comma-separated API sources (default: all). "
+ f"Available: {', '.join(SOURCES.keys())}",
+ )
+ parser.add_argument(
+ "--list", action="store_true", dest="list_sources",
+ help="List available API sources and exit",
+ )
+ parser.add_argument(
+ "--timeout", type=int, default=30,
+ help="HTTP timeout per request in seconds (default: 30)",
+ )
+ parser.add_argument(
+ "--delay", type=float, default=1.0,
+ help="Delay between API calls in seconds (default: 1.0)",
+ )
+ parser.add_argument(
+ "--dry-run", action="store_true",
+ help="Show what would be fetched without making API calls",
+ )
+ args = parser.parse_args()
+
+ if args.list_sources:
+ print("Available API sources:\n")
+ for sid, spec in SOURCES.items():
+ print(f" {sid:8s} {spec['name']}")
+ print(f" {'':<8s} Auth: {spec['auth']}")
+ print()
+ return
+
+ workspace = args.workspace.resolve()
+ if not workspace.exists():
+ print(f"Error: workspace does not exist: {workspace}", file=sys.stderr)
+ sys.exit(1)
+
+ # Resolve entities
+ entity_names: list[str] = []
+ if args.entities:
+ entity_names = args.entities
+ elif args.all_entities:
+ entity_names = load_entity_names(workspace)
+ if not entity_names:
+ print("No canonical entities found. Run entity_resolver.py first or use --entities.")
+ sys.exit(1)
+ else:
+ print("Error: specify --entities or --all-entities", file=sys.stderr)
+ sys.exit(1)
+
+ # Resolve sources
+ source_ids = [s.strip() for s in args.sources.split(",")]
+ unknown = [s for s in source_ids if s not in SOURCES]
+ if unknown:
+ print(f"Error: unknown source(s): {', '.join(unknown)}\n"
+ f"Available: {', '.join(SOURCES.keys())}", file=sys.stderr)
+ sys.exit(1)
+
+ dest_dir = workspace / "datasets" / "scraped"
+ dest_dir.mkdir(parents=True, exist_ok=True)
+
+ print(f"Fetching records for {len(entity_names)} entity/entities "
+ f"from {len(source_ids)} source(s)\n")
+
+ if args.dry_run:
+ for name in entity_names:
+ for sid in source_ids:
+ print(f" [dry-run] Would query {SOURCES[sid]['name']}: {name}")
+ return
+
+ all_results: list[dict] = []
+ for name in entity_names:
+ print(f" Entity: {name}")
+ for sid in source_ids:
+ fetch_fn = SOURCES[sid]["func"]
+ result = fetch_fn(name, dest_dir, args.timeout)
+ result["query_entity"] = name
+ all_results.append(result)
+
+ if args.delay > 0:
+ time.sleep(args.delay)
+ print()
+
+ # Write provenance log
+ provenance = {
+ "timestamp": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
+ "entities_queried": entity_names,
+ "sources": source_ids,
+ "results": all_results,
+ }
+ prov_file = dest_dir / "provenance.json"
+ prov_file.write_text(json.dumps(provenance, indent=2), encoding="utf-8")
+
+ ok = sum(1 for r in all_results if r["status"] == "ok")
+ no_match = sum(1 for r in all_results if r["status"] == "no_match")
+ errs = sum(1 for r in all_results if r["status"] == "error")
+ print(f"Done: {ok} matched, {no_match} no match, {errs} errors "
+ f"out of {len(all_results)} queries")
+ print(f"Results in: {dest_dir}")
+
+ if errs:
+ sys.exit(1)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/skills/openplanter/scripts/web_enrich.py b/skills/openplanter/scripts/web_enrich.py
new file mode 100644
index 00000000..e48a9fb3
--- /dev/null
+++ b/skills/openplanter/scripts/web_enrich.py
@@ -0,0 +1,313 @@
+#!/usr/bin/env python3
+"""Enrich resolved entities with web search results via Exa.
+
+Reads entities/canonical.json, searches for each entity using the Exa neural
+search API (via the exa-search skill subprocess), and writes enriched records
+to entities/enriched.json. Provider-agnostic — works with any Exa-compatible
+search backend.
+
+Uses Python stdlib only — zero external dependencies.
+Exa search is invoked as a subprocess (exa_search.py from exa-search skill).
+
+Usage:
+ python3 web_enrich.py /path/to/investigation
+ python3 web_enrich.py /path/to/investigation --entities "Acme Corp" "Beta LLC"
+ python3 web_enrich.py /path/to/investigation --categories company,news --limit 5
+ python3 web_enrich.py /path/to/investigation --dry-run
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import shutil
+import subprocess
+import sys
+import time
+from datetime import datetime, timezone
+from pathlib import Path
+
+
+# ---------------------------------------------------------------------------
+# Exa search skill locations
+# ---------------------------------------------------------------------------
+
+_EXA_SEARCH_CANDIDATES = [
+ Path.home() / ".claude" / "skills" / "exa-search" / "scripts" / "exa_search.py",
+ Path.home() / "Desktop" / "Programming" / "claude-code-minoan" / "skills"
+ / "web-search" / "exa-search" / "scripts" / "exa_search.py",
+]
+
+_EXA_CONTENTS_CANDIDATES = [
+ Path.home() / ".claude" / "skills" / "exa-search" / "scripts" / "exa_contents.py",
+]
+
+
+def _find_script(candidates: list[Path], name: str) -> Path | None:
+ """Find the first existing script from candidate paths."""
+ for p in candidates:
+ if p.exists():
+ return p
+ # Check PATH
+ if shutil.which(name):
+ return Path(shutil.which(name)) # type: ignore[arg-type]
+ return None
+
+
+def _run_exa_search(
+ query: str,
+ category: str = "company",
+ num_results: int = 5,
+ extra_args: list[str] | None = None,
+) -> list[dict]:
+ """Run exa_search.py as a subprocess and return parsed results."""
+ script = _find_script(_EXA_SEARCH_CANDIDATES, "exa_search.py")
+ if not script:
+ return []
+
+ cmd = [
+ sys.executable, str(script),
+ query,
+ "--category", category,
+ "-n", str(num_results),
+ "--json",
+ ]
+ if extra_args:
+ cmd.extend(extra_args)
+
+ try:
+ result = subprocess.run(
+ cmd,
+ capture_output=True,
+ text=True,
+ timeout=60,
+ env={**os.environ},
+ )
+ if result.returncode != 0:
+ return []
+ return json.loads(result.stdout)
+ except (subprocess.TimeoutExpired, json.JSONDecodeError, FileNotFoundError):
+ return []
+
+
+def _run_exa_contents(
+ urls: list[str],
+ max_chars: int = 3000,
+ summary: bool = True,
+) -> list[dict]:
+ """Run exa_contents.py to get page content for given URLs."""
+ script = _find_script(_EXA_CONTENTS_CANDIDATES, "exa_contents.py")
+ if not script:
+ return []
+
+ cmd = [
+ sys.executable, str(script),
+ *urls,
+ "--max-chars", str(max_chars),
+ ]
+ if summary:
+ cmd.append("--summary")
+
+ try:
+ result = subprocess.run(
+ cmd,
+ capture_output=True,
+ text=True,
+ timeout=120,
+ env={**os.environ},
+ )
+ if result.returncode != 0:
+ return []
+ return json.loads(result.stdout)
+ except (subprocess.TimeoutExpired, json.JSONDecodeError, FileNotFoundError):
+ return []
+
+
+def load_canonical(workspace: Path) -> list[dict]:
+ """Load canonical entities from entities/canonical.json."""
+ canon_path = workspace / "entities" / "canonical.json"
+ if not canon_path.exists():
+ print(f"Error: {canon_path} not found. Run entity_resolver.py first.",
+ file=sys.stderr)
+ return []
+ try:
+ data = json.loads(canon_path.read_text(encoding="utf-8"))
+ return data.get("entities", data) if isinstance(data, dict) else data
+ except (json.JSONDecodeError, OSError) as e:
+ print(f"Error reading {canon_path}: {e}", file=sys.stderr)
+ sys.exit(1)
+
+
+def enrich_entity(
+ entity: dict,
+ categories: list[str],
+ limit: int,
+ delay: float,
+) -> dict:
+ """Search for an entity across categories and return enrichment data."""
+ name = entity.get("canonical_name", "")
+ if not name:
+ return {}
+
+ enrichment: dict = {
+ "canonical_id": entity.get("canonical_id", ""),
+ "canonical_name": name,
+ "search_results": [],
+ "summaries": [],
+ "enrichment_timestamp": datetime.now(timezone.utc).strftime(
+ "%Y-%m-%dT%H:%M:%SZ"
+ ),
+ }
+
+ all_urls: list[str] = []
+
+ for category in categories:
+ print(f" Searching [{category}]: {name}")
+ results = _run_exa_search(name, category=category, num_results=limit)
+
+ for r in results:
+ entry = {
+ "title": r.get("title", ""),
+ "url": r.get("url", ""),
+ "category": category,
+ "score": r.get("score", 0),
+ "published_date": r.get("publishedDate", ""),
+ }
+ enrichment["search_results"].append(entry)
+ if r.get("url"):
+ all_urls.append(r["url"])
+
+ if delay > 0:
+ time.sleep(delay)
+
+ # Get summaries for top URLs (limit to 3 to conserve API calls)
+ top_urls = all_urls[:3]
+ if top_urls:
+ print(f" Fetching summaries for {len(top_urls)} URL(s)")
+ contents = _run_exa_contents(top_urls, max_chars=3000, summary=True)
+ for c in contents:
+ enrichment["summaries"].append({
+ "url": c.get("url", ""),
+ "title": c.get("title", ""),
+ "summary": c.get("summary", c.get("text", ""))[:2000],
+ })
+
+ return enrichment
+
+
+def main() -> None:
+ parser = argparse.ArgumentParser(
+ description="Enrich resolved entities with Exa web search results"
+ )
+ parser.add_argument(
+ "workspace", type=Path,
+ help="Path to investigation workspace",
+ )
+ parser.add_argument(
+ "--entities", nargs="+",
+ help="Specific entity names to enrich (default: all canonical entities)",
+ )
+ parser.add_argument(
+ "--categories", type=str, default="company,news",
+ help="Comma-separated Exa search categories (default: company,news). "
+ "Options: company, research paper, news, pdf, github, tweet, "
+ "personal site, people, financial report",
+ )
+ parser.add_argument(
+ "--limit", type=int, default=5,
+ help="Max results per category per entity (default: 5)",
+ )
+ parser.add_argument(
+ "--delay", type=float, default=0.5,
+ help="Delay between API calls in seconds (default: 0.5)",
+ )
+ parser.add_argument(
+ "--dry-run", action="store_true",
+ help="Show what would be searched without making API calls",
+ )
+ args = parser.parse_args()
+
+ workspace = args.workspace.resolve()
+ if not workspace.exists():
+ print(f"Error: workspace does not exist: {workspace}", file=sys.stderr)
+ sys.exit(1)
+
+ # Check Exa availability
+ exa_script = _find_script(_EXA_SEARCH_CANDIDATES, "exa_search.py")
+ if not exa_script and not args.dry_run:
+ print(
+ "Error: exa_search.py not found. Install the exa-search skill:\n"
+ " cp -r exa-search ~/.claude/skills/exa-search\n"
+ " export EXA_API_KEY=your-key",
+ file=sys.stderr,
+ )
+ sys.exit(1)
+
+ if not os.environ.get("EXA_API_KEY") and not args.dry_run:
+ print(
+ "Warning: EXA_API_KEY not set. Search may fail.",
+ file=sys.stderr,
+ )
+
+ categories = [c.strip() for c in args.categories.split(",")]
+ entities = load_canonical(workspace)
+ if not entities:
+ print("No canonical entities found. Run entity_resolver.py first.")
+ sys.exit(1)
+
+ # Filter to specific entities if requested
+ if args.entities:
+ target_names = {n.lower() for n in args.entities}
+ entities = [
+ e for e in entities
+ if e.get("canonical_name", "").lower() in target_names
+ ]
+ if not entities:
+ print(f"No matching entities found for: {args.entities}")
+ sys.exit(1)
+
+ print(f"Enriching {len(entities)} entity/entities across "
+ f"{len(categories)} category/categories\n")
+
+ if args.dry_run:
+ for e in entities:
+ name = e.get("canonical_name", "unknown")
+ for cat in categories:
+ print(f" [dry-run] Would search [{cat}]: {name}")
+ print(f"\n [dry-run] Would write: entities/enriched.json")
+ return
+
+ enriched = []
+ for i, entity in enumerate(entities, 1):
+ name = entity.get("canonical_name", "unknown")
+ print(f" [{i}/{len(entities)}] {name}")
+ result = enrich_entity(entity, categories, args.limit, args.delay)
+ if result:
+ enriched.append(result)
+ print()
+
+ # Write output
+ output_path = workspace / "entities" / "enriched.json"
+ output_path.parent.mkdir(parents=True, exist_ok=True)
+
+ output = {
+ "enrichment_metadata": {
+ "timestamp": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
+ "categories": categories,
+ "limit_per_category": args.limit,
+ "entities_searched": len(entities),
+ "entities_enriched": len(enriched),
+ "exa_script": str(exa_script) if exa_script else None,
+ },
+ "entities": enriched,
+ }
+ output_path.write_text(json.dumps(output, indent=2), encoding="utf-8")
+ print(f"Wrote {len(enriched)} enriched entities to {output_path}")
+
+ total_results = sum(len(e.get("search_results", [])) for e in enriched)
+ total_summaries = sum(len(e.get("summaries", [])) for e in enriched)
+ print(f"Total: {total_results} search results, {total_summaries} summaries")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/skills/openplanter/scripts/wiki_graph_query.py b/skills/openplanter/scripts/wiki_graph_query.py
new file mode 100644
index 00000000..a068e81c
--- /dev/null
+++ b/skills/openplanter/scripts/wiki_graph_query.py
@@ -0,0 +1,282 @@
+#!/usr/bin/env python3
+"""Query an OpenPlanter wiki knowledge graph (read-only).
+
+Reads a NetworkX-compatible knowledge graph JSON file produced by
+OpenPlanter's wiki_graph.py during delegated investigations. Supports
+entity lookup, neighbor traversal, path finding, and subgraph export.
+
+Use this to inspect investigation results after a delegated run without
+requiring NetworkX or the full OpenPlanter agent.
+
+Uses Python stdlib only — zero external dependencies.
+
+Graph format: Node-link JSON (NetworkX node_link_data export):
+ {"directed": true, "nodes": [...], "links": [...]}
+
+Usage:
+ python3 wiki_graph_query.py /path/to/investigation --entity "Raytheon"
+ python3 wiki_graph_query.py /path/to/investigation --entity "Raytheon" --neighbors
+ python3 wiki_graph_query.py /path/to/investigation --path "Raytheon" "OFAC SDN"
+ python3 wiki_graph_query.py /path/to/investigation --stats
+ python3 wiki_graph_query.py /path/to/investigation --types
+ python3 wiki_graph_query.py /path/to/investigation --search "missile"
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from collections import defaultdict
+from pathlib import Path
+
+
+def load_graph(workspace: Path) -> dict:
+ """Find and load the wiki knowledge graph from the workspace."""
+ # Check common locations for the graph file
+ candidates = [
+ workspace / ".openplanter" / "wiki" / "graph.json",
+ workspace / "wiki" / "graph.json",
+ workspace / "knowledge_graph.json",
+ workspace / "entity_graph.json",
+ ]
+ # Also check sessions for the most recent graph
+ session_dir = workspace / ".openplanter" / "sessions"
+ if session_dir.exists():
+ for d in sorted(session_dir.iterdir(), key=lambda p: p.stat().st_mtime, reverse=True):
+ if d.is_dir():
+ candidates.append(d / "wiki" / "graph.json")
+ candidates.append(d / "graph.json")
+
+ for path in candidates:
+ if path.exists():
+ try:
+ data = json.loads(path.read_text(encoding="utf-8"))
+ if "nodes" in data:
+ data["_source_path"] = str(path)
+ return data
+ except (json.JSONDecodeError, OSError):
+ continue
+
+ return {}
+
+
+def get_nodes(graph: dict) -> list[dict]:
+ """Extract nodes from graph data."""
+ return graph.get("nodes", [])
+
+
+def get_links(graph: dict) -> list[dict]:
+ """Extract links/edges from graph data."""
+ return graph.get("links", graph.get("edges", []))
+
+
+def find_entity(graph: dict, name: str) -> list[dict]:
+ """Find nodes matching an entity name (case-insensitive substring)."""
+ name_lower = name.lower()
+ matches = []
+ for node in get_nodes(graph):
+ node_id = str(node.get("id", "")).lower()
+ node_label = str(node.get("label", node.get("name", ""))).lower()
+ if name_lower in node_id or name_lower in node_label:
+ matches.append(node)
+ return matches
+
+
+def get_neighbors(graph: dict, node_id: str) -> dict:
+ """Get all neighbors (inbound + outbound) of a node."""
+ outbound = []
+ inbound = []
+ for link in get_links(graph):
+ src = str(link.get("source", ""))
+ tgt = str(link.get("target", ""))
+ rel = link.get("relation", link.get("label", link.get("type", "")))
+ if src == node_id:
+ outbound.append({"target": tgt, "relation": rel})
+ elif tgt == node_id:
+ inbound.append({"source": src, "relation": rel})
+ return {"outbound": outbound, "inbound": inbound}
+
+
+def find_path(graph: dict, start: str, end: str, max_depth: int = 6) -> list[list[str]]:
+ """BFS to find shortest path(s) between two node IDs."""
+ # Build adjacency list
+ adj: dict[str, list[tuple[str, str]]] = defaultdict(list)
+ for link in get_links(graph):
+ src = str(link.get("source", ""))
+ tgt = str(link.get("target", ""))
+ rel = link.get("relation", link.get("label", ""))
+ adj[src].append((tgt, rel))
+ adj[tgt].append((src, rel)) # Treat as undirected for pathfinding
+
+ # BFS
+ queue: list[list[str]] = [[start]]
+ visited = {start}
+ paths = []
+
+ while queue:
+ path = queue.pop(0)
+ if len(path) > max_depth * 2: # Each step adds node + relation
+ continue
+ current = path[-1]
+ for neighbor, rel in adj.get(current, []):
+ if neighbor == end:
+ paths.append(path + [f"--[{rel}]-->", neighbor])
+ continue
+ if neighbor not in visited:
+ visited.add(neighbor)
+ queue.append(path + [f"--[{rel}]-->", neighbor])
+
+ return paths
+
+
+def graph_stats(graph: dict) -> dict:
+ """Compute basic graph statistics."""
+ nodes = get_nodes(graph)
+ links = get_links(graph)
+
+ # Count node types
+ type_counts: dict[str, int] = defaultdict(int)
+ for node in nodes:
+ ntype = node.get("type", node.get("entity_type", "unknown"))
+ type_counts[str(ntype)] += 1
+
+ # Count relation types
+ rel_counts: dict[str, int] = defaultdict(int)
+ for link in links:
+ rel = link.get("relation", link.get("label", link.get("type", "unknown")))
+ rel_counts[str(rel)] += 1
+
+ # Degree distribution
+ degree: dict[str, int] = defaultdict(int)
+ for link in links:
+ degree[str(link.get("source", ""))] += 1
+ degree[str(link.get("target", ""))] += 1
+
+ top_nodes = sorted(degree.items(), key=lambda x: x[1], reverse=True)[:10]
+
+ return {
+ "source_path": graph.get("_source_path", ""),
+ "node_count": len(nodes),
+ "link_count": len(links),
+ "directed": graph.get("directed", False),
+ "node_types": dict(type_counts),
+ "relation_types": dict(rel_counts),
+ "top_connected_nodes": [{"id": n, "degree": d} for n, d in top_nodes],
+ }
+
+
+def search_graph(graph: dict, term: str) -> list[dict]:
+ """Search all node and link fields for a term."""
+ term_lower = term.lower()
+ results = []
+
+ for node in get_nodes(graph):
+ for key, val in node.items():
+ if term_lower in str(val).lower():
+ results.append({"type": "node", "id": node.get("id"), "match_field": key, "data": node})
+ break
+
+ for link in get_links(graph):
+ for key, val in link.items():
+ if term_lower in str(val).lower():
+ results.append({
+ "type": "link",
+ "source": link.get("source"),
+ "target": link.get("target"),
+ "match_field": key,
+ "data": link,
+ })
+ break
+
+ return results
+
+
+def main() -> None:
+ parser = argparse.ArgumentParser(
+ description="Query an OpenPlanter wiki knowledge graph (read-only)"
+ )
+ parser.add_argument(
+ "workspace", type=Path,
+ help="Path to investigation workspace directory",
+ )
+ parser.add_argument("--entity", "-e", help="Find nodes matching entity name")
+ parser.add_argument("--neighbors", "-n", action="store_true", help="Show neighbors of matched entity")
+ parser.add_argument("--path", nargs=2, metavar=("START", "END"), help="Find path between two entities")
+ parser.add_argument("--stats", action="store_true", help="Show graph statistics")
+ parser.add_argument("--types", action="store_true", help="List all entity types")
+ parser.add_argument("--search", "-s", help="Search all fields for a term")
+ parser.add_argument("--graph-file", type=Path, help="Explicit path to graph JSON file")
+
+ args = parser.parse_args()
+
+ workspace = args.workspace.resolve()
+ if not workspace.exists():
+ print(f"ERROR: Workspace does not exist: {workspace}", file=sys.stderr)
+ sys.exit(1)
+
+ if args.graph_file:
+ try:
+ graph = json.loads(args.graph_file.read_text(encoding="utf-8"))
+ except (json.JSONDecodeError, OSError) as e:
+ print(f"ERROR: Cannot read graph file: {e}", file=sys.stderr)
+ sys.exit(1)
+ else:
+ graph = load_graph(workspace)
+
+ if not graph or "nodes" not in graph:
+ print("No wiki knowledge graph found in workspace", file=sys.stderr)
+ print("Run an OpenPlanter investigation first to generate one", file=sys.stderr)
+ sys.exit(1)
+
+ if args.stats:
+ print(json.dumps(graph_stats(graph), indent=2))
+ return
+
+ if args.types:
+ type_counts: dict[str, int] = defaultdict(int)
+ for node in get_nodes(graph):
+ ntype = node.get("type", node.get("entity_type", "unknown"))
+ type_counts[str(ntype)] += 1
+ for t, c in sorted(type_counts.items(), key=lambda x: x[1], reverse=True):
+ print(f" {t:30s} {c}")
+ return
+
+ if args.search:
+ results = search_graph(graph, args.search)
+ if not results:
+ print(f"No matches for '{args.search}'", file=sys.stderr)
+ sys.exit(1)
+ print(json.dumps(results, indent=2))
+ return
+
+ if args.path:
+ start, end = args.path
+ paths = find_path(graph, start, end)
+ if not paths:
+ print(f"No path found between '{start}' and '{end}'", file=sys.stderr)
+ sys.exit(1)
+ for i, p in enumerate(paths[:5]):
+ print(f"Path {i+1}: {' '.join(p)}")
+ return
+
+ if args.entity:
+ matches = find_entity(graph, args.entity)
+ if not matches:
+ print(f"No entities matching '{args.entity}'", file=sys.stderr)
+ sys.exit(1)
+
+ if args.neighbors:
+ for m in matches:
+ nid = str(m.get("id", ""))
+ nbrs = get_neighbors(graph, nid)
+ print(json.dumps({"entity": m, "neighbors": nbrs}, indent=2))
+ else:
+ print(json.dumps(matches, indent=2))
+ return
+
+ # Default: show stats
+ print(json.dumps(graph_stats(graph), indent=2))
+
+
+if __name__ == "__main__":
+ main()