Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions skills/openclaw-native/context-assembly-scorer/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
---
name: context-assembly-scorer
version: "1.0"
category: openclaw-native
description: Scores how well the current context represents the full conversation — detects information blind spots, stale summaries, and coverage gaps that cause the agent to forget critical details.
stateful: true
cron: "0 */4 * * *"
---

# Context Assembly Scorer

## What it does

When an agent compacts context, it loses information. But how much? And which information? Context Assembly Scorer answers these questions by measuring **coverage** — the ratio of important topics in the full conversation history that are represented in the current assembled context.

Inspired by [lossless-claw](https://github.com/Martian-Engineering/lossless-claw)'s context assembly system, which carefully selects which summaries to include in each turn's context to maximize information coverage.

## When to invoke

- Automatically every 4 hours (cron) — silent coverage check
- Before starting a task that depends on prior context — verify nothing critical is missing
- After compaction — measure information loss
- When the agent says "I don't remember" — diagnose why

## Coverage dimensions

| Dimension | What it measures | Weight |
|---|---|---|
| Topic coverage | % of conversation topics present in current context | 2x |
| Recency bias | Whether recent context is over-represented vs. older important context | 1.5x |
| Entity continuity | Named entities (files, people, APIs) mentioned in history that are missing from context | 2x |
| Decision retention | Architectural decisions and user preferences still accessible | 2x |
| Task continuity | Active/pending tasks that might be lost after compaction | 1.5x |

## How to use

```bash
python3 score.py --score # Score current context assembly
python3 score.py --score --verbose # Detailed per-dimension breakdown
python3 score.py --blind-spots # List topics missing from context
python3 score.py --drift # Compare current vs. previous scores
python3 score.py --status # Last score summary
python3 score.py --format json # Machine-readable output
```

## Procedure

**Step 1 — Score context coverage**

```bash
python3 score.py --score
```

The scorer reads MEMORY.md (full history) and compares it against what's currently accessible. Outputs a coverage score from 0–100% with a letter grade.

**Step 2 — Find blind spots**

```bash
python3 score.py --blind-spots
```

Lists specific topics, entities, and decisions that exist in full history but are missing from current context — these are what the agent has effectively "forgotten."

**Step 3 — Track drift over time**

```bash
python3 score.py --drift
```

Shows how coverage has changed across the last 20 scores. Identify if compaction is progressively losing more information.

## Grading

| Grade | Coverage | Meaning |
|---|---|---|
| A | 90–100% | Excellent — minimal information loss |
| B | 75–89% | Good — minor gaps, unlikely to cause issues |
| C | 60–74% | Fair — some important context missing |
| D | 40–59% | Poor — significant blind spots |
| F | 0–39% | Critical — agent is operating with major gaps |

## State

Coverage scores and blind spot history stored in `~/.openclaw/skill-state/context-assembly-scorer/state.yaml`.

Fields: `last_score_at`, `current_score`, `blind_spots`, `score_history`.

## Notes

- Read-only — does not modify context or memory
- Topic extraction uses keyword clustering, not LLM calls
- Entity detection uses regex patterns for file paths, URLs, class names, API endpoints
- Decision detection looks for markers: "decided", "chose", "prefer", "always", "never"
- Recency bias is measured as the ratio of recent-vs-old entry representation
31 changes: 31 additions & 0 deletions skills/openclaw-native/context-assembly-scorer/STATE_SCHEMA.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
version: "1.0"
description: Context coverage scores, blind spot tracking, and drift history.
fields:
last_score_at:
type: datetime
current_score:
type: object
fields:
overall: { type: float, description: "0-100 coverage percentage" }
grade: { type: string }
topic_coverage: { type: float }
recency_bias: { type: float }
entity_continuity: { type: float }
decision_retention: { type: float }
task_continuity: { type: float }
blind_spots:
type: list
description: Topics/entities missing from current context
items:
type: { type: enum, values: [topic, entity, decision, task] }
name: { type: string }
importance: { type: enum, values: [critical, high, medium, low] }
last_seen: { type: string, description: "When this was last in context" }
score_history:
type: list
description: Rolling log of past scores (last 20)
items:
scored_at: { type: datetime }
overall: { type: float }
grade: { type: string }
blind_spot_count: { type: integer }
74 changes: 74 additions & 0 deletions skills/openclaw-native/context-assembly-scorer/example-state.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Example runtime state for context-assembly-scorer
last_score_at: "2026-03-16T16:00:08.000000"
current_score:
overall: 72.3
grade: C
topic_coverage: 82.0
recency_bias: 65.5
entity_continuity: 68.0
decision_retention: 75.0
task_continuity: 70.0
blind_spots:
- type: decision
name: "Decided to use Jaccard similarity threshold of 0.7 for deduplication"
importance: critical
last_seen: "in full memory"
- type: entity
name: "/skills/openclaw-native/heartbeat-governor/governor.py"
importance: high
last_seen: "in full memory"
- type: task
name: "TODO: add --dry-run flag to radar.py before next release"
importance: high
last_seen: "in full memory"
- type: entity
name: "https://github.com/Neirth/OpenLobster"
importance: medium
last_seen: "in full memory"
score_history:
- scored_at: "2026-03-16T16:00:08.000000"
overall: 72.3
grade: C
blind_spot_count: 12
- scored_at: "2026-03-16T12:00:05.000000"
overall: 85.1
grade: B
blind_spot_count: 5
- scored_at: "2026-03-16T08:00:03.000000"
overall: 91.2
grade: A
blind_spot_count: 2
# ── Walkthrough ──────────────────────────────────────────────────────────────
# Cron runs every 4 hours: python3 score.py --score --verbose
#
# Context Assembly Score — 2026-03-16 16:00
# ───────────────────────────────────────────────────────
# Overall: 72.3% Grade: C
# Topic coverage: 82.0% (2x weight)
# Recency bias: 65.5% (1.5x weight)
# Entity continuity: 68.0% (2x weight)
# Decision retention: 75.0% (2x weight)
# Task continuity: 70.0% (1.5x weight)
#
# Memory stats:
# Topics: 284 unique | Entities: 47
# Decisions: 12 | Tasks: 8
# Blind spots: 12
#
# python3 score.py --blind-spots
#
# Blind Spots — 12 items missing from context
# ───────────────────────────────────────────────────────
# !! [CRITICAL] decision: Decided to use Jaccard similarity threshold...
# ! [ HIGH] entity: /skills/openclaw-native/heartbeat-governor/...
# ! [ HIGH] task: TODO: add --dry-run flag to radar.py...
#
# python3 score.py --drift
#
# Coverage Drift — 3 data points
# ───────────────────────────────────────────────────────
# 2026-03-16T16:00 [=======---] 72.3% (C) 12 blind spots
# 2026-03-16T12:00 [=========-] 85.1% (B) 5 blind spots
# 2026-03-16T08:00 [=========+] 91.2% (A) 2 blind spots
#
# Trend: declining (-12.8%)
Loading
Loading