skill-optimizer

Automated AI skill testing and improvement, optimized for token efficiency.

The Problem

Your AI skill works most of the time. But some inputs fail unpredictably, and you don't know why. The usual fix is manual trial and error: tweak the prompt, try it out, hope it's better.

skill-optimizer replaces that guesswork with a measurable loop: define what "good output" looks like, run automated tests, get a pass rate, make one targeted change, see if it improves. Every decision is backed by a number, not a feeling.

The catch with existing tools is cost: re-running all test cases after every change, grading outputs one by one, passing full context to every agent call — it adds up fast. This project fixes that without cutting corners on quality.

Token-Saving Strategies

Strategy	Mechanism	Savings
Partial re-eval	After a change, re-run only failing cases first. Skip full suite if partial doesn't improve.	~50% on re-tests
Batch grading	All outputs graded in one LLM call (matrix output), not N calls	~(N-1)/N per eval round
Structured context	Improver receives JSON-schema failure metadata, not narrative history	~60% on improver input
Diff format	Changes expressed as unified diffs, not repeated full file	~70% on change context
Failure clustering	Group failures by root cause; fix the cause, not each symptom	Fewer iterations needed

On Semantic Safety

Context compression is only applied to metadata (failure clusters, changelog), never to skill content itself. The improver agent always receives the verbatim text of the skill section it needs to reason about. This is the safe boundary: compact schemas for structured data, full text for nuanced prose.

Two Modes

Analysis Only

Static 6-dimension audit. No test cases needed. Cheap.

Tell Claude: "Analyze the [skill-name] skill"

Outputs a scored diagnostic report covering structural clarity, instruction completeness, edge case coverage, consistency, actionability, and quality assurance.

Full Optimization

Eval-driven loop with auto-improvements.

Tell Claude: "Optimize the [skill-name] skill"

Claude collects test cases and a checklist, runs a baseline, then iterates: analyze failures → propose one change → partial re-eval → full re-eval if partial improves → keep or revert.

Tooling (TypeScript / Node.js)

Token Estimator

Estimate cost before committing to a run:

npx ts-node src/cli/token-estimate.ts path/to/SKILL.md 5 4 10

Token Cost Estimate
══════════════════════════════════════════════════
Skill file:        820 tokens
Test cases:        5
Checklist items:   4
Max iterations:    10

Analysis only:     ~1,120 tokens

Optimization run:
  Baseline eval:   ~2,350 tokens
  Per iteration:   ~2,700 tokens
    partial eval:  ~940
    full eval×60%: ~1,410
    improver:      ~350
  Total (est.):    ~29,350 tokens

Naive approach:    ~52,000 tokens
Estimated savings: ~44%

✓  Low token cost. Good to proceed.

Eval Output Cache

Cache prompt outputs by content hash. If the skill and test case input haven't changed since last run, reuse the cached output:

# Check cache before running a case
npx ts-node src/cli/cache.ts check SKILL.md case-2 .cache/

# Store output after running
npx ts-node src/cli/cache.ts store SKILL.md case-2 output.md .cache/

# View cache stats
npx ts-node src/cli/cache.ts stats .cache/

# Clear all cached entries
npx ts-node src/cli/cache.ts clear .cache/

Eval Viewer (React + Vite)

Browse optimization runs interactively:

cd src/viewer
npm install
npm run dev

Drop a changelog.json from any {name}-workspace/ directory to explore:

Run summary (baseline → final pass rate, tokens spent)
Pass-rate sparkline across iterations
Per-iteration cards with verdict, delta, and change type
Diff viewer for each iteration's proposed change

Project Structure

skill-optimizer/
├── SKILL.md                      Main orchestration skill (read by Claude)
├── agents/
│   ├── analyzer.md               6-dimension static analysis (single call)
│   ├── grader.md                 Batch grading — all cases in one call
│   ├── improver.md               Failure analysis + unified diff proposal
│   ├── eval_designer.md          Checklist design with token-aware guidance
│   └── case_extractor.md         Test case extraction from history
├── references/
│   ├── eval_guide.md             How to write effective Yes/No checklists
│   └── schemas.md                JSON schemas for all output files
├── src/
│   ├── shared/
│   │   └── types.ts              Shared TypeScript types
│   ├── cli/
│   │   ├── token-estimate.ts     Pre-run cost estimator
│   │   └── cache.ts              Hash-based output cache
│   └── viewer/
│       ├── App.tsx               React eval viewer
│       ├── main.tsx              Entry point
│       ├── index.html
│       ├── vite.config.ts
│       └── package.json
├── package.json
└── tsconfig.json

Configuration

Set these when telling Claude to optimize a skill, e.g. "optimize with auto_apply off":

Parameter	Default	Description
`auto_apply`	`true`	`false` = pause for approval before each change
`target_pass_rate`	`0.9`	Stop when reached
`max_iterations`	`10`	Hard stop
`stall_limit`	`3`	Stop after N consecutive no-improvement rounds
`runs_per_case`	`1`	Set to `3` for flaky skills (uses majority vote)
`partial_eval`	`true`	`false` = always run full eval (more accurate, more tokens)
`max_cases`	`7`	Warn if user provides more

Workspace Output

After a run:

{name}-workspace/
├── analysis_report.md        Phase 0 diagnostic (if run)
├── analysis_report.json
├── optimization_report.md    Full history with token costs
├── changelog.json            Load this in the eval viewer
└── iter-{N}/
    ├── diff.patch            Change applied this iteration
    ├── token_log.json        Per-iteration cost breakdown
    └── eval-{id}/
        ├── output.md         Skill output for this test case
        └── grading.json      Checklist results

Compatibility

Works with Claude Code, OpenCode, or any AI coding assistant that can read markdown files and execute TypeScript.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

skill-optimizer

The Problem

Token-Saving Strategies

On Semantic Safety

Two Modes

Analysis Only

Full Optimization

Tooling (TypeScript / Node.js)

Token Estimator

Eval Output Cache

Eval Viewer (React + Vite)

Project Structure

Configuration

Workspace Output

Compatibility

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
App.tsx		App.tsx
README.md		README.md
SKILL.md		SKILL.md
analyzer.md		analyzer.md
cache.ts		cache.ts
grader.md		grader.md
improver.md		improver.md
token-estimate.ts		token-estimate.ts
types.ts		types.ts

Folders and files

Latest commit

History

Repository files navigation

skill-optimizer

The Problem

Token-Saving Strategies

On Semantic Safety

Two Modes

Analysis Only

Full Optimization

Tooling (TypeScript / Node.js)

Token Estimator

Eval Output Cache

Eval Viewer (React + Vite)

Project Structure

Configuration

Workspace Output

Compatibility

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages