Skip to content

yinmay/skill-optimizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

skill-optimizer

Automated AI skill testing and improvement, optimized for token efficiency.


The Problem

Your AI skill works most of the time. But some inputs fail unpredictably, and you don't know why. The usual fix is manual trial and error: tweak the prompt, try it out, hope it's better.

skill-optimizer replaces that guesswork with a measurable loop: define what "good output" looks like, run automated tests, get a pass rate, make one targeted change, see if it improves. Every decision is backed by a number, not a feeling.

The catch with existing tools is cost: re-running all test cases after every change, grading outputs one by one, passing full context to every agent call — it adds up fast. This project fixes that without cutting corners on quality.


Token-Saving Strategies

Strategy Mechanism Savings
Partial re-eval After a change, re-run only failing cases first. Skip full suite if partial doesn't improve. ~50% on re-tests
Batch grading All outputs graded in one LLM call (matrix output), not N calls ~(N-1)/N per eval round
Structured context Improver receives JSON-schema failure metadata, not narrative history ~60% on improver input
Diff format Changes expressed as unified diffs, not repeated full file ~70% on change context
Failure clustering Group failures by root cause; fix the cause, not each symptom Fewer iterations needed

On Semantic Safety

Context compression is only applied to metadata (failure clusters, changelog), never to skill content itself. The improver agent always receives the verbatim text of the skill section it needs to reason about. This is the safe boundary: compact schemas for structured data, full text for nuanced prose.


Two Modes

Analysis Only

Static 6-dimension audit. No test cases needed. Cheap.

Tell Claude: "Analyze the [skill-name] skill"

Outputs a scored diagnostic report covering structural clarity, instruction completeness, edge case coverage, consistency, actionability, and quality assurance.

Full Optimization

Eval-driven loop with auto-improvements.

Tell Claude: "Optimize the [skill-name] skill"

Claude collects test cases and a checklist, runs a baseline, then iterates: analyze failures → propose one change → partial re-eval → full re-eval if partial improves → keep or revert.


Tooling (TypeScript / Node.js)

Token Estimator

Estimate cost before committing to a run:

npx ts-node src/cli/token-estimate.ts path/to/SKILL.md 5 4 10
Token Cost Estimate
══════════════════════════════════════════════════
Skill file:        820 tokens
Test cases:        5
Checklist items:   4
Max iterations:    10

Analysis only:     ~1,120 tokens

Optimization run:
  Baseline eval:   ~2,350 tokens
  Per iteration:   ~2,700 tokens
    partial eval:  ~940
    full eval×60%: ~1,410
    improver:      ~350
  Total (est.):    ~29,350 tokens

Naive approach:    ~52,000 tokens
Estimated savings: ~44%

✓  Low token cost. Good to proceed.

Eval Output Cache

Cache prompt outputs by content hash. If the skill and test case input haven't changed since last run, reuse the cached output:

# Check cache before running a case
npx ts-node src/cli/cache.ts check SKILL.md case-2 .cache/

# Store output after running
npx ts-node src/cli/cache.ts store SKILL.md case-2 output.md .cache/

# View cache stats
npx ts-node src/cli/cache.ts stats .cache/

# Clear all cached entries
npx ts-node src/cli/cache.ts clear .cache/

Eval Viewer (React + Vite)

Browse optimization runs interactively:

cd src/viewer
npm install
npm run dev

Drop a changelog.json from any {name}-workspace/ directory to explore:

  • Run summary (baseline → final pass rate, tokens spent)
  • Pass-rate sparkline across iterations
  • Per-iteration cards with verdict, delta, and change type
  • Diff viewer for each iteration's proposed change

Project Structure

skill-optimizer/
├── SKILL.md                      Main orchestration skill (read by Claude)
├── agents/
│   ├── analyzer.md               6-dimension static analysis (single call)
│   ├── grader.md                 Batch grading — all cases in one call
│   ├── improver.md               Failure analysis + unified diff proposal
│   ├── eval_designer.md          Checklist design with token-aware guidance
│   └── case_extractor.md         Test case extraction from history
├── references/
│   ├── eval_guide.md             How to write effective Yes/No checklists
│   └── schemas.md                JSON schemas for all output files
├── src/
│   ├── shared/
│   │   └── types.ts              Shared TypeScript types
│   ├── cli/
│   │   ├── token-estimate.ts     Pre-run cost estimator
│   │   └── cache.ts              Hash-based output cache
│   └── viewer/
│       ├── App.tsx               React eval viewer
│       ├── main.tsx              Entry point
│       ├── index.html
│       ├── vite.config.ts
│       └── package.json
├── package.json
└── tsconfig.json

Configuration

Set these when telling Claude to optimize a skill, e.g. "optimize with auto_apply off":

Parameter Default Description
auto_apply true false = pause for approval before each change
target_pass_rate 0.9 Stop when reached
max_iterations 10 Hard stop
stall_limit 3 Stop after N consecutive no-improvement rounds
runs_per_case 1 Set to 3 for flaky skills (uses majority vote)
partial_eval true false = always run full eval (more accurate, more tokens)
max_cases 7 Warn if user provides more

Workspace Output

After a run:

{name}-workspace/
├── analysis_report.md        Phase 0 diagnostic (if run)
├── analysis_report.json
├── optimization_report.md    Full history with token costs
├── changelog.json            Load this in the eval viewer
└── iter-{N}/
    ├── diff.patch            Change applied this iteration
    ├── token_log.json        Per-iteration cost breakdown
    └── eval-{id}/
        ├── output.md         Skill output for this test case
        └── grading.json      Checklist results

Compatibility

Works with Claude Code, OpenCode, or any AI coding assistant that can read markdown files and execute TypeScript.

License

MIT

About

Automated AI skill testing and improvement, optimized for token efficiency

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors