-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Summary
Systematic framework for mining GitHub repositories to extract agent skills and translate them into SKILL.md format using dense retrieval for semantic skill identification. Produces SKILL.md-compatible output directly ingestible by Zeph's skill registry.
Source: arXiv 2603.11808 — "Automating Skill Acquisition via Open-Source Repository Mining" (March 2026)
Technique
- Repository crawl: clone/index GitHub repos by topic/language
- Dense retrieval: embed README, docs, code summaries to identify procedural knowledge units
- SKILL.md generation: map extracted procedures to SKILL.md spec format (name, description, steps, examples)
- Quality filtering: dedup via embedding similarity, reject low-quality extractions
Applicability to Zeph
HIGH. Zeph already implements the SKILL.md specification (zeph-skills parser, registry, embedding matcher). This paper proposes a batch pipeline to bootstrap the skill registry from external repositories — completely additive, no format translation required.
Current Zeph self-learning (MetaAgent + SkillRL, #1865) generates skills from session trajectories. This extends it with bulk acquisition from public knowledge. Together: session-driven refinement + corpus-bootstrapped initial skill set.
Implementation sketch
- Offline mining script: scripts/mine-skills.py or a zeph-skills-miner crate
- Input: GitHub search API queries (e.g., "devops automation", "data analysis workflows")
- Output: SKILL.md files written to a configurable skills directory
- Integration: existing skills.paths config picks them up on next startup (hot-reload compatible)
- Quality gate: cosine similarity dedup against existing registry before writing
Extends #1865 (MetaAgent + SkillRL) — complementary acquisition path.