Add local inference service for task summarization#219
Conversation
Adds local GGUF model inference using llama.cpp via yzma for task summarization and branch name generation. Key components: - InferenceService: Handles model loading and text generation - ModelDownloader: Downloads and caches GGUF models from HuggingFace - LibraryDownloader: Auto-downloads llama.cpp libraries for current platform - summarize command: CLI interface for generating summaries - download command: Pre-download model and libraries - REST API endpoint: POST /v1/inference/summarize Critical fix: Must use addSpecial=true when tokenizing prompts for Gemma models to include BOS token - without this, the model produces incorrect outputs (was outputting examples from the prompt instead of actual summaries). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
|
Warning Review the following alerts detected in dependencies. According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.
|
f347259 to
8069e87
Compare
- Truncate parts slice to max 3 elements before loop - Add nolint comment for false positive gosec warning - Update golangci-lint version to 2.6.2 to match CI 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
8069e87 to
ec38067
Compare
- Implement non-blocking background initialization for inference service - Add state management (initializing/ready/failed/disabled) with progress tracking - Return 503 with status info while model downloads in background - Add retry logic with exponential backoff (3 attempts) - Use golang.org/x/sys/unix for cross-platform stderr suppression - Clean up .gitignore (remove models/) and .goreleaser.yml (remove bundled libs) The inference service now starts immediately and downloads libraries/model in the background. Enable with CATNIP_INFERENCE=1 environment variable. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Specify stable versions (yarn@4, pnpm@9, npm@10) instead of letting corepack pick dev versions that may not be available. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
58b9e28 to
1cfccbe
Compare
Summary
Adds local GGUF model inference using llama.cpp via yzma, enabling on-device task summarization and git branch name generation with our fine-tuned Gemma 3 270M model.
Key Features
CLI Commands
catnip summarize "task description"- Generate task summary and branch namecatnip download- Pre-download model and llama.cpp librariesREST API
POST /v1/inference/summarize- Inference endpoint for programmatic accessGET /v1/inference/status- Check inference service availabilityAuto-downloading
~/.catnip/models/~/.catnip/lib/Critical Bug Fix
Fixed inference producing incorrect outputs (always returning "Add Dark Mode" from examples instead of actual summaries).
Root cause: Missing BOS (Beginning of Sequence) token when tokenizing prompts for Gemma models.
Fix: Set
addSpecial=truein tokenization call to include required special tokens.Test plan
catnip summarizeproduces varied, contextually appropriate outputs🤖 Generated with Claude Code