Centralized queue system for AI model execution with VRAM management. Clients submit jobs via API; the system handles all VRAM coordination and Ollama execution automatically.
pip install flask platformdirs pyyaml requests-
Ensure Ollama is running on localhost:11434
-
Start the service:
systemctl --user start model-manager -
Submit a job via the HTTP API:
curl -X POST http://localhost:5001/api/submit \ -H "Content-Type: application/json" \ -d '{"model": "qwen2.5:1.5b", "prompt": "Hello world"}'
The ./cli tool provides commands for batch queries and vision analysis. It communicates with the running Model Manager service.
# Batch query: run a question against every line in a file
./cli batch-query items.txt "Is {item} a tool?"
# Full image analysis
./cli analyze photo.png
# Quick image Q&A
./cli quick photo.png "What color is the car?"
# Count objects in an image
./cli count photo.png "people"
# Introspection
./cli --version
./cli --print-defaults
./cli --print-resolved
./cli --print-config-schema
./cli --validate-configSubcommands accept additional options (use ./cli <command> -h for details):
batch-query supports --model, --priority, --timeout, --max-workers;
analyze supports --model, --focus, --action, --checklist, --prompt, --timeout.
| Command | Description |
|---|---|
batch-query FILE QUESTION |
Run multiple queries from a file with {item} placeholder |
analyze FILE |
Full image analysis with optional --focus, --action, --checklist, --prompt |
quick FILE QUESTION |
Quick Q&A about an image |
count FILE OBJECT_TYPE |
Count specific objects in an image |
| Endpoint | Method | Description |
|---|---|---|
/api/submit |
POST | Submit inference job |
/api/job/<job_id> |
GET | Get job status/result |
/api/models |
GET | List available models |
/api/models/refresh |
POST | Refresh model list from Ollama |
/api/stats |
GET | Queue and resource statistics |
/api/health |
GET | Health check |
curl -X POST http://localhost:5001/api/submit \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5:1.5b",
"prompt": "Analyze this process",
"priority": "high",
"images": [],
"metadata": {}
}'curl http://localhost:5001/api/job/<job_id>Copy config.example.yaml to config.local.yaml and edit:
cp config.example.yaml config.local.yamlSettings include Ollama URL, VRAM margins, scheduler strategy, queue size, and HTTP port. Storage directories (data_dir, cache_dir) are auto-detected via platformdirs but can be overridden in config.
Six core components with strict boundaries:
- HTTP API (Flask) — external interface on port 5001
- Internal API — client interface for job submission
- Queue Manager — priority-based job storage and batching
- VRAM Scheduler — load/unload decisions based on available GPU memory
- Execution Engine — Ollama API integration for inference
- Resource Monitor — VRAM state via nvidia-smi and Ollama /api/ps
Background scheduler loop runs every 100ms, pulling batches from the queue, creating execution plans, and dispatching to the engine.
MIT License. See LICENSE.