Skip to content

Ckrest/model-manager

Repository files navigation

Model Manager

Centralized queue system for AI model execution with VRAM management. Clients submit jobs via API; the system handles all VRAM coordination and Ollama execution automatically.

Installation

pip install flask platformdirs pyyaml requests

Quick Start

  1. Ensure Ollama is running on localhost:11434

  2. Start the service: systemctl --user start model-manager

  3. Submit a job via the HTTP API:

    curl -X POST http://localhost:5001/api/submit \
      -H "Content-Type: application/json" \
      -d '{"model": "qwen2.5:1.5b", "prompt": "Hello world"}'

CLI Tool

The ./cli tool provides commands for batch queries and vision analysis. It communicates with the running Model Manager service.

# Batch query: run a question against every line in a file
./cli batch-query items.txt "Is {item} a tool?"

# Full image analysis
./cli analyze photo.png

# Quick image Q&A
./cli quick photo.png "What color is the car?"

# Count objects in an image
./cli count photo.png "people"

# Introspection
./cli --version
./cli --print-defaults
./cli --print-resolved
./cli --print-config-schema
./cli --validate-config

Subcommands accept additional options (use ./cli <command> -h for details): batch-query supports --model, --priority, --timeout, --max-workers; analyze supports --model, --focus, --action, --checklist, --prompt, --timeout.

Subcommands

Command Description
batch-query FILE QUESTION Run multiple queries from a file with {item} placeholder
analyze FILE Full image analysis with optional --focus, --action, --checklist, --prompt
quick FILE QUESTION Quick Q&A about an image
count FILE OBJECT_TYPE Count specific objects in an image

HTTP API

Endpoint Method Description
/api/submit POST Submit inference job
/api/job/<job_id> GET Get job status/result
/api/models GET List available models
/api/models/refresh POST Refresh model list from Ollama
/api/stats GET Queue and resource statistics
/api/health GET Health check

Submit Job

curl -X POST http://localhost:5001/api/submit \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5:1.5b",
    "prompt": "Analyze this process",
    "priority": "high",
    "images": [],
    "metadata": {}
  }'

Poll Result

curl http://localhost:5001/api/job/<job_id>

Configuration

Copy config.example.yaml to config.local.yaml and edit:

cp config.example.yaml config.local.yaml

Settings include Ollama URL, VRAM margins, scheduler strategy, queue size, and HTTP port. Storage directories (data_dir, cache_dir) are auto-detected via platformdirs but can be overridden in config.

Architecture

Six core components with strict boundaries:

  • HTTP API (Flask) — external interface on port 5001
  • Internal API — client interface for job submission
  • Queue Manager — priority-based job storage and batching
  • VRAM Scheduler — load/unload decisions based on available GPU memory
  • Execution Engine — Ollama API integration for inference
  • Resource Monitor — VRAM state via nvidia-smi and Ollama /api/ps

Background scheduler loop runs every 100ms, pulling batches from the queue, creating execution plans, and dispatching to the engine.

License

MIT License. See LICENSE.

About

VRAM-aware Ollama job scheduler with intelligent model loading/unloading and priority-based queue management.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors