The Ollama developer experience, the vLLM production power.
vAquila (via the vaq command) is an open-source AI model inference manager. It combines the absolute simplicity of a CLI with the production performance of vLLM and the isolation of Docker, all with smart and automated GPU management.
- The Problem: Ollama is amazing for local testing, but its architecture shows limits in production. vLLM is the undisputed king of production performance, but its deployment is often a hassle (manual VRAM calculation, Docker volume management, Out of Memory crashes).
- The Solution: vAquila orchestrates everything for you. Like an eagle soaring over your infrastructure, it analyzes your GPU state in real-time, calculates the perfect memory ratio, and deploys the vLLM Docker container invisibly and securely.
For the most up-to-date guide on how vAquila works and how to run it, please refer to the Docusaurus documentation:
Quick Links:
- Auto-VRAM: Automatic calculation of the
--gpu-memory-utilizationflag via NVML to prevent crashes. - One-Click Deployment: Download and run models via a simple
vaq run <hf-model>command. - Advanced Model Compatibility: Optional
trust_remote_codesupport for repositories that require custom model code. - Docker Orchestration: Invisible management of containers, exposed ports, and Hugging Face cache.
- Web UI: A local dashboard to manage models, containers, cache, and inference workflows.
- Language: Python 3.10+
- CLI: Typer
- Orchestration: Official Docker SDK for Python
- Hardware Monitoring:
nvidia-ml-py(NVML) - Inference Engine: vLLM
Built with ❤️ to make high-performance AI inference accessible to everyone.
