OpenBioLLM-8B locally on Apple Silicon (M1/M2) with quantized GGUF or MLX, and deploy on a GPU server using vLLM + FastAPI.
- Mac local: llama.cpp (Metal) or MLX runners
- GPU: vLLM server + FastAPI proxy (OpenAI-style)
- Clean prompt templating and consistent interface
- Structured logging, health checks, simple config in YAML
- Uses uv (Python package manager) for fast, reproducible installs
- uv: https://docs.astral.sh/uv/ (install once:
pipx install uvorcurl -LsSf https://astral.sh/uv/install.sh | sh) - Apple Silicon (macOS 13+), or Linux GPU box (CUDA 12+ for vLLM image)
- Model files:
- GGUF: place into
models/gguf/(e.g.,OpenBioLLM-Llama3-8B-Q4_K_M.gguf) - HF model id for GPU (default:
aaditya/Llama3-OpenBioLLM-8B)
- GGUF: place into
Note: You are responsible for model licenses and usage compliance.
uv venv .venv
source .venv/bin/activateuv pip install -e ".[mac-gguf]"uv pip install -e ".[mac-mlx]"uv pip install -e ".[gpu]"- Put your
.ggufatmodels/gguf/OpenBioLLM-Llama3-8B-Q4_K_M.gguf - Edit
configs/local_gguf.yamlif needed - Run:
python -m openbiollm_inference.local_m1.run_gguf --text "Summarize: patient with fever and cough."python -m openbiollm_inference.local_m1.run_mlx --text "Summarize: patient with fever and cough."Set a proper NVIDIA host with CUDA drivers. Then:
cp .env.example .env
# edit .env if needed
docker compose -f docker/docker-compose.gpu.yml up --buildThis starts:
- vllm serving OpenAI-compatible API
- app FastAPI proxy at http://localhost:8000
Notes:
- The
appcontainer installs only the.[app]extra (FastAPI, Uvicorn, httpx, etc.); thevllmdependency lives only in thevllmcontainer. - The proxy reads the system prompt from
configs/prompt_template.txt; on startup it warns if the template lacks<|system|>/<|user|>tags.
Test:
curl -X POST http://localhost:8000/generate -H "Content-Type: application/json" -d '{"input":"Give two differentials for fever in adults."}'openbiollm inference/
├─ README.md
├─ pyproject.toml
├─ .gitignore
├─ .env.example
├─ configs/
│ ├─ prompt_template.txt
│ ├─ local_gguf.yaml
│ └─ gpu_vllm.yaml
├─ models/
│ └─ gguf/
│ └─ .gitkeep
├─ docker/
│ ├─ Dockerfile.vllm
│ └─ docker-compose.gpu.yml
└─ src/
└─ openbiollm_inference/
├─ __init__.py
├─ common/
│ ├─ logging_setup.py
│ ├─ schema.py
│ └─ templates.py
├─ local_m1/
│ ├─ run_gguf.py
│ └─ run_mlx.py
└─ serve_gpu/
├─ fastapi_app.py
└─ vllm_server.py
- Keep your prompt template stable between local and GPU serving for consistent behavior.
- For production: add authentication in FastAPI, rate-limits at NGINX, and observability stack (Loki/Grafana).
- To extend: add a RAG service and a prompt registry with Redis caching.
Install dev deps and run tests:
uv pip install -e ".[app]" -e ".[dev]"
pytest -q