A high-performance batch processing API for large language models with support for both single-GPU and multi-GPU (8-GPU) deployments.
🚀 Intelligent Queue Management
⚡ Advanced Parallel Processing
🔄 Production-Ready Reliability
📊 Real-Time Analytics Dashboard
🎯 Enterprise-Scale Architecture
💾 Persistent Storage
docker compose -f docker-compose.yml up -d --build --scale worker=4
# Access the dashboard
curl http://localhost:8080/For high-throughput production workloads, use the 24-GPU configuration with load balancing:
export NGINX_CONF=nginx-24gpu.conf
export REMOTE_CLUSTER_HOST_A=<ip_remote_cluster>
export REMOTE_CLUSTER_HOST_B=<ip_remote_cluster>
docker compose -f docker-compose-multi-gpu.yml up -d --build --scale worker=12Features:
- See throughput per 15min
- See uploads per 15min
- See stats per 24h
- See and delete individual batches
Things you might want check:
MAX_WORKERS=128&MAX_CONCURRENT_BATCHES=10in the docker compose files of the batch api.- if the batches are very large, maybe from concurrent batches, or visa versa.
- Find your most efficient number of workers, I found 128 for H100 running a small model.
events {worker_connections 2048;}Make sure this value is larger then MAX_WORKERS.- Consider uploading the model once for faster init on 8 gpus.
- There is no storage managment system -> make sure you delete your batch files (in & out)
The 8-GPU deployment provides horizontal scaling with the following architecture, featuring multiple workers and a Redis queue for efficient batch processing:
graph TB
subgraph "Client Layer"
Client[Client Applications]
end
subgraph "API Layer"
BatchAPI[Batch API<br/>:8080]
end
subgraph "Queue Layer"
Queue[Redis Queue<br/>:6379]
end
subgraph "Worker Layer"
Worker1[Worker 1]
Worker2[Worker 2]
Worker3[Worker 3]
Worker4[Worker 4]
end
subgraph "Load Balancer Layer"
LB[Nginx Load Balancer<br/>:8000]
end
subgraph "vLLM Inference Layer"
GPU0[vLLM GPU-0<br/>Device: GPU 0]
GPU1[vLLM GPU-1<br/>Device: GPU 1]
GPU2[vLLM GPU-2<br/>Device: GPU 2]
GPU3[vLLM GPU-3<br/>Device: GPU 3]
GPU4[vLLM GPU-4<br/>Device: GPU 4]
GPU5[vLLM GPU-5<br/>Device: GPU 5]
GPU6[vLLM GPU-6<br/>Device: GPU 6]
GPU7[vLLM GPU-7<br/>Device: GPU 7]
end
subgraph "Storage Layer"
DB[(PostgreSQL<br/>Database)]
Models[(Model Cache<br/>HuggingFace)]
Files[(Batch Files<br/>Volume)]
end
Client --> BatchAPI
BatchAPI --> Queue
BatchAPI --> DB
BatchAPI --> Files
Queue --> Worker1
Queue --> Worker2
Queue --> Worker3
Queue --> Worker4
Worker1 --> LB
Worker2 --> LB
Worker3 --> LB
Worker4 --> LB
LB --> GPU0
LB --> GPU1
LB --> GPU2
LB --> GPU3
LB --> GPU4
LB --> GPU5
LB --> GPU6
LB --> GPU7
GPU0 --> Models
GPU1 --> Models
GPU2 --> Models
GPU3 --> Models
GPU4 --> Models
GPU5 --> Models
GPU6 --> Models
GPU7 --> Models
classDef gpu fill:#e1f5fe
classDef lb fill:#f3e5f5
classDef api fill:#e8f5e8
classDef storage fill:#fff3e0
classDef worker fill:#e8f5e8
classDef queue fill:#fff9c4
class GPU0,GPU1,GPU2,GPU3,GPU4,GPU5,GPU6,GPU7 gpu
class LB lb
class BatchAPI api
class DB,Models,Files storage
class Worker1,Worker2,Worker3,Worker4 worker
class Queue queue
To run with multiple workers for increased throughput, use the --scale option:
# Run with 4 workers (recommended for 8-GPU setup)
docker-compose -f docker-compose-multi-gpu.yml up --scale worker=4
# Or with 12 workers for maximum throughput
docker-compose -f docker-compose-multi-gpu.yml up --scale worker=12The worker scaling provides:
- Horizontal scaling: Each worker processes batch jobs independently
- Queue-based distribution: Redis queue distributes jobs across available workers
- Load balancing: Workers share the load across all 8 GPU instances
- Fault tolerance: If a worker fails, other workers continue processing
# Test vLLM endpoint directly
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<your_model_name>",
"prompt": "Why is open source important for the progress of AI?",
"max_tokens": 100,
"temperature": 0.3
}'
# Test batch API health
curl http://localhost:8080/healthRun individual endpoint tests + 100 calls to openai gpt-nano. We do not have a pytest for GPUs. We advise running the test_large.py and test_api.py manually to check GPU deployment. Since vLLM is openai compatible, we did not see the need for those test.
# MAKE SURE YOU HAVE TEST_API_KEY set in .env
docker-compose -f docker-compose.test.yml up --build --scale worker=2scripts/check_batch.py– CLI check batch from IDscripts/delete_x.py– CLI delete files and/or Bacthes
Recommended to only update the batch-api using this command for CI/CD pipelines.
docker compose -f <compose-file> up -d --no-deps --build batch-apiContributions are welcome! Feel free to open an issue or submit a pull request.
- Author: Floris Fok
- 📧 Email: floris.fok@prosus.com
- 🔗 LinkedIn: floris-jan-fok

