BESH

A high-performance batch processing API for large language models with support for both single-GPU and multi-GPU (8-GPU) deployments.

Features

🚀 Intelligent Queue Management
⚡ Advanced Parallel Processing
🔄 Production-Ready Reliability
📊 Real-Time Analytics Dashboard
🎯 Enterprise-Scale Architecture
💾 Persistent Storage

Quick Start

Standard Deployment (Single GPU)

docker compose -f docker-compose.yml up -d --build --scale worker=4

# Access the dashboard
curl http://localhost:8080/

High-Performance Deployment (24-GPU)

For high-throughput production workloads, use the 24-GPU configuration with load balancing:

export NGINX_CONF=nginx-24gpu.conf
export REMOTE_CLUSTER_HOST_A=<ip_remote_cluster>
export REMOTE_CLUSTER_HOST_B=<ip_remote_cluster>
docker compose -f docker-compose-multi-gpu.yml up -d --build --scale worker=12

Screenshots

Features:

See throughput per 15min
See uploads per 15min
See stats per 24h
See and delete individual batches

Tips

Things you might want check:

MAX_WORKERS=128 & MAX_CONCURRENT_BATCHES=10 in the docker compose files of the batch api.
- if the batches are very large, maybe from concurrent batches, or visa versa.
- Find your most efficient number of workers, I found 128 for H100 running a small model.
events {worker_connections 2048;} Make sure this value is larger then MAX_WORKERS.
Consider uploading the model once for faster init on 8 gpus.
There is no storage managment system -> make sure you delete your batch files (in & out)

8-GPU Architecture Overview

The 8-GPU deployment provides horizontal scaling with the following architecture, featuring multiple workers and a Redis queue for efficient batch processing:

graph TB
    subgraph "Client Layer"
        Client[Client Applications]
    end
    
    subgraph "API Layer"
        BatchAPI[Batch API<br/>:8080]
    end
    
    subgraph "Queue Layer"
        Queue[Redis Queue<br/>:6379]
    end
    
    subgraph "Worker Layer"
        Worker1[Worker 1]
        Worker2[Worker 2]  
        Worker3[Worker 3]
        Worker4[Worker 4]
    end
    
    subgraph "Load Balancer Layer"
        LB[Nginx Load Balancer<br/>:8000]
    end
    
    subgraph "vLLM Inference Layer"
        GPU0[vLLM GPU-0<br/>Device: GPU 0]
        GPU1[vLLM GPU-1<br/>Device: GPU 1]
        GPU2[vLLM GPU-2<br/>Device: GPU 2]
        GPU3[vLLM GPU-3<br/>Device: GPU 3]
        GPU4[vLLM GPU-4<br/>Device: GPU 4]
        GPU5[vLLM GPU-5<br/>Device: GPU 5]
        GPU6[vLLM GPU-6<br/>Device: GPU 6]
        GPU7[vLLM GPU-7<br/>Device: GPU 7]
    end
    
    subgraph "Storage Layer"
        DB[(PostgreSQL<br/>Database)]
        Models[(Model Cache<br/>HuggingFace)]
        Files[(Batch Files<br/>Volume)]
    end
    
    Client --> BatchAPI
    BatchAPI --> Queue
    BatchAPI --> DB
    BatchAPI --> Files
    
    Queue --> Worker1
    Queue --> Worker2
    Queue --> Worker3
    Queue --> Worker4
    
    Worker1 --> LB
    Worker2 --> LB
    Worker3 --> LB
    Worker4 --> LB
    
    LB --> GPU0
    LB --> GPU1
    LB --> GPU2
    LB --> GPU3
    LB --> GPU4
    LB --> GPU5
    LB --> GPU6
    LB --> GPU7
    
    GPU0 --> Models
    GPU1 --> Models
    GPU2 --> Models
    GPU3 --> Models
    GPU4 --> Models
    GPU5 --> Models
    GPU6 --> Models
    GPU7 --> Models
    
    classDef gpu fill:#e1f5fe
    classDef lb fill:#f3e5f5
    classDef api fill:#e8f5e8
    classDef storage fill:#fff3e0
    classDef worker fill:#e8f5e8
    classDef queue fill:#fff9c4
    
    class GPU0,GPU1,GPU2,GPU3,GPU4,GPU5,GPU6,GPU7 gpu
    class LB lb
    class BatchAPI api
    class DB,Models,Files storage
    class Worker1,Worker2,Worker3,Worker4 worker
    class Queue queue

Scaling Workers

To run with multiple workers for increased throughput, use the --scale option:

# Run with 4 workers (recommended for 8-GPU setup)
docker-compose -f docker-compose-multi-gpu.yml up --scale worker=4

# Or with 12 workers for maximum throughput
docker-compose -f docker-compose-multi-gpu.yml up --scale worker=12

The worker scaling provides:

Horizontal scaling: Each worker processes batch jobs independently
Queue-based distribution: Redis queue distributes jobs across available workers
Load balancing: Workers share the load across all 8 GPU instances
Fault tolerance: If a worker fails, other workers continue processing

Options

Test Model Inference

# Test vLLM endpoint directly
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "<your_model_name>",           
        "prompt": "Why is open source important for the progress of AI?",
        "max_tokens": 100,
        "temperature": 0.3
    }'

# Test batch API health
curl http://localhost:8080/health

Pytest

Run individual endpoint tests + 100 calls to openai gpt-nano. We do not have a pytest for GPUs. We advise running the test_large.py and test_api.py manually to check GPU deployment. Since vLLM is openai compatible, we did not see the need for those test.

# MAKE SURE YOU HAVE TEST_API_KEY set in .env
docker-compose -f docker-compose.test.yml up --build --scale worker=2

Helper files

scripts/check_batch.py – CLI check batch from ID
scripts/delete_x.py – CLI delete files and/or Bacthes

CI/CD

Recommended to only update the batch-api using this command for CI/CD pipelines.

docker compose -f <compose-file> up -d --no-deps --build batch-api

Contribute

Contributions are welcome! Feel free to open an issue or submit a pull request.

Contact

Author: Floris Fok
📧 Email: floris.fok@prosus.com
🔗 LinkedIn: floris-jan-fok

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
configs		configs
images		images
scripts		scripts
src		src
testing		testing
worker		worker
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.test		Dockerfile.test
Dockerfile.worker		Dockerfile.worker
LICENSE		LICENSE
README.md		README.md
docker-compose-multi-gpu.yml		docker-compose-multi-gpu.yml
docker-compose.test-external.yml		docker-compose.test-external.yml
docker-compose.test.yml		docker-compose.test.yml
docker-compose.yml		docker-compose.yml
env.example		env.example
manage-8gpu.sh		manage-8gpu.sh
nginx-24gpu.conf		nginx-24gpu.conf
nginx-8gpu.conf		nginx-8gpu.conf
nginx.conf		nginx.conf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BESH

Features

Quick Start

Standard Deployment (Single GPU)

High-Performance Deployment (24-GPU)

Screenshots

Tips

8-GPU Architecture Overview

Scaling Workers

Options

Test Model Inference

Pytest

Helper files

CI/CD

Contribute

Contact

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

ProsusAI/BESH

Folders and files

Latest commit

History

Repository files navigation

BESH

Features

Quick Start

Standard Deployment (Single GPU)

High-Performance Deployment (24-GPU)

Screenshots

Tips

8-GPU Architecture Overview

Scaling Workers

Options

Test Model Inference

Pytest

Helper files

CI/CD

Contribute

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages