Skip to content

Commit 426cf68

Browse files
authored
Add Load Balancer deployment with two use cases (#191)
* Create docker-compose.yml * Create haproxy.cfg * Update README.md * add rotation solution * minor update * Add docker-compose.env-override.yml to override UR_L0_USE_IMMEDIATE_COMMANDLISTS
1 parent fa0a0be commit 426cf68

File tree

5 files changed

+311
-1
lines changed

5 files changed

+311
-1
lines changed

vllm/README.md

Lines changed: 85 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,8 @@ llm-scaler-vllm is an extended and optimized version of vLLM, specifically adapt
2323
2.7 [Finding maximum Context Length](#27-finding-maximum-context-length)
2424
2.8 [Multi-Modal Webui](#28-multi-modal-webui)
2525
2.9 [Multi-node Distributed Deployment (PP/TP)](#29-multi-node-distributed-deployment-pptp)
26-
2.10 [BPE-Qwen Tokenizer](#210-bpe-qwen-tokenizer)
26+
2.10 [BPE-Qwen Tokenizer](#210-bpe-qwen-tokenizer)
27+
2.11 [Load Balancer Solution](#211-load-balancer-solution)
2728
4. [Supported Models](#3-supported-models)
2829
5. [Troubleshooting](#4-troubleshooting)
2930
6. [Performance tuning](#5-performance-tuning)
@@ -2445,6 +2446,12 @@ To enable data parallelism, add:
24452446
--dp 2
24462447
```
24472448
2449+
> **Note**
2450+
> In addition to DP, a **load balancer–based deployment** is also supported as a drop-in alternative.
2451+
> It provides slightly better performance in some scenarios and supports periodic instance rotation for long-running services.
2452+
> See [Section 2.11 Load Balancer](#211-load-balancer-solution) for details.
2453+
2454+
24482455
---
24492456
24502457
### 2.7 Finding maximum Context Length
@@ -2743,6 +2750,83 @@ To enable it when launching the API server, add:
27432750
27442751
---
27452752
2753+
### 2.11 Load Balancer Solution
2754+
2755+
This document describes a **load balancer–based deployment** for vLLM using Docker Compose.
2756+
The load balancer routes traffic to multiple vLLM instances and exposes a single endpoint.
2757+
2758+
Once started, send requests to:
2759+
2760+
```
2761+
http://localhost:8000
2762+
```
2763+
2764+
2765+
#### Use Case 1: Drop-in Alternative to DP
2766+
2767+
Use this setup as a **drop-in alternative to DP**.
2768+
2769+
Compared to DP, the load balancer approach provides **slightly better performance** in our testing and does not require any DP-specific configuration.
2770+
2771+
Start the Load Balancer
2772+
2773+
```bash
2774+
cd vllm/docker-compose/load_balancer
2775+
docker compose up -d
2776+
```
2777+
2778+
You can view logs in real time to monitor service status:
2779+
2780+
```bash
2781+
docker compose logs -f
2782+
```
2783+
2784+
After startup, all requests can be sent directly to:
2785+
2786+
```
2787+
http://localhost:8000
2788+
```
2789+
2790+
Stop / clean up:
2791+
```
2792+
docker compose down
2793+
```
2794+
2795+
#### Use Case 2: Periodic vLLM Rotation (Long-Running Service)
2796+
2797+
Use this when running vLLM for a long time and you want to periodically restart instances (e.g., once per day) to avoid degradation, without service interruption.
2798+
2799+
Start with Rotation Enabled
2800+
2801+
```bash
2802+
cd vllm/docker-compose/load_balancer
2803+
chmod +x vllm_bootstrap_and_rotate.sh
2804+
bash vllm_bootstrap_and_rotate.sh
2805+
```
2806+
2807+
You can view logs in real time to monitor service status:
2808+
2809+
```bash
2810+
docker compose logs -f
2811+
```
2812+
2813+
Once started, requests continue to be served at:
2814+
2815+
```
2816+
http://localhost:8000
2817+
```
2818+
2819+
To stop the rotation and clean up resources:
2820+
2821+
```bash
2822+
docker compose down
2823+
crontab -l | grep -v "vllm_bootstrap_and_rotate.sh" | crontab -
2824+
```
2825+
2826+
> This will stop all containers and remove the cron job that triggers periodic rotation.
2827+
2828+
---
2829+
27462830
## 3. Supported Models
27472831
27482832
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
services:
2+
vllm_1:
3+
environment:
4+
- UR_L0_USE_IMMEDIATE_COMMANDLISTS=1
5+
6+
vllm_2:
7+
environment:
8+
- UR_L0_USE_IMMEDIATE_COMMANDLISTS=1
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
services:
2+
vllm_1:
3+
image: intel/llm-scaler-vllm
4+
container_name: vllm_1
5+
privileged: true
6+
network_mode: host
7+
devices:
8+
- "/dev/dri:/dev/dri"
9+
shm_size: "32gb"
10+
working_dir: /llm
11+
entrypoint: >
12+
bash -lc "source /opt/intel/oneapi/setvars.sh --force &&
13+
python3 -m vllm.entrypoints.openai.api_server
14+
--model /llm/models/DeepSeek-R1-Distill-Qwen-7B
15+
--served-model-name model
16+
--dtype=float16
17+
--enforce-eager
18+
--port 8008
19+
--host 0.0.0.0
20+
--disable-log-requests
21+
--trust-remote-code
22+
--gpu-memory-util=0.9
23+
--no-enable-prefix-caching
24+
--max-num-batched-tokens=8192
25+
--max-model-len=32768
26+
--max-num-seqs 256
27+
--block-size 64
28+
-tp=1"
29+
environment:
30+
PWD: "/llm"
31+
VLLM_WORKER_MULTIPROC_METHOD: "spawn"
32+
ZE_AFFINITY_MASK: "2"
33+
VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT: "1"
34+
VLLM_ALLOW_LONG_MAX_MODEL_LEN: "1"
35+
volumes:
36+
- /home/intel/LLM/:/llm/models/
37+
38+
vllm_2:
39+
image: intel/llm-scaler-vllm
40+
container_name: vllm_2
41+
privileged: true
42+
network_mode: host
43+
devices:
44+
- "/dev/dri:/dev/dri"
45+
shm_size: "32gb"
46+
working_dir: /llm
47+
entrypoint: >
48+
bash -lc "source /opt/intel/oneapi/setvars.sh --force &&
49+
python3 -m vllm.entrypoints.openai.api_server
50+
--model /llm/models/DeepSeek-R1-Distill-Qwen-7B
51+
--served-model-name model
52+
--dtype=float16
53+
--enforce-eager
54+
--port 8009
55+
--host 0.0.0.0
56+
--disable-log-requests
57+
--trust-remote-code
58+
--gpu-memory-util=0.9
59+
--no-enable-prefix-caching
60+
--max-num-batched-tokens=8192
61+
--max-model-len=32768
62+
--max-num-seqs 256
63+
--block-size 64
64+
-tp=1"
65+
environment:
66+
PWD: "/llm"
67+
VLLM_WORKER_MULTIPROC_METHOD: "spawn"
68+
ZE_AFFINITY_MASK: "3"
69+
VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT: "1"
70+
VLLM_ALLOW_LONG_MAX_MODEL_LEN: "1"
71+
volumes:
72+
- /home/intel/LLM/:/llm/models/
73+
74+
haproxy:
75+
image: haproxy:latest
76+
container_name: llm_haproxy
77+
network_mode: host
78+
volumes:
79+
- ./haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg
80+
restart: always
81+
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
global
2+
log stdout format raw local0
3+
maxconn 1024
4+
stats socket 0.0.0.0:9999 level admin
5+
6+
defaults
7+
log global
8+
option tcplog # TCP 日志模式
9+
option dontlognull
10+
timeout connect 5s
11+
timeout client 600s
12+
timeout server 600s
13+
14+
frontend llm_front
15+
bind *:8000
16+
mode tcp # 明确 TCP 模式
17+
default_backend vllm_backend
18+
19+
backend vllm_backend
20+
mode tcp # 明确 TCP 模式
21+
balance roundrobin
22+
default-server inter 2s rise 2 fall 3 slowstart 10s
23+
server vllm_1 127.0.0.1:8008 check
24+
server vllm_2 127.0.0.1:8009 check
25+
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
#!/usr/bin/env bash
2+
set -e
3+
4+
# ==== 配置区域 ====
5+
COMPOSE_DIR="$(cd "$(dirname "$0")" && pwd)"
6+
cd "${COMPOSE_DIR}"
7+
8+
HAPROXY_SOCK=127.0.0.1:9999
9+
LOG_FILE=/tmp/vllm_rotate.log
10+
CRON_CMD="* 3 * * * ${COMPOSE_DIR}/vllm_bootstrap_and_rotate.sh >> ${LOG_FILE} 2>&1"
11+
echo "${CRON_CMD}"
12+
# ==== Step 0: 确保 HAProxy + 至少一个 vLLM 运行 ====
13+
echo "==> Ensuring HAProxy + at least one vLLM is running..."
14+
15+
docker compose -f docker-compose.yml -f docker-compose.env-override.yml up -d haproxy
16+
17+
# 检查 vLLM 容器是否运行
18+
VLLM_1_RUNNING=$(docker ps --filter "name=vllm_1" --filter "status=running" | grep -q vllm_1 && echo 1 || echo 0)
19+
VLLM_2_RUNNING=$(docker ps --filter "name=vllm_2" --filter "status=running" | grep -q vllm_2 && echo 1 || echo 0)
20+
21+
if [[ "$VLLM_1_RUNNING" == "0" && "$VLLM_2_RUNNING" == "0" ]]; then
22+
echo "==> No vLLM running, starting vllm_1..."
23+
docker compose -f docker-compose.yml -f docker-compose.env-override.yml up -d vllm_1
24+
25+
else
26+
echo "==> At least one vLLM already running, skipping initial start"
27+
fi
28+
29+
30+
# ==== Step 0.5: 等待 HAProxy socket 就绪 ====
31+
echo "==> Waiting for HAProxy socket..."
32+
for i in {1..20}; do
33+
if echo "show info" | socat -t 2 stdio TCP:${HAPROXY_SOCK} >/dev/null 2>&1; then
34+
break
35+
fi
36+
sleep 1
37+
done
38+
39+
if ! echo "show info" | socat -t 2 stdio TCP:${HAPROXY_SOCK} >/dev/null 2>&1; then
40+
echo "[ERROR] HAProxy socket not ready"
41+
exit 1
42+
fi
43+
44+
# ==== Step 1: 判断哪个 vLLM 是旧实例,哪个是新实例 ====
45+
if docker ps --filter "name=vllm_1" --filter "status=running" | grep -q vllm_1; then
46+
OLD=vllm_1
47+
NEW=vllm_2
48+
else
49+
OLD=vllm_2
50+
NEW=vllm_1
51+
fi
52+
53+
METRIC_PORT_OLD=$([ "$OLD" == "vllm_1" ] && echo 8008 || echo 8009)
54+
METRIC_PORT_NEW=$([ "$NEW" == "vllm_1" ] && echo 8008 || echo 8009)
55+
56+
# ==== Step 2: 启动新 vLLM ====
57+
echo "==> Starting new vLLM: ${NEW}"
58+
docker compose -f docker-compose.yml -f docker-compose.env-override.yml up -d ${NEW}
59+
60+
# ==== Step 3: 等待新 vLLM 健康 ====
61+
echo "==> Waiting for ${NEW} to be healthy..."
62+
until curl -sf http://127.0.0.1:${METRIC_PORT_NEW}/health > /dev/null; do
63+
sleep 5
64+
done
65+
66+
# ==== Step 4: 启用新 backend ====
67+
echo "==> Enabling ${NEW} in HAProxy..."
68+
echo "enable server vllm_backend/${NEW}" | socat stdio TCP:${HAPROXY_SOCK}
69+
sleep 2
70+
71+
# ==== Step 5: 禁用旧 backend ====
72+
echo "==> Disabling ${OLD} in HAProxy..."
73+
echo "disable server vllm_backend/${OLD}" | socat stdio TCP:${HAPROXY_SOCK}
74+
75+
# ==== Step 6: 等待旧 vLLM drain ====
76+
echo "==> Waiting for old vLLM to drain..."
77+
while true; do
78+
RUNNING=$(curl -s http://127.0.0.1:${METRIC_PORT_OLD}/metrics \
79+
| grep '^vllm:num_requests_running' | awk '{print $2}')
80+
WAITING=$(curl -s http://127.0.0.1:${METRIC_PORT_OLD}/metrics \
81+
| grep '^vllm:num_requests_waiting' | awk '{print $2}')
82+
83+
if [[ "${RUNNING}" == "0.0" && "${WAITING}" == "0.0" ]]; then
84+
break
85+
fi
86+
sleep 5
87+
done
88+
89+
# ==== Step 7: 停止旧 vLLM ====
90+
echo "==> Stopping old vLLM: ${OLD}"
91+
docker stop ${OLD}
92+
93+
echo "==> Rotation complete ✅"
94+
95+
# 获取当前 crontab
96+
CURRENT_CRON=$(crontab -l 2>/dev/null || true)
97+
98+
echo "=== 当前 crontab ==="
99+
echo "$CURRENT_CRON"
100+
echo "==================="
101+
102+
# 判断是否已经注册
103+
if ! echo "$CURRENT_CRON" | grep -F -q "${COMPOSE_DIR}/vllm_bootstrap_and_rotate.sh"; then
104+
echo "==> Cron not found, registering..."
105+
# 保留原有 cron,追加新 cron
106+
(echo "$CURRENT_CRON"; echo "$CRON_CMD") | crontab -
107+
echo "Cron registered:"
108+
crontab -l
109+
else
110+
echo "==> Cron already registered, skipping"
111+
fi
112+

0 commit comments

Comments
 (0)