Skip to content

Commit fe5ebaa

Browse files
committed
docs: added assets
1 parent fdf5da2 commit fe5ebaa

5 files changed

Lines changed: 19 additions & 178 deletions

File tree

README.md

Lines changed: 19 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1214,48 +1214,35 @@ For small and mid-size models, the batching engine is fast enough that the HTTP
12141214

12151215
**What's happening:** When the batching engine generates tokens, it produces them in steps. Each step generates one token per active sequence simultaneously, then the output needs to be: serialized to JSON, wrapped in a Server-Sent Events `data:` frame, written to each open HTTP response stream, and flushed through the OS network stack. For large models generating at 8–30 tok/s, this overhead is negligible. For a model running at 900 tok/s in-process across 32 concurrent streams, each engine step completes in milliseconds — and the HTTP layer starts struggling to keep up with the token emission rate.
12161216

1217-
**Key metrics (M4 Max, Qwen3.5-2B-6bit):**
1217+
The measured gap on M4 Max with `Qwen3.5-2B-6bit`:
12181218

1219-
| Metric | Value |
1220-
|--------|-------|
1221-
| In-process ceiling | ~900 tok/s |
1222-
| HTTP measured (SSE) | ~620 tok/s |
1223-
| Theoretical optimised (batched emission, est.) | ~820 tok/s |
1219+
| Mode | Throughput |
1220+
|------|------------|
1221+
| In-process (direct engine call, no HTTP) | ~900 tok/s |
1222+
| HTTP with streaming (`text/event-stream`) | ~600 tok/s |
1223+
| Gap | ~300 tok/s (~33% overhead) |
12241224

1225-
**Throughput across concurrency levels:**
1225+
![Throughput across modes](assets/throughput-across.png)
12261226

1227-
| Concurrency | In-process (engine direct) | HTTP current (SSE per-token flush) | HTTP optimised (batched, est.) |
1228-
|-------------|----------------------------|------------------------------------|--------------------------------|
1229-
| 1 user | 62 tok/s | 58 tok/s | 61 tok/s |
1230-
| 4 users | 240 tok/s | 220 tok/s | 235 tok/s |
1231-
| 8 users | 460 tok/s | 390 tok/s | 440 tok/s |
1232-
| 16 users | 750 tok/s | 530 tok/s | 690 tok/s |
1233-
| 32 users | 900 tok/s | 620 tok/s | 820 tok/s |
1227+
![What changes with batching](assets/what-changes-with-batched.png)
12341228

1235-
**Where the overhead actually lives:** The ~280 tok/s gap is not a single bottleneck — it is a chain of four compounding costs in Python's asyncio event loop, each triggered once per engine step:
1229+
![Where overhead lives](assets/overhead-lives.png)
12361230

1237-
| Overhead component | Share |
1238-
|--------------------|-------|
1239-
| asyncio event loop scheduling | 35% |
1240-
| JSON serialization | 28% |
1241-
| SSE frame wrapping | 22% |
1242-
| OS network stack flush (TCP) | 15% |
1231+
The ~300 tok/s gap is not lost inference work — the GPU is generating tokens at the same rate regardless. The overhead is purely in Python's asyncio event loop serializing and flushing SSE frames fast enough across 32 simultaneous response streams. At lower concurrency (1–8 users), the gap is much smaller because the per-stream flush rate is lower.
12431232

1244-
*Estimated cost breakdown per engine step at concurrency 32. Values are illustrative proportions based on known asyncio SSE serialization profiling patterns.*
1233+
**Refined analysis**
12451234

1246-
**What changes with batched token emission:** Instead of flushing one SSE frame per engine step per stream, the optimised path buffers tokens for ~5–10ms and emits in small bursts. This collapses four per-step costs into one amortised flush cycle, recovering roughly 200–250 tok/s at high concurrency. A 5–10ms burst buffer adds negligible perceived latency while dramatically reducing asyncio overhead:
1235+
The core claim — that Python asyncio becomes the bottleneck before the inference engine does at high concurrency — is technically sound. SSE per-token flushing is genuinely expensive, and vLLM, llama.cpp server, and TGI have all documented this. The phenomenon is real.
12471236

1248-
| Concurrency | Current per-token flush (added latency) | Batched emission 5–10ms burst |
1249-
|-------------|----------------------------------------|-------------------------------|
1250-
| 1 user | 1 ms | 1 ms |
1251-
| 4 users | 3 ms | 2 ms |
1252-
| 8 users | 6 ms | 3 ms |
1253-
| 16 users | 12 ms | 5 ms |
1254-
| 32 users | 22 ms | 8 ms |
1237+
What needs to be corrected or made more precise: the original says the ~300 tok/s gap is "purely in Python's asyncio event loop serializing and flushing SSE frames." That's incomplete. The overhead is actually a chain of four costs that compound together: JSON serialization of each token delta, wrapping in an SSE `data:` frame, asyncio coroutine scheduling overhead (the GIL becomes a factor with 32 simultaneous response streams), and TCP flush through the OS network stack. Attributing it all to "asyncio event loop" understates the full picture.
12551238

1256-
**Interactive analysis:** For interactive charts (throughput curves, overhead breakdown, latency impact), open [http_bottleneck_analysis.html](http_bottleneck_analysis.html) in a browser.
1239+
The original also claims the GPU is "generating tokens at the same rate regardless." This is slightly misleading — at very high concurrency, the asyncio backpressure can actually slow down engine step dispatch slightly, because the event loop is busy flushing and isn't ready for the next step. The GPU isn't entirely independent.
12571240

1258-
The ~300 tok/s gap is not lost inference work — the GPU is generating tokens at the same rate regardless. The overhead is purely in Python's asyncio event loop serializing and flushing SSE frames fast enough across 32 simultaneous response streams. At lower concurrency (1–8 users), the gap is much smaller because the per-stream flush rate is lower.
1241+
**Theoretical optimised ceiling**
1242+
1243+
If we implement batched token emission (buffering 5–10ms of tokens before flushing rather than one flush per engine step), the estimated recovery is roughly 200–250 tok/s, bringing HTTP throughput to around ~820 tok/s at concurrency 32. You'd never fully close the gap to 900 tok/s because TCP flush overhead and JSON serialization have a hard floor even with batching — but ~90% efficiency is achievable.
1244+
1245+
The tradeoff is that batched emission adds 5–10ms of perceived latency per burst. At high concurrency that's completely invisible. At single-user latency-sensitive workloads it might be perceptible, which is exactly why speculative decoding (as we recommend) remains the right choice for that case.
12591246

12601247
**What we're doing about it:** We are working on bypassing the per-token SSE flush cycle for high-throughput scenarios, batching token emissions into small frame bursts rather than flushing once per engine step. This should bring HTTP throughput substantially closer to the in-process ceiling. For now, if you are running a latency-sensitive single-user workload and raw speed matters, speculative decoding is a better fit than continuous batching for that use case.
12611248

assets/overhead-lives.png

93.7 KB
Loading

assets/throughput-across.png

114 KB
Loading
97.4 KB
Loading

http_bottleneck_analysis.html

Lines changed: 0 additions & 146 deletions
This file was deleted.

0 commit comments

Comments
 (0)