|
| 1 | +# Feature Design Documentation |
| 2 | + |
| 3 | +## Why is disaggregated-prefill? |
| 4 | + |
| 5 | +This feature addresses the need to optimize the **Time Per Output Token (TPOT)** and **Time To First Token (TTFT)** in large-scale inference tasks. The motivation is two-fold: |
| 6 | + |
| 7 | +1. **Adjusting Parallel Strategy and Instance Count for P and D Nodes** |
| 8 | + Using the disaggregated-prefill strategy, this feature allows the system to flexibly adjust the parallelization strategy (e.g., data parallelism (dp), tensor parallelism (tp), and expert parallelism (ep)) and the instance count for both P (Prefiller) and D (Decoder) nodes. This leads to better system performance tuning, particularly for **TTFT** and **TPOT**. |
| 9 | + |
| 10 | +2. **Optimizing TPOT** |
| 11 | + Without disaggregated-prefill strategy, prefill tasks are inserted during decoding, which results in inefficiencies and delays. disaggregated-prefill solves this by allowing for better control over the system’s **TPOT**. By managing chunked prefill tasks effectively, the system avoids the challenge of determining the optimal chunk size and provides more reliable control over the time taken for generating output tokens. |
| 12 | + |
| 13 | +--- |
| 14 | + |
| 15 | +## Usage |
| 16 | + |
| 17 | +vLLM Ascend currently supports two types of connectors for handling KV cache management: |
| 18 | +- **MooncakeConnector**: D nodes pull KV cache from P nodes. |
| 19 | +- **MooncakeLayerwiseConnector**: P nodes push KV cache to D nodes in a layered manner. |
| 20 | + |
| 21 | +For step-by-step deployment and configuration, refer to the following guide: |
| 22 | +[https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node_pd_disaggregation_mooncake.html](https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_node_pd_disaggregation_mooncake.html) |
| 23 | + |
| 24 | +--- |
| 25 | + |
| 26 | +## How It Works |
| 27 | + |
| 28 | +### 1. Design Approach |
| 29 | + |
| 30 | +Under the disaggregated-prefill, a global proxy receives external requests, forwarding prefill to P nodes and decode to D nodes; the KV cache (key–value cache) is exchanged between P and D nodes via peer-to-peer (P2P) communication. |
| 31 | + |
| 32 | +### 2. Implementation Design |
| 33 | + |
| 34 | +#### Mooncake Connector: |
| 35 | + |
| 36 | +1. The request is sent to the Proxy’s `_handle_completions` endpoint. |
| 37 | +2. The Proxy calls `select_prefiller` to choose a P node and forwards the request, configuring `kv_transfer_params` with `do_remote_decode=True`, `max_tokens=1`, and `min_tokens=1`. |
| 38 | +3. After the P node’s scheduler finishes prefill, `update_from_output` invokes the schedule connector’s `request_finished` to defer KV cache release, constructs `kv_transfer_params` with `do_remote_prefill=True`, and returns to the Proxy. |
| 39 | +4. The Proxy calls `select_decoder` to choose a D node and forwards the request. |
| 40 | +5. On the D node, the scheduler marks the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, calls `kv_connector_no_forward` to pull the remote KV cache, then notifies the P node to release KV cache and proceeds with decoding to return the result. |
| 41 | + |
| 42 | + |
| 43 | +#### Mooncake Layerwise Connector: |
| 44 | + |
| 45 | +1. The request is sent to the Proxy’s `_handle_completions` endpoint. |
| 46 | +2. The Proxy calls `select_decoder` to choose a D node and forwards the request, configuring `kv_transfer_params` with `do_remote_prefill=True` and setting the `metaserver` endpoint. |
| 47 | +3. On the D node, the scheduler uses `kv_transfer_params` to mark the request as `RequestStatus.WAITING_FOR_REMOTE_KVS`, pre-allocates KV cache, then calls `kv_connector_no_forward` to send a request to the metaserver and waits for the KV cache transfer to complete. |
| 48 | +4. The Proxy’s `metaserver` endpoint receives the request, calls `select_prefiller` to choose a P node, and forwards it with `kv_transfer_params` set to `do_remote_decode=True`, `max_tokens=1`, and `min_tokens=1`. |
| 49 | +5. During processing, the P node’s scheduler pushes KV cache layer-wise; once all layers pushing is complete, it releases the request and notifies the D node to begin decoding. |
| 50 | +6. The D node performs decoding and returns the result. |
| 51 | + |
| 52 | + |
| 53 | +### 3. Interface Design |
| 54 | + |
| 55 | +Taking MooncakeConnector as an example, the system is organized into three primary classes: |
| 56 | +- **MooncakeConnector**: Base class that provides core interfaces. |
| 57 | +- **MooncakeConnectorSchedule**: Interface for scheduling the connectors within the engine core, responsible for managing KV cache transfer requirements and completion. |
| 58 | +- **MooncakeConnectorWorker**: Interface for managing KV cache registration and transfer in worker processes. |
| 59 | + |
| 60 | +### 4. Specifications Design |
| 61 | + |
| 62 | +This feature is flexible and supports various configurations, including setups with MLA and GQA models. It is compatible with A2 and A3 hardware configurations and facilitates scenarios involving both equal and unequal TP setups across multiple P and D nodes. |
| 63 | + |
| 64 | +--- |
| 65 | + |
| 66 | +## DFX Analysis |
| 67 | + |
| 68 | +### 1. Config Parameter Validation |
| 69 | + |
| 70 | +Validate KV transfer config by checking whether the kv_connector type is supported and whether kv_connector_module_path exists and is loadable. On transfer failures, emit clear error logs for diagnostics. |
| 71 | + |
| 72 | +### 2. Port Conflict Detection |
| 73 | + |
| 74 | +Before startup, perform a port-usage check on configured ports (e.g., rpc_port, metrics_port, http_port/metaserver) by attempting to bind. If a port is already in use, fail fast and log an error. |
| 75 | + |
| 76 | +### 3.PD Ratio Validation |
| 77 | + |
| 78 | +Under non-symmetric PD scenarios, validate the P-to-D tp ratio against expected and scheduling constraints to ensure correct and reliable operation. |
| 79 | + |
| 80 | +--- |
| 81 | + |
| 82 | +## Limitations |
| 83 | + |
| 84 | +- Heterogeneous P and D nodes are not supported—for example, running P nodes on A2 and D nodes on A3. |
| 85 | + |
| 86 | +- In non-symmetric TP configurations, only cases where the P nodes have a higher TP degree than the D nodes and the P TP count is an integer multiple of the D TP count are supported (i.e., P_tp > D_tp and P_tp % D_tp = 0). |
0 commit comments