Skip to content

Conversation

@dyyoungg
Copy link

summary

This PR introduces a comprehensive performance overhaul of the multimodal resource allocation pipeline. It refactors both the httpserver.manager and the server (CacheServer) to replace sequential, "chatty" operations with a concurrent, batched approach. This significantly reduces latency and improves throughput, especially for requests with a large number of multimodal items.

bottleneck Problem

The original implementation was inefficient due to two primary bottlenecks:

  1. Sequential Client-Side Processing: In httpserver.manager, I/O operations (reading files) and CPU-bound tasks (calculating MD5s, create_shm) for each multimodal item were executed one after another
  2. RPC Overhead: The communication protocol was inefficient due to the main factor.
    The original exposed_alloc function signature was alloc(self, md5sum_list: list[str], token_num_list: list[int]). Although it handled lists, rpyc serializes each argument (md5sum_list and token_num_list) independently. This process involves significant overhead:
    • rpyc has to traverse and serialize the entire structure of each list argument separately.
    • This per-argument serialization is computationally expensive, and the cost increases with the number of items in the lists.

Solution (v2 Implementation)

✅ Concurrent Processing

  • The _alloc_multimodal_resources_v2 function now uses a ThreadPoolExecutor to concurrently read item data and calculate MD5 sums, fully leveraging available CPU cores.
  • Shared memory (SHM) creation is also parallelized using asyncio.gather to expedite resource setup.

✅ New Batched RPC Interface

  • To eliminate RPC chattiness, new exposed_*_v2 methods have been added to the CacheServer.
    Include alloc_v2, release_v2, set_items_data_v2, get_items_data_v2, set_items_embed_v2, get_items_embed_v2.

✅ Server-Side Batch Handling

  • The CacheServer's new v2 endpoints deserialize the request blob, process the batch of items internally, and return a single serialized response. This makes the server-side logic more efficient and cohesive.

✅ Feature Toggle

  • Added --enable_concurrent_alloc and --concurrent_alloc_workers parameters to control the new concurrent allocation behavior. This allows for gradual rollout and easy fallback to the original implementation if needed
  • In audioserver/visualserver.manager , get_items_embed is default to v2 implementation to reduce time.

Performance Evaluation

I evaluated the performance using images of the same size(644*364) during inference with our internal LLaVA-like model.
Testing Environment:

  • GPU: Single NVIDIA A100
  • CPU: Two Intel Xeon Platinum 8358P @ 2.60GHz (128 logical cores)
  • parameter: concurrent_alloc_workers=4

The reported values are averages and may fluctuate slightly, but not significantly.

Image num cache_client.root.alloc time (ms) _alloc_multimodal_resources func time (ms)
origin optimized origin optimized
4 43.4 1.60 129.8 5.95
8 43.2 1.49 129.8 5.90
16 42.2 1.00 128.0 5.96
32 42.4 1.16 130.6 10.0
64 44.0 1.58 141.2 15.9
128 44.6 2.17 142.6 26.4

@dyyoungg dyyoungg marked this pull request as ready for review August 22, 2025 06:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant