Optimize multimodal resource allocation with concurrency and improved batch RPC #1017

dyyoungg · 2025-08-21T04:35:09Z

summary

This PR introduces a comprehensive performance overhaul of the multimodal resource allocation pipeline. It refactors both the httpserver.manager and the server (CacheServer) to replace sequential, "chatty" operations with a concurrent, batched approach. This significantly reduces latency and improves throughput, especially for requests with a large number of multimodal items.

bottleneck Problem

The original implementation was inefficient due to two primary bottlenecks:

Sequential Client-Side Processing: In httpserver.manager, I/O operations (reading files) and CPU-bound tasks (calculating MD5s, create_shm) for each multimodal item were executed one after another
RPC Overhead: The communication protocol was inefficient due to the main factor.
The original exposed_alloc function signature was alloc(self, md5sum_list: list[str], token_num_list: list[int]). Although it handled lists, rpyc serializes each argument (md5sum_list and token_num_list) independently. This process involves significant overhead:
- rpyc has to traverse and serialize the entire structure of each list argument separately.
- This per-argument serialization is computationally expensive, and the cost increases with the number of items in the lists.

Solution (v2 Implementation)

✅ Concurrent Processing

The _alloc_multimodal_resources_v2 function now uses a ThreadPoolExecutor to concurrently read item data and calculate MD5 sums, fully leveraging available CPU cores.
Shared memory (SHM) creation is also parallelized using asyncio.gather to expedite resource setup.

✅ New Batched RPC Interface

To eliminate RPC chattiness, new exposed_*_v2 methods have been added to the CacheServer.
Include alloc_v2, release_v2, set_items_data_v2, get_items_data_v2, set_items_embed_v2, get_items_embed_v2.

✅ Server-Side Batch Handling

The CacheServer's new v2 endpoints deserialize the request blob, process the batch of items internally, and return a single serialized response. This makes the server-side logic more efficient and cohesive.

✅ Feature Toggle

Added --enable_concurrent_alloc and --concurrent_alloc_workers parameters to control the new concurrent allocation behavior. This allows for gradual rollout and easy fallback to the original implementation if needed
In audioserver/visualserver.manager , get_items_embed is default to v2 implementation to reduce time.

Performance Evaluation

I evaluated the performance using images of the same size（644*364） during inference with our internal LLaVA-like model.
Testing Environment:

GPU: Single NVIDIA A100
CPU: Two Intel Xeon Platinum 8358P @ 2.60GHz (128 logical cores)
parameter: concurrent_alloc_workers=4

The reported values are averages and may fluctuate slightly, but not significantly.

Image num	cache_client.root.alloc time (ms)		_alloc_multimodal_resources func time (ms)
	origin	optimized	origin	optimized
4	43.4	1.60	129.8	5.95
8	43.2	1.49	129.8	5.90
16	42.2	1.00	128.0	5.96
32	42.4	1.16	130.6	10.0
64	44.0	1.58	141.2	15.9
128	44.6	2.17	142.6	26.4

dyyoungg added 2 commits August 21, 2025 01:00

Implement mmalloc optimization

c9f128c

fix pre commit

72b29b3

dyyoungg marked this pull request as ready for review August 22, 2025 06:29

dyyoungg added 7 commits October 13, 2025 21:19

merge main

337e85f

remove alloc v1 and reformat

2150ba1

fix qwen25vl rot pos emb

fdcb7dd

fix whisper batch alloc

becfd36

reformat

7887d7b

fix black reformat

d5fcc2f

merge remote

87103f7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize multimodal resource allocation with concurrency and improved batch RPC #1017

Optimize multimodal resource allocation with concurrency and improved batch RPC #1017

Uh oh!

dyyoungg commented Aug 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Optimize multimodal resource allocation with concurrency and improved batch RPC #1017

Are you sure you want to change the base?

Optimize multimodal resource allocation with concurrency and improved batch RPC #1017

Uh oh!

Conversation

dyyoungg commented Aug 21, 2025

summary

bottleneck Problem

Solution (v2 Implementation)

Performance Evaluation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant