Optimize multimodal resource allocation with concurrency and improved batch RPC #1017
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
summary
This PR introduces a comprehensive performance overhaul of the multimodal resource allocation pipeline. It refactors both the
httpserver.managerand the server (CacheServer) to replace sequential, "chatty" operations with a concurrent, batched approach. This significantly reduces latency and improves throughput, especially for requests with a large number of multimodal items.bottleneck Problem
The original implementation was inefficient due to two primary bottlenecks:
httpserver.manager, I/O operations (reading files) and CPU-bound tasks (calculating MD5s,create_shm) for each multimodal item were executed one after anotherThe original exposed_alloc function signature was
alloc(self, md5sum_list: list[str], token_num_list: list[int]). Although it handled lists, rpyc serializes each argument(md5sum_list and token_num_list)independently. This process involves significant overhead:Solution (v2 Implementation)
✅ Concurrent Processing
ThreadPoolExecutorto concurrently read item data and calculate MD5 sums, fully leveraging available CPU cores.asyncio.gatherto expedite resource setup.✅ New Batched RPC Interface
exposed_*_v2methods have been added to the CacheServer.Include
alloc_v2,release_v2,set_items_data_v2,get_items_data_v2,set_items_embed_v2,get_items_embed_v2.✅ Server-Side Batch Handling
CacheServer'snew v2 endpoints deserialize the request blob, process the batch of items internally, and return a single serialized response. This makes the server-side logic more efficient and cohesive.✅ Feature Toggle
--enable_concurrent_allocand--concurrent_alloc_workersparameters to control the new concurrent allocation behavior. This allows for gradual rollout and easy fallback to the original implementation if neededaudioserver/visualserver.manager,get_items_embedis default to v2 implementation to reduce time.Performance Evaluation
I evaluated the performance using images of the same size(644*364) during inference with our internal LLaVA-like model.
Testing Environment:
concurrent_alloc_workers=4The reported values are averages and may fluctuate slightly, but not significantly.