Model
Orion-14B-Chat-Int4
Description
When the conversation started, the original 9G GPU memory usage, increased to 13G,
Test 4 concurrent sessions, the Mem growth to 22G has not stopped signs, only when the session is completely over a period of time, the Mem usage will be released.
It's easy to trigger a Crash.
Question
- Is there a way to prevent the rapid linear growth of GPU memory usage?
- Is this caused by enabling cache policy? Whether Mem can be used instead of GPU Mem?