-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Add half-float (FP16) storage support for vectors #15549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@Pulkitg64 the latency is the main concern IMO. We must copy the vectors onto heap (we know this is expensive), transform the bytes to I wonder if all the cost is spent just decoding the vector? What does a flame graph tell you? Also, could you indicate your JVM, etc.? See this interesting jep update on the ever incubating vector API:
|
|
@Pulkitg64 also, thank you for doing an initial pass and benchmarking, its important data :D. I wonder if we want a true element type vs. a new format? The element type has indeed expanded its various uses, but for many of them, Float16 isn't that much different than float (e.g. you still likely query & index with This is just an idea. I am not 100% sold either way. Looking for discussion. |
|
You need https://bugs.openjdk.org/browse/JDK-8370691 for this one to be performant. |
|
Just look at numbers on the PR. they benchmark the cosine and the dot product. Maybe try it out with the branch from that openjdk PR. Code in |
|
Thanks @benwtrent, @rmuir for such quick responses. Let me try to gather some more data to confirm if the conversion is driving the regression.
Trying now |
|
stop converting. use the native fp16 type (and vector type), otherwise code will be slow |

Description
This draft PR explores storing float vectors using 2 bytes (half-float/FP16) instead of 4 bytes (FP32), reducing vector disk usage by approximately 50%. The approach involves storing vectors on disk in half-float format while converting them back to full-float precision for dot-product computations during search and index merge operations. However, this conversion step introduces additional overhead during vector reads, resulting in slower indexing and search performance.
This is an early draft to gather community feedback on the viability and direction of this implementation..
TODO : Support for MemorySegmentVectorScorer with half-float vectors is yet to be implemented.
For no quantization, we are seeing around 100% increase in latency. For 8bit quantization, we are not seeing latency regression but for 4 bit we are seeing about 18% latency regression. We are seeing 20-25% drop in indexing rate across all quantization.