Skip to content

Conversation

@Pulkitg64
Copy link
Contributor

@Pulkitg64 Pulkitg64 commented Jan 5, 2026

Description

This draft PR explores storing float vectors using 2 bytes (half-float/FP16) instead of 4 bytes (FP32), reducing vector disk usage by approximately 50%. The approach involves storing vectors on disk in half-float format while converting them back to full-float precision for dot-product computations during search and index merge operations. However, this conversion step introduces additional overhead during vector reads, resulting in slower indexing and search performance.

This is an early draft to gather community feedback on the viability and direction of this implementation..

TODO : Support for MemorySegmentVectorScorer with half-float vectors is yet to be implemented.

  • Benchmark Results:

For no quantization, we are seeing around 100% increase in latency. For 8bit quantization, we are not seeing latency regression but for 4 bit we are seeing about 18% latency regression. We are seeing 20-25% drop in indexing rate across all quantization.

Encoding recall  latency(ms)  quantized  index(s)  index_docs/s  index_size(MB)  vec_disk(MB)  vec_RAM(MB)
float16  0.991 11.392  no 34.8 2873.81  206.22 390.625  390.625
float16  0.981  4.337 8 bits 41.55 2406.97 305.4 294.495 99.182
float16  0.926  6.069 4 bits 42.07 2376.93  256.58 245.667 50.354
float32  0.991  4.942 no 28.93 3456.38 401.53 390.625  390.625
float32  0.981  4.367 8 bits 32.04 3121.49 500.71 489.807 99.182
float32  0.926  5.343 4 bits 32.12 3113.33 451.91 440.979 50.354

@benwtrent
Copy link
Member

@Pulkitg64 the latency is the main concern IMO. We must copy the vectors onto heap (we know this is expensive), transform the bytes to float32 (which is an additional cost), then do the float32 panama vector actions (which are super fast). I would expect this to also impact quantization query time for anything that must rescore (though, likely less of an impact as that would be fewer vectors to decode).

I wonder if all the cost is spent just decoding the vector? What does a flame graph tell you?

Also, could you indicate your JVM, etc.?

See this interesting jep update on the ever incubating vector API:

https://openjdk.org/jeps/508

Addition, subtraction, division, multiplication, square root, and fused multiply/add operations on Float16 values are now auto-vectorized on supporting x64 CPUs.

@benwtrent
Copy link
Member

@Pulkitg64 also, thank you for doing an initial pass and benchmarking, its important data :D.

I wonder if we want a true element type vs. a new format?

The element type has indeed expanded its various uses, but for many of them, Float16 isn't that much different than float (e.g. you still likely query & index with float[], still use FloatVectorValues, etc.). The only difference is the on disk representation (which...seems like a format thing).

This is just an idea. I am not 100% sold either way. Looking for discussion.

@rmuir
Copy link
Member

rmuir commented Jan 5, 2026

You need https://bugs.openjdk.org/browse/JDK-8370691 for this one to be performant.

@rmuir
Copy link
Member

rmuir commented Jan 5, 2026

Just look at numbers on the PR. they benchmark the cosine and the dot product. Maybe try it out with the branch from that openjdk PR.

Code in o.a.l.internal.vectorization will be needed that takes advantage of the new Float16Vector or whatever the name ends out being. I would try to keep it looking as close to the existing 32-bit float code as possible.

@Pulkitg64
Copy link
Contributor Author

Thanks @benwtrent, @rmuir for such quick responses.

Let me try to gather some more data to confirm if the conversion is driving the regression.

Just look at numbers on the PR. they benchmark the cosine and the dot product. Maybe try it out with the branch from that openjdk PR.

Code in o.a.l.internal.vectorization will be needed that takes advantage of the new Float16Vector or whatever the name ends out being. I would try to keep it looking as close to the existing 32-bit float code as possible.

Trying now

@Pulkitg64
Copy link
Contributor Author

Pulkitg64 commented Jan 7, 2026

Here is the output difference from profiler between float16 and float32 benchmark runs for no quantization. Based on below comparison, it can be clearly seen the additional latency in float16 benchmark run is coming while reading float16 vectors.

Screenshot 2026-01-07 at 12 11 15 PM

Also, could you indicate your JVM, etc.?

I am running these test on x86 machine with JDK25

java --version
openjdk 25.0.1 2025-10-21 LTS
OpenJDK Runtime Environment Corretto-25.0.1.9.1 (build 25.0.1+9-LTS)
OpenJDK 64-Bit Server VM Corretto-25.0.1.9.1 (build 25.0.1+9-LTS, mixed mode, sharing)

@rmuir
Copy link
Member

rmuir commented Jan 7, 2026

stop converting. use the native fp16 type (and vector type), otherwise code will be slow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants