-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Align vectors on disk for optimal performance on ARM CPUs #15341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
|
I wrote a small JMH benchmark to "pad" float vectors on disk with some My machine uses the 256-bit variant of Panama to score vectors, so I saw optimal performance when floats are aligned to 32 bytes -- but keeping it 64 here as the max case.. |
|
cc @mikemccand who found this^ byte-misalignment possibility offline! |
|
Also noting that for byte vectors, I saw no impact of padding: ..so I'm not changing its alignment in this PR |
|
Wow, alignment still matters, and it matters a lot (24 -> 33 ops/us)! Thank you @kaivalnp for testing. Was this an It's frustrating how the CPU just silently runs slower ... but what else could it do. I wonder whether modern x86-64 (Intel, AMD) CPUs also show this effect. I'll test this PR on nightly Lucene benchy box ( |
Yes, it was a Graviton3 ( |
|
I tested on I applied this PR, built ( Net/net it seems like alignment of the mapped in-ram (virtual address space) doesn't matter? I also tested newer CPU (Raptor Lake) -- I'll post that shortly. |
|
Raptor Lake box is i9-13900K: Results: There might be small some mis-alignment penalty for float SIMD? |
from the cpu's optimization guide: |
|
Thanks @mikemccand, there doesn't seem to be any performance penalty on "beast3 (nightly benchmarking box) -- a Ryzen Threadripper 3990X". There's definitely some impact of alignment on "Raptor Lake box is i9-13900K", but this is lower than my machine (<10%) -- so this alignment issue is mostly on Graviton, or ARM CPUs in general, as @rmuir shared? I tried running
This PR (64-byte-alignment) Indexing was sped up by ~7.6%, while Search was sped up by ~3.8% I see another action item from this benchmark: I wasn't aligning the output inside this merge function, which is used by HNSW-based vector formats for merging (see that |
|
Thanks @rmuir. How does the Panama Vector API handle alignment? Does it have methods to allocate aligned on-heap or off-heap vectors? Hmm it looks like |
Oh good catch! I wonder what other places might write the flat vectors? Is the alignment also (or maybe less) important for the quantized cases? (Your results above are for Maybe at least |
This reverts commit 5764ac8.
Hmm this did not help for some reason (merge time increased)..
This PR (64-byte-alignment) I'll still add a commit + revert, so people can see what I tried, and comment if I'm missing something!
I think alignment is less important for quantized vectors (which are stored as byte vectors on disk) -- because none of the JMH benchmarks show non-trivial variation with padding? (see
Yeah, those benchmarks^ are for float vectors
I added some print statements to complain if non-64-byte-aligned addresses were used ( Not committing because it may not be needed after this PR? |
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
|
Sorry for the delay here, I ran benchmarks a few more times offline, and the differences in This is because:
The only constant is the speedup in search time (3-4%) Another recent run with 1M Cohere vectors, 768d,
This PR: |
mikemccand
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kaivalnp -- this looks good to me. Thank you for also improving JMH benchy so we can measure these alignment impacts.
I left a few small comments.
I suppose the added up to 15 bytes won't matter in practice -- it's constant per .vec file, so any added bytes are amortized against the often massive .vec files in larger and larger segments as they are merged...
| } | ||
|
|
||
| private static long alignOutput(IndexOutput output, VectorEncoding encoding) throws IOException { | ||
| return output.alignFilePointer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh how nice that we already had a method to alignFilePointer -- I wonder who uses that today?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All usages (except here in the compound format) are in Lucene*VectorsWriter classes :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my invention and was specifically made for CFS files. But keep in mind, we may/should make this configurable in CFS files, too, because if you have a CFS file warapped around your vectors, its gets unaligned if you require 64 bytes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compound file seems to align all files to 8 bytes (Long.BYTES) inside the "blob" -- which works because all other alignFilePointer calls I see are factors of 8 (I only see 4 byte alignment, using Float.BYTES)
we may/should make this configurable in CFS files, too
Would this involve aligning all files to 64 bytes instead? (because we do not retain the intended alignment in the index, so we'd have to pick the LCM of all occurrences of alignFilePointer in the codebase?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aligning all files to 64 bytes instead?
This is slightly wasteful because most files do not need the high alignment, but at the same time something like an extension-specific alignment (i.e. for .vec files) seems leaky..
| long vectorDataOffset = vectorData.alignFilePointer(Float.BYTES); | ||
| switch (fieldData.fieldInfo.getVectorEncoding()) { | ||
| VectorEncoding encoding = fieldData.fieldInfo.getVectorEncoding(); | ||
| long vectorDataOffset = alignOutput(vectorData, encoding); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm should we not bother aligning if the vectors themselves won't stay aligned? I.e. the vector dimensions is not 0 mod 16 or so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah nice catch, we can at least align to 4 bytes (slightly better than no alignment) -- as it was before this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a commit to apply 64 byte alignment when dimension % 16 == 0, and 4 byte alignment otherwise.
Looking back, I wonder if we can just keep it to 64 in all cases, for code simplicity? (vectors with arbitrary dimensions will implicitly be 4 byte aligned in this case).
No strong opinions though..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah OK +1 to just keep it simple -- always align.
| return total; | ||
| } | ||
|
|
||
| private static long alignOutput(IndexOutput output, VectorEncoding encoding) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a NOTE somewhere in javadocs (maybe in the ...Format.java?) that this format tries to preserve alignment to 64 byte boundary because on at least ARM Neoverse this matters, or so? And that means ideally to maximize performance the incoming vectors (if they are float[]) should have 0 mod 16 dimensionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, added!
i.e. only applied when dimension is a multiple of 16 Also add Javadoc comment about the alignment
| @Param({"1", "128", "207", "256", "300", "512", "702", "1024"}) | ||
| public int size; | ||
|
|
||
| @Param({"0", "1", "2", "4", "6", "8", "16", "20", "32", "50", "64", "100", "128", "255", "256"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: I added a bunch of values during testing, should we reduce while committing? (say 0, 1, 4, 64)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Also maybe add a comment about why we have this alignment benchy if you didn't already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a small comment!
…ectors" Also add comment about padBytes in VectorScorerBenchmark (used to capture performance impact of byte alignment)


Description
Today, float vectors are aligned to 4 bytes in a Lucene index, but with Panama -- we can work with (upto) 512 bits (== 64 bytes, or 16 floats) at the same time.
I wonder if we should change this alignment to 64 bytes, in order to get optimal vector search performance with Panama?