-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Separate Panama and Vector classes #15285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
This PR |
|
Ran some luceneutil benchmarks on Cohere vectors, 768d for various vector similarities x quantization bits:
|
# Conflicts: # lucene/core/src/java25/org/apache/lucene/internal/vectorization/VectorizedVectorUtilSupport.java
|
|
||
| /** A vectorization provider that leverages the Panama Vector API. */ | ||
| final class PanamaVectorizationProvider extends VectorizationProvider { | ||
| final class VectorizedVectorizationProvider extends VectorizationProvider { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just keep the class name the same? The Panama name is correct here. Please don't change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for the other classes. Everything which uses incubatoing APIs should keep its name with "Panama" (as it is called "Panama Vectorization" in the JEP).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I don't have strong opinions on this. Changed back to the original Panama* names :)
|
I am not able to do any close review here, so please don't merge this now. |
Maybe we could enhance Lucene's Edit: heh, and some comment about luceneutil's |
Haha true :)
Side note: I found this cool visualizer (https://jmh.morethan.io), which takes the JSON output of JMH (add For example, I re-ran a subset of functions and recorded their output in https://gist.github.com/kaivalnp/0424bd84326aebdecd10f8144fb46c73 Also found this GH action that automatically runs and compares JMH output: https://github.com/benchmark-action/github-action-benchmark, might be interesting to add to Lucene! |
|
@uschindler just wanted to ask, did you get a chance to look at these changes? Thanks! |
|
This change makes a lot of sense -- the FFM part of Panama is done incubating as of Java 22, so we should promote it out of Lucene's mrjar sources? @kaivalnp I think you've addressed all of @uschindler's comments? Can we take this out of draft now? Are there any other parts you still need to do? |
|
Thanks @mikemccand, IIUC Uwe's main concern was to not expose incubating / experimental APIs (i.e. the Vector API) publicly (which was true in this PR) -- I'm not sure what else to look out for.. One other potential issue I see is the renaming of public APIs -- like I'll pull out of draft! |
|
Hi, What's done here is just moving some glue classes to allow panama vector to access memory segments provided by MmapDir and possibly a new MemorySegment backed replacement of Bytebuffersdirectory to the main sources. This is needed and ok to do, but my main issue with this PR is the additional complexity just to achieve this! Why do we need all those additional abstractions with functional interfaces everywhere? I tried to understand this and every time I looked at this PR I gave up after 20 minutes starring at those horrible abstractions with generics. Let's have exactly one interface preferably without generics as glue part between panama vector in separate sourceSet. Please make it simpler or give a full explanation why we need all those extra generics and functional interfaces. The pr adds 400 extra lines of code instead of making it simpler! Another problem which makes reviewing harder is the additional renames of methods. Can we separate that to make it easier to get a glimpse what's going on? Sorry for the delay but it's busy here and that's too.much complexity for a quick review. Maybe @rmuir can also have a look. And finally: we have some benchmarks here, but those are too simple to show how the additional abstractions affect hotspot compiler of executed in real code. A micro benchmark like this has lead to problems suddenly only appearing on Mike's benchmarks. Because if you benchmark a little bit of code with abstractions, hotspot has an easy job to remove the abstractions. But in complex environments during query execution the additional abstractions can kill your performance! So because of this I am really afraid of this PR to go in in that current form. |
|
Thanks @uschindler -- sorry for my ignorance, indeed I see we already promoted |
|
Hi, This comment in issue:
This is an argument, but still I don't see a reason to write code on top of MemorySegment for non-vectorized code. It does not get simpler, because you still need two different implementations: One for MemorySegments and one from RandomAccessInput / IndexInput. The current code duplicates a lot of methods and has variants for (byte[],byte[]), (MemorySegment,byte[]), (MemorySegment,MemorySegment). Maybe for Lucene 11 the better idea would be to implement VectorUtil only with MemorySegment and throw away the byte[] impls. The code should be same speed (if Hotspot works correct). If you have byte[] code you can wrap it as a MemorySegment before calling VectorUtil. Provokative remark: In general, I tend to think that at some point we should throw away ByteBuffer and byte[] everywhere in our code and replace it by MemorySegment. This would also allow us to get rid of certain 31 bit limitations. Of course I am planning to submit a PR to replace the ByteBuffersDirectory by a MemorySegment backed variant! Keep in mind that MemorySegment also works on-heap! This would allow for example the current NRTCachingDirectory to vectorize like MMapDir. P.S.: Sorry if my comment yesterday was a bit harsh reagrding "horrible generics". Uwe |
Addresses #15284