How to not weigh part of a vector? #148
Replies: 2 comments
-
|
I think the main issue is your distance function - text embeddings are generally trained with cosine similarity, so that or inner product should work as expected and show meaningful semantic similarities. That I believe also explains why your solution works well with exact document matches and produces false similarities for cases without them. When you use cosine or inner product, for 1. the easiest solution is having a 0 vector for the query embedding part corresponding to the document. |
Beta Was this translation helpful? Give feedback.
-
|
That seemed to do the trick! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I got here through the recent livestream, where @svonava explained how you concatenate multiple embeddings to boost several criteria. I have a few questions.
For context, I'm making a RAG allowing users to ask questions about various relatively large documents (such as medical guidelines and manuals). For a proof of concept, I want to boost results when the user specifically mentions a document (e.g. "What does the dementia guideline say about..."). To achieve this, I ask an LLM to process the question and return a document if mentioned (in this case "dementia guideline") or nothing. I have also created my vectors so that they're a concatenation of the document's original title (e.g. "Guideline on Dementia") and the document's contents.
issue
The issue I'm running into is the following: this proof of concept works well if an actual document is referenced, or in other words if my search query contains a vector for the title part. However, if this part is empty (embedding of an empty string), I notice the documents retrieved are strongly biased to certain sources. My suspicion is that the 'empty' query-embedding is not actually neutral, but in fact is 'closer' to some results than others. How do you go about 'disabling' a certain property in your queries?
number of dimensions
Another more general question: if you keep adding properties using your method of concatenation, the number of dimensions keeps adding up, right? My embeddings are 1024d, so in my proof of concept with 2 properties I'm already at 2048d. If you keep adding properties, what does that do to performance? What's the limit?
Practically: I'm using a default setup of Chromadb (squared L2). For embedding I'm using the Mistral embedder, which produces 1024-dimension results. (I'm not using Superlinked as of yet).
Beta Was this translation helpful? Give feedback.
All reactions