Open
Conversation
|
To avoid any confusion in the future about your contribution to Weaviate, we work with a Contributor License Agreement. If you agree, you can simply add a comment to this PR that you agree with the CLA so that we can merge. |
Contributor
|
@kl-thamm if you want us to be able to merge your PR you need to agree to CLA. |
Author
|
@antas-marcin Thanks! I agree to CLA. |
StefanBogdan
approved these changes
Jun 29, 2023
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I had an issue with the t2v-transformers today:
I create embeddings using a sentence-transformers model. One time using the sentence-transformers python library and one time using the t2v-transformers container.
The cosine distance of the vectors was up to 0.16.
@antas-marcin greatly and quickly helped me by suggesting setting "T2V_TRANSFORMERS_DIRECT_TOKENIZE=true". This reduced the cosine distance to almost 0.
When looking into what it does i noticed two things:
Regarding 1:
Tokenize in the context of this program means splitting the input into sentences and using the transformers tokenizer.
I suggest changing
direct_tokenizetoshall_split_in_sentencesor something similar. Actuallyshall_embed_sentence_per_sentencemight even be more precise but that is a bit verbose. Other suggestions very welcome but its just the general idea.Therefore the environment variable becomes
T2V_SHALL_SPLIT_IN_SENTENCES.(see the commit)
Regarding 2:
For me this setting seems to be important and should be documented somewhere.
I don't know how to suggest edits for the documentation so I am writing down what I think what would be helpful here:
Environment Settings
T2V_SHALL_SPLIT_IN_SENTENCES: If not set, will use true. If set to false, use raw input.
By default all t2v-transformers split the input into sentences using nltk with english interpunctuation and calculates the mean over the sentence embeddings. This allows to embed inputs of arbitrary length. But it will produce unexpected results if your text does not have the expected interpunctuation.
Embedding on a per sentence level could at least theoretically degrade the embedding model's performance in case it produces better results with longer inputs.
(Also could this be significantly slower? Doing it sentence by sentence than doing a larger input at once?).