Hello Weaviate team!
I wanted to share that Sentence Transformers now supports the ONNX and OpenVINO backends. It's rather robust in my opinion, still allowing every kind of pooling (i.e. not just mean pooling like the current ONNX Vectorizer), allowing the use of various different ONNX files if there's multiple (e.g. if they're optimized or quantized ones), and even automatically exporting the model if there's not already an ONNX (or OpenVINO) file.
Beyond that, if you're on CPU devices, then you may also benefit from the excellent OpenVINO int8 quantization options.
There's documentation on these different backends here: https://sbert.net/docs/sentence_transformer/usage/efficiency.html, including benchmarks at the bottom.
But don't be alarmed, it's extremely simple, just:
# pip install sentence-transformers[onnx]
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"all-MiniLM-L6-v2",
+ backend="onnx",
)
sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)
or
# pip install sentence-transformers[openvino]
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"all-MiniLM-L6-v2",
+ backend="openvino",
)
sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)
Or if there's more model files, you can specify them with the model_kwargs
# pip install sentence-transformers[onnx], or
# pip install sentence-transformers[onnx-gpu]
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"all-MiniLM-L6-v2",
+ backend="onnx",
+ model_kwargs={"file_name": "onnx/model_qint8_avx512_vnni.onnx"},
)
All good if you're not interested in integrating this feature, but I figured it might have gone unnoticed.
In a future version, these ONNX and OpenVINO backends will also be introduced for CrossEncoder/Reranker models.
Full disclosure: I'm the maintainer of Sentence Transformers.
Hello Weaviate team!
I wanted to share that Sentence Transformers now supports the ONNX and OpenVINO backends. It's rather robust in my opinion, still allowing every kind of pooling (i.e. not just mean pooling like the current ONNX Vectorizer), allowing the use of various different ONNX files if there's multiple (e.g. if they're optimized or quantized ones), and even automatically exporting the model if there's not already an ONNX (or OpenVINO) file.
Beyond that, if you're on CPU devices, then you may also benefit from the excellent OpenVINO int8 quantization options.
There's documentation on these different backends here: https://sbert.net/docs/sentence_transformer/usage/efficiency.html, including benchmarks at the bottom.
But don't be alarmed, it's extremely simple, just:
or
Or if there's more model files, you can specify them with the
model_kwargsAll good if you're not interested in integrating this feature, but I figured it might have gone unnoticed.
In a future version, these ONNX and OpenVINO backends will also be introduced for CrossEncoder/Reranker models.
Full disclosure: I'm the maintainer of Sentence Transformers.