new: BAAI/BGE-M3 support with testing script#602
new: BAAI/BGE-M3 support with testing script#602lucifertrj wants to merge 3 commits intoqdrant:mainfrom
Conversation
📝 WalkthroughWalkthroughThis pull request adds support for the BAAI/bge-m3 multilingual text embedding model to the FastEmbed library. A new model entry is registered in the Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@joein Supported Models docs page needs to be updated
|
|
when will this merge? |
|
I feel like the library is falling a bit behind in supporting models sadly |
|
i agree |
JiwaniZakir
left a comment
There was a problem hiding this comment.
The DenseModelDescription in onnx_embedding.py advertises "8192 input tokens truncation" in the description, but there's no corresponding additional_kwargs (e.g., {"max_length": 8192}) or similar field to actually configure the tokenizer's max sequence length. Without this, the tokenizer will fall back to its default, likely 512 tokens, silently discarding the model's long-context capability and making the description misleading. It's worth checking how other long-context models in this registry handle that setting.
The test entry in test_text_onnx_embeddings.py only checks 5 embedding dimensions, consistent with the rest of the suite, but there's no assertion that the output shape matches the declared dim=1024. Given the model is 2.27 GB and has an unusual external data file (model.onnx_data), a shape check would help catch loading issues early — for instance, if the wrong ONNX graph is loaded or the external data file is missing at inference time.

All Submissions:
New models submission:
New Model: BAAI/bge-m3
MIT-Licensed:
🔗 Colab Notebook:
Code Snippet for Canonical Values:
Output:
Added these values in:
tests/test_text_onnx_embeddings.py