new: BAAI/BGE-M3 support with testing script by lucifertrj · Pull Request #602 · qdrant/fastembed

lucifertrj · 2026-02-05T15:45:52Z

All Submissions:

Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New models submission:

Have you added an explanation of why it's important to include this model?
Have you added tests for the new model? Were canonical values for tests computed via the original model?
Have you added the code snippet for how canonical values were computed?
Have you successfully ran tests with your changes locally?

New Model: BAAI/bge-m3

MIT-Licensed:

Model Name	Dimension	Sequence Length	Introduction
BAAI/bge-m3	1024	8192	multilingual; unified fine-tuning (dense, sparse, and colbert) from bge-m3-unsupervised

🔗 Colab Notebook:

Code Snippet for Canonical Values:

docs = ["hello world", "flag embedding"]
embeddings = list(embedding_model.embed(docs))
embeddings = np.stack(embeddings, axis=0)

canonical = np.round(embeddings[0, :5], 4)
print(f"Canonical vector values: {canonical}")

Output:

Canonical vector values: [-0.0404  0.037  -0.029   0.0161 -0.0357]

Added these values in: tests/test_text_onnx_embeddings.py

Bgem3

coderabbitai · 2026-02-05T15:48:24Z

📝 Walkthrough

Walkthrough

This pull request adds support for the BAAI/bge-m3 multilingual text embedding model to the FastEmbed library. A new model entry is registered in the supported_onnx_models list with metadata including embedding dimension (1024), model size (2.27 GB), license (MIT), and file paths for the ONNX model and associated resources. A corresponding canonical embedding vector is added to the test suite for verification purposes.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding support for the BAAI/BGE-M3 model with corresponding test coverage, which aligns perfectly with the changeset.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, providing model details, testing information, canonical values, and code snippets used to compute those values.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

lucifertrj · 2026-02-05T15:49:08Z

@joein Supported Models docs page needs to be updated

mohamad-tohidi · 2026-02-25T10:46:53Z

when will this merge?

michelkluger · 2026-03-06T13:59:49Z

I feel like the library is falling a bit behind in supporting models sadly

mohamad-tohidi · 2026-03-06T16:05:54Z

i agree

JiwaniZakir

The DenseModelDescription in onnx_embedding.py advertises "8192 input tokens truncation" in the description, but there's no corresponding additional_kwargs (e.g., {"max_length": 8192}) or similar field to actually configure the tokenizer's max sequence length. Without this, the tokenizer will fall back to its default, likely 512 tokens, silently discarding the model's long-context capability and making the description misleading. It's worth checking how other long-context models in this registry handle that setting.

The test entry in test_text_onnx_embeddings.py only checks 5 embedding dimensions, consistent with the rest of the suite, but there's no assertion that the output shape matches the declared dim=1024. Given the model is 2.27 GB and has an unusual external data file (model.onnx_data), a shape check would help catch loading issues early — for instance, if the wrong ONNX graph is loaded or the external data file is missing at inference time.

lucifertrj added 3 commits February 5, 2026 17:56

bgem3 embed support added

faf0281

CANONICAL_VECTOR_VALUES for bge m3

92e87e9

Merge pull request #1 from lucifertrj/bgem3

7fefc94

Bgem3

JiwaniZakir reviewed Apr 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new: BAAI/BGE-M3 support with testing script#602

new: BAAI/BGE-M3 support with testing script#602
lucifertrj wants to merge 3 commits intoqdrant:mainfrom
lucifertrj:main

lucifertrj commented Feb 5, 2026

Uh oh!

coderabbitai bot commented Feb 5, 2026

Walkthrough

Estimated code review effort

Uh oh!

lucifertrj commented Feb 5, 2026 •

edited

Loading

Uh oh!

mohamad-tohidi commented Feb 25, 2026

Uh oh!

michelkluger commented Mar 6, 2026

Uh oh!

mohamad-tohidi commented Mar 6, 2026

Uh oh!

JiwaniZakir left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lucifertrj commented Feb 5, 2026

All Submissions:

New models submission:

New Model: BAAI/bge-m3

Code Snippet for Canonical Values:

Uh oh!

coderabbitai bot commented Feb 5, 2026

Walkthrough

Estimated code review effort

Uh oh!

lucifertrj commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mohamad-tohidi commented Feb 25, 2026

Uh oh!

michelkluger commented Mar 6, 2026

Uh oh!

mohamad-tohidi commented Mar 6, 2026

Uh oh!

JiwaniZakir left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lucifertrj commented Feb 5, 2026 •

edited

Loading