Skip to content

Commit 154f61d

Browse files
Michael van den Bergclaude
authored andcommitted
feat: switch default embedding model to BAAI/bge-m3 for multilingual support
bge-large-en-v1.5 is English-only. bge-m3 supports 100+ languages (including German) with the same 1024-dim output — no schema change needed. Also fix prefix logic: bge-m3 is symmetric (no query prefix), unlike the English BGE models which require a retrieval instruction prefix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 638de2b commit 154f61d

5 files changed

Lines changed: 24 additions & 14 deletions

File tree

.env.example

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,9 @@ POSTGRES_DB=graphrag
1313
POSTGRES_USER=graphrag
1414

1515
# Embedding model (HuggingFace model ID)
16-
# BAAI/bge-large-en-v1.5 → 1024-dim, great quality
16+
# BAAI/bge-m3 → 1024-dim, great quality
1717
# Change here AND update vector(1024) in 02_schema.sql if using a different model
18-
EMBEDDING_MODEL=BAAI/bge-large-en-v1.5
18+
EMBEDDING_MODEL=BAAI/bge-m3
1919
EMBEDDING_DEVICE=cuda # set to "cpu" if no GPU is available
2020
EMBEDDING_BATCH_SIZE=32
2121

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -74,13 +74,13 @@ Key variables:
7474
| Variable | Default | Description |
7575
|---|---|---|
7676
| `POSTGRES_PASSWORD` | *(required)* | PostgreSQL password |
77-
| `EMBEDDING_MODEL` | `BAAI/bge-large-en-v1.5` | HuggingFace model ID |
77+
| `EMBEDDING_MODEL` | `BAAI/bge-m3` | HuggingFace model ID |
7878
| `EMBEDDING_DEVICE` | `cuda` | `cuda` or `cpu` |
7979
| `MCP_PORT` | `8000` | MCP server port |
8080

8181
## Embedding model
8282

83-
Default: **`BAAI/bge-large-en-v1.5`** (1024-dim, asymmetric retrieval).
83+
Default: **`BAAI/bge-m3`** (1024-dim, multilingual, 100+ languages including German).
8484
Optimised for H100/H200/RTX 6000 Ada GPU hardware.
8585

8686
If you change `EMBEDDING_MODEL` to a model with different output dimensions, update `vector(1024)` in `docker/postgres/init/02_schema.sql` accordingly and recreate the database volume.

docker/postgres/init/02_schema.sql

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ CREATE TABLE graphrag.chunks (
2424
position INTEGER NOT NULL, -- ordinal within document (0-based)
2525
content TEXT NOT NULL, -- raw markdown text of the chunk
2626
token_count INTEGER, -- approximate token count
27-
-- BAAI/bge-large-en-v1.5 produces 1024-dimensional embeddings.
27+
-- BAAI/bge-m3 produces 1024-dimensional embeddings.
2828
-- If you change EMBEDDING_MODEL to one with different dimensions,
2929
-- update this column type and recreate the index accordingly.
3030
embedding vector(1024) NOT NULL,

src/graphrag/config.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,10 @@ class Settings(BaseSettings):
1111
postgres_password: str # required — no default
1212

1313
# ── Embeddings ────────────────────────────────────────────────────────────
14-
# BAAI/bge-large-en-v1.5 → 1024-dim, high quality, asymmetric retrieval
14+
# BAAI/bge-m3 → 1024-dim, high quality, asymmetric retrieval
1515
# If you change this to a model with different output dimensions, you must
1616
# also update the vector(1024) column in 02_schema.sql and recreate the DB.
17-
embedding_model: str = "BAAI/bge-large-en-v1.5"
17+
embedding_model: str = "BAAI/bge-m3"
1818
embedding_device: str = "cuda" # "cpu" for CPU-only environments
1919
embedding_batch_size: int = 32
2020

src/graphrag/embeddings/embedder.py

Lines changed: 17 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,16 @@
22
33
BGE asymmetric retrieval
44
────────────────────────
5-
BAAI/bge-large-en-v1.5 uses *asymmetric* embeddings for retrieval:
5+
BGE English models (bge-large-en-v1.5, bge-base-en-v1.5, …) use asymmetric
6+
embeddings for retrieval:
67
- Document chunks are embedded as-is (no prefix).
78
- Queries must be prefixed with the instruction string below.
89
9-
Skipping the query prefix significantly degrades recall. The prefix is applied
10-
automatically in ``embed_query()``. Never use ``embed()`` for query strings.
10+
BGE-M3 (multilingual, 100+ languages incl. German) does NOT use a query
11+
prefix — both queries and passages are embedded identically.
12+
13+
The correct prefix behaviour is selected automatically in ``embed_query()``
14+
based on the model name. Never call ``embed()`` with raw query strings.
1115
"""
1216

1317
from __future__ import annotations
@@ -24,7 +28,7 @@
2428
class Embedder:
2529
def __init__(
2630
self,
27-
model_name: str = "BAAI/bge-large-en-v1.5",
31+
model_name: str = "BAAI/bge-m3",
2832
device: str = "cuda",
2933
batch_size: int = 32,
3034
) -> None:
@@ -52,7 +56,7 @@ def embed(self, texts: list[str]) -> list[list[float]]:
5256
return [v.tolist() for v in vectors]
5357

5458
def embed_query(self, text: str) -> list[float]:
55-
"""Embed a query string, applying the BGE retrieval prefix."""
59+
"""Embed a query string, applying a retrieval prefix where required."""
5660
prefixed = _bge_prefix(self._model_name, text)
5761
return self.embed([prefixed])[0]
5862

@@ -62,7 +66,13 @@ def dimensions(self) -> int:
6266

6367

6468
def _bge_prefix(model_name: str, text: str) -> str:
65-
"""Apply query prefix for BGE-family models; pass through for others."""
66-
if "bge" in model_name.lower():
69+
"""Apply query prefix where the model requires it.
70+
71+
- BGE English models (bge-large-en, bge-base-en, …): need the prefix.
72+
- BGE-M3 (multilingual): symmetric — no prefix for either queries or docs.
73+
- All other models: passed through unchanged.
74+
"""
75+
name = model_name.lower()
76+
if "bge" in name and "m3" not in name:
6777
return _BGE_QUERY_PREFIX + text
6878
return text

0 commit comments

Comments
 (0)