JMTEB v2.0 is a major update to the Japanese Massive Text Embedding Benchmark that integrates with the MTEB (Massive Text Embedding Benchmark) framework.
JMTEB v2.0 provides:
- 🌐 MTEB Compatibility: Integration with MTEB tools, leaderboards, and ecosystem
- 🚀 MTEB as Evaluation Engine: Leverage MTEB's robust framework
- 🎯 Simpler API: Cleaner, more intuitive interface
- 📊 28 Japanese Datasets: Comprehensive evaluation across 5 task types
- ⚡ High Performance: Efficient caching and batch processing
Important
The leaderboard is now hosted on the MTEB Leaderboard (General Purpose → Language-specific → Japanese). We no longer maintain a separate leaderboard in this repository.
From source (Poetry):
# Clone the repository
git clone https://github.com/sbintuitions/JMTEB.git
cd JMTEB
# Default (v2.0 with OpenAI support)
poetry install
# With v1.x support
poetry install --extras v1
# With everything
poetry install --all-extrasNote
The package is not yet available on PyPI. Please install from source using the commands above.
from jmteb.v2 import JMTEBModel, JMTEBV2Evaluator
from jmteb.v2.tasks import get_jmteb_tasks, get_jmteb_lite_benchmark
# Create model
model = JMTEBModel.from_sentence_transformer("cl-nagoya/ruri-v3-30m")
# Get datasets - Full JMTEB
tasks = get_jmteb_tasks() # All 28 JMTEB datasets
# Or use JMTEB-lite for faster evaluation
# lite_benchmark = get_jmteb_lite_benchmark()
# tasks = lite_benchmark.tasks
# Evaluate
evaluator = JMTEBV2Evaluator(
model=model,
tasks=tasks,
save_path="results_v2"
)
results = evaluator.run()# Evaluate all datasets
python -m jmteb.v2 --model_name cl-nagoya/ruri-v3-30m --save_path results_v2
# Evaluate specific datasets
python -m jmteb.v2 \
--model_name cl-nagoya/ruri-v3-30m \
--include JSTS JSICK JaqketRetrieval \
--save_path results_v2
# Use prompts (e.g., for Ruri-v3 models)
python -m jmteb.v2 \
--model_name cl-nagoya/ruri-v3-30m \
--prompt_profile src/jmteb/configs/prompts/ruri-v3.yaml \
--save_path results_v2src/jmteb/v2/
├── __init__.py # Main exports
├── __main__.py # CLI entry point
├── adapters.py # Model adapter (JMTEBModel)
├── evaluator.py # Evaluation orchestrator
├── tasks.py # Task definitions and utilities
└── utils.py # Helper functions
Adapter that bridges models with MTEB's evaluation system.
from jmteb.v2 import JMTEBModel
# From HuggingFace via SentenceTransformer
model = JMTEBModel.from_sentence_transformer("cl-nagoya/ruri-v3-30m")
# From MTEB's unified model interface (recommended)
model = JMTEBModel.from_mteb("cl-nagoya/ruri-v3-30m")Orchestrates evaluation across multiple tasks.
from jmteb.v2 import JMTEBV2Evaluator
evaluator = JMTEBV2Evaluator(
model=model,
tasks=tasks,
save_path="results_v2",
batch_size=32,
task_batch_sizes={"JSTS": 128}, # Per-task overrides
overwrite_cache=False,
generate_summary=True,
)
results = evaluator.run()Functions for working with JMTEB tasks.
from jmteb.v2.tasks import (
get_jmteb_benchmark,
get_jmteb_tasks,
get_task_by_name,
get_task_category,
)
# Get all datasets
all_tasks = get_jmteb_tasks()
# Filter by task type
retrieval_tasks = get_jmteb_tasks(task_types=["Retrieval"])
# Get specific datasets
specific_tasks = get_jmteb_tasks(task_names=["JSTS", "JSICK"])
# Get full benchmark
benchmark = get_jmteb_benchmark()JMTEB v2.0 includes 28 datasets across 5 task types.
- JMTEB (Full): Complete benchmark with full corpus sizes
- JMTEB-lite: Lightweight version with reduced corpus sizes for faster evaluation (~5x faster with high correlation to full JMTEB)
Both versions include the same 28 datasets:
- AmazonReviewsClassification
- AmazonCounterfactualClassification
- MassiveIntentClassification
- MassiveScenarioClassification
- JapaneseSentimentClassification
- SIB200Classification
- WRIMEClassification
- LivedoorNewsClustering.v2
- MewsC16JaClustering
- SIB200ClusteringS2S
- JSTS
- JSICK
- JaqketRetrieval (→ JaqketRetrievalLite in JMTEB-lite)
- MrTidyRetrieval (→ MrTyDiJaRetrievalLite in JMTEB-lite)
- JaGovFaqsRetrieval
- NLPJournalTitleAbsRetrieval.V2
- NLPJournalTitleIntroRetrieval.V2
- NLPJournalAbsIntroRetrieval.V2
- NLPJournalAbsArticleRetrieval.V2
- JaCWIRRetrieval (→ JaCWIRRetrievalLite in JMTEB-lite)
- MIRACLRetrieval (→ MIRACLJaRetrievalLite in JMTEB-lite)
- MintakaRetrieval
- MultiLongDocRetrieval
- ESCIReranking
- JQaRAReranking (→ JQaRARerankingLite in JMTEB-lite)
- JaCWIRReranking (→ JaCWIRRerankingLite in JMTEB-lite)
- MIRACLReranking
- MultiLongDocReranking
Configure prompts for models that require them (e.g., E5):
# src/jmteb/configs/prompts/e5.yaml
query: "query: "
document: "passage: "from jmteb.v2.utils import load_prompts
prompts = load_prompts("src/jmteb/configs/prompts/e5.yaml")
model = JMTEBModel.from_sentence_transformer(
"intfloat/multilingual-e5-base",
prompts=prompts
)Different tasks have varying memory requirements (e.g., retrieval tasks with long documents need smaller batches to avoid OOM), so per-task batch size configuration helps optimize performance.
# batch_sizes.yaml
JSTS: 128
JSICK: 128
JaqketRetrieval: 32
MIRACLRetrieval: 16
MultiLongDocRetrieval: 8from jmteb.v2.utils import load_batch_sizes
batch_sizes = load_batch_sizes("batch_sizes.yaml")
evaluator = JMTEBV2Evaluator(
model=model,
tasks=tasks,
task_batch_sizes=batch_sizes
)Results are automatically cached to avoid re-evaluation:
# First run: evaluates all tasks
evaluator.run()
# Second run: loads from cache
evaluator.run() # Instant!
# Force re-evaluation
evaluator = JMTEBV2Evaluator(
model=model,
tasks=tasks,
overwrite_cache=True
)
evaluator.run()Automatically generates summary.json with main scores:
{
"Classification": {
"amazon_review_classification": {
"main_metric": "accuracy",
"main_score": 67.32,
"eval_time (s)": "12.34"
}
},
"STS": {
"jsts": {
"main_metric": "cosine_spearman",
"main_score": 82.14,
"eval_time (s)": "5.67"
}
}
}Real-time progress updates during evaluation:
[1/28] Task: JSTS (batch_size=128)
--------------------------------------------------------------------------------
✓ Completed: JSTS (time: 5.67s)
→ Updated summary: STS/jsts = 82.14 (time: 5.67s)
[2/28] Task: JSICK (batch_size=128)
--------------------------------------------------------------------------------
✓ Loaded from cache: JSICK
→ Updated summary: STS/jsick = 76.89 (cached)
Implement your own model by following the encode interface:
import numpy as np
from jmteb.v2 import JMTEBModel
class CustomModel:
def encode(self, sentences: list[str], batch_size: int = 32, **kwargs) -> np.ndarray:
# Your encoding logic here
embeddings = your_model.encode(sentences)
return np.array(embeddings)
# Wrap for JMTEB v2
model = JMTEBModel(sentence_transformer=CustomModel())Evaluate subsets of datasets:
from jmteb.v2.tasks import get_jmteb_tasks
# Only datasets from STS task
sts_tasks = get_jmteb_tasks(task_types=["STS"])
# Specific retrieval datasets
retrieval_tasks = get_jmteb_tasks(
task_names=["JaqketRetrieval", "MIRACLRetrieval"]
)
# All classification datasets
classification_tasks = get_jmteb_tasks(task_types=["Classification"])Use FP16/BF16 for faster evaluation:
import torch
model = JMTEBModel.from_sentence_transformer(
"cl-nagoya/ruri-v3-30m",
model_kwargs={"torch_dtype": torch.bfloat16}
)Or via CLI:
python -m jmteb.v2 \
--model_name cl-nagoya/ruri-v3-30m \
--bf16 true \
--save_path results_v2| Feature | v1.x | v2.0 |
|---|---|---|
| Evaluation Engine | Custom | MTEB |
| Dataset Count | 28 | 28 |
| Batch Configuration | Per-embedder | Global + Per-task |
| Prompt Support | Limited | Full |
| MTEB Compatibility | No | Yes |
results_v2/
└── model_name/
├── JSTS.json # Individual dataset results
├── JSICK.json
├── JaqketRetrieval.json
├── ...
└── summary.json # Aggregated summary
Each dataset result file contains:
{
"test": [
{
"main_score": 0.8214,
"metric1": value1,
"metric2": value2,
...
}
]
}See MIGRATION_V2.md for a comprehensive migration guide.
Quick comparison:
v1.x:
from jmteb.embedders import SentenceBertEmbedder
embedder = SentenceBertEmbedder(model_name_or_path="cl-nagoya/ruri-base")
# ... manual evaluator setupv2.0:
from jmteb.v2 import JMTEBModel, JMTEBV2Evaluator
from jmteb.v2.tasks import get_jmteb_tasks
model = JMTEBModel.from_sentence_transformer("cl-nagoya/ruri-v3-30m")
tasks = get_jmteb_tasks()
evaluator = JMTEBV2Evaluator(model=model, tasks=tasks)
evaluator.run()- Use appropriate batch sizes: Larger models need smaller batches
- Enable caching: Don't overwrite unless necessary
- Use mixed precision: BF16 can significantly speed up evaluation
- Filter datasets: Evaluate only what you need for faster iteration
- Per-dataset batch sizes: Optimize for memory requirements
See the docs/examples/v2/ directory for complete examples:
v2_basic_evaluation.py: Basic usagev2_cli_evaluation.sh: CLI examplesv2_jmteb_lite.py: JMTEB-lite usagev2_mteb_model_loader.py: Using MTEB's model loader
For v1.x examples, see docs/examples/v1/.
- Python >= 3.10
- PyTorch >= 2.0
- MTEB >= 1.22.0
- sentence-transformers >= 5.0
- Other dependencies in
pyproject.toml
If you use JMTEB v2.0, please cite both JMTEB and MTEB:
@inproceedings{li2026jmteb,
author = {Li, Shengzhe and Ohagi, Masaya and Ri, Ryokan and Fukuchi, Akihiko and Shibata, Tomohide and Kawahara, Daisuke},
title = {{JMTEB and JMTEB-lite: Japanese Massive Text Embedding Benchmark and Its Lightweight Version}},
booktitle = "Proceedings of the Fifteenth Language Resources and Evaluation Conference",
month = may,
year = "2026",
address = "Palma, Mallorca, Spain",
publisher = "European Language Resources Association",
note = "to appear",
}
@article{muennighoff2022mteb,
author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Loïc and Reimers, Nils},
title = {MTEB: Massive Text Embedding Benchmark},
publisher = {arXiv},
journal={arXiv preprint arXiv:2210.07316},
year = {2022}
url = {https://arxiv.org/abs/2210.07316},
doi = {10.48550/ARXIV.2210.07316},
}Same as JMTEB v1.x. See LICENSE file.
Contributions are welcome! Please:
- Follow existing code style
- Add tests for new features
- Update documentation
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: See
docs/directory - Migration Help: See
docs/MIGRATION_V2.md
- MTEB team for the excellent evaluation framework
- All contributors to JMTEB v1.x
- Japanese NLP community for dataset contributions