Light Novel Translation Benchmark

A comprehensive benchmark system for evaluating LLM translation quality using a 3-stage workflow:

Translation - Generate translations from multiple models
Comparison - Use LLM-as-judge to compare translation pairs
Dashboard - Compute ELO ratings and visualize results

This benchmark translates Japanese light novel samples using different LLM models and ranks them based on translation quality.

Workflow

│ Stage 1: Translation                                        │
│ Tool:   LinguaGacha (External)                              │
│ Input:  samples/*.txt                                       │
│ Output: raw_translations/*.txt                              │
│ Action: Use utils/wrap_lingua_gacha.py to format output     │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│ Stage 2: Comparison                                        │
│ Script: compare_translations.sh                             │
│ Input:  translated_results/*.translated.txt                 │
│ Output: comparison_results/*.json                           │
│ Purpose: LLM-as-judge pairwise comparison (win/loss/tie)   │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│ Stage 3: Dashboard                                         │
│ Script: serve_dashboard.py                                  │
│ Input:  comparison_results/*.json                           │
│ Output: Local web dashboard (ELO + charts + drilldowns)     │
│ Purpose: Calculate ELO ratings and explore results          │
└─────────────────────────────────────────────────────────────┘

Setup

Install dependencies:

pip install -r requirements.txt

(Optional) Configure environment variables:

cp .env.example .env
# Edit .env with your API key and settings

Set your API key(s) (choose one method):

# Option 1: Export in current shell
export API_KEY=your-api-key-here
export JUDGE_API_KEY=your-judge-api-key-here

# Option 2: Add to .env file (not tracked by git)
echo "API_KEY=your-api-key-here" > .env
echo "JUDGE_API_KEY=your-judge-api-key-here" >> .env
source .env

# Option 3: Pass directly via command line (see examples below)

Usage

Quick Start with Bash Scripts

The benchmark consists of 3 stages:

Translation - Generate translations from all models
Comparison - Use LLM-as-judge to compare translation pairs
Ranking - Calculate ELO ratings and model rankings

Stage 1: Translation with LinguaGacha

This benchmark relies on LinguaGacha for the initial translation stage.

Generate Translations: Use LinguaGacha to translate the files in samples/ from Japanese to Chinese (or your target language).
Save Output: Save the raw translated .txt files to a temporary directory (e.g., raw_translations/).
Format for Benchmark: Run the provided wrapper script to add the required metadata headers for Stage 2.

python utils/wrap_lingua_gacha.py \
  --input-dir raw_translations \
  --output-dir translated_results \
  --model "Your-Model-Name"

This will populate translated_results/ with files named sample.Your-Model-Name.translated.txt which are ready for comparison.

Stage 2: Comparison

Uses an LLM as a judge to perform scheduled pairwise comparisons of translations (round-robin by default).

# Configure judge API (OpenAI-compatible recommended)
export JUDGE_API_URL="http://localhost:8000/v1/chat/completions"
export JUDGE_MODEL="your-judge-model-here"
export JUDGE_API_KEY="your-judge-api-key-here"

# Compare translations using LLM-as-judge (round-robin schedule by default)
./compare_translations.sh

# Example output: comparison_results/
#   - sample1.model1_vs_model2.json
#   - sample1.model4_vs_model9.json
#   - ... (one round per sample)
#   - comparison_summary.json

Each comparison JSON includes an anonymous A/B mapping, a detailed category breakdown, and a final winner (A/B/tie).

Round-robin note: with N models, each sample produces N/2 comparisons. To cover every pair once you need N-1 samples (e.g. 16 models → 15 samples → 120 total comparisons).

Stage 3: Dashboard (ELO + Charts)

Serves a local, interactive dashboard that computes Elo from the current comparison_results/ and lets you:

Switch between judge models (or combined results)
View Elo leaderboard + histogram
View a pairwise heatmap
Expand chapters to see all pairs and judge outputs

Live Demo: https://bench.opensakura.com/

To run locally:

python serve_dashboard.py --host 127.0.0.1 --port 8001
# Open http://127.0.0.1:8001 in your browser

Direct Python Usage

Basic Usage - Translate samples with a single model:

python benchmark_translate.py http://localhost:8000/translate --models gpt-4

Multiple Models - Translate with multiple models (comma-separated):

python benchmark_translate.py http://localhost:8000/translate --models gpt-4,claude-3-opus,gemini-pro

Custom Directories - Specify custom sample and output directories:

python benchmark_translate.py http://localhost:8000/translate \
  --models gpt-4 \
  --samples ./my_samples \
  --output ./results

Different Languages - Translate from English to Japanese:

python benchmark_translate.py http://localhost:8000/translate \
  --models gpt-4 \
  --source en \
  --target ja

Included Results

We ran this benchmark once and the results are available in the translated_results/ and comparison_results/ folders. The dashboard is live at https://bench.opensakura.com/.

The run included the following models:

DeepSeek Family:

DeepSeek-R1-0528
DeepSeek-V3-250324
DeepSeek-V3.1
DeepSeek-V3.1-Terminus
DeepSeek-V3.2

GLM Family:

GLM-4.5
GLM-4.6
GLM-4.7

Qwen Family:

Qwen3-30B-A3B-Instruct-2507
Qwen3-30B-A3B-Thinking-2507
Qwen3-235B-A22B-Instruct-2507
Qwen3-235B-A22B-Thinking-2507
Qwen3-Next-80B-A3B-Instruct
Qwen3-Next-80B-A3B-Thinking

Other Models:

Kimi-K2
MiniMax-M2

Command Line Arguments

api_url (required): Translation API endpoint URL (e.g., http://localhost:8000/api/v1/translate)
--models, -m (required): Comma-separated list of model names
--samples, -s: Directory containing sample text files (default: samples)
--output, -o: Output directory for results (default: translated_results)
--source: Source language code (default: ja)
--target: Target language code (default: en)
--api-key, -k: API key for authentication (can also use API_KEY environment variable)
--api-key-header: Custom API key header name (default: X-API-Key)

API Authentication

The benchmark supports API key authentication in three ways:

Command line argument:

./translate_all.sh
# Or with custom script:
./translate_custom.sh --api-key your-api-key-here

Environment variable (recommended):

export API_KEY=your-api-key-here
./translate_all.sh

Direct Python usage:

python benchmark_translate.py http://localhost:8000/api/v1/translate \
  --models DeepSeek-V3.2 \
  --api-key your-api-key-here

Output Format

Each translation result is saved as a separate file with the format:

<sample_name>.<model_name>.translated.txt

Each file contains:

Metadata section with:
- Sample file name
- Model used
- Timestamp
- Source/target languages
- Text lengths
- Success status
- Error details (if failed)
Separator
Translated text

Example output file: kakuyomu.1177354054893080355.ch1.gpt-4.translated.txt

Benchmark Summary

After completion, a JSON summary is created at translated_results/benchmark_summary.json containing:

Total number of translations
Success/failure counts
Details for each translation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Light Novel Translation Benchmark

Workflow

Setup

Usage

Quick Start with Bash Scripts

Stage 1: Translation with LinguaGacha

Stage 2: Comparison

Stage 3: Dashboard (ELO + Charts)

Direct Python Usage

Included Results

Command Line Arguments

API Authentication

Output Format

Benchmark Summary

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
comparison_results		comparison_results
dashboard		dashboard
samples		samples
translated_results		translated_results
utils		utils
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
PROJECT_STRUCTURE.md		PROJECT_STRUCTURE.md
README.md		README.md
compare_translations.py		compare_translations.py
compare_translations.sh		compare_translations.sh
requirements-tracing.txt		requirements-tracing.txt
requirements.txt		requirements.txt
serve_dashboard.py		serve_dashboard.py

Folders and files

Latest commit

History

Repository files navigation

Light Novel Translation Benchmark

Workflow

Setup

Usage

Quick Start with Bash Scripts

Stage 1: Translation with LinguaGacha

Stage 2: Comparison

Stage 3: Dashboard (ELO + Charts)

Direct Python Usage

Included Results

Command Line Arguments

API Authentication

Output Format

Benchmark Summary

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages