⚠️ Contributors: I know you are watching this. I can't stop you from pretending to be a user and just passing by — but if you dare contribute without following the contribution guidelines, I will go full Linus on you and show you the dark side 👿
OpenSakura is a community project for domain-specific LLM translation. We build datasets, train models, run evaluations, and publish benchmarks. We try to do it in public, with receipts.
If you are here to:
- find a translation model that actually understands your domain
- compare models without trusting vibes
- download datasets and benchmarks
- learn how we evaluate quality (and why it is hard)
...you are in the right place.
- Benchmark dashboard (LLM-as-judge + Elo): https://bench.opensakura.com/
- Translation model arena: https://arena.opensakura.com/
- Hugging Face org (datasets + models): https://huggingface.co/OpenSakura
- GitHub org: https://github.com/OpenSakura
General translation is hard. Domain translation is harder. It is not just "translate words". It is "translate style". It is "translate character voice". It is "translate jokes without killing them". It is "keep names consistent across 200 chapters".
Domains we care about (not exclusive):
- light novels
- visual novels / galgames
- web novels
- fan translations with strong style constraints
- other niche domains where generic MT collapses into polite nonsense
If you have ever seen an otherwise-smart model translate a character's catchphrase into 17 different versions in the same chapter, you understand the mission.
OpenSakura is not a single model. It is a loop.
- Gather and curate data.
- Clean it.
- Align it (when possible).
- Train / fine-tune.
- Evaluate.
- Argue politely.
- Repeat.
Outputs you will see:
- datasets (usually on Hugging Face)
- models (usually on Hugging Face)
- evaluation tooling (usually on GitHub)
- dashboards (sometimes live)
The live benchmark is here: https://bench.opensakura.com/
It is built around a simple idea:
- generate translations from many models on the same prompts
- compare outputs pairwise using an LLM judge
- aggregate results into an Elo-style rating
Why Elo?
- it is easy to interpret
- it works well for pairwise preferences
- it gives you a "current snapshot" of relative strength
Important caveats:
- Elo is not a universal truth.
- LLM judges can be biased.
- changing the judge changes the world.
- a leaderboard is an invitation to overfit.
So we treat the dashboard as:
- a decision aid
- a regression detector
- a public log of what we tested
Not as:
- a sacred ranking carved into GPU stone tablets
Arena-style evaluation is the fun version of benchmarking.
Instead of only offline batches, the arena supports:
- head-to-head comparisons on the same source text
- quick "which output is better" votes
- multiple evaluation modes (fluency, faithfulness, style, consistency)
- transparent model metadata (when available)
If the arena ever goes down, assume we are either:
- migrating databases
- fighting CSS
- fighting rate limits
- fighting our own ambitions
Most datasets are published under the Hugging Face org: https://huggingface.co/OpenSakura
Typical workflow:
- Find a dataset that matches your domain and language pair.
- Read the dataset card.
- Use
datasetsto load.
Example (generic):
from datasets import load_dataset
ds = load_dataset("OpenSakura/<dataset-id>")
print(ds)Notes:
- Some datasets are large; streaming may be necessary.
- Some datasets are designed for alignment; others for SFT.
- Always check the license and redistribution rules.
Models (when published) also live on Hugging Face: https://huggingface.co/OpenSakura
Depending on the model, you may use:
transformersfor local inferencevllmor other serving stacks for high-throughput usage- an OpenAI-compatible server (if you are comparing models via a judge pipeline)
Example (generic):
from transformers import AutoTokenizer, AutoModelForCausalLM
tok = AutoTokenizer.from_pretrained("OpenSakura/<model-id>")
model = AutoModelForCausalLM.from_pretrained("OpenSakura/<model-id>")Translation is prompt-sensitive. If the model card suggests a prompt format, use it. If you ignore the prompt format, the model will ignore your expectations.
We care about more than literal correctness.
Common failure modes in domain translation:
- flattening character voice into neutral narration
- losing honorifics / register / relationship cues
- inconsistent named entities
- mistranslating invented terms
- hallucinating connective tissue to sound smooth
- "polite rewrite" that silently changes meaning
So evaluation tends to include:
- adequacy (did it keep meaning?)
- fluency (is it readable?)
- style faithfulness (does it feel like the same character?)
- terminology consistency
- formatting faithfulness (line breaks, dialogue markers, etc.)
Some domains contain adult content. Some contain violence. Some contain very online dialogue.
We do not promise that every artifact is safe for every audience. We do promise to take labeling and documentation seriously.
If you are using models/datasets in production:
- run your own safety checks
- run your own red teaming
- assume failures will happen at the worst possible time
If you are using them for fun:
- also run your own sanity checks
- but you can complain loudly (politely) when something breaks
Things we want to keep pushing:
- better alignment tooling for messy source material
- higher-quality evaluation sets
- more judge diversity and bias analysis
- an arena that is actually useful (not just pretty)
- more language pairs
- more domains
Things we want to avoid:
- "leaderboard chasing" without understanding why scores move
- publishing unlicensed data
- turning the project into a single-model fandom
Q: Is the benchmark definitive? A: No. A: It is evidence.
Q: Why use an LLM judge? A: Because humans are expensive and tired. A: Also because "translation quality" is not one scalar number.
Q: Why do two good models trade wins? A: Style. A: And because judges are not perfect.
Q: Can I rely on scores to pick the best model for my story? A: Use the dashboard to shortlist. A: Then test on your actual content.
Q: Will you publish more models? A: That is the plan. A: Training takes time, compute, and many small arguments.
Q: Do you support my extremely specific niche domain? A: Maybe. A: If not, we probably want to.
OpenSakura is built by people who:
- like translation
- like models
- dislike low-quality evals
- dislike silent regressions
- enjoy shipping things anyway
If you use OpenSakura artifacts and they help you, tell a friend. If they do not help you, tell us what failed (with examples).