OpenSakura 🌸

⚠️ Contributors: I know you are watching this. I can't stop you from pretending to be a user and just passing by — but if you dare contribute without following the contribution guidelines, I will go full Linus on you and show you the dark side 👿

OpenSakura is a community project for domain-specific LLM translation. We build datasets, train models, run evaluations, and publish benchmarks. We try to do it in public, with receipts.

If you are here to:

find a translation model that actually understands your domain
compare models without trusting vibes
download datasets and benchmarks
learn how we evaluate quality (and why it is hard)

...you are in the right place.

Quick Links

Benchmark dashboard (LLM-as-judge + Elo): https://bench.opensakura.com/
Translation model arena: https://arena.opensakura.com/
Hugging Face org (datasets + models): https://huggingface.co/OpenSakura
GitHub org: https://github.com/OpenSakura

What "Domain-Specific Translation" Means Here

General translation is hard. Domain translation is harder. It is not just "translate words". It is "translate style". It is "translate character voice". It is "translate jokes without killing them". It is "keep names consistent across 200 chapters".

Domains we care about (not exclusive):

light novels
visual novels / galgames
web novels
fan translations with strong style constraints
other niche domains where generic MT collapses into polite nonsense

If you have ever seen an otherwise-smart model translate a character's catchphrase into 17 different versions in the same chapter, you understand the mission.

What We Publish

OpenSakura is not a single model. It is a loop.

Gather and curate data.
Clean it.
Align it (when possible).
Train / fine-tune.
Evaluate.
Argue politely.
Repeat.

Outputs you will see:

datasets (usually on Hugging Face)
models (usually on Hugging Face)
evaluation tooling (usually on GitHub)
dashboards (sometimes live)

The Benchmark Dashboard (Elo, But For Translation)

The live benchmark is here: https://bench.opensakura.com/

It is built around a simple idea:

generate translations from many models on the same prompts
compare outputs pairwise using an LLM judge
aggregate results into an Elo-style rating

Why Elo?

it is easy to interpret
it works well for pairwise preferences
it gives you a "current snapshot" of relative strength

Important caveats:

Elo is not a universal truth.
LLM judges can be biased.
changing the judge changes the world.
a leaderboard is an invitation to overfit.

So we treat the dashboard as:

a decision aid
a regression detector
a public log of what we tested

Not as:

a sacred ranking carved into GPU stone tablets

The Arena

Arena-style evaluation is the fun version of benchmarking.

Instead of only offline batches, the arena supports:

head-to-head comparisons on the same source text
quick "which output is better" votes
multiple evaluation modes (fluency, faithfulness, style, consistency)
transparent model metadata (when available)

https://arena.opensakura.com/

If the arena ever goes down, assume we are either:

migrating databases
fighting CSS
fighting rate limits
fighting our own ambitions

Using OpenSakura Datasets

Most datasets are published under the Hugging Face org: https://huggingface.co/OpenSakura

Typical workflow:

Find a dataset that matches your domain and language pair.
Read the dataset card.
Use datasets to load.

Example (generic):

from datasets import load_dataset

ds = load_dataset("OpenSakura/<dataset-id>")
print(ds)

Notes:

Some datasets are large; streaming may be necessary.
Some datasets are designed for alignment; others for SFT.
Always check the license and redistribution rules.

Using OpenSakura Models

Models (when published) also live on Hugging Face: https://huggingface.co/OpenSakura

Depending on the model, you may use:

transformers for local inference
vllm or other serving stacks for high-throughput usage
an OpenAI-compatible server (if you are comparing models via a judge pipeline)

Example (generic):

from transformers import AutoTokenizer, AutoModelForCausalLM

tok = AutoTokenizer.from_pretrained("OpenSakura/<model-id>")
model = AutoModelForCausalLM.from_pretrained("OpenSakura/<model-id>")

Translation is prompt-sensitive. If the model card suggests a prompt format, use it. If you ignore the prompt format, the model will ignore your expectations.

What "Good" Looks Like (In This Project)

We care about more than literal correctness.

Common failure modes in domain translation:

flattening character voice into neutral narration
losing honorifics / register / relationship cues
inconsistent named entities
mistranslating invented terms
hallucinating connective tissue to sound smooth
"polite rewrite" that silently changes meaning

So evaluation tends to include:

adequacy (did it keep meaning?)
fluency (is it readable?)
style faithfulness (does it feel like the same character?)
terminology consistency
formatting faithfulness (line breaks, dialogue markers, etc.)

Safety, Content, And Reality

Some domains contain adult content. Some contain violence. Some contain very online dialogue.

We do not promise that every artifact is safe for every audience. We do promise to take labeling and documentation seriously.

If you are using models/datasets in production:

run your own safety checks
run your own red teaming
assume failures will happen at the worst possible time

If you are using them for fun:

also run your own sanity checks
but you can complain loudly (politely) when something breaks

Roadmap (Not A Contract)

Things we want to keep pushing:

better alignment tooling for messy source material
higher-quality evaluation sets
more judge diversity and bias analysis
an arena that is actually useful (not just pretty)
more language pairs
more domains

Things we want to avoid:

"leaderboard chasing" without understanding why scores move
publishing unlicensed data
turning the project into a single-model fandom

FAQ (Mildly Unhelpful, But Honest)

Q: Is the benchmark definitive? A: No. A: It is evidence.

Q: Why use an LLM judge? A: Because humans are expensive and tired. A: Also because "translation quality" is not one scalar number.

Q: Why do two good models trade wins? A: Style. A: And because judges are not perfect.

Q: Can I rely on scores to pick the best model for my story? A: Use the dashboard to shortlist. A: Then test on your actual content.

Q: Will you publish more models? A: That is the plan. A: Training takes time, compute, and many small arguments.

Q: Do you support my extremely specific niche domain? A: Maybe. A: If not, we probably want to.

Credits

OpenSakura is built by people who:

like translation
like models
dislike low-quality evals
dislike silent regressions
enjoy shipping things anyway

If you use OpenSakura artifacts and they help you, tell a friend. If they do not help you, tell us what failed (with examples).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenSakura 🌸

Quick Links

What "Domain-Specific Translation" Means Here

What We Publish

The Benchmark Dashboard (Elo, But For Translation)

The Arena

Using OpenSakura Datasets

Using OpenSakura Models

What "Good" Looks Like (In This Project)

Safety, Content, And Reality

Roadmap (Not A Contract)

FAQ (Mildly Unhelpful, But Honest)

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

OpenSakura 🌸

Quick Links

What "Domain-Specific Translation" Means Here

What We Publish

The Benchmark Dashboard (Elo, But For Translation)

The Arena

Using OpenSakura Datasets

Using OpenSakura Models

What "Good" Looks Like (In This Project)

Safety, Content, And Reality

Roadmap (Not A Contract)

FAQ (Mildly Unhelpful, But Honest)

Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages