Skip to content

OpenSakura/Users-Please-Come-And-See-This

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

OpenSakura 🌸

⚠️ Contributors: I know you are watching this. I can't stop you from pretending to be a user and just passing by — but if you dare contribute without following the contribution guidelines, I will go full Linus on you and show you the dark side 👿

OpenSakura is a community project for domain-specific LLM translation. We build datasets, train models, run evaluations, and publish benchmarks. We try to do it in public, with receipts.

If you are here to:

  • find a translation model that actually understands your domain
  • compare models without trusting vibes
  • download datasets and benchmarks
  • learn how we evaluate quality (and why it is hard)

...you are in the right place.

Quick Links

What "Domain-Specific Translation" Means Here

General translation is hard. Domain translation is harder. It is not just "translate words". It is "translate style". It is "translate character voice". It is "translate jokes without killing them". It is "keep names consistent across 200 chapters".

Domains we care about (not exclusive):

  • light novels
  • visual novels / galgames
  • web novels
  • fan translations with strong style constraints
  • other niche domains where generic MT collapses into polite nonsense

If you have ever seen an otherwise-smart model translate a character's catchphrase into 17 different versions in the same chapter, you understand the mission.

What We Publish

OpenSakura is not a single model. It is a loop.

  1. Gather and curate data.
  2. Clean it.
  3. Align it (when possible).
  4. Train / fine-tune.
  5. Evaluate.
  6. Argue politely.
  7. Repeat.

Outputs you will see:

  • datasets (usually on Hugging Face)
  • models (usually on Hugging Face)
  • evaluation tooling (usually on GitHub)
  • dashboards (sometimes live)

The Benchmark Dashboard (Elo, But For Translation)

The live benchmark is here: https://bench.opensakura.com/

It is built around a simple idea:

  • generate translations from many models on the same prompts
  • compare outputs pairwise using an LLM judge
  • aggregate results into an Elo-style rating

Why Elo?

  • it is easy to interpret
  • it works well for pairwise preferences
  • it gives you a "current snapshot" of relative strength

Important caveats:

  • Elo is not a universal truth.
  • LLM judges can be biased.
  • changing the judge changes the world.
  • a leaderboard is an invitation to overfit.

So we treat the dashboard as:

  • a decision aid
  • a regression detector
  • a public log of what we tested

Not as:

  • a sacred ranking carved into GPU stone tablets

The Arena

Arena-style evaluation is the fun version of benchmarking.

Instead of only offline batches, the arena supports:

  • head-to-head comparisons on the same source text
  • quick "which output is better" votes
  • multiple evaluation modes (fluency, faithfulness, style, consistency)
  • transparent model metadata (when available)

https://arena.opensakura.com/

If the arena ever goes down, assume we are either:

  • migrating databases
  • fighting CSS
  • fighting rate limits
  • fighting our own ambitions

Using OpenSakura Datasets

Most datasets are published under the Hugging Face org: https://huggingface.co/OpenSakura

Typical workflow:

  1. Find a dataset that matches your domain and language pair.
  2. Read the dataset card.
  3. Use datasets to load.

Example (generic):

from datasets import load_dataset

ds = load_dataset("OpenSakura/<dataset-id>")
print(ds)

Notes:

  • Some datasets are large; streaming may be necessary.
  • Some datasets are designed for alignment; others for SFT.
  • Always check the license and redistribution rules.

Using OpenSakura Models

Models (when published) also live on Hugging Face: https://huggingface.co/OpenSakura

Depending on the model, you may use:

  • transformers for local inference
  • vllm or other serving stacks for high-throughput usage
  • an OpenAI-compatible server (if you are comparing models via a judge pipeline)

Example (generic):

from transformers import AutoTokenizer, AutoModelForCausalLM

tok = AutoTokenizer.from_pretrained("OpenSakura/<model-id>")
model = AutoModelForCausalLM.from_pretrained("OpenSakura/<model-id>")

Translation is prompt-sensitive. If the model card suggests a prompt format, use it. If you ignore the prompt format, the model will ignore your expectations.

What "Good" Looks Like (In This Project)

We care about more than literal correctness.

Common failure modes in domain translation:

  • flattening character voice into neutral narration
  • losing honorifics / register / relationship cues
  • inconsistent named entities
  • mistranslating invented terms
  • hallucinating connective tissue to sound smooth
  • "polite rewrite" that silently changes meaning

So evaluation tends to include:

  • adequacy (did it keep meaning?)
  • fluency (is it readable?)
  • style faithfulness (does it feel like the same character?)
  • terminology consistency
  • formatting faithfulness (line breaks, dialogue markers, etc.)

Safety, Content, And Reality

Some domains contain adult content. Some contain violence. Some contain very online dialogue.

We do not promise that every artifact is safe for every audience. We do promise to take labeling and documentation seriously.

If you are using models/datasets in production:

  • run your own safety checks
  • run your own red teaming
  • assume failures will happen at the worst possible time

If you are using them for fun:

  • also run your own sanity checks
  • but you can complain loudly (politely) when something breaks

Roadmap (Not A Contract)

Things we want to keep pushing:

  • better alignment tooling for messy source material
  • higher-quality evaluation sets
  • more judge diversity and bias analysis
  • an arena that is actually useful (not just pretty)
  • more language pairs
  • more domains

Things we want to avoid:

  • "leaderboard chasing" without understanding why scores move
  • publishing unlicensed data
  • turning the project into a single-model fandom

FAQ (Mildly Unhelpful, But Honest)

Q: Is the benchmark definitive? A: No. A: It is evidence.

Q: Why use an LLM judge? A: Because humans are expensive and tired. A: Also because "translation quality" is not one scalar number.

Q: Why do two good models trade wins? A: Style. A: And because judges are not perfect.

Q: Can I rely on scores to pick the best model for my story? A: Use the dashboard to shortlist. A: Then test on your actual content.

Q: Will you publish more models? A: That is the plan. A: Training takes time, compute, and many small arguments.

Q: Do you support my extremely specific niche domain? A: Maybe. A: If not, we probably want to.

Credits

OpenSakura is built by people who:

  • like translation
  • like models
  • dislike low-quality evals
  • dislike silent regressions
  • enjoy shipping things anyway

If you use OpenSakura artifacts and they help you, tell a friend. If they do not help you, tell us what failed (with examples).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors